Difference between revisions of "BOOK:Artificial Intelligence"

From SAS
Jump to: navigation, search
(Created page with " The author and publisher have provided this e-book to you for your personal use only. You may not make this e- book publicly available in any way. Copyright infringement is...")
 
(4 Who, What, When, Where, Why)
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
  
 +
The author and publisher have provided this e-book to you for your personal use only. You may not make this e- book publicly available in any way. Copyright infringement is against the law. If you believe the copy of this e- book you are reading infringes on the author’s copyright, please notify the publisher at: us.macmillanusa.com/piracy.
  
The author and publisher have provided this e-book to you for your personal use only. You may not make this e- book publicly available in any way. Copyright infringement is against the law. If you believe the copy of this e- book you are reading infringes on the author’s copyright, please notify the publisher at: us.macmillanusa.com/piracy.
+
To my parents, who taught me how to be a thinking human, and so much more
  
To my parents, who taught me how to be a thinking human, and so much more
+
==Prologue: Terrified ==
  
Prologue: Terrified
+
Computers seem to be getting smarter at an alarming rate, but one thing they still can’t do is appreciate irony. That’s what was on my mind a few years ago, when, on my way to a discussion about artificial intelligence (AI), I got lost in the capital of searching and finding—the Googleplex, Google’s world headquarters in Mountain View, California. What’s more, I was lost inside the Google Maps building. Irony squared.
  
Computers seem to be getting smarter at an alarming rate, but one thing they still can’t do is appreciate irony. That’s what was on my mind a few years ago, when, on my way to a discussion about artificial intelligence (AI), I got lost in the capital of searching and finding—the Googleplex, Google’s world headquarters in Mountain View, California. What’s more, I was lost inside the Google Maps building. Irony squared.
+
The Maps building itself had been easy to find. A Google Street View car was parked by the front door, a hulking appendage crowned by a red-and-black soccer ball of a camera sticking up from its roof. However, once inside, with my prominent “Visitor” badge assigned by security, I wandered, embarrassed, among warrens of cubicles occupied by packs of Google workers, headphones over ears, intently typing on Apple desktops. After some (map-less) random search, I finally found the conference room assigned for the daylong meeting and joined the group gathered there.  
  
The Maps building itself had been easy to find. A Google Street View car was parked by the front door, a hulking appendage crowned by a red-and-black soccer ball of a camera sticking up from its roof. However, once inside, with my prominent “Visitor” badge assigned by security, I wandered, embarrassed, among warrens of cubicles occupied by packs of Google workers, headphones over ears, intently typing on Apple desktops. After some (map-less) random search, I finally found the conference room assigned for the daylong meeting and joined the group gathered there.
+
The meeting, in May 2014, had been organized by Blaise Agüera y Arcas, a young computer scientist who had recently left a top position at Microsoft to help lead Google’s machine intelligence effort. Google started out in 1998 with one “product”: a website that used a novel, extraordinarily successful method for searching the web. Over the years, Google has evolved into the world’s most important tech company and now offers a vast array of products and services, including Gmail, Google Docs, Google Translate, YouTube, Android, many more that you might use every day, and some that you’ve likely never heard of.  
  
The meeting, in May 2014, had been organized by Blaise Agüera y Arcas, a young computer scientist who had recently left a top position at Microsoft to help lead Google’s machine intelligence effort. Google started out in 1998 with one “product”: a website that used a novel, extraordinarily successful method for searching the web. Over the years, Google has evolved into the world’s most important tech company and now offers a vast array of products and services, including Gmail, Google Docs, Google Translate, YouTube, Android, many more that you might use every day, and some that you’ve likely never heard of.
+
Google’s founders, Larry Page and Sergey Brin, have long been motivated by the idea of creating artificial intelligence in computers, and this quest has become a major focus at Google. In the last decade, the company has hired a profusion of AI experts, most notably Ray Kurzweil, a well-known inventor and a controversial futurist who promotes the idea of an AI Singularity, a time in the near future when computers will become smarter than humans. Google hired Kurzweil to help realize this vision. In 2011, Google created an internal AI research group called Google Brain; since then, the company has also acquired an impressive array of AI start-up companies with equally optimistic names: Applied Semantics, DeepMind, and Vision Factory, among others.  
  
Google’s founders, Larry Page and Sergey Brin, have long been motivated by the idea of creating artificial intelligence in computers, and this quest has become a major focus at Google. In the last decade, the company has hired a profusion of AI experts, most notably Ray Kurzweil, a well-known inventor and a controversial futurist who promotes the idea of an AI Singularity, a time in the near future when computers will become smarter than humans. Google hired Kurzweil to help realize this vision. In 2011, Google created an internal AI research group called Google Brain; since then, the company has also acquired an impressive array of AI start-up companies with equally optimistic names: Applied Semantics, DeepMind, and Vision Factory, among others.
+
In short, Google is no longer merely a web-search portal—not by a long shot. It is rapidly becoming an applied AI company. AI is the glue that unifies the diverse products, services, and blue-sky research efforts offered by Google and its parent company, Alphabet. The company’s ultimate aspiration is reflected in the original mission statement of its DeepMind group: “Solve intelligence and use it to solve everything else.
 +
0001
  
In short, Google is no longer merely a web-search portal—not by a long shot. It is rapidly becoming an applied AI company. AI is the glue that unifies the diverse products, services, and blue-sky research efforts offered by Google and its parent company, Alphabet. The company’s ultimate aspiration is reflected in the original mission statement of its DeepMind group: “Solve intelligence and use it to solve everything else.”1
+
===AI and GEB ===
  
AI and GEB
+
I was pretty excited to attend an AI meeting at Google. I had been working on various aspects of AI since graduate school in the 1980s and had been tremendously impressed by what Google had accomplished. I also thought I had some good ideas to contribute. But I have to admit that I was there only as a tagalong. The meeting was happening so that a group of select Google AI researchers could hear from and converse with Douglas Hofstadter, a legend in AI and the author of a famous book cryptically titled Gödel, Escher, Bach: an Eternal Golden Braid, or more succinctly, GEB (pronounced “gee-ee-bee”). If you’re a computer scientist, or a computer enthusiast, it’s likely you’ve heard of it, or read it, or tried to read it.
  
I was pretty excited to attend an AI meeting at Google. I had been working on various aspects of AI since graduate school in the 1980s and had been tremendously impressed by what Google had accomplished. I also thought I had some good ideas to contribute. But I have to admit that I was there only as a tagalong. The meeting was happening so that a group of select Google AI researchers could hear from and converse with Douglas Hofstadter, a legend in AI and the author of a famous book cryptically titled Gödel, Escher, Bach: an Eternal Golden Braid, or more succinctly, GEB (pronounced “gee-ee-bee”). If you’re a computer scientist, or a computer enthusiast, it’s likely you’ve heard of it, or read it, or tried to read it.
+
Written in the 1970s, GEB was an outpouring of Hofstadter’s many intellectual passions—mathematics, art, music, language, humor, and wordplay, all brought together to address the deep questions of how intelligence, consciousness, and the sense of self-awareness that each human experiences so fundamentally can emerge from the non-intelligent, nonconscious substrate of biological cells. It’s also about how intelligence and self-awareness might eventually be attained by computers. It’s a unique book; I don’t know of any other book remotely like it. It’s not an easy read, and yet it became a bestseller and won both the Pulitzer Prize and the National Book Award. Without a doubt, GEB inspired more young people to pursue AI than any other book. I was one of those young people.  
  
Written in the 1970s, GEB was an outpouring of Hofstadter’s many intellectual passions—mathematics, art, music, language, humor, and wordplay, all brought together to address the deep questions of how intelligence, consciousness, and the sense of self-awareness that each human experiences so fundamentally can emerge from the non-intelligent, nonconscious substrate of biological cells. It’s also about how intelligence and self-awareness might eventually be attained by computers. It’s a unique book; I don’t know of any other book remotely like it. It’s not an easy read, and yet it became a bestseller and won both the Pulitzer Prize and the National Book Award. Without a doubt, GEB inspired more young people to pursue AI than any other book. I was one of those young people.
+
In the early 1980s, after graduating from college with a math degree, I was living in New York City, teaching math in a prep school, unhappy, and casting about for what I really wanted to do in life. I discovered GEB after reading a rave review in Scientific American. I went out and bought the book immediately. Over the next several weeks, I devoured it, becoming increasingly convinced that not only did I want to become an AI researcher but I specifically wanted to work with Douglas Hofstadter. I had never before felt so strongly about a book, or a career choice.  
  
In the early 1980s, after graduating from college with a math degree, I was living in New York City, teaching
+
At the time, Hofstadter was a professor in computer science at Indiana University, and my quixotic plan was to apply to the computer science PhD program there, arrive, and then persuade Hofstadter to accept me as a student. One minor problem was that I had never taken even one computer science course. I had grown up with computers; my father was a hardware engineer at a 1960s tech start-up company, and as a hobby he built a mainframe computer in our family’s den. The refrigerator-sized Sigma 2 machine wore a magnetic button proclaiming “I pray in FORTRAN,” and as a child I was half-convinced it did, quietly at night, while the rest of the family was asleep. Growing up in the 1960s and ’70s, I learned a bit of each of the popular languages of the day: FORTRAN, then BASIC, then Pascal, but I knew next to nothing about proper programming techniques, not to mention anything else an incoming computer science graduate student needs to know.
  
math in a prep school, unhappy, and casting about for what I really wanted to do in life. I discovered GEB after reading a rave review in Scientific American. I went out and bought the book immediately. Over the next several weeks, I devoured it, becoming increasingly convinced that not only did I want to become an AI researcher but I specifically wanted to work with Douglas Hofstadter. I had never before felt so strongly about a book, or a career choice.
+
To speed along my plan, I quit my teaching job at the end of the school year, moved to Boston, and started taking introductory computer science courses to prepare for my new career. A few months into my new life, I was on the campus of the Massachusetts Institute of Technology, waiting for a class to begin, and I caught sight of a poster advertising a lecture by Douglas Hofstadter, to take place in two days on that very campus. I did a double take; I couldn’t believe my good fortune. I went to the lecture, and after a long wait for my turn in a crowd of admirers I managed to speak to Hofstadter. It turned out he was in the middle of a yearlong sabbatical at MIT, after which he was moving from Indiana to the University of Michigan in Ann Arbor.  
  
At the time, Hofstadter was a professor in computer science at Indiana University, and my quixotic plan was to apply to the computer science PhD program there, arrive, and then persuade Hofstadter to accept me as a student. One minor problem was that I had never taken even one computer science course. I had grown up with computers; my father was a hardware engineer at a 1960s tech start-up company, and as a hobby he built a mainframe computer in our family’s den. The refrigerator-sized Sigma 2 machine wore a magnetic button proclaiming “I pray in FORTRAN,” and as a child I was half-convinced it did, quietly at night, while the rest of the family was asleep. Growing up in the 1960s and ’70s, I learned a bit of each of the popular languages of the day: FORTRAN, then BASIC, then Pascal, but I knew next to nothing about proper programming techniques, not to mention anything else an incoming computer science graduate student needs to know.
+
To make a long story short, after some persistent pursuit on my part, I persuaded Hofstadter to take me on as a research assistant, first for a summer, and then for the next six years as a graduate student, after which I graduated with a doctorate in computer science from Michigan. Hofstadter and I have kept in close touch over the years and have had many discussions about AI. He knew of my interest in Google’s AI research and was nice enough to invite me to accompany him to the Google meeting.  
  
To speed along my plan, I quit my teaching job at the end of the school year, moved to Boston, and started taking introductory computer science courses to prepare for my new career. A few months into my new life, I was on the campus of the Massachusetts Institute of Technology, waiting for a class to begin, and I caught sight of a poster advertising a lecture by Douglas Hofstadter, to take place in two days on that very campus. I did a double take; I couldn’t believe my good fortune. I went to the lecture, and after a long wait for my turn in a crowd of admirers I managed to speak to Hofstadter. It turned out he was in the middle of a yearlong sabbatical at MIT, after which he was moving from Indiana to the University of Michigan in Ann Arbor.
+
===Chess and the First Seed of Doubt ===
  
To make a long story short, after some persistent pursuit on my part, I persuaded Hofstadter to take me on as a research assistant, first for a summer, and then for the next six years as a graduate student, after which I graduated with a doctorate in computer science from Michigan. Hofstadter and I have kept in close touch over the years and have had many discussions about AI. He knew of my interest in Google’s AI research and was nice enough to invite me to accompany him to the Google meeting.
+
The group in the hard-to-locate conference room consisted of about twenty Google engineers (plus Douglas Hofstadter and myself), all of whom were members of various Google AI teams. The meeting started with the usual going around the room and having people introduce themselves. Several noted that their own careers in AI had been spurred by reading GEB at a young age. They were all excited and curious to hear what the legendary Hofstadter would say about AI. Then Hofstadter got up to speak. “I have some remarks about AI research in general, and here at Google in particular.” His voice became passionate. “I am terrified. Terrified.
  
Chess and the First Seed of Doubt
+
Hofstadter went on.
 +
2
 +
He described how, when he first started working on AI in the 1970s, it was an exciting prospect but seemed so far from being realized that there was no “danger on the horizon, no sense of it actually happening.” Creating machines with humanlike intelligence was a profound intellectual adventure, a long-term research project whose fruition, it had been said, lay at least “one hundred Nobel prizes away.”
 +
3
 +
Hofstadter believed AI was possible in principle: “The ‘enemy’ were people like John Searle, Hubert Dreyfus, and other skeptics, who were saying it was impossible. They did not understand that a brain is a hunk of matter that obeys physical law and the computer can simulate anything … the level of neurons, neurotransmitters, et cetera. In theory, it can be done.” Indeed, Hofstadter’s ideas about simulating intelligence at various levels—from neurons to consciousness—were discussed at length in GEB and had been the focus of his own research for decades. But in practice, until recently, it seemed to Hofstadter that general “human-level” AI had no chance of occurring in his (or even his children’s) lifetime, so he didn’t worry much about it.
  
The group in the hard-to-locate conference room consisted of about twenty Google engineers (plus Douglas Hofstadter and myself), all of whom were members of various Google AI teams. The meeting started with the usual going around the room and having people introduce themselves. Several noted that their own careers in AI had been spurred by reading GEB at a young age. They were all excited and curious to hear what the legendary Hofstadter would say about AI. Then Hofstadter got up to speak. “I have some remarks about AI research in general, and here at Google in particular.” His voice became passionate. “I am terrified. Terrified.”
+
Near the end of GEB, Hofstadter had listed “Ten Questions and Speculations” about artificial intelligence. Here’s one of them: “Will there be chess programs that can beat anyone?” Hofstadter’s speculation was “no.” “There may be programs which can beat anyone at chess, but they will not be exclusively chess players. They will be programs of general intelligence.”
 +
4
  
Hofstadter went on.2 He described how, when he first started working on AI in the 1970s, it was an exciting prospect but seemed so far from being realized that there was no “danger on the horizon, no sense of it actually happening.” Creating machines with humanlike intelligence was a profound intellectual adventure, a long-term research project whose fruition, it had been said, lay at least “one hundred Nobel prizes away.”3 Hofstadter believed AI was possible in principle: “The ‘enemy’ were people like John Searle, Hubert Dreyfus, and other skeptics, who
+
At the Google meeting in 2014, Hofstadter admitted that he had been “dead wrong.” The rapid improvement in chess programs in the 1980s and ’90s had sown the first seed of doubt in his appraisal of AI’s short-term prospects. Although the AI pioneer Herbert Simon had predicted in 1957 that a chess program would be world champion “within 10 years,” by the mid-1970s, when Hofstadter was writing GEB, the best computer chess programs played only at the level of a good (but not great) amateur. Hofstadter had befriended Eliot Hearst, a chess champion and psychology professor who had written extensively on how human chess experts differ from computer chess programs. Experiments showed that expert human players rely on quick recognition of patterns on the chessboard to decide on a move rather than the extensive brute-force look-ahead search that all chess programs use. During a game, the best human players can perceive a configuration of pieces as a particular “kind of position” that requires a certain “kind of strategy.” That is, these players can quickly recognize particular configurations and strategies as instances of higher-level concepts. Hearst argued that without such a general ability to perceive patterns and recognize abstract concepts, chess programs would never reach the level of the best humans. Hofstadter was persuaded by Hearst’s arguments.
  
were saying it was impossible. They did not understand that a brain is a hunk of matter that obeys physical law and the computer can simulate anything … the level of neurons, neurotransmitters, et cetera. In theory, it can be done.” Indeed, Hofstadter’s ideas about simulating intelligence at various levels—from neurons to consciousness—were discussed at length in GEB and had been the focus of his own research for decades. But in practice, until recently, it seemed to Hofstadter that general “human-level” AI had no chance of occurring in his (or even his children’s) lifetime, so he didn’t worry much about it.
+
However, in the 1980s and ’90s, computer chess saw a big jump in improvement, mostly due to the steep increase in computer speed. The best programs still played in a very unhuman way: performing extensive look- ahead to decide on the next move. By the mid-1990s, IBM’s Deep Blue machine, with specialized hardware for playing chess, had reached the Grandmaster level, and in 1997 the program defeated the reigning world chess champion, Garry Kasparov, in a six-game match. Chess mastery, once seen as a pinnacle of human intelligence, had succumbed to a brute-force approach.  
  
Near the end of GEB, Hofstadter had listed “Ten Questions and Speculations” about artificial intelligence. Here’s one of them: “Will there be chess programs that can beat anyone?” Hofstadter’s speculation was “no.” “There may be programs which can beat anyone at chess, but they will not be exclusively chess players. They will be programs of general intelligence.”4
+
===Music: The Bastion of Humanity ===
  
At the Google meeting in 2014, Hofstadter admitted that he had been “dead wrong.” The rapid improvement
+
Although Deep Blue’s win generated a lot of hand-wringing in the press about the rise of intelligent machines, “true” AI still seemed quite distant. Deep Blue could play chess, but it couldn’t do anything else. Hofstadter had been wrong about chess, but he still stood by the other speculations in GEB, especially the one he had listed first:
  
in chess programs in the 1980s and ’90s had sown the first seed of doubt in his appraisal of AI’s short-term prospects. Although the AI pioneer Herbert Simon had predicted in 1957 that a chess program would be world
+
QUESTION: Will a computer ever write beautiful music?
  
champion “within 10 years,” by the mid-1970s, when Hofstadter was writing GEB, the best computer chess programs played only at the level of a good (but not great) amateur. Hofstadter had befriended Eliot Hearst, a chess champion and psychology professor who had written extensively on how human chess experts differ from computer chess programs. Experiments showed that expert human players rely on quick recognition of patterns on the chessboard to decide on a move rather than the extensive brute-force look-ahead search that all chess programs use. During a game, the best human players can perceive a configuration of pieces as a particular “kind of position” that requires a certain “kind of strategy.” That is, these players can quickly recognize particular configurations and strategies as instances of higher-level concepts. Hearst argued that without such a general ability to perceive patterns and recognize abstract concepts, chess programs would never reach the level of the best humans. Hofstadter was persuaded by Hearst’s arguments.
+
SPECULATION: Yes but not soon.  
  
However, in the 1980s and ’90s, computer chess saw a big jump in improvement, mostly due to the steep increase in computer speed. The best programs still played in a very unhuman way: performing extensive look- ahead to decide on the next move. By the mid-1990s, IBM’s Deep Blue machine, with specialized hardware for playing chess, had reached the Grandmaster level, and in 1997 the program defeated the reigning world chess champion, Garry Kasparov, in a six-game match. Chess mastery, once seen as a pinnacle of human intelligence, had succumbed to a brute-force approach.
+
Hofstadter continued, Music is a language of emotions, and until programs have emotions as complex as ours, there is no way a program will write anything beautiful. There can be “forgeries”—shallow imitations of the syntax of earlier music—but despite what one might think at first, there is much more to musical expression than can be captured in syntactic rules.… To think … that we might soon be able to command a preprogrammed mass-produced mail-order twenty-dollar desk-model “music box” to bring forth from its sterile circuitry pieces which Chopin or Bach might have written had they lived longer is a grotesque and shameful misestimation of the depth of the human spirit.
 +
5
  
Music: The Bastion of Humanity
+
Hofstadter described this speculation as “one of the most important parts of GEB—I would have staked my life on it.”
  
Although Deep Blue’s win generated a lot of hand-wringing in the press about the rise of intelligent machines, “true” AI still seemed quite distant. Deep Blue could play chess, but it couldn’t do anything else. Hofstadter had been wrong about chess, but he still stood by the other speculations in GEB, especially the one he had listed first:
+
In the mid-1990s, Hofstadter’s confidence in his assessment of AI was again shaken, this time quite profoundly, when he encountered a program written by a musician, David Cope. The program was called Experiments in Musical Intelligence, or EMI (pronounced “Emmy”). Cope, a composer and music professor, had originally developed EMI to aid him in his own composing process by automatically creating pieces in Cope’s specific style. However, EMI became famous for creating pieces in the style of classical composers such as Bach and Chopin. EMI composes by following a large set of rules, developed by Cope, that are meant to capture a general syntax of composition. These rules are applied to copious examples from a particular composer’s opus in order to produce a new piece “in the style” of that composer.
  
QUESTION: Will a computer ever write beautiful music?
+
Back at our Google meeting, Hofstadter spoke with extraordinary emotion about his encounters with EMI:  
  
SPECULATION: Yes but not soon.
+
:I sat down at my piano and I played one of EMI’s mazurkas “in the style of Chopin.” It didn’t sound exactly like Chopin, but it sounded enough like Chopin, and like coherent music, that I just felt deeply troubled.  
  
Hofstadter continued,
+
:Ever since I was a child, music has thrilled me and moved me to the very core. And every piece that I love feels like it’s a direct message from the emotional heart of the human being who composed it. It feels like it is giving me access to their innermost soul. And it feels like there is nothing more human in the world than that expression of music. Nothing. The idea that pattern manipulation of the most superficial sort can yield things that sound as if they are coming from a human being’s heart is very, very troubling. I was just completely thrown by this.
  
Music is a language of emotions, and until programs have emotions as complex as ours, there is no way a program will write anything beautiful. There can be “forgeries”—shallow imitations of the syntax of earlier music—but despite what one might think at first, there is much more to musical expression than can be captured in syntactic rules.To think … that we might soon be able to command a preprogrammed mass-produced mail-order twenty-dollar desk-model “music box” to bring forth from its sterile circuitry pieces which
+
Hofstadter then recounted a lecture he gave at the prestigious Eastman School of Music, in Rochester, New York. After describing EMI, Hofstadter had asked the Eastman audience—including several music theory and composition faculty—to guess which of two pieces a pianist played for them was a (little-known) mazurka by Chopin and which had been composed by EMI. As one audience member described later, “The first mazurka had grace and charm, but not ‘true-Chopin’ degrees of invention and large-scale fluidity The second was clearly the genuine Chopin, with a lyrical melody; large-scale, graceful chromatic modulations; and a natural, balanced form.”
 +
6
  
Chopin or Bach might have written had they lived longer is a grotesque and shameful misestimation of the depth of the human spirit.5
+
Many of the faculty agreed and, to Hofstadter’s shock, voted EMI for the first piece and “real-Chopin” for the second piece. The correct answers were the reverse.  
  
Hofstadter described this speculation as “one of the most important parts of GEB—I would have staked my life on it.”
+
In the Google conference room, Hofstadter paused, peering into our faces. No one said a word. At last he went on. “I was terrified by EMI. Terrified. I hated it, and was extremely threatened by it. It was threatening to destroy what I most cherished about humanity. I think EMI was the most quintessential example of the fears that I have about artificial intelligence.”  
  
In the mid-1990s, Hofstadter’s confidence in his assessment of AI was again shaken, this time quite profoundly, when he encountered a program written by a musician, David Cope. The program was called Experiments in Musical Intelligence, or EMI (pronounced “Emmy”). Cope, a composer and music professor, had originally developed EMI to aid him in his own composing process by automatically creating pieces in Cope’s specific style. However, EMI became famous for creating pieces in the style of classical composers such as Bach and Chopin. EMI composes by following a large set of rules, developed by Cope, that are meant to capture a general syntax of composition. These rules are applied to copious examples from a particular composer’s opus in order to produce a new piece “in the style” of that composer.
+
===Google and the Singularity ===
  
Back at our Google meeting, Hofstadter spoke with extraordinary emotion about his encounters with EMI:
+
Hofstadter then spoke of his deep ambivalence about what Google itself was trying to accomplish in AI—self- driving cars, speech recognition, natural-language understanding, translation between languages, computer- generated art, music composition, and more. Hofstadter’s worries were underlined by Google’s embrace of Ray Kurzweil and his vision of the Singularity, in which AI, empowered by its ability to improve itself and learn on its own, will quickly reach, and then exceed, human-level intelligence. Google, it seemed, was doing everything it could to accelerate that vision. While Hofstadter strongly doubted the premise of the Singularity, he admitted that Kurzweil’s predictions still disturbed him. “I was terrified by the scenarios. Very skeptical, but at the same time, I thought, maybe their timescale is off, but maybe they’re right. We’ll be completely caught off guard. We’ll think nothing is happening and all of a sudden, before we know it, computers will be smarter than us.”
  
I sat down at my piano and I played one of EMI’s mazurkas “in the style of Chopin.” It didn’t sound exactly like Chopin, but it sounded enough like Chopin, and like coherent music, that I just felt deeply troubled.
+
If this actually happens, “we will be superseded. We will be relics. We will be left in the dust.  
  
Ever since I was a child, music has thrilled me and moved me to the very core. And every piece that I love feels like it’s a direct message from the emotional heart of the human being who composed it. It feels like it is giving me access to their innermost soul. And it feels like there is nothing more human in the world than that expression of music. Nothing. The idea that pattern manipulation of the most superficial sort can yield things that sound as if they are coming from a human being’s heart is very, very troubling. I was just completely thrown by this.
+
:“Maybe this is going to happen, but I don’t want it to happen soon. I don’t want my children to be left in the dust.
  
Hofstadter then recounted a lecture he gave at the prestigious Eastman School of Music, in Rochester, New York. After describing EMI, Hofstadter had asked the Eastman audience—including several music theory and
+
Hofstadter ended his talk with a direct reference to the very Google engineers in that room, all listening intently: “I find it very scary, very troubling, very sad, and I find it terrible, horrifying, bizarre, baffling, bewildering, that people are rushing ahead blindly and deliriously in creating these things.”
  
composition faculty—to guess which of two pieces a pianist played for them was a (little-known) mazurka by Chopin and which had been composed by EMI. As one audience member described later, “The first mazurka had grace and charm, but not ‘true-Chopin’ degrees of invention and large-scale fluidity … The second was clearly the genuine Chopin, with a lyrical melody; large-scale, graceful chromatic modulations; and a natural, balanced form.”6
+
===Why Is Hofstadter Terrified? ===
  
Many of the faculty agreed and, to Hofstadter’s shock, voted EMI for the first piece and “real-Chopin” for the
+
I looked around the room. The audience appeared mystified, embarrassed even. To these Google AI researchers, none of this was the least bit terrifying. In fact, it was old news. When Deep Blue beat Kasparov, when EMI started composing Chopin-like mazurkas, and when Kurzweil wrote his first book on the Singularity, many of these engineers had been in high school, probably reading GEB and loving it, even though its AI prognostications were a bit out of date. The reason they were working at Google was precisely to make AI happen—not in a hundred years, but now, as soon as possible. They didn’t understand what Hofstadter was so stressed out about.
  
second piece. The correct answers were the reverse.
+
People who work in AI are used to encountering the fears of people outside the field, who have presumably been influenced by the many science fiction movies depicting superintelligent machines that turn evil. AI researchers are also familiar with the worries that increasingly sophisticated AI will replace humans in some jobs, that AI applied to big data sets could subvert privacy and enable subtle discrimination, and that ill-understood AI systems allowed to make autonomous decisions have the potential to cause havoc.  
  
In the Google conference room, Hofstadter paused, peering into our faces. No one said a word. At last he went on. “I was terrified by EMI. Terrified. I hated it, and was extremely threatened by it. It was threatening to destroy what I most cherished about humanity. I think EMI was the most quintessential example of the fears that I have about artificial intelligence.
+
Hofstadter’s terror was in response to something entirely different. It was not about AI becoming too smart, too invasive, too malicious, or even too useful. Instead, he was terrified that intelligence, creativity, emotions, and maybe even consciousness itself would be too easy to produce—that what he valued most in humanity would end up being nothing more than a “bag of tricks,” that a superficial set of brute-force algorithms could explain the human spirit.  
  
Google and the Singularity
+
As GEB made abundantly clear, Hofstadter firmly believes that the mind and all its characteristics emerge wholly from the physical substrate of the brain and the rest of the body, along with the body’s interaction with the physical world. There is nothing immaterial or incorporeal lurking there. The issue that worries him is really one of complexity. He fears that AI might show us that the human qualities we most value are disappointingly simple to mechanize. As Hofstadter explained to me after the meeting, here referring to Chopin, Bach, and other paragons of humanity, “If such minds of infinite subtlety and complexity and emotional depth could be trivialized by a small chip, it would destroy my sense of what humanity is about.”
  
Hofstadter then spoke of his deep ambivalence about what Google itself was trying to accomplish in AI—self- driving cars, speech recognition, natural-language understanding, translation between languages, computer- generated art, music composition, and more. Hofstadter’s worries were underlined by Google’s embrace of Ray Kurzweil and his vision of the Singularity, in which AI, empowered by its ability to improve itself and learn on its own, will quickly reach, and then exceed, human-level intelligence. Google, it seemed, was doing everything it could to accelerate that vision. While Hofstadter strongly doubted the premise of the Singularity, he admitted that Kurzweil’s predictions still disturbed him. “I was terrified by the scenarios. Very skeptical, but at the same time, I thought, maybe their timescale is off, but maybe they’re right. We’ll be completely caught off guard. We’ll think nothing is happening and all of a sudden, before we know it, computers will be smarter than us.”
+
===I Am Confused ===
  
If this actually happens, “we will be superseded. We will be relics. We will be left in the dust.
+
Following Hofstadter’s remarks, there was a short discussion, in which the nonplussed audience prodded Hofstadter to further explain his fears about AI and about Google in particular. But a communication barrier remained. The meeting continued, with project presentations, group discussion, coffee breaks, the usual—none of it really touching on Hofstadter’s comments. Close to the end of the meeting, Hofstadter asked the participants for their thoughts about the near-term future of AI. Several of the Google researchers predicted that general human-level AI would likely emerge within the next thirty years, in large part due to Google’s own advances on the brain-inspired method of “deep learning.
  
“Maybe this is going to happen, but I don’t want it to happen soon. I don’t want my children to be left in the
+
I left the meeting scratching my head in confusion. I knew that Hofstadter had been troubled by some of Kurzweil’s Singularity writings, but I had never before appreciated the degree of his emotion and anxiety. I also had known that Google was pushing hard on AI research, but I was startled by the optimism several people there expressed about how soon AI would reach a general “human” level. My own view had been that AI had progressed a lot in some narrow areas but was still nowhere close to having the broad, general intelligence of humans, and it would not get there in a century, let alone thirty years. And I had thought that people who believed otherwise were vastly underestimating the complexity of human intelligence. I had read Kurzweil’s books and had found them largely ridiculous. However, listening to all the comments at the meeting, from people I respected and admired, forced me to critically examine my own views. While assuming that these AI researchers underestimated humans, had I in turn underestimated the power and promise of current-day AI?
  
dust.”
+
Over the months that followed, I started paying more attention to the discussion surrounding these questions. I started to notice the slew of articles, blog posts, and entire books by prominent people suddenly telling us we should start worrying, right now, about the perils of “superhuman” AI. In 2014, the physicist Stephen Hawking proclaimed, “The development of full artificial intelligence could spell the end of the human race.”
 +
7
 +
In the same year, the entrepreneur Elon Musk, founder of the Tesla and SpaceX companies, said that artificial intelligence is probably “our biggest existential threat” and that “with artificial intelligence we are summoning the demon.”
 +
8
 +
Microsoft’s cofounder Bill Gates concurred: “I agree with Elon Musk and some others on this and don’t understand why some people are not concerned.”
 +
9
 +
The philosopher Nick Bostrom’s book Superintelligence, on the potential dangers of machines becoming smarter than humans, became a surprise bestseller, despite its dry and ponderous style.
  
Hofstadter ended his talk with a direct reference to the very Google engineers in that room, all listening
+
Other prominent thinkers were pushing back. Yes, they said, we should make sure that AI programs are safe and don’t risk harming humans, but any reports of near-term superhuman AI are greatly exaggerated. The entrepreneur and activist Mitchell Kapor advised, “Human intelligence is a marvelous, subtle, and poorly understood phenomenon. There is no danger of duplicating it anytime soon.”
 +
10
 +
The roboticist (and former director of MIT’s AI Lab) Rodney Brooks agreed, stating that we “grossly overestimate the capabilities of machines—those of today and of the next few decades.”
 +
11
 +
The psychologist and AI researcher Gary Marcus went so far as to assert that in the quest to create “strong AI”—that is, general human-level AI—“there has been almost no progress.”
 +
12
  
intently: “I find it very scary, very troubling, very sad, and I find it terrible, horrifying, bizarre, baffling, bewildering, that people are rushing ahead blindly and deliriously in creating these things.”
+
I could go on and on with dueling quotations. In short, what I found is that the field of AI is in turmoil. Either a huge amount of progress has been made, or almost none at all. Either we are within spitting distance of “true” AI, or it is centuries away. AI will solve all our problems, put us all out of a job, destroy the human race, or cheapen our humanity. It’s either a noble quest or “summoning the demon.”  
  
Why Is Hofstadter Terrified?
+
===What This Book Is About ===
  
I looked around the room. The audience appeared mystified, embarrassed even. To these Google AI researchers, none of this was the least bit terrifying. In fact, it was old news. When Deep Blue beat Kasparov, when EMI started composing Chopin-like mazurkas, and when Kurzweil wrote his first book on the Singularity, many of these engineers had been in high school, probably reading GEB and loving it, even though its AI prognostications were a bit out of date. The reason they were working at Google was precisely to make AI happen—not in a hundred years, but now, as soon as possible. They didn’t understand what Hofstadter was so stressed out about.
+
This book arose from my attempt to understand the true state of affairs in artificial intelligence—what computers can do now, and what we can expect from them over the next decades. Hofstadter’s provocative comments at the Google meeting were something of a wake-up call for me, as were the Google researchers’ confident responses about AI’s near-term future. In the chapters that follow, I try to sort out how far artificial intelligence has come, as well as elucidate its disparate—and sometimes conflicting—goals. In doing so, I consider how some of the most prominent AI systems actually work, and investigate how successful they are and where their limitations lie. I look at the extent to which computers can now do things that we believe to require high levels of intelligence—beating humans at the most intellectually demanding games, translating between languages, answering complex questions, navigating vehicles in challenging terrain. And I examine how they fare at the things we take for granted, the everyday tasks we humans perform without conscious thought: recognizing faces and objects in images, understanding spoken language and written text, and using the most basic common sense.  
  
People who work in AI are used to encountering the fears of people outside the field, who have presumably been influenced by the many science fiction movies depicting superintelligent machines that turn evil. AI researchers are also familiar with the worries that increasingly sophisticated AI will replace humans in some jobs, that AI applied to big data sets could subvert privacy and enable subtle discrimination, and that ill-understood AI systems allowed to make autonomous decisions have the potential to cause havoc.
+
I also try to make sense of the broader questions that have fueled debates about AI since its inception: What do we actually mean by “general human” or even “superhuman” intelligence? Is current AI close to this level, or even on a trajectory to get there? What are the dangers? What aspects of our intelligence do we most cherish, and to what extent would human-level AI challenge how we think about our own humanness? To use Hofstadter’s terms, how terrified should we be?
  
Hofstadter’s terror was in response to something entirely different. It was not about AI becoming too smart, too invasive, too malicious, or even too useful. Instead, he was terrified that intelligence, creativity, emotions, and maybe even consciousness itself would be too easy to produce—that what he valued most in humanity would end up being nothing more than a “bag of tricks,” that a superficial set of brute-force algorithms could explain the human spirit.
+
This book is not a general survey or history of artificial intelligence. Rather, it is an in-depth exploration of some of the AI methods that probably affect your life, or will soon, as well as the AI efforts that perhaps go furthest in challenging our sense of human uniqueness. My aim is for you to share in my own exploration and, like me, to come away with a clearer sense of what the field has accomplished and how much further there is to go before our machines can argue for their own humanity.
  
As GEB made abundantly clear, Hofstadter firmly believes that the mind and all its characteristics emerge wholly from the physical substrate of the brain and the rest of the body, along with the body’s interaction with the physical world. There is nothing immaterial or incorporeal lurking there. The issue that worries him is really one of complexity. He fears that AI might show us that the human qualities we most value are disappointingly simple to mechanize. As Hofstadter explained to me after the meeting, here referring to Chopin, Bach, and other paragons of humanity, “If such minds of infinite subtlety and complexity and emotional depth could be trivialized by a small
+
=Part I Background =
  
chip, it would destroy my sense of what humanity is about.”
+
== 1 - The Roots of Artificial Intelligence ==
  
I Am Confused
+
===Two Months and Ten Men at Dartmouth ===
  
Following Hofstadter’s remarks, there was a short discussion, in which the nonplussed audience prodded Hofstadter to further explain his fears about AI and about Google in particular. But a communication barrier remained. The meeting continued, with project presentations, group discussion, coffee breaks, the usual—none of it really touching on Hofstadter’s comments. Close to the end of the meeting, Hofstadter asked the participants for their thoughts about the near-term future of AI. Several of the Google researchers predicted that general human-level AI would likely emerge within the next thirty years, in large part due to Google’s own advances on the brain-inspired method of “deep learning.
+
The dream of creating an intelligent machine—one that is as smart as or smarter than humans—is centuries old but became part of modern science with the rise of digital computers. In fact, the ideas that led to the first programmable computers came out of mathematicians’ attempts to understand human thought—particularly logic—as a mechanical process of “symbol manipulation.” Digital computers are essentially symbol manipulators, pushing around combinations of the symbols 0 and 1. To pioneers of computing like Alan Turing and John von Neumann, there were strong analogies between computers and the human brain, and it seemed obvious to them that human intelligence could be replicated in computer programs.  
  
I left the meeting scratching my head in confusion. I knew that Hofstadter had been troubled by some of Kurzweil’s Singularity writings, but I had never before appreciated the degree of his emotion and anxiety. I also had known that Google was pushing hard on AI research, but I was startled by the optimism several people there expressed about how soon AI would reach a general “human” level. My own view had been that AI had progressed a lot in some narrow areas but was still nowhere close to having the broad, general intelligence of humans, and it would not get there in a century, let alone thirty years. And I had thought that people who believed otherwise were vastly underestimating the complexity of human intelligence. I had read Kurzweil’s books and had found them largely ridiculous. However, listening to all the comments at the meeting, from people I respected and admired, forced me to critically examine my own views. While assuming that these AI researchers underestimated humans, had I in turn underestimated the power and promise of current-day AI?
+
Most people in artificial intelligence trace the field’s official founding to a small workshop in 1956 at Dartmouth College organized by a young mathematician named John McCarthy.  
 
+
Over the months that followed, I started paying more attention to the discussion surrounding these questions. I started to notice the slew of articles, blog posts, and entire books by prominent people suddenly telling us we should start worrying, right now, about the perils of “superhuman” AI. In 2014, the physicist Stephen Hawking proclaimed, “The development of full artificial intelligence could spell the end of the human race.”7 In the same year, the entrepreneur Elon Musk, founder of the Tesla and SpaceX companies, said that artificial intelligence is probably “our biggest existential threat” and that “with artificial intelligence we are summoning the demon.”8 Microsoft’s cofounder Bill Gates concurred: “I agree with Elon Musk and some others on this and don’t understand why some people are not concerned.”9 The philosopher Nick Bostrom’s book Superintelligence, on the potential dangers of machines becoming smarter than humans, became a surprise bestseller, despite its dry and ponderous style.
+
 
+
Other prominent thinkers were pushing back. Yes, they said, we should make sure that AI programs are safe and don’t risk harming humans, but any reports of near-term superhuman AI are greatly exaggerated. The entrepreneur and activist Mitchell Kapor advised, “Human intelligence is a marvelous, subtle, and poorly understood phenomenon. There is no danger of duplicating it anytime soon.”10 The roboticist (and former director of MIT’s AI Lab) Rodney Brooks agreed, stating that we “grossly overestimate the capabilities of machines—those of today and of the next few decades.”11 The psychologist and AI researcher Gary Marcus went so far as to assert that in the quest to create “strong AI”—that is, general human-level AI—“there has been almost no progress.”12
+
 
+
I could go on and on with dueling quotations. In short, what I found is that the field of AI is in turmoil. Either
+
 
+
a huge amount of progress has been made, or almost none at all. Either we are within spitting distance of “true” AI, or it is centuries away. AI will solve all our problems, put us all out of a job, destroy the human race, or cheapen our humanity. It’s either a noble quest or “summoning the demon.”
+
 
+
What This Book Is About
+
 
+
This book arose from my attempt to understand the true state of affairs in artificial intelligence—what computers can do now, and what we can expect from them over the next decades. Hofstadter’s provocative comments at the Google meeting were something of a wake-up call for me, as were the Google researchers’ confident responses about AI’s near-term future. In the chapters that follow, I try to sort out how far artificial intelligence has come, as well as elucidate its disparate—and sometimes conflicting—goals. In doing so, I consider how some of the most prominent AI systems actually work, and investigate how successful they are and where their limitations lie. I look at the extent to which computers can now do things that we believe to require high levels of intelligence—beating humans at the most intellectually demanding games, translating between languages, answering complex questions,
+
 
+
navigating vehicles in challenging terrain. And I examine how they fare at the things we take for granted, the everyday tasks we humans perform without conscious thought: recognizing faces and objects in images, understanding spoken language and written text, and using the most basic common sense.
+
 
+
I also try to make sense of the broader questions that have fueled debates about AI since its inception: What do we actually mean by “general human” or even “superhuman” intelligence? Is current AI close to this level, or even on a trajectory to get there? What are the dangers? What aspects of our intelligence do we most cherish, and to what extent would human-level AI challenge how we think about our own humanness? To use Hofstadter’s terms, how terrified should we be?
+
 
+
This book is not a general survey or history of artificial intelligence. Rather, it is an in-depth exploration of some of the AI methods that probably affect your life, or will soon, as well as the AI efforts that perhaps go furthest in challenging our sense of human uniqueness. My aim is for you to share in my own exploration and, like me, to come away with a clearer sense of what the field has accomplished and how much further there is to go before our machines can argue for their own humanity.
+
 
+
Part I Background
+
  
 +
In 1955, McCarthy, aged twenty-eight, joined the mathematics faculty at Dartmouth. As an undergraduate, he had learned a bit about both psychology and the nascent field of “automata theory” (later to become computer science) and had become intrigued with the idea of creating a thinking machine. In graduate school in the mathematics department at Princeton, McCarthy had met a fellow student, Marvin Minsky, who shared his fascination with the potential of intelligent computers. After graduating, McCarthy had short-lived stints at Bell Labs and IBM, where he collaborated, respectively, with Claude Shannon, the inventor of information theory, and Nathaniel Rochester, a pioneering electrical engineer. Once at Dartmouth, McCarthy persuaded Minsky, Shannon, and Rochester to help him organize “a 2 month, 10 man study of artificial intelligence to be carried out during the summer of 1956.”
 
1
 
1
 +
The term artificial intelligence was McCarthy’s invention; he wanted to distinguish this field from a related effort called cybernetics.
 +
2
 +
McCarthy later admitted that no one really liked the name—after all, the goal was genuine, not “artificial,” intelligence—but “I had to call it something, so I called it ‘Artificial Intelligence.’”
 +
3
  
The Roots of Artificial Intelligence
+
The four organizers submitted a proposal to the Rockefeller Foundation asking for funding for the summer workshop. The proposed study was, they wrote, based on “the conjecture that every aspect of learning or any other feature of intelligence can be in principle so precisely described that a machine can be made to simulate it.”
 +
4
 +
The proposal listed a set of topics to be discussed—natural-language processing, neural networks, machine learning, abstract concepts and reasoning, creativity—that have continued to define the field to the present day.
  
Two Months and Ten Men at Dartmouth
+
Even though the most advanced computers in 1956 were about a million times slower than today’s smartphones, McCarthy and colleagues were optimistic that AI was in close reach: “We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”
 +
5
  
The dream of creating an intelligent machine—one that is as smart as or smarter than humans—is centuries old but became part of modern science with the rise of digital computers. In fact, the ideas that led to the first programmable computers came out of mathematicians’ attempts to understand human thought—particularly logic—as a mechanical process of “symbol manipulation.” Digital computers are essentially symbol manipulators, pushing around combinations of the symbols 0 and 1. To pioneers of computing like Alan Turing and John von Neumann, there were strong analogies between computers and the human brain, and it seemed obvious to them that human intelligence could be replicated in computer programs.
+
Obstacles  soon  arose  that  would  be  familiar  to  anyone  organizing  a  scientific  workshop  today.  The Rockefeller Foundation came through with only half the requested amount of funding. And it turned out to be harder than McCarthy had thought to persuade the participants to actually come and then stay, not to mention agree on anything. There were lots of interesting discussions but not a lot of coherence. As usual in such meetings, “Everyone had a different idea, a hearty ego, and much enthusiasm for their own plan.”
 +
6
 +
However, the Dartmouth summer of AI did  produce a few very  important outcomes. The field  itself was named, and  its general goals were outlined. The soon-to-be “big four” pioneers of the field—McCarthy, Minsky, Allen Newell, and Herbert Simon—met and did some planning for the future. And for whatever reason, these four came out of the meeting with tremendous optimism for the field. In the early 1960s, McCarthy founded the Stanford Artificial Intelligence Project, with the “goal of building a fully intelligent machine in a decade.”
 +
7
 +
Around the same time, the future Nobel laureate Herbert Simon predicted, “Machines will be capable, within twenty years, of doing any work that a man can do.”
 +
8
 +
Soon after, Marvin Minsky, founder of the MIT AI Lab, forecast that “within a generation … the problems of creating ‘artificial intelligence’ will be substantially solved.
 +
9
 +
  
Most people in artificial intelligence trace the field’s official founding to a small workshop in 1956 at Dartmouth College organized by a young mathematician named John McCarthy.
+
===Definitions, and Getting On with It ===
  
In 1955, McCarthy, aged twenty-eight, joined the mathematics faculty at Dartmouth. As an undergraduate, he had learned a bit about both psychology and the nascent field of “automata theory” (later to become computer science) and had become intrigued with the idea of creating a thinking machine. In graduate school in the mathematics department at Princeton, McCarthy had met a fellow student, Marvin Minsky, who shared his fascination with the potential of intelligent computers. After graduating, McCarthy had short-lived stints at Bell Labs and IBM, where he collaborated, respectively, with Claude Shannon, the inventor of information theory, and Nathaniel Rochester, a pioneering electrical engineer. Once at Dartmouth, McCarthy persuaded Minsky, Shannon, and Rochester to help him organize “a 2 month, 10 man study of artificial intelligence to be carried out during the
+
None of these predicted events have yet come to pass. So how far do we remain from the goal of building a “fully intelligent machine”? Would such a machine require us to reverse engineer the human brain in all its complexity, or is there a shortcut, a clever set of yet-unknown algorithms, that can produce what we recognize as full intelligence? What does “full intelligence” even mean?
  
summer of 1956.”1 The term artificial intelligence was McCarthy’s invention; he wanted to distinguish this field from a related effort called cybernetics.2 McCarthy later admitted that no one really liked the name—after all, the goal was genuine, not “artificial,” intelligence—but “I had to call it something, so I called it ‘Artificial Intelligence.’”3
+
“Define your terms … or we shall never understand one another.
 +
10
 +
This admonition from the eighteenth- century philosopher Voltaire is a challenge for anyone talking about artificial intelligence, because its central notion —intelligence—remains so ill-defined. Marvin Minsky himself coined the phrase “suitcase word”
 +
11
 +
for terms like intelligence and its many cousins, such as thinking, cognition, consciousness, and emotion. Each is packed like a suitcase with a jumble of different meanings. Artificial intelligence inherits this packing problem, sporting different meanings in different contexts.  
  
The four organizers submitted a proposal to the Rockefeller Foundation asking for funding for the summer
+
Most people would agree that humans are intelligent and specks of dust are not. Likewise, we generally believe that humans are more intelligent than worms. As for human intelligence, IQ is measured on a single scale, but we also talk about the different dimensions of intelligence: emotional, verbal, spatial, logical, artistic, social, and so forth. Thus, intelligence can be binary (something is or is not intelligent), on a continuum (one thing is more intelligent than another thing), or multidimensional (someone can have high verbal intelligence but low emotional intelligence). Indeed, the word intelligence is an over-packed suitcase, zipper on the verge of breaking.
  
workshop. The proposed study was, they wrote, based on “the conjecture that every aspect of learning or any other feature of intelligence can be in principle so precisely described that a machine can be made to simulate it.”4 The proposal listed a set of topics to be discussed—natural-language processing, neural networks, machine learning, abstract concepts and reasoning, creativity—that have continued to define the field to the present day.
+
For better or worse, the field of AI has largely ignored these various distinctions. Instead, it has focused on two efforts: one scientific and one practical. On the scientific side, AI researchers are investigating the mechanisms of “natural” (that is, biological) intelligence by trying to embed it in computers. On the practical side, AI proponents simply want to create computer programs that perform tasks as well as or better than humans, without worrying about whether these programs are actually thinking in the way humans think. When asked if their motivations are practical or scientific, many AI people joke that it depends on where their funding currently comes from.  
  
Even though the most advanced computers in 1956 were about a million times slower than today’s
+
In a recent report on the current state of AI, a committee of prominent researchers defined the field as “a branch of computer science that studies the properties of intelligence by synthesizing intelligence.”
 +
12
 +
A bit circular, yes. But the same committee also admitted that it’s hard to define the field, and that may be a good thing: “The lack of a precise, universally accepted definition of AI probably has helped the field to grow, blossom, and advance at an ever-accelerating pace.”
 +
13
 +
Furthermore, the committee notes, “Practitioners, researchers, and developers of AI are instead guided by a rough sense of direction and an imperative to ‘get on with it.’”
  
smartphones, McCarthy and colleagues were optimistic that AI was in close reach: “We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”5
+
===An Anarchy of Methods ===
  
Obstacles  soon  arose  that would  be  familiar  to anyone  organizing  a  scientific  workshop  todayThe
+
At the 1956 Dartmouth workshop, different participants espoused divergent opinions about the correct approach to take to develop AI. Some people—generally mathematicians—promoted mathematical logic and deductive reasoning as the language of rational thought. Others championed inductive methods in which programs extract statistics from data and use probabilities to deal with uncertainty. Still others believed firmly in taking inspiration from biology and psychology to create brain-like programs. What you may find surprising is that the arguments among proponents of these various approaches persist to this day. And each approach has generated its own panoply of principles and techniques, fortified by specialty conferences and journals, with little communication among the subspecialties. A recent AI survey paper summed it up: “Because we don’t deeply understand intelligence or know how to produce general AI, rather than cutting off any avenues of exploration, to truly make progress we should embrace AI’s ‘anarchy of methods.’”
 +
14
 +
   
  
Rockefeller Foundation came through with only half the requested amount of funding. And it turned out to be harder than McCarthy had thought to persuade the participants to actually come and then stay, not to mention agree on anything. There were lots of interesting discussions but not a lot of coherence. As usual in such meetings, “Everyone had a different idea, a hearty ego, and much enthusiasm for their own plan.”6 However, the Dartmouth summer of AI did  produce a few very  important outcomes.  The field  itself was named,  and its general goals were outlined. The
+
But since the 2010s, one family of AI methods—collectively called deep learning (or deep neural networks)— has risen above the anarchy to become the dominant AI paradigm. In fact, in much of the popular media, the term artificial intelligence itself has come to mean “deep learning.” This is an unfortunate inaccuracy, and I need to clarify the distinction. AI is a field that includes a broad set of approaches, with the goal of creating machines with intelligence. Deep learning is only one such approach. Deep learning is itself one method among many in the field of machine learning, a subfield of AI in which machines “learn” from data or from their own “experiences.” To better understand these various distinctions, it’s important to understand a philosophical split that occurred early in the AI research community: the split between so-called symbolic and subsymbolic AI.  
  
soon-to-be “big four” pioneers of the field—McCarthy, Minsky, Allen Newell, and Herbert Simon—met and did some planning for the future. And for whatever reason, these four came out of the meeting with tremendous optimism for the field. In the early 1960s, McCarthy founded the Stanford Artificial Intelligence Project, with the
+
===Symbolic AI ===
  
“goal of building a fully intelligent machine in a decade.”7 Around the same time, the future Nobel laureate Herbert Simon predicted, “Machines will be capable, within twenty years, of doing any work that a man can do.”8 Soon after, Marvin Minsky, founder of the MIT AI Lab, forecast that “within a generation … the problems of creating ‘artificial intelligence’ will be substantially solved.”9
+
First let’s look at symbolic AI. A symbolic AI program’s knowledge consists of words or phrases (the “symbols”), typically understandable to a human, along with rules by which the program can combine and process these symbols in order to perform its assigned task.  
  
Definitions, and Getting On with It
+
I’ll give you an example. One early AI program was confidently called the General Problem Solver,
 +
15
 +
or GPS for short. (Sorry about the confusing acronym; the General Problem Solver predated the Global Positioning System.) GPS could solve problems such as the “Missionaries and Cannibals” puzzle, which you might have tackled yourself as a child. In this well-known conundrum, three missionaries and three cannibals all need to cross a river, but their boat holds only two people. If at any time the (hungry) cannibals outnumber the (tasty-looking) missionaries on one side of the river … well, you probably know what happens. How do all six get across the river intact?
  
None of these predicted events have yet come to pass. So how far do we remain from the goal of building a “fully intelligent machine”? Would such a machine require us to reverse engineer the human brain in all its complexity, or is there a shortcut, a clever set of yet-unknown algorithms, that can produce what we recognize as full intelligence? What does “full intelligence” even mean?
+
The creators of the General Problem Solver, the cognitive scientists Herbert Simon and Allen Newell, had recorded several students “thinking out loud” while solving this and other logic puzzles. Simon and Newell then designed their program to mimic what they believed were the students’ thought processes.
  
“Define your terms … or we shall never understand one another.”10 This admonition from the eighteenth- century philosopher Voltaire is a challenge for anyone talking about artificial intelligence, because its central notion
+
I won’t go into the details of how GPS worked, but its symbolic nature can be seen by the way the program’s instructions were encoded. To set up the problem, a human would write code for GPS that looked something like this:
  
—intelligence—remains so ill-defined. Marvin Minsky himself coined the phrase “suitcase word”11 for terms like intelligence and its many cousins, such as thinking, cognition, consciousness, and emotion. Each is packed like a suitcase with a jumble of different meanings. Artificial intelligence inherits this packing problem, sporting different meanings in different contexts.
+
:CURRENT STATE:
 +
:LEFT-BANK = [3 MISSIONARIES, 3 CANNIBALS, 1 BOAT] RIGHT-BANK = [EMPTY]
  
Most people would agree that humans are intelligent and specks of dust are not. Likewise, we generally believe that humans are more intelligent than worms. As for human intelligence, IQ is measured on a single scale, but we also talk about the different dimensions of intelligence: emotional, verbal, spatial, logical, artistic, social, and so forth. Thus, intelligence can be binary (something is or is not intelligent), on a continuum (one thing is more intelligent than another thing), or multidimensional (someone can have high verbal intelligence but low emotional intelligence). Indeed, the word intelligence is an over-packed suitcase, zipper on the verge of breaking.
+
:DESIRED STATE:
 +
:LEFT-BANK = [EMPTY]
 +
:RIGHT-BANK = [3 MISSIONARIES, 3 CANNIBALS, 1 BOAT]
  
For better or worse, the field of AI has largely ignored these various distinctions. Instead, it has focused on two efforts: one scientific and one practical. On the scientific side, AI researchers are investigating the mechanisms of “natural” (that is, biological) intelligence by trying to embed it in computers. On the practical side, AI proponents simply want to create computer programs that perform tasks as well as or better than humans, without worrying about whether these programs are actually thinking in the way humans think. When asked if their motivations are practical or scientific, many AI people joke that it depends on where their funding currently comes from.
+
In English, these lines represent the fact that initially the left bank of the river “contains” three missionaries, three cannibals, and one boat, whereas the right bank doesn’t contain any of these. The desired state represents the goal of the program—get everyone to the right bank of the river.  
  
In a recent report on the current state of AI, a committee of prominent researchers defined the field as “a branch of computer science that studies the properties of intelligence by synthesizing intelligence.”12 A bit circular, yes. But the same committee also admitted that it’s hard to define the field, and that may be a good thing: “The lack of a precise, universally accepted definition of AI probably has helped the field to grow, blossom, and advance at an ever-accelerating pace.”13 Furthermore, the committee notes, “Practitioners, researchers, and developers of AI are instead guided by a rough sense of direction and an imperative to ‘get on with it.’”
+
At each step in its procedure, GPS attempts to change its current state to make it more similar to the desired state. In its code, the program has “operators” (in the form of subprograms) that can transform the current state into a new state and “rules” that encode the constraints of the task. For example, there is an operator that moves some number of missionaries and cannibals from one side of the river to the other:
  
An Anarchy of Methods
+
:MOVE (#MISSIONARIES, #CANNIBALS, FROM-SIDE, TO-SIDE)
  
At the 1956 Dartmouth workshop, different participants espoused divergent opinions about the correct approach to take to develop AI. Some people—generally mathematicians—promoted mathematical logic and deductive reasoning as the language of rational thought. Others championed inductive methods in which programs extract statistics from data and use probabilities to deal with uncertainty. Still others believed firmly in taking inspiration from biology and psychology to create brain-like programs. What you may find surprising is that the arguments among proponents of these various approaches persist to this day. And each approach has generated its own panoply of principles and techniques, fortified by specialty conferences and journals, with little communication among the subspecialties. A recent AI survey paper summed it up: “Because we don’t deeply understand intelligence or know how to produce general AI, rather than cutting off any avenues of exploration, to truly make progress we should
+
The words inside the parentheses are called arguments, and when the program runs, it replaces these words with numbers or other words. That is, #MISSIONARIES is replaced with the number of missionaries to move, #CANNIBALS with the number of cannibals to move, and FROM-SIDE and TO-SIDE are replaced with “LEFT- BANK” or “RIGHT-BANK,” depending on which riverbank the missionaries and cannibals are to be moved from. Encoded into the program is the knowledge that the boat is moved along with the missionaries and cannibals.
  
embrace AI’s ‘anarchy of methods.’”14
+
Before being able to apply this operator with specific values replacing the arguments, the program must check its encoded rules; for example, the maximum number of people that can move at a time is two, and the operator cannot be used if it will result in cannibals outnumbering missionaries on a riverbank.  
  
But since the 2010s, one family of AI methods—collectively called deep learning (or deep neural networks)— has risen above the anarchy to become the dominant AI paradigm. In fact, in much of the popular media, the term artificial intelligence itself has come to mean “deep learning.This is an unfortunate inaccuracy, and I need to clarify the distinction. AI is a field that includes a broad set of approaches, with the goal of creating machines with
+
While these symbols represent human-interpretable concepts such as missionaries, cannibals, boat, and left bank, the computer running this program of course has no knowledge of the meaning of these symbols. You could replace all occurrences of “MISSIONARIES” with “Z372B” or any other nonsense string, and the program would work in exactly the same way. This is part of what the term General refers to in General Problem Solver. To the computer, the “meaning” of the symbols derives from the ways in which they can be combined, related to one another, and operated on.
  
intelligence. Deep learning is only one such approach. Deep learning is itself one method among many in the field of machine learning, a subfield of AI in which machines “learn” from data or from their own “experiences.” To better understand these various distinctions, it’s important to understand a philosophical split that occurred early in the AI research community: the split between so-called symbolic and subsymbolic AI.
+
Advocates of the symbolic approach to AI argued that to attain intelligence in computers, it would not be necessary to build programs that mimic the brain. Instead, the argument goes, general intelligence can be captured entirely by the right kind of symbol-processing program. Agreed, the workings of such a program would be vastly more complex than the Missionaries and Cannibals example, but it would still consist of symbols, combinations of symbols, and rules and operations on symbols. Symbolic AI of the kind illustrated by GPS ended up dominating the field for its first three decades, most notably in the form of expert systems, in which human experts devised rules for computer programs to use in tasks such as medical diagnosis and legal decision-making. There are several active branches of AI that still employ symbolic AI; I’ll describe examples of it later, particularly in discussions of AI approaches to reasoning and common sense.  
  
Symbolic AI
+
===Subsymbolic AI: Perceptrons ===
  
First let’s look at symbolic AI. A symbolic AI program’s knowledge consists of words or phrases (the “symbols”), typically understandable to a human, along with rules by which the program can combine and process these symbols in order to perform its assigned task.
+
Symbolic AI was originally inspired by mathematical logic as well as by the way people described their conscious thought processes. In contrast, subsymbolic approaches to AI took inspiration from neuroscience and sought to capture the sometimes-unconscious thought processes underlying what some have called fast perception, such as recognizing faces or identifying spoken words. Subsymbolic AI programs do not contain the kind of human- understandable language we saw in the Missionaries and Cannibals example above. Instead, a subsymbolic program is essentially a stack of equations—a thicket of often hard-to-interpret operations on numbers. As I’ll explain shortly, such systems are designed to learn from data how to perform a task.  
  
I’ll give you an example. One early AI program was confidently called the General Problem Solver,15 or GPS for short. (Sorry about the confusing acronym; the General Problem Solver predated the Global Positioning System.) GPS could solve problems such as the “Missionaries and Cannibals” puzzle, which you might have tackled yourself as a child. In this well-known conundrum, three missionaries and three cannibals all need to cross a river, but their boat holds only two people. If at any time the (hungry) cannibals outnumber the (tasty-looking) missionaries on one side of the river … well, you probably know what happens. How do all six get across the river intact?
+
An early example of a subsymbolic, brain-inspired AI program was the perceptron, invented in the late 1950s by the psychologist Frank Rosenblatt.
 +
16
 +
The term perceptron may sound a bit 1950s science-fiction-y to our modern ears (as we’ll see, it was soon followed by the “cognitron” and the “neocognitron”), but the perceptron was an important milestone in AI and was the influential great-grandparent of modern AI’s most successful tool, deep neural networks.  
  
The creators of the General Problem Solver, the cognitive scientists Herbert Simon and Allen Newell, had recorded several students “thinking out loud” while solving this and other logic puzzles. Simon and Newell then designed their program to mimic what they believed were the students’ thought processes.
+
Rosenblatt’s invention of perceptrons was inspired by the way in which neurons process information. A neuron is a cell in the brain that receives electrical or chemical input from other neurons that connect to it. Roughly speaking, a neuron sums up all the inputs it receives from other neurons, and if the total sum reaches a certain threshold level, the neuron fires. Importantly, different connections (synapses) from other neurons to a given neuron have different strengths; in calculating the sum of its inputs, the given neuron gives more weight to inputs from stronger connections than inputs from weaker connections. Neuroscientists believe that adjustments to the strength of connections between neurons is a key part of how learning takes place in the brain.  
  
I won’t go into the details of how GPS worked, but its symbolic nature can be seen by the way the program’s instructions were encoded. To set up the problem, a human would write code for GPS that looked something like this:
+
FIGURE 1: A, a neuron in the brain; B, a simple perceptron
  
CURRENT STATE:
+
To a computer scientist (or, in Rosenblatt’s case, a psychologist), information processing in neurons can be simulated by a computer program—a perceptron—that has multiple numerical inputs and one output. The analogy between a neuron and a perceptron is illustrated in figure 1. Figure 1A shows a neuron, with its branching dendrites (fibers that carry inputs to the cell), cell body, and axon (that is, output channel) labeled. Figure 1B shows a simple perceptron. Analogous to the neuron, the perceptron adds up its inputs, and if the resulting sum is equal to or greater than the perceptron’s threshold, the perceptron outputs the value 1 (it “fires”); otherwise it outputs the value 0 (it “does not fire”). To simulate the different strengths of connections to a neuron, Rosenblatt proposed that a numerical weight be assigned to each of a perceptron’s inputs; each input is multiplied by its weight before being added to the sum. A perceptron’s threshold is simply a number set by the programmer (or, as we’ll see, learned by the perceptron itself).
  
LEFT-BANK = [3 MISSIONARIES, 3 CANNIBALS, 1 BOAT] RIGHT-BANK = [EMPTY]
+
In short, a perceptron is a simple program that makes a yes-or-no (1 or 0) decision based on whether the sum of its weighted inputs meets a threshold value. You probably make some decisions like this in your life. For example, you might get input from several friends on how much they liked a particular movie, but you trust some of those friends’ taste in movies more than others. If the total amount of “friend enthusiasm”—giving more weight to your more trusted friends—is high enough (that is, greater than some unconscious threshold), you decide to go to the movie. This is how a perceptron would decide about movies, if only it had friends.
  
DESIRED STATE:
+
FIGURE 2: Examples of handwritten digits
  
LEFT-BANK = [EMPTY]
+
Inspired by networks of neurons in the brain, Rosenblatt proposed that networks of perceptrons could perform visual tasks such as recognizing faces and objects. To get a flavor of how that might work, let’s explore how a perceptron might be used for a particular visual task: recognizing handwritten digits like those in figure 2.
  
RIGHT-BANK = [3 MISSIONARIES, 3 CANNIBALS, 1 BOAT]
+
In particular, let’s design a perceptron to be an 8 detector—that is, to output a 1 if its inputs are from an image depicting an 8, and to output a 0 if the image depicts some other digit. Designing such a detector requires us to (1) figure out how to turn an image into a set of numerical inputs, and (2) determine numbers to use for the perceptron’s weights and threshold, so that it will give the correct output (1 for 8s, 0 for other digits). I’ll go into some detail here because many of the same ideas will arise later in my discussions of neural networks and their applications in computer vision.
  
In English, these lines represent the fact that initially the left bank of the river “contains” three missionaries, three cannibals, and one boat, whereas the right bank doesn’t contain any of these. The desired state represents the goal of the program—get everyone to the right bank of the river.
+
===Our Perceptron’s Inputs ===
  
At each step in its procedure, GPS attempts to change its current state to make it more similar to the desired state. In its code, the program has “operators” (in the form of subprograms) that can transform the current state into a new state and “rules” that encode the constraints of the task. For example, there is an operator that moves some number of missionaries and cannibals from one side of the river to the other:
+
Figure 3A shows an enlarged handwritten 8. Each grid square is a pixel with a numerical “intensity” value: white squares have an intensity of 0, black squares have an intensity of 1, and gray squares are in between. Let’s assume that the images we give to our perceptron have been adjusted to be the same size as this one: 18 × 18 pixels. Figure 3B illustrates a perceptron for recognizing 8s. This perceptron has 324 (that is, 18 × 18) inputs, each of which corresponds to one of the pixels in the 18 × 18 grid. Given an image like the one in figure 3A, each of the perceptron’s inputs is set to the corresponding pixel’s intensity. Each of the inputs would have its own weight value (not shown in the figure).
  
MOVE (#MISSIONARIES, #CANNIBALS, FROM-SIDE, TO-SIDE)
+
FIGURE 3: An illustration of a perceptron that recognizes handwritten 8s. Each pixel in the 18 × 18–pixel image corresponds to an input for the perceptron, yielding 324 (= 18 × 18) inputs.
  
The words inside the parentheses are called arguments, and when the program runs, it replaces these words with numbers or other words. That is, #MISSIONARIES is replaced with the number of missionaries to move, #CANNIBALS with the number of cannibals to move, and FROM-SIDE and TO-SIDE are replaced with “LEFT- BANK” or “RIGHT-BANK,” depending on which riverbank the missionaries and cannibals are to be moved from. Encoded into the program is the knowledge that the boat is moved along with the missionaries and cannibals.
+
===Learning the Perceptron’s Weights and Threshold ===
  
Before being able to apply this operator with specific values replacing the arguments, the program must check its encoded rules; for example, the maximum number of people that can move at a time is two, and the operator cannot be used if it will result in cannibals outnumbering missionaries on a riverbank.
+
Unlike the symbolic General Problem Solver system that I described earlier, a perceptron doesn’t have any explicit rules for performing its task; all of its “knowledge” is encoded in the numbers making up its weights and threshold. In his various papers, Rosenblatt showed that given the correct weight and threshold values, a perceptron like the one in figure 3B can perform fairly well on perceptual tasks such as recognizing simple handwritten digits. But how, exactly, can we determine the correct weights and threshold for a given task? Again, Rosenblatt proposed a brain- inspired answer: the perceptron should learn these values on its own. And how is it supposed to learn the correct values? Like the behavioral psychology theories popular at the time, Rosenblatt’s idea was that perceptrons should learn via conditioning. Inspired in part by the behaviorist psychologist B. F. Skinner, who trained rats and pigeons to perform tasks by giving them positive and negative reinforcement, Rosenblatt’s idea was that the perceptron should similarly be trained on examples: it should be rewarded when it fires correctly and punished when it errs. This form of conditioning is now known in AI as supervised learning. During training, the learning system is given an example, it produces an output, and it is then given a “supervision signal,” which tells how much the system’s output differs from the correct output. The system then uses this signal to adjust its weights and threshold.  
  
While these symbols represent human-interpretable concepts such as missionaries, cannibals, boat, and left bank, the computer running this program of course has no knowledge of the meaning of these symbols. You could replace all occurrences of “MISSIONARIES” with “Z372B” or any other nonsense string, and the program would work in exactly the same way. This is part of what the term General refers to in General Problem Solver. To the computer, the “meaning” of the symbols derives from the ways in which they can be combined, related to one another, and operated on.
+
The concept of supervised learning is a key part of modern AI, so it’s worth discussing in more detail. Supervised learning typically requires a large set of positive examples (for instance, a collection of 8s written by different people) and negative examples (for instance, a collection of other handwritten digits, not including 8s). Each example is labeled by a human with its category—here, 8 or not-8. This label will be used as the supervision signal. Some of the positive and negative examples are used to train the system; these are called the training set. The remainder—the test set—is used to evaluate the system’s performance after it has been trained, to see how well it has learned to answer correctly in general, not just on the training examples.  
  
Advocates of the symbolic approach to AI argued that to attain intelligence in computers, it would not be necessary to build programs that mimic the brain. Instead, the argument goes, general intelligence can be captured entirely by the right kind of symbol-processing program. Agreed, the workings of such a program would be vastly more complex than the Missionaries and Cannibals example, but it would still consist of symbols, combinations of symbols, and rules and operations on symbols. Symbolic AI of the kind illustrated by GPS ended up dominating the field for its first three decades, most notably in the form of expert systems, in which human experts devised rules for computer programs to use in tasks such as medical diagnosis and legal decision-making. There are several active branches of AI that still employ symbolic AI; I’ll describe examples of it later, particularly in discussions of AI approaches to reasoning and common sense.
+
Perhaps the most important term in computer science is algorithm, which refers to a “recipe” of steps a computer can take in order to solve a particular problem. Frank Rosenblatt’s primary contribution to AI was his design of a specific algorithm, called the perceptron-learning algorithm, by which a perceptron could be trained from examples to determine the weights and threshold that would produce correct answers. Here’s how it works: Initially, the weights and threshold are set to random values between −1 and 1. In our example, the weight on the first input might be set to 0.2, the weight on the second input set to −0.6, and so on, and the threshold set to 0.7. A computer program called a random-number generator can easily generate these initial values.  
  
Subsymbolic AI: Perceptrons
+
Now we can start the training process. The first training example is given to the perceptron; at this point, the perceptron doesn’t see the correct category label. The perceptron multiplies each input by its weight, sums up all the results, compares the sum with the threshold, and outputs either 1 or 0. Here, the output 1 means a guess of 8, and the output 0 means a guess of not-8. Now, the training process compares the perceptron’s output with the correct answer given by the human-provided label (that is, 8 or not-8). If the perceptron is correct, the weights and threshold don’t change. But if the perceptron is wrong, the weights and threshold are changed a little bit, making the perceptron’s sum on this training example closer to producing the right answer. Moreover, the amount each weight is changed depends on its associated input value; that is, the blame for the error is meted out depending on which inputs had the most impact. For example, in the 8 of figure 3A, the higher-intensity (here, black) pixels would have the most impact, and the pixels with 0 intensity (here, white) would have no impact. (For interested readers, I have included some mathematical details in the notes.
 +
17
 +
)
  
Symbolic AI was originally inspired by mathematical logic as well as by the way people described their conscious thought processes. In contrast, subsymbolic approaches to AI took inspiration from neuroscience and sought to capture the sometimes-unconscious thought processes underlying what some have called fast perception, such as recognizing faces or identifying spoken words. Subsymbolic AI programs do not contain the kind of human- understandable language we saw in the Missionaries and Cannibals example above. Instead, a subsymbolic program is essentially a stack of equations—a thicket of often hard-to-interpret operations on numbers. As I’ll explain shortly, such systems are designed to learn from data how to perform a task.
+
The whole process is repeated for the next training example. The training process goes through all the training examples multiple times, modifying the weights and threshold a little bit each time the perceptron makes an error. Just as the psychologist B. F. Skinner found when training pigeons, it’s better to learn gradually over many trials; if the weights and threshold are changed too much on any one trial, then the system might end up learning the wrong thing (such as an overgeneralization that “the bottom and top halves of an 8 are always equal in size”). After many repetitions on each training example, the system eventually (we hope) settles on a set of weights and a threshold that result in correct answers for all the training examples. At that point, we can evaluate the perceptron on the test examples to see how it performs on images it hasn’t been trained on.  
  
An early example of a subsymbolic, brain-inspired AI program was the perceptron, invented in the late 1950s by the psychologist Frank Rosenblatt.16 The term perceptron may sound a bit 1950s science-fiction-y to our modern ears (as we’ll see, it was soon followed by the “cognitron” and the “neocognitron”), but the perceptron was an important milestone in AI and was the influential great-grandparent of modern AI’s most successful tool, deep neural networks.
+
An 8 detector is useful if you care only about 8s. But what about recognizing other digits? It’s fairly straightforward to extend our perceptron to have ten outputs, one for each digit. Given an example handwritten digit, the output corresponding to that digit should be 1, and all the other outputs should be 0. This extended perceptron can learn all of its weights and thresholds using the perceptron-learning algorithm; the system just needs enough examples.  
  
Rosenblatt’s invention of perceptrons was inspired by the way in which neurons process information. A neuron is a cell in the brain that receives electrical or chemical input from other neurons that connect to it. Roughly speaking, a neuron sums up all the inputs it receives from other neurons, and if the total sum reaches a certain threshold level, the neuron fires. Importantly, different connections (synapses) from other neurons to a given neuron have different strengths; in calculating the sum of its inputs, the given neuron gives more weight to inputs from stronger connections than inputs from weaker connections. Neuroscientists believe that adjustments to the strength of connections between neurons is a key part of how learning takes place in the brain.
+
Rosenblatt and others showed that networks of perceptrons could learn to perform relatively simple perceptual tasks; moreover, Rosenblatt proved mathematically that for a certain, albeit very limited, class of tasks, perceptrons with sufficient training could, in principle, learn to perform these tasks without error. What wasn’t clear was how well perceptrons could perform on more general AI tasks. This uncertainty didn’t seem to stop Rosenblatt and his funders at the Office of Naval Research from making ridiculously optimistic predictions about their algorithm. Reporting on a press conference Rosenblatt held in July 1958, The New York Times featured this recap:
  
FIGURE 1: A, a neuron in the brain; B, a simple perceptron
+
:The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself, and be conscious of its existence. Later perceptrons will be able to recognize people and call out their names and instantly translate speech in one language to speech and writing in another language, it was predicted.
 +
18
 +
  
To a computer scientist (or, in Rosenblatt’s case, a psychologist), information processing in neurons can be simulated by a computer program—a perceptron—that has multiple numerical inputs and one output. The analogy between a neuron and a perceptron is illustrated in figure 1. Figure 1A shows a neuron, with its branching dendrites (fibers that carry inputs to the cell), cell body, and axon (that is, output channel) labeled. Figure 1B shows a simple perceptron. Analogous to the neuron, the perceptron adds up its inputs, and if the resulting sum is equal to or greater than the perceptron’s threshold, the perceptron outputs the value 1 (it “fires”); otherwise it outputs the value 0 (it
+
Yes, even at its beginning, AI suffered from a hype problem. I’ll talk more about the unhappy results of such hype shortly. But for now, I want to use perceptrons to highlight a major difference between symbolic and subsymbolic approaches to AI.
  
“does not fire”). To simulate the different strengths of connections to a neuron, Rosenblatt proposed that a numerical weight be assigned to each of a perceptron’s inputs; each input is multiplied by its weight before being added to the sum. A perceptron’s threshold is simply a number set by the programmer (or, as we’ll see, learned by the perceptron itself).
+
The fact that a perceptron’s “knowledge” consists of a set of numbers—namely, the weights and threshold it has learned—means that it is hard to uncover the rules the perceptron is using in performing its recognition task. The perceptron’s rules are not symbolic; unlike the General Problem Solver’s symbols, such as LEFT-BANK, #MISSIONARIES, and MOVE, a perceptron’s weights and threshold don’t stand for particular concepts. It’s not easy to translate these numbers into rules that are understandable by humans. The situation gets much worse with modern neural networks that have millions of weights.  
  
In short, a perceptron is a simple program that makes a yes-or-no (1 or 0) decision based on whether the sum of its weighted inputs meets a threshold value. You probably make some decisions like this in your life. For example, you might get input from several friends on how much they liked a particular movie, but you trust some of those friends’ taste in movies more than others. If the total amount of “friend enthusiasm”—giving more weight to your more trusted friends—is high enough (that is, greater than some unconscious threshold), you decide to go to the movie. This is how a perceptron would decide about movies, if only it had friends.
+
One might make a rough analogy between perceptrons and the human brain. If I could open up your head and watch some subset of your hundred billion neurons firing, I would likely not get any insight into what you were thinking or the “rules” you used to make a particular decision. However, the human brain has given rise to language, which allows you to use symbols (words and phrases) to tell me—often imperfectly—what your thoughts are about or why you did a certain thing. In this sense, our neural firings can be considered subsymbolic, in that they underlie the symbols our brains somehow create. Perceptrons, as well as more complicated networks of simulated neurons, have been dubbed “subsymbolic” in analogy to the brain. Their advocates believe that to achieve artificial intelligence, language-like symbols and the rules that govern symbol processing cannot be programmed directly, as was done in the General Problem Solver, but must emerge from neural-like architectures similar to the way that intelligent symbol processing emerges from the brain.  
  
FIGURE 2: Examples of handwritten digits
+
===The Limitations of Perceptrons ===
  
Inspired by networks of neurons in the brain, Rosenblatt proposed that networks of perceptrons could perform visual tasks such as recognizing faces and objects. To get a flavor of how that might work, let’s explore how a perceptron might be used for a particular visual task: recognizing handwritten digits like those in figure 2.
+
After the 1956 Dartmouth meeting, the symbolic camp dominated the AI landscape. In the early 1960s, while Rosenblatt was working avidly on the perceptron, the big four “founders” of AI, all strong devotees of the symbolic camp, had created influential—and well-funded—AI laboratories: Marvin Minsky at MIT, John McCarthy at Stanford, and Herbert Simon and Allen Newell at Carnegie Mellon. (Remarkably, these three universities remain to this day among the most prestigious places to study AI.) Minsky, in particular, felt that Rosenblatt’s brain-inspired approach to AI was a dead end, and moreover was stealing away research dollars from more worthy symbolic AI efforts.
 +
19
 +
In 1969, Minsky and his MIT colleague Seymour Papert published a book, Perceptrons,
 +
20
 +
in which they gave a mathematical proof showing that the types of problems a perceptron could solve perfectly were very limited and that the perceptron-learning algorithm would not do well in scaling up to tasks requiring a large number of weights and thresholds.  
  
In particular, let’s design a perceptron to be an 8 detector—that is, to output a 1 if its inputs are from an image depicting an 8, and to output a 0 if the image depicts some other digit. Designing such a detector requires us to (1) figure out how to turn an image into a set of numerical inputs, and (2) determine numbers to use for the perceptron’s weights and threshold, so that it will give the correct output (1 for 8s, 0 for other digits). I’ll go into some detail here because many of the same ideas will arise later in my discussions of neural networks and their applications in computer vision.
+
Minsky and Papert pointed out that if a perceptron is augmented by adding a “layer” of simulated neurons, the types of problems that the device can solve is, in principle, much broader.
 +
21
 +
A perceptron with such an added layer is called a multilayer neural network. Such networks form the foundations of much of modern AI; I’ll describe them in detail in the next chapter. But for now, I’ll note that at the time of Minsky and Papert’s book, multilayer neural networks were not broadly studied, largely because there was no general algorithm, analogous to the perceptron-learning algorithm, for learning weights and thresholds.  
  
Our Perceptron’s Inputs
+
The limitations Minsky and Papert proved for simple perceptrons were already known to people working in this area.
 +
22
 +
Frank Rosenblatt himself had done extensive work on multilayer perceptrons and recognized the difficulty of training them.
 +
23
 +
It wasn’t Minsky and Papert’s mathematics that put the final nail in the perceptron’s coffin; rather, it was their speculation on multilayer neural networks:
  
Figure 3A shows an enlarged handwritten 8. Each grid square is a pixel with a numerical “intensity” value: white squares have an intensity of 0, black squares have an intensity of 1, and gray squares are in between. Let’s assume that the images we give to our perceptron have been adjusted to be the same size as this one: 18 × 18 pixels. Figure 3B illustrates a perceptron for recognizing 8s. This perceptron has 324 (that is, 18 × 18) inputs, each of which corresponds to one of the pixels in the 18 × 18 grid. Given an image like the one in figure 3A, each of the perceptron’s inputs is set to the corresponding pixel’s intensity. Each of the inputs would have its own weight value (not shown in the figure).
+
:[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.
 +
24
 +
  
FIGURE 3: An illustration of a perceptron that recognizes handwritten 8s. Each pixel in the 18 × 18–pixel image corresponds to an input for the perceptron, yielding 324 (= 18 × 18) inputs.
+
Ouch. In today’s vernacular that final sentence might be termed “passive-aggressive.” Such negative speculations were at least part of the reason that funding for neural network research dried up in the late 1960s, at the same time that symbolic AI was flush with government dollars. In 1971, at the age of forty-three, Frank Rosenblatt died in a boating accident. Without its most prominent proponent, and without much government funding, research on perceptrons and other subsymbolic AI methods largely halted, except in a few isolated academic groups.  
  
Learning the Perceptron’s Weights and Threshold
+
===AI Winter===
  
Unlike the symbolic General Problem Solver system that I described earlier, a perceptron doesn’t have any explicit rules for performing its task; all of its “knowledge” is encoded in the numbers making up its weights and threshold. In his various papers, Rosenblatt showed that given the correct weight and threshold values, a perceptron like the one in figure 3B can perform fairly well on perceptual tasks such as recognizing simple handwritten digits. But how, exactly, can we determine the correct weights and threshold for a given task? Again, Rosenblatt proposed a brain- inspired answer: the perceptron should learn these values on its own. And how is it supposed to learn the correct values? Like the behavioral psychology theories popular at the time, Rosenblatt’s idea was that perceptrons should learn via conditioning. Inspired in part by the behaviorist psychologist B. F. Skinner, who trained rats and pigeons to perform tasks by giving them positive and negative reinforcement, Rosenblatt’s idea was that the perceptron should similarly be trained on examples: it should be rewarded when it fires correctly and punished when it errs. This form of conditioning is now known in AI as supervised learning. During training, the learning system is given an example, it produces an output, and it is then given a “supervision signal,” which tells how much the system’s output differs from the correct output. The system then uses this signal to adjust its weights and threshold.
+
In the meantime, proponents of symbolic AI were writing grant proposals promising impending breakthroughs in areas such as speech and language understanding, commonsense reasoning, robot navigation, and autonomous vehicles. By the mid-1970s, while some very narrowly focused expert systems were successfully deployed, the more general AI breakthroughs that had been promised had not materialized.  
  
The concept of supervised learning is a key part of modern AI, so it’s worth discussing in more detail. Supervised learning typically requires a large set of positive examples (for instance, a collection of 8s written by different people) and negative examples (for instance, a collection of other handwritten digits, not including 8s). Each example is labeled by a human with its category—here, 8 or not-8. This label will be used as the supervision signal. Some of the positive and negative examples are used to train the system; these are called the training set. The remainder—the test set—is used to evaluate the system’s performance after it has been trained, to see how well it has learned to answer correctly in general, not just on the training examples.
+
The funding agencies noticed. Two reports, solicited respectively by the Science Research Council in the U.K. and the Department of Defense in the United States, reported very negatively on the progress and prospects for AI research. The U.K. report in particular acknowledged that there was promise in the area of specialized expert systems—“programs written to perform in highly specialised problem domains, when the programming takes very full account of the results of human experience and human intelligence within the relevant domain”—but concluded that the results to date were “wholly discouraging about general-purpose programs seeking to mimic the problem- solving aspects of human [brain] activity over a rather wide field. Such a general-purpose program, the coveted long-term goal of AI activity, seems as remote as ever.”
 +
25
 +
This report led to a sharp decrease in government funding for AI research in the U.K.; similarly, the Department of Defense drastically cut funding for basic AI research in the United States.  
  
Perhaps the most important term in computer science is algorithm, which refers to a “recipe” of steps a computer can take in order to solve a particular problem. Frank Rosenblatt’s primary contribution to AI was his design of a specific algorithm, called the perceptron-learning algorithm, by which a perceptron could be trained from examples to determine the weights and threshold that would produce correct answers. Here’s how it works: Initially, the weights and threshold are set to random values between −1 and 1. In our example, the weight on the first input might be set to 0.2, the weight on the second input set to −0.6, and so on, and the threshold set to 0.7. A computer program called a random-number generator can easily generate these initial values.
+
This was an early example of a repeating cycle of bubbles and crashes in the field of AI. The two-part cycle goes like this. Phase 1: New ideas create a lot of optimism in the research community. Results of imminent AI breakthroughs are promised, and often hyped in the news media. Money pours in from government funders and venture capitalists for both academic research and commercial start-ups. Phase 2: The promised breakthroughs don’t occur, or are much less impressive than promised. Government funding and venture capital dry up. Start-up companies fold, and AI research slows. This pattern became familiar to the AI community: “AI spring,” followed by overpromising and media hype, followed by “AI winter.” This has happened, to various degrees, in cycles of five to ten years. When I got out of graduate school in 1990, the field was in one of its winters and had garnered such a bad image that I was even advised to leave the term “artificial intelligence” off my job applications.  
  
Now we can start the training process. The first training example is given to the perceptron; at this point, the perceptron doesn’t see the correct category label. The perceptron multiplies each input by its weight, sums up all the results, compares the sum with the threshold, and outputs either 1 or 0. Here, the output 1 means a guess of 8, and the output 0 means a guess of not-8. Now, the training process compares the perceptron’s output with the correct answer given by the human-provided label (that is, 8 or not-8). If the perceptron is correct, the weights and threshold don’t change. But if the perceptron is wrong, the weights and threshold are changed a little bit, making the perceptron’s sum on this training example closer to producing the right answer. Moreover, the amount each weight
+
===Easy Things Are Hard ===
  
is changed depends on its associated input value; that is, the blame for the error is meted out depending on which inputs had the most impact. For example, in the 8 of figure 3A, the higher-intensity (here, black) pixels would have the most impact, and the pixels with 0 intensity (here, white) would have no impact. (For interested readers, I have included some mathematical details in the notes.17)
+
The cold AI winters taught practitioners some important lessons. The simplest lesson was noted by John McCarthy, fifty years after the Dartmouth conference: “AI was harder than we thought.”
 +
26
 +
Marvin Minsky pointed out that in fact AI research had uncovered a paradox: “Easy things are hard.” The original goals of AI—computers that could converse with us in natural language, describe what they saw through their camera eyes, learn new concepts after seeing only a few examples—are things that young children can easily do, but, surprisingly, these “easy things” have turned out to be harder for AI to achieve than diagnosing complex diseases, beating human champions at chess and Go, and solving complex algebraic problems. As Minsky went on, “In general, we’re least aware of what our minds do best.”
 +
27
 +
The attempt to create artificial intelligence has, at the very least, helped elucidate how complex and subtle are our own minds.
  
The whole process is repeated for the next training example. The training process goes through all the training
+
==2 Neural Networks and the Ascent of Machine Learning ==
  
examples multiple times, modifying the weights and threshold a little bit each time the perceptron makes an error. Just as the psychologist B. F. Skinner found when training pigeons, it’s better to learn gradually over many trials; if the weights and threshold are changed too much on any one trial, then the system might end up learning the wrong thing (such as an overgeneralization that “the bottom and top halves of an 8 are always equal in size”). After many repetitions on each training example, the system eventually (we hope) settles on a set of weights and a threshold that result in correct answers for all the training examples. At that point, we can evaluate the perceptron on the test examples to see how it performs on images it hasn’t been trained on.
+
Spoiler alert: Multilayer neural networks—the extension of perceptrons that was dismissed by Minsky and Papert as likely to be “sterile”—have instead turned out to form the foundation of much of modern artificial intelligence. Because they are the basis of several of the methods I’ll describe in later chapters, I’ll take some time here to describe how these networks work.  
  
An 8 detector is useful if you care only about 8s. But what about recognizing other digits? It’s fairly straightforward to extend our perceptron to have ten outputs, one for each digit. Given an example handwritten digit, the output corresponding to that digit should be 1, and all the other outputs should be 0. This extended perceptron can learn all of its weights and thresholds using the perceptron-learning algorithm; the system just needs enough examples.
+
===Multilayer Neural Networks ===
  
Rosenblatt and others showed that networks of perceptrons could learn to perform relatively simple perceptual tasks; moreover, Rosenblatt proved mathematically that for a certain, albeit very limited, class of tasks, perceptrons with sufficient training could, in principle, learn to perform these tasks without error. What wasn’t clear was how well perceptrons could perform on more general AI tasks. This uncertainty didn’t seem to stop Rosenblatt and his funders at the Office of Naval Research from making ridiculously optimistic predictions about their algorithm. Reporting on a press conference Rosenblatt held in July 1958, The New York Times featured this recap:
+
A network is simply a set of elements that are connected to one another in various ways. We’re all familiar with social networks, in which the elements are people, and computer networks, in which the elements are, naturally, computers. In neural networks, the elements are simulated neurons akin to the perceptrons I described in the previous chapter.  
  
The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself, and be conscious of its existence. Later perceptrons will be able to recognize people and call out their names and instantly translate speech in one
+
FIGURE 4: A two-layer neural network for recognizing handwritten digits
  
language to speech and writing in another language, it was predicted.18
+
In figure 4, I’ve sketched a simple multilayer neural network, designed to recognize handwritten digits. The network has two columns (layers) of perceptron-like simulated neurons (circles). For simplicity (and probably to the relief of any neuroscientists reading this), I’ll use the term unit instead of simulated neuron to describe the elements of this network. Like the 8-detecting perceptron from chapter 1, the network in figure 4 has 324 (18 × 18) inputs, each of which is set to the intensity value of the corresponding pixel in the input image. But unlike the perceptron, this network has a layer of three so-called hidden units, along with its layer of ten output units. Each output unit corresponds to one of the possible digit categories.
  
Yes, even at its beginning, AI suffered from a hype problem. I’ll talk more about the unhappy results of such hype shortly. But for now, I want to use perceptrons to highlight a major difference between symbolic and subsymbolic approaches to AI.
+
The large gray arrows signify that each input has a weighted connection to each hidden unit, and each hidden unit has a weighted connection to each output unit. The mysterious-sounding term hidden unit comes from the neural network literature; it simply means a non-output unit. A better name might have been interior unit.  
  
The fact that a perceptron’s “knowledge” consists of a set of numbers—namely, the weights and threshold it has learned—means that it is hard to uncover the rules the perceptron is using in performing its recognition task. The perceptron’s rules are not symbolic; unlike the General Problem Solver’s symbols, such as LEFT-BANK, #MISSIONARIES, and MOVE, a perceptron’s weights and threshold don’t stand for particular concepts. It’s not easy to translate these numbers into rules that are understandable by humans. The situation gets much worse with modern neural networks that have millions of weights.
+
Think of the structure of your brain, in which some neurons directly control “outputs” such as your muscle movements but most neurons simply communicate with other neurons. These could be called the brain’s hidden neurons.  
  
One might make a rough analogy between perceptrons and the human brain. If I could open up your head and watch some subset of your hundred billion neurons firing, I would likely not get any insight into what you were thinking or the “rules” you used to make a particular decision. However, the human brain has given rise to language, which allows you to use symbols (words and phrases) to tell me—often imperfectly—what your thoughts are about or why you did a certain thing. In this sense, our neural firings can be considered subsymbolic, in that they underlie the symbols our brains somehow create. Perceptrons, as well as more complicated networks of simulated neurons, have been dubbed “subsymbolic” in analogy to the brain. Their advocates believe that to achieve artificial intelligence, language-like symbols and the rules that govern symbol processing cannot be programmed directly, as was done in the General Problem Solver, but must emerge from neural-like architectures similar to the way that intelligent symbol processing emerges from the brain.
+
The network shown in figure 4 is referred to as “multilayered” because it has two layers of units (hidden and output) instead of just an output layer. In principle, a multilayer network can have multiple layers of hidden units; networks that have more than one layer of hidden units are called deep networks. The “depth” of a network is simply its number of hidden layers. I’ll have much more to say about deep networks in upcoming chapters.  
  
The Limitations of Perceptrons
+
Similar to perceptrons, each unit here multiplies each of its inputs by the weight on that input’s connection and then sums the results. However, unlike in a perceptron, a unit here doesn’t simply “fire” or “not fire” (that is, produce 1 or 0) based on a threshold; instead, each unit uses its sum to compute a number between 0 and 1 that is called the unit’s “activation.” If the sum that a unit computes is low, the unit’s activation is close to 0; if the sum is high, the activation is close to 1. (For interested readers, I’ve included some of the mathematical details in the notes.
 +
1
 +
)
  
After the 1956 Dartmouth meeting, the symbolic camp dominated the AI landscape. In the early 1960s, while Rosenblatt was working avidly on the perceptron, the big four “founders” of AI, all strong devotees of the symbolic camp, had created influential—and well-funded—AI laboratories: Marvin Minsky at MIT, John McCarthy at Stanford, and Herbert Simon and Allen Newell at Carnegie Mellon. (Remarkably, these three universities remain to
+
To process an image such as the handwritten 8 in figure 4, the network performs its computations layer by layer, from left to right. Each hidden unit computes its activation value; these activation values then become the inputs for the output units, which then compute their own activations. In the network of figure 4, the activation of an output unit can be thought of as the network’s confidence that it is “seeing” the corresponding digit; the digit category with the highest confidence can be taken as the network’s answer—its classification.  
  
this day among the most prestigious places to study AI.) Minsky, in particular, felt that Rosenblatt’s brain-inspired approach to AI was a dead end, and moreover was stealing away research dollars from more worthy symbolic AI efforts.19 In 1969, Minsky and his MIT colleague Seymour Papert published a book, Perceptrons,20 in which they
+
In principle, a multilayer neural network can learn to use its hidden units to recognize more abstract features (for example, visual shapes, such as the top and bottom “circles” on a handwritten 8) than the simple features (for example, pixels) encoded by the input. In general, it’s hard to know ahead of time how many layers of hidden units are needed, or how many hidden units should be included in a layer, for a network to perform well on a given task. Most neural network researchers use a form of trial and error to find the best settings.
  
gave a mathematical proof showing that the types of problems a perceptron could solve perfectly were very limited
+
===Learning via Back-Propagation ===
  
and that the perceptron-learning algorithm would not do well in scaling up to tasks requiring a large number of weights and thresholds.
+
In their book Perceptrons, Minsky and Papert were skeptical that a successful algorithm could be designed for learning the weights in a multilayer neural network. Their skepticism (along with doubts from others in the symbolic AI community) was largely responsible for the sharp decrease in funding for neural network research in the 1970s. But despite the chilling effect of Minsky and Papert’s book on the field, a small core of neural network researchers persisted, especially in Frank Rosenblatt’s own field of cognitive psychology. And by the late 1970s and early ’80s, several of these groups had definitively rebutted Minsky and Papert’s speculations on the “sterility” of multilayer neural networks by developing a general learning algorithm—called back-propagation—for training these networks.  
 
+
Minsky and Papert pointed out that if a perceptron is augmented by adding a “layer” of simulated neurons, the types of problems that the device can solve is, in principle, much broader.21 A perceptron with such an added layer is called a multilayer neural network. Such networks form the foundations of much of modern AI; I’ll describe them in detail in the next chapter. But for now, I’ll note that at the time of Minsky and Papert’s book, multilayer neural networks were not broadly studied, largely because there was no general algorithm, analogous to the perceptron-
+
 
+
learning algorithm, for learning weights and thresholds.
+
 
+
The limitations Minsky and Papert proved for simple perceptrons were already known to people working in this area.22 Frank Rosenblatt himself had done extensive work on multilayer perceptrons and recognized the difficulty of training them.23 It wasn’t Minsky and Papert’s mathematics that put the final nail in the perceptron’s coffin; rather, it was their speculation on multilayer neural networks:
+
 
+
[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless,
+
 
+
we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.24
+
 
+
Ouch. In today’s vernacular that final sentence might be termed “passive-aggressive.” Such negative speculations were at least part of the reason that funding for neural network research dried up in the late 1960s, at the same time that symbolic AI was flush with government dollars. In 1971, at the age of forty-three, Frank Rosenblatt died in a boating accident. Without its most prominent proponent, and without much government funding, research on perceptrons and other subsymbolic AI methods largely halted, except in a few isolated academic groups.
+
 
+
AI Winter
+
 
+
In the meantime, proponents of symbolic AI were writing grant proposals promising impending breakthroughs in areas such as speech and language understanding, commonsense reasoning, robot navigation, and autonomous vehicles. By the mid-1970s, while some very narrowly focused expert systems were successfully deployed, the more general AI breakthroughs that had been promised had not materialized.
+
 
+
The funding agencies noticed. Two reports, solicited respectively by the Science Research Council in the U.K. and the Department of Defense in the United States, reported very negatively on the progress and prospects for AI research. The U.K. report in particular acknowledged that there was promise in the area of specialized expert systems—“programs written to perform in highly specialised problem domains, when the programming takes very full account of the results of human experience and human intelligence within the relevant domain”—but concluded that the results to date were “wholly discouraging about general-purpose programs seeking to mimic the problem- solving aspects of human [brain] activity over a rather wide field. Such a general-purpose program, the coveted
+
 
+
long-term goal of AI activity, seems as remote as ever.”25 This report led to a sharp decrease in government funding for AI research in the U.K.; similarly, the Department of Defense drastically cut funding for basic AI research in the United States.
+
 
+
This was an early example of a repeating cycle of bubbles and crashes in the field of AI. The two-part cycle goes like this. Phase 1: New ideas create a lot of optimism in the research community. Results of imminent AI breakthroughs are promised, and often hyped in the news media. Money pours in from government funders and venture capitalists for both academic research and commercial start-ups. Phase 2: The promised breakthroughs don’t occur, or are much less impressive than promised. Government funding and venture capital dry up. Start-up companies fold, and AI research slows. This pattern became familiar to the AI community: “AI spring,” followed by overpromising and media hype, followed by “AI winter.” This has happened, to various degrees, in cycles of five to ten years. When I got out of graduate school in 1990, the field was in one of its winters and had garnered such a bad image that I was even advised to leave the term “artificial intelligence” off my job applications.
+
 
+
Easy Things Are Hard
+
 
+
The cold AI winters taught practitioners some important lessons. The simplest lesson was noted by John McCarthy, fifty years after the Dartmouth conference: “AI was harder than we thought.”26 Marvin Minsky pointed out that in fact AI research had uncovered a paradox: “Easy things are hard.” The original goals of AI—computers that could converse with us in natural language, describe what they saw through their camera eyes, learn new concepts after seeing only a few examples—are things that young children can easily do, but, surprisingly, these “easy things”
+
 
+
have turned out to be harder for AI to achieve than diagnosing complex diseases, beating human champions at chess and Go, and solving complex algebraic problems. As Minsky went on, “In general, we’re least aware of what our minds do best.”27 The attempt to create artificial intelligence has, at the very least, helped elucidate how complex and subtle are our own minds.
+
  
 +
As its name implies, back-propagation is a way to take an error observed at the output units (for example, a high confidence for the wrong digit in the example of figure 4) and to “propagate” the blame for that error backward (in figure 4, this would be from right to left) so as to assign proper blame to each of the weights in the network. This allows back-propagation to determine how much to change each weight in order to reduce the error. Learning in neural networks simply consists in gradually modifying the weights on connections so that each output’s error gets as close to 0 as possible on all training examples. While the mathematics of back-propagation is beyond the scope of my discussion here, I’ve included some details in the notes.
 
2
 
2
 +
  
Neural Networks and the Ascent of Machine Learning
+
Back-propagation will work (in principle at least) no matter how many inputs, hidden units, or output units your neural network has. While there is no mathematical guarantee that back-propagation will settle on the correct weights for a network, in practice it has worked very well on many tasks that are too hard for simple perceptrons. For example, I trained both a perceptron and a two-layer neural network, each with 324 inputs and 10 outputs, on the handwritten-digit-recognition task, using sixty thousand examples, and then tested how well each was able to recognize ten thousand new examples. The perceptron was correct on about 80 percent of the new examples, whereas the neural network, with 50 hidden units, was correct on a whopping 94 percent of those new examples. Kudos to the hidden units! But what exactly has the neural network learned that allowed it to soar past the perceptron? I don’t know. It’s possible that I could find a way to visualize the neural network’s 16,700 weights
 +
3
 +
to get some insight into its performance, but I haven’t done so, and in general it’s not at all easy to understand how these networks make their decisions.
  
Spoiler alert: Multilayer neural networks—the extension of perceptrons that was dismissed by Minsky and Papert as likely to be “sterile”—have instead turned out to form the foundation of much of modern artificial intelligence. Because they are the basis of several of the methods I’ll describe in later chapters, I’ll take some time here to describe how these networks work.
+
It’s important to note that while I’ve used the example of handwritten digits, neural networks can be applied not just to images but to any kind of data. Neural networks have been applied in areas as diverse as speech recognition, stock-market prediction, language translation, and music composition.  
  
Multilayer Neural Networks
+
===Connectionism ===
  
A network is simply a set of elements that are connected to one another in various ways. We’re all familiar with social networks, in which the elements are people, and computer networks, in which the elements are, naturally, computers. In neural networks, the elements are simulated neurons akin to the perceptrons I described in the previous chapter.
+
In the 1980s, the most visible group working on neural networks was a team at the University of California at San Diego headed by two psychologists, David Rumelhart and James McClelland. What we now call neural networks were then generally referred to as connectionist networks, where the term connectionist refers to the idea that knowledge in these networks resides in weighted connections between units. The team led by Rumelhart and McClelland is known for writing the so-called bible of connectionism—a two-volume treatise, published in 1986, called Parallel Distributed Processing. In the midst of an AI landscape dominated by symbolic AI, the book was a pep talk for the subsymbolic approach, arguing that “people are smarter than today’s computers because the brain employs a basic computational architecture that is more suited to the natural information-processing tasks that people are so good at,” for example, “perceiving objects in natural scenes and noting their relations,… understanding language, and retrieving contextually appropriate information from memory.”
 +
4
 +
The authors speculated that “symbolic systems such as those favored by Minsky and Papert”
 +
5
 +
would not be able to capture these humanlike abilities.  
  
FIGURE 4: A two-layer neural network for recognizing handwritten digits
+
Indeed, by the mid-1980s, expert systems—symbolic AI approaches that rely on humans to create rules that reflect expert knowledge of a particular domain—were increasingly revealing themselves to be brittle: that is, error- prone and often unable to generalize or adapt when presented with new situations. In analyzing the limitations of these systems, researchers were discovering how much the human experts writing the rules actually rely on subconscious knowledge—what you might call common sense—in order to act intelligently. This kind of common sense could not easily be captured in programmed rules or logical deduction, and the lack of it severely limited any broad application of symbolic AI methods. In short, after a cycle of grand promises, immense funding, and media hype, symbolic AI was facing yet another AI winter.
  
In figure 4, I’ve sketched a simple multilayer neural network, designed to recognize handwritten digits. The network has two columns (layers) of perceptron-like simulated neurons (circles). For simplicity (and probably to the relief of any neuroscientists reading this), I’ll use the term unit instead of simulated neuron to describe the elements of this network. Like the 8-detecting perceptron from chapter 1, the network in figure 4 has 324 (18 × 18) inputs, each of which is set to the intensity value of the corresponding pixel in the input image. But unlike the perceptron, this network has a layer of three so-called hidden units, along with its layer of ten output units. Each output unit corresponds to one of the possible digit categories.
+
According to the proponents of connectionism, the key to intelligence was an appropriate computational architecture—inspired by the brain—and the ability of the system to learn on its own from data or from acting in the world. Rumelhart, McClelland, and their team constructed connectionist networks (in software) as scientific models of human learning, perception, and language development. While these networks did not exhibit anywhere near human-level performance, the various networks described in the Parallel Distributed Processing books and elsewhere were interesting enough as AI artifacts that many people took notice, including those at funding agencies. In 1988, a top official at the Defense Advanced Research Projects Agency (DARPA), which provided the lion’s share of AI funding, proclaimed, “I believe that this technology which we are about to embark upon [that is, neural networks] is more important than the atom bomb.
 +
6
 +
Suddenly neural networks were “in” again.  
  
The large gray arrows signify that each input has a weighted connection to each hidden unit, and each hidden unit has a weighted connection to each output unit. The mysterious-sounding term hidden unit comes from the neural network literature; it simply means a non-output unit. A better name might have been interior unit.
+
===Bad at Logic, Good at Frisbee ===
  
Think of the structure of your brain, in which some neurons directly control “outputs” such as your muscle movements but most neurons simply communicate with other neurons. These could be called the brain’s hidden neurons.
+
Over the last six decades of AI research, people have repeatedly debated the relative advantages and disadvantages of symbolic and subsymbolic approaches. Symbolic systems can be engineered by humans, be imbued with human knowledge, and use human-understandable reasoning to solve problems. For example, MYCIN, an expert system developed in the early 1970s, was given about six hundred rules that it used to help physicians diagnose and treat blood diseases. MYCIN’s programmers developed these rules after painstaking interviews with expert physicians. Given a patient’s symptoms and medical test results, MYCIN was able to use both logic and probabilistic reasoning together with its rules in order to come up with a diagnosis, and it was able to explain its reasoning process. In short, MYCIN was a paradigmatic example of symbolic AI.  
  
The network shown in figure 4 is referred to as “multilayered” because it has two layers of units (hidden and output) instead of just an output layer. In principle, a multilayer network can have multiple layers of hidden units; networks that have more than one layer of hidden units are called deep networks. The “depth” of a network is simply its number of hidden layers. I’ll have much more to say about deep networks in upcoming chapters.
+
In contrast, as we’ve seen, subsymbolic systems tend to be hard to interpret, and no one knows how to directly program complex human knowledge or logic into these systems. Subsymbolic systems seem much better suited to perceptual or motor tasks for which humans can’t easily define rules. You can’t easily write down rules for identifying handwritten digits, catching a baseball, or recognizing your mother’s voice; you just seem to do it automatically, without conscious thought. As the philosopher Andy Clark put it, the nature of subsymbolic systems is to be “bad at logic, good at Frisbee.”7
  
Similar to perceptrons, each unit here multiplies each of its inputs by the weight on that input’s connection and then sums the results. However, unlike in a perceptron, a unit here doesn’t simply “fire” or “not fire” (that is, produce 1 or 0) based on a threshold; instead, each unit uses its sum to compute a number between 0 and 1 that is called the unit’s “activation.” If the sum that a unit computes is low, the unit’s activation is close to 0; if the sum is high, the activation is close to 1. (For interested readers, I’ve included some of the mathematical details in the
+
So, why not just use symbolic systems for tasks that require high-level language-like descriptions and logical reasoning, and use subsymbolic systems for the low-level perceptual tasks such as recognizing faces and voices? To some extent, this is what has been done in AI, with very little connection between the two areas. Each of these approaches has had important successes in narrow areas but has serious limitations in achieving the original goals of AI. While there have been some attempts to construct hybrid systems that integrate subsymbolic and symbolic methods, none have yet led to any striking success.  
  
notes.1)
+
===The Ascent of Machine Learning ===
  
To process an image such as the handwritten 8 in figure 4, the network performs its computations layer by layer, from left to right. Each hidden unit computes its activation value; these activation values then become the inputs for the output units, which then compute their own activations. In the network of figure 4, the activation of an output unit can be thought of as the network’s confidence that it is “seeing” the corresponding digit; the digit category with the highest confidence can be taken as the network’s answer—its classification.
+
Inspired by statistics and probability theory, AI researchers developed numerous algorithms that enable computers to learn from data, and the field of machine learning became its own independent subdiscipline of AI, intentionally separate from symbolic AI. Machine-learning researchers disparagingly referred to symbolic AI methods as good old-fashioned AI, or GOFAI (pronounced “go-fye”),8 and roundly rejected them.  
  
In principle, a multilayer neural network can learn to use its hidden units to recognize more abstract features (for example, visual shapes, such as the top and bottom “circles” on a handwritten 8) than the simple features (for example, pixels) encoded by the input. In general, it’s hard to know ahead of time how many layers of hidden units are needed, or how many hidden units should be included in a layer, for a network to perform well on a given task. Most neural network researchers use a form of trial and error to find the best settings.
+
Over the next two decades, machine learning had its own cycles of optimism, government funding, start-ups, and overpromising, followed by the inevitable winters. Training neural networks and similar methods to solve real- world problems could be glacially slow, and often didn’t work very well, given the limited amount of data and computer power available at the time. But more data and computing power were coming shortly. The explosive growth of the internet would see to that. The stage was set for the next big AI revolution.
  
Learning via Back-Propagation
+
==3 AI Spring - Spring Fever ==
  
In their book Perceptrons, Minsky and Papert were skeptical that a successful algorithm could be designed for learning the weights in a multilayer neural network. Their skepticism (along with doubts from others in the symbolic AI community) was largely responsible for the sharp decrease in funding for neural network research in the 1970s. But despite the chilling effect of Minsky and Papert’s book on the field, a small core of neural network researchers persisted, especially in Frank Rosenblatt’s own field of cognitive psychology. And by the late 1970s and early ’80s, several of these groups had definitively rebutted Minsky and Papert’s speculations on the “sterility” of multilayer neural networks by developing a general learning algorithm—called back-propagation—for training these networks.
+
Have you ever taken a video of your cat and uploaded it to YouTube? If so, you are not alone. More than a billion videos have been uploaded to YouTube, and a lot of them feature cats. In 2012, an AI team at Google constructed a multilayer neural network with over a billion weights that “viewed” millions of random YouTube videos while it adjusted these weights in order to successfully compress, and then decompress, selected frames from the videos. The Google researchers didn’t tell the system to learn about any particular objects, but after a week of training, when they probed the innards of the network, what did they find? A “neuron” (unit) that seemed to encode cats.
 
+
1
As its name implies, back-propagation is a way to take an error observed at the output units (for example, a high confidence for the wrong digit in the example of figure 4) and to “propagate” the blame for that error backward (in figure 4, this would be from right to left) so as to assign proper blame to each of the weights in the network. This allows back-propagation to determine how much to change each weight in order to reduce the error. Learning in neural networks simply consists in gradually modifying the weights on connections so that each output’s error gets as close to 0 as possible on all training examples. While the mathematics of back-propagation is beyond the scope of
+
This self- taught cat-recognition machine was one of a series of impressive AI feats that have captured the public’s attention over the last decade. Most of these achievements rely on a set of neural network algorithms known as deep learning. Until recently, AI’s popular image came largely from the many movies and TV shows in which it played a starring role; think 2001: A Space Odyssey or The Terminator. Real-world AI wasn’t very noticeable in our everyday lives or mainstream media. If you came of age in the 1990s or earlier, you might recall frustrating encounters with customer service speech-recognition systems, the robotic word-learning toy Furby, or Microsoft’s annoying and ill-fated Clippy, the paper-clip virtual assistant. Full-blown AI didn’t seem imminent.  
 
+
my discussion here, I’ve included some details in the notes.2
+
 
+
Back-propagation will work (in principle at least) no matter how many inputs, hidden units, or output units your neural network has. While there is no mathematical guarantee that back-propagation will settle on the correct weights for a network, in practice it has worked very well on many tasks that are too hard for simple perceptrons. For example, I trained both a perceptron and a two-layer neural network, each with 324 inputs and 10 outputs, on the handwritten-digit-recognition task, using sixty thousand examples, and then tested how well each was able to recognize ten thousand new examples. The perceptron was correct on about 80 percent of the new examples, whereas the neural network, with 50 hidden units, was correct on a whopping 94 percent of those new examples. Kudos to the hidden units! But what exactly has the neural network learned that allowed it to soar past the
+
 
+
perceptron? I don’t know. It’s possible that I could find a way to visualize the neural network’s 16,700 weights3 to get some insight into its performance, but I haven’t done so, and in general it’s not at all easy to understand how these networks make their decisions.
+
 
+
It’s important to note that while I’ve used the example of handwritten digits, neural networks can be applied not just to images but to any kind of data. Neural networks have been applied in areas as diverse as speech recognition, stock-market prediction, language translation, and music composition.
+
 
+
Connectionism
+
 
+
In the 1980s, the most visible group working on neural networks was a team at the University of California at San Diego headed by two psychologists, David Rumelhart and James McClelland. What we now call neural networks were then generally referred to as connectionist networks, where the term connectionist refers to the idea that
+
 
+
knowledge in these networks resides in weighted connections between units. The team led by Rumelhart and McClelland is known for writing the so-called bible of connectionism—a two-volume treatise, published in 1986, called Parallel Distributed Processing. In the midst of an AI landscape dominated by symbolic AI, the book was a pep talk for the subsymbolic approach, arguing that “people are smarter than today’s computers because the brain employs a basic computational architecture that is more suited to … the natural information-processing tasks that people are so good at,” for example, “perceiving objects in natural scenes and noting their relations,… understanding language, and retrieving contextually appropriate information from memory.”4 The authors speculated that “symbolic systems such as those favored by Minsky and Papert”5 would not be able to capture these humanlike abilities.
+
 
+
Indeed, by the mid-1980s, expert systems—symbolic AI approaches that rely on humans to create rules that
+
 
+
reflect expert knowledge of a particular domain—were increasingly revealing themselves to be brittle: that is, error- prone and often unable to generalize or adapt when presented with new situations. In analyzing the limitations of these systems, researchers were discovering how much the human experts writing the rules actually rely on subconscious knowledge—what you might call common sense—in order to act intelligently. This kind of common sense could not easily be captured in programmed rules or logical deduction, and the lack of it severely limited any broad application of symbolic AI methods. In short, after a cycle of grand promises, immense funding, and media hype, symbolic AI was facing yet another AI winter.
+
 
+
According to the proponents of connectionism, the key to intelligence was an appropriate computational architecture—inspired by the brain—and the ability of the system to learn on its own from data or from acting in the world. Rumelhart, McClelland, and their team constructed connectionist networks (in software) as scientific models of human learning, perception, and language development. While these networks did not exhibit anywhere near human-level performance, the various networks described in the Parallel Distributed Processing books and elsewhere were interesting enough as AI artifacts that many people took notice, including those at funding agencies. In 1988, a top official at the Defense Advanced Research Projects Agency (DARPA), which provided the lion’s share of AI funding, proclaimed, “I believe that this technology which we are about to embark upon [that is, neural
+
 
+
networks] is more important than the atom bomb.”6 Suddenly neural networks were “in” again.
+
 
+
Bad at Logic, Good at Frisbee
+
 
+
Over the last six decades of AI research, people have repeatedly debated the relative advantages and disadvantages of symbolic and subsymbolic approaches. Symbolic systems can be engineered by humans, be imbued with human knowledge, and use human-understandable reasoning to solve problems. For example, MYCIN, an expert system developed in the early 1970s, was given about six hundred rules that it used to help physicians diagnose and treat blood diseases. MYCIN’s programmers developed these rules after painstaking interviews with expert physicians. Given a patient’s symptoms and medical test results, MYCIN was able to use both logic and probabilistic reasoning together with its rules in order to come up with a diagnosis, and it was able to explain its reasoning process. In short, MYCIN was a paradigmatic example of symbolic AI.
+
 
+
In contrast, as we’ve seen, subsymbolic systems tend to be hard to interpret, and no one knows how to directly program complex human knowledge or logic into these systems. Subsymbolic systems seem much better suited to perceptual or motor tasks for which humans can’t easily define rules. You can’t easily write down rules for identifying handwritten digits, catching a baseball, or recognizing your mother’s voice; you just seem to do it automatically, without conscious thought. As the philosopher Andy Clark put it, the nature of subsymbolic systems
+
 
+
is to be “bad at logic, good at Frisbee.”7
+
 
+
So, why not just use symbolic systems for tasks that require high-level language-like descriptions and logical reasoning, and use subsymbolic systems for the low-level perceptual tasks such as recognizing faces and voices? To some extent, this is what has been done in AI, with very little connection between the two areas. Each of these approaches has had important successes in narrow areas but has serious limitations in achieving the original goals of AI. While there have been some attempts to construct hybrid systems that integrate subsymbolic and symbolic methods, none have yet led to any striking success.
+
 
+
The Ascent of Machine Learning
+
 
+
Inspired by statistics and probability theory, AI researchers developed numerous algorithms that enable computers to learn from data, and the field of machine learning became its own independent subdiscipline of AI, intentionally separate from symbolic AI. Machine-learning researchers disparagingly referred to symbolic AI methods as good old-fashioned AI, or GOFAI (pronounced “go-fye”),8 and roundly rejected them.
+
 
+
Over the next two decades, machine learning had its own cycles of optimism, government funding, start-ups,
+
 
+
and overpromising, followed by the inevitable winters. Training neural networks and similar methods to solve real- world problems could be glacially slow, and often didn’t work very well, given the limited amount of data and computer power available at the time. But more data and computing power were coming shortly. The explosive growth of the internet would see to that. The stage was set for the next big AI revolution.
+
  
 +
Maybe this is why so many people were shocked and upset when, in 1997, IBM’s Deep Blue chess-playing system defeated the world chess champion Garry Kasparov. This event so stunned Kasparov that he accused the IBM team of cheating; he assumed that for the machine to play so well, it must have received help from human experts.
 +
2
 +
(In a nice bit of irony, during the 2006 World Chess Championship matches the tables were turned, with one player accusing the other of cheating by receiving help from a computer chess program.
 
3
 
3
 +
)
  
AI Spring
+
Our collective human angst over Deep Blue quickly receded. We accepted that chess could yield to brute-force machinery; playing chess well, we allowed, didn’t require general intelligence after all. This seems to be a common response when computers surpass humans on a particular task; we conclude that the task doesn’t actually require intelligence. As John McCarthy lamented, “As soon as it works, no one calls it AI anymore.”
 +
4
 +
  
Spring Fever
+
However, by the mid-2000s and beyond, a more pervasive succession of AI accomplishments started sneaking up on us and then proliferating at a dizzying pace. Google launched its automated language-translation service, Google Translate. It wasn’t perfect, but it worked surprisingly well, and it has since improved significantly. Shortly thereafter, Google’s self-driving cars showed up on the roads of Northern California, careful and timid, but commuting on their own in full traffic. Virtual assistants such as Apple’s Siri and Amazon’s Alexa were installed on our phones and in our homes and could deal with many of our spoken requests. YouTube started providing impressively accurate automated subtitles for videos, and Skype offered simultaneous translation between languages in video calls. Suddenly Facebook could recognize your face eerily well in uploaded photos, and the photo-sharing website Flickr began automatically labeling photos with text describing their content.
  
Have you ever taken a video of your cat and uploaded it to YouTube? If so, you are not alone. More than a billion videos have been uploaded to YouTube, and a lot of them feature cats. In 2012, an AI team at Google constructed a multilayer neural network with over a billion weights that “viewed” millions of random YouTube videos while it adjusted these weights in order to successfully compress, and then decompress, selected frames from the videos. The Google researchers didn’t tell the system to learn about any particular objects, but after a week of training, when
+
In 2011, IBM’s Watson program roundly defeated human champions on television’s Jeopardy! game show, adroitly interpreting pun-laden clues and prompting its challenger Ken Jennings to “welcome our new computer overlords.” Just five years later, millions of internet viewers were introduced to the complex game of Go, a longtime grand challenge for AI, when a program called AlphaGo stunningly defeated one of the world’s best players in four out of five games.
  
they probed the innards of the network, what did they find? A “neuron” (unit) that seemed to encode cats.1 This self- taught cat-recognition machine was one of a series of impressive AI feats that have captured the public’s attention over the last decade. Most of these achievements rely on a set of neural network algorithms known as deep learning. Until recently, AI’s popular image came largely from the many movies and TV shows in which it played a starring role; think 2001: A Space Odyssey or The Terminator. Real-world AI wasn’t very noticeable in our everyday lives or mainstream media. If you came of age in the 1990s or earlier, you might recall frustrating encounters with customer service speech-recognition systems, the robotic word-learning toy Furby, or Microsoft’s
+
The buzz over artificial intelligence was quickly becoming deafening, and the commercial world took notice. All of the largest technology companies have poured billions of dollars into AI research and development, either hiring AI experts directly or acquiring smaller start-up companies for the sole purpose of grabbing (“acqui-hiring”) their talented employees. The potential of being acquired, with its promise of instant millionaire status, has fueled a proliferation of start-ups, often founded and run by former university professors, each with his or her own twist on AI. As the technology journalist Kevin Kelly observed, “The business plans of the next 10,000 startups are easy to forecast: Take X and add AI.”
 +
5
 +
And, crucially, for nearly all of these companies, AI has meant “deep learning.”
  
annoying and ill-fated Clippy, the paper-clip virtual assistant. Full-blown AI didn’t seem imminent.
+
AI spring is once again in full bloom.  
  
Maybe this is why so many people were shocked and upset when, in 1997, IBM’s Deep Blue chess-playing system defeated the world chess champion Garry Kasparov. This event so stunned Kasparov that he accused the IBM team of cheating; he assumed that for the machine to play so well, it must have received help from human experts.2 (In a nice bit of irony, during the 2006 World Chess Championship matches the tables were turned, with one player accusing the other of cheating by receiving help from a computer chess program.3)
+
===AI: Narrow and General, Weak and Strong ===
  
Our collective human angst over Deep Blue quickly receded. We accepted that chess could yield to brute-
+
Like every AI spring before it, our current one features experts predicting that “general AI”—AI that equals or surpasses humans in most ways—will be here soon. “Human level AI will be passed in the mid-2020s,”
 +
6
 +
predicted Shane Legg, cofounder of Google DeepMind, in 2008. In 2015, Facebook’s CEO, Mark Zuckerberg, declared, “One of our goals for the next five to 10 years is to basically get better than human level at all of the primary human senses: vision, hearing, language, general cognition.”
 +
7
 +
The AI philosophers Vincent Müller and Nick Bostrom published a 2013 poll of AI researchers in which many assigned a 50 percent chance of human-level AI by the year 2040.
 +
8
 +
  
force machinery; playing chess well, we allowed, didn’t require general intelligence after all. This seems to be a common response when computers surpass humans on a particular task; we conclude that the task doesn’t actually require intelligence. As John McCarthy lamented, “As soon as it works, no one calls it AI anymore.”4
+
While much of this optimism is based on the recent successes of deep learning, these programs—like all instances of AI to date—are still examples of what is called “narrow” or “weak” AI. These terms are not as derogatory as they sound; they simply refer to a system that can perform only one narrowly defined task (or a small set of related tasks). AlphaGo is possibly the world’s best Go player, but it can’t do anything else; it can’t even play checkers, tic-tac-toe, or Candy Land. Google Translate can render an English movie review into Chinese, but it can’t tell you if the reviewer liked the movie or not, and it certainly can’t watch and review the movie itself.  
  
However, by the mid-2000s and beyond, a more pervasive succession of AI accomplishments started sneaking
+
The terms narrow and weak are used to contrast with strong, human-level, general, or full-blown AI (sometimes called AGI, or artificial general intelligence)—that is, the AI that we see in movies, that can do most everything we humans can do, and possibly much more. General AI might have been the original goal of the field, but achieving it has turned out to be much harder than expected. Over time, efforts in AI have become focused on particular well-defined tasks—speech recognition, chess playing, autonomous driving, and so on. Creating machines that perform such functions is useful and often lucrative, and it could be argued that each of these tasks individually requires “intelligence.” But no AI program has been created yet that could be called intelligent in any general sense. A recent appraisal of the field stated this well: “A pile of narrow intelligences will never add up to a general intelligence. General intelligence isn’t about the number of abilities, but about the integration between those abilities.”
 +
9
 +
  
up on us and then proliferating at a dizzying pace. Google launched its automated language-translation service, Google Translate. It wasn’t perfect, but it worked surprisingly well, and it has since improved significantly. Shortly thereafter, Google’s self-driving cars showed up on the roads of Northern California, careful and timid, but commuting on their own in full traffic. Virtual assistants such as Apple’s Siri and Amazon’s Alexa were installed on our phones and in our homes and could deal with many of our spoken requests. YouTube started providing impressively accurate automated subtitles for videos, and Skype offered simultaneous translation between languages in video calls. Suddenly Facebook could recognize your face eerily well in uploaded photos, and the photo-sharing website Flickr began automatically labeling photos with text describing their content.
+
But wait. Given the rapidly increasing pile of narrow intelligences, how long will it be before someone figures out how to integrate them and produce all of the broad, deep, and subtle features of human intelligence? Do we believe the cognitive scientist Steven Pinker, who thinks all this is business as usual? “Human-level AI is still the standard fifteen to twenty-five years away, just as it always has been, and many of its recently touted advances have shallow roots,” Pinker declared.
 +
10
 +
Or should we pay more attention to the AI optimists, who are certain that this time around, this AI spring, things will be different?
  
In 2011, IBM’s Watson program roundly defeated human champions on television’s Jeopardy! game show, adroitly interpreting pun-laden clues and prompting its challenger Ken Jennings to “welcome our new computer overlords.” Just five years later, millions of internet viewers were introduced to the complex game of Go, a longtime grand challenge for AI, when a program called AlphaGo stunningly defeated one of the world’s best players in four out of five games.
+
Not surprisingly, in the AI research community there is considerable controversy over what human-level AI would entail. How can we know if we have succeeded in building such a “thinking machine”? Would such a system be required to have consciousness or self-awareness in the way humans do? Would it need to understand things in the same way a human understands them? Given that we’re talking about a machine here, would we be more correct to say it is “simulating thought,” or could we say it is truly thinking?
  
The buzz over artificial intelligence was quickly becoming deafening, and the commercial world took notice. All of the largest technology companies have poured billions of dollars into AI research and development, either hiring AI experts directly or acquiring smaller start-up companies for the sole purpose of grabbing (“acqui-hiring”)
+
===Could Machines Think? ===
  
their talented employees. The potential of being acquired, with its promise of instant millionaire status, has fueled a proliferation of start-ups, often founded and run by former university professors, each with his or her own twist on AI. As the technology journalist Kevin Kelly observed, “The business plans of the next 10,000 startups are easy to forecast: Take X and add AI.”5 And, crucially, for nearly all of these companies, AI has meant “deep learning.”
+
Such philosophical questions have dogged the field of AI since its inception. Alan Turing, the British mathematician who in the 1930s sketched out the first framework for programmable computers, published a paper in 1950 asking what we might mean when we ask, “Can machines think?” After proposing his famous “imitation game” (now called the Turing test—more on this in a bit), Turing listed nine possible objections to the prospect of a machine actually thinking, all of which he tried to refute. These imagined objections range from the theological—“Thinking is a function of man’s immortal soul. God has given an immortal soul to every man and woman, but not to any other animal or to machines. Hence no animal or machine can think”—to the parapsychological, something along the lines of “Humans can use telepathy to communicate while machines cannot.” Strangely enough, Turing judged this last argument as “quite a strong one,” because “the statistical evidence, at least for telepathy, is overwhelming.”  
  
AI spring is once again in full bloom.
+
From the vantage of many decades, my own vote for the strongest of Turing’s possible arguments is the “argument from consciousness,” which he summarizes by quoting the neurologist Geoffrey Jefferson:
  
AI: Narrow and General, Weak and Strong
+
:Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain—that is, not only write it but know that it had written it. No mechanism could feel (and not merely artificially signal, an easy contrivance) pleasure at its successes, grief when its valves fuse, be warmed by flattery, be made miserable by its mistakes, be charmed by sex, be angry or depressed when it cannot get what it wants.
 +
11
 +
  
Like every AI spring before it, our current one features experts predicting that “general AI”—AI that equals or surpasses humans in most ways—will be here soon. “Human level AI will be passed in the mid-2020s,”6 predicted Shane Legg, cofounder of Google DeepMind, in 2008. In 2015, Facebook’s CEO, Mark Zuckerberg, declared, “One of our goals for the next five to 10 years is to basically get better than human level at all of the primary human senses: vision, hearing, language, general cognition.”7 The AI philosophers Vincent Müller and Nick Bostrom published a 2013 poll of AI researchers in which many assigned a 50 percent chance of human-level AI by the year 2040.8
+
Note that this argument is saying the following: (1) Only when a machine feels things and is aware of its own actions and feelings—in short, is conscious—could we consider it actually thinking, and (2) No machine could ever do this. Ergo, no machine could ever actually think.  
  
While much of this optimism is based on the recent successes of deep learning, these programs—like all
+
I think it’s a strong argument, even though I don’t agree with it. It resonates with our intuitions about what machines are and how they are limited. Over the years, I’ve talked with any number of friends, relatives, and students about the possibility of machine intelligence, and this is the argument many of them stand by. For example, I was recently talking with my mother, a retired lawyer, after she had read a New York Times article about advances in the Google Translate program:
  
instances of AI to date—are still examples of what is called “narrow” or “weak” AI. These terms are not as derogatory as they sound; they simply refer to a system that can perform only one narrowly defined task (or a small set of related tasks). AlphaGo is possibly the world’s best Go player, but it can’t do anything else; it can’t even play checkers, tic-tac-toe, or Candy Land. Google Translate can render an English movie review into Chinese, but it can’t tell you if the reviewer liked the movie or not, and it certainly can’t watch and review the movie itself.
+
:MOM: The problem with people in the field of AI is that they anthropomorphize so much!
  
The terms narrow and weak are used to contrast with strong, human-level, general, or full-blown AI (sometimes called AGI, or artificial general intelligence)—that is, the AI that we see in movies, that can do most everything we humans can do, and possibly much more. General AI might have been the original goal of the field, but achieving it has turned out to be much harder than expected. Over time, efforts in AI have become focused on particular well-defined tasks—speech recognition, chess playing, autonomous driving, and so on. Creating machines that perform such functions is useful and often lucrative, and it could be argued that each of these tasks individually requires “intelligence.” But no AI program has been created yet that could be called intelligent in any general sense. A recent appraisal of the field stated this well: “A pile of narrow intelligences will never add up to a general intelligence. General intelligence isn’t about the number of abilities, but about the integration between those
+
:ME: What do you mean, anthropomorphize?
 +
:MOM: The language they use implies that machines might be able to actually think, rather than to just simulate thinking.  
 +
:ME: What’s the difference between “actually thinking” and “simulating thinking”?
 +
:MOM: Actual thinking is done with a brain, and simulating is done with computers.  
 +
:ME: What’s so special about a brain that it allows “actual” thinking? What’s missing in computers?
 +
:MOM: I don’t know. I think there’s a human quality to thinking that can’t ever be completely mimicked by computers.  
  
abilities.”9
+
My mother isn’t the only one who has this intuition. In fact, to many people it seems so obvious as to require no argument. And like many of these people, my mother would claim to be a philosophical materialist; that is, she doesn’t believe in any nonphysical “soul” or “life force” that imbues living things with intelligence. It’s just that she doesn’t think machines could ever have the right stuff to “actually think.”
  
But wait. Given the rapidly increasing pile of narrow intelligences, how long will it be before someone figures out how to integrate them and produce all of the broad, deep, and subtle features of human intelligence? Do we believe the cognitive scientist Steven Pinker, who thinks all this is business as usual? “Human-level AI is still the standard fifteen to twenty-five years away, just as it always has been, and many of its recently touted advances have shallow roots,” Pinker declared.10 Or should we pay more attention to the AI optimists, who are certain that this time
+
In the academic realm, the most famous version of this argument was put forth by the philosopher John Searle. In 1980, Searle published an article called “Minds, Brains, and Programs”
 +
12
 +
in which he vigorously argued against the possibility of machines actually thinking. In this widely read, controversial piece, Searle introduced the concepts of “strong” and “weak” AI in order to distinguish between two philosophical claims made about AI programs. While many people today use the phrase strong AI to mean “AI that can perform most tasks as well as a human” and weak AI to mean the kind of narrow AI that currently exists, Searle meant something different by these terms. For Searle, the strong AI claim would be that “the appropriately programmed digital computer does not just simulate having a mind; it literally has a mind.”
 +
13
 +
In contrast, in Searle’s terminology, weak AI views computers as tools to simulate human intelligence and does not make any claims about them “literally” having a mind.
 +
14
 +
We’re back to the philosophical question I was discussing with my mother: Is there a difference between “simulating a mind” and “literally having a mind”? Like my mother, Searle believes there is a fundamental difference, and he argued that strong AI is impossible even in principle.
 +
15
 +
  
around, this AI spring, things will be different?
+
===The Turing Test ===
  
Not surprisingly, in the AI research community there is considerable controversy over what human-level AI would entail. How can we know if we have succeeded in building such a “thinking machine”? Would such a system be required to have consciousness or self-awareness in the way humans do? Would it need to understand things in the same way a human understands them? Given that we’re talking about a machine here, would we be more correct to say it is “simulating thought,” or could we say it is truly thinking?
+
Searle’s article was spurred in part by Alan Turing’s 1950 paper, “Computing Machinery and Intelligence,” which had proposed a way to cut through the Gordian knot of “simulated” versus “actual” intelligence. Declaring that “the original question ‘Can a machine think?’ is too meaningless to deserve discussion,” Turing proposed an operational method to give it meaning. In his “imitation game,” now called the Turing test, there are two contestants: a computer and a human. Each is questioned separately by a (human) judge who tries to determine which is which. The judge is physically separated from the two contestants so cannot rely on visual or auditory cues; only typed text is communicated.
  
Could Machines Think?
+
Turing suggested the following: “The question, ‘Can machines think?’ should be replaced by ‘Are there imaginable digital computers which would do well in the imitation game?’” In other words, if a computer is sufficiently humanlike to be indistinguishable from humans, aside from its physical appearance or what it sounds like (or smells or feels like, for that matter), why shouldn’t we consider it to actually think? Why should we require an entity to be created out of a particular kind of material (for example, biological cells) to grant it “thinking” status?  
  
Such philosophical questions have dogged the field of AI since its inception. Alan Turing, the British mathematician who in the 1930s sketched out the first framework for programmable computers, published a paper in 1950 asking what we might mean when we ask, “Can machines think?” After proposing his famous “imitation game” (now called the Turing test—more on this in a bit), Turing listed nine possible objections to the prospect of a machine actually thinking, all of which he tried to refute. These imagined objections range from the theological—“Thinking is a function of man’s immortal soul. God has given an immortal soul to every man and woman, but not to any other animal or to machines. Hence no animal or machine can think”—to the parapsychological, something along the lines
+
As the computer scientist Scott Aaronson put it bluntly, Turing’s proposal is “a plea against meat chauvinism.
 +
16
 +
  
of “Humans can use telepathy to communicate while machines cannot.” Strangely enough, Turing judged this last argument as “quite a strong one,” because “the statistical evidence, at least for telepathy, is overwhelming.
+
The devil is always in the details, and the Turing test is no exception. Turing did not specify the criteria for selecting the human contestant and the judge, or stipulate how long the test should last, or what conversational topics should be allowed. However, he did make an oddly specific prediction: “I believe that in about 50 years’ time it will be possible to programme computers … to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning.In other words, in a five-minute session, the average judge will be fooled 30 percent of the time.  
  
From the vantage of many decades, my own vote for the strongest of Turing’s possible arguments is the “argument from consciousness,which he summarizes by quoting the neurologist Geoffrey Jefferson:
+
Turing’s prediction has turned out to be pretty accurate. Several Turing tests have been staged over the years, in which the computer contestants are chatbots—programs specifically built to carry on conversations (they can’t do anything else). In 2014, the Royal Society in London was host to a Turing test demonstration featuring five computer programs, thirty human contestants, and thirty human judges of different ages and walks of life, including computer experts and nonexperts, as well as native and nonnative English speakers. Each judge conducted several rounds of five-minute conversations in which he or she conversed (by typing) in parallel with a pair of contestants— one human and one machine—after which the judge had to guess which was which. A chatbot named “Eugene Goostman,” created by a group of Russian and Ukrainian programmers, won the competition by fooling ten (or 33.3 percent) of the judges. The competition organizers, following Turing’s “more than 30 percent fooled in five minutes” criterion, quickly flooded the media with reports that the Turing test had finally been passed.
  
Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain—that is, not only write it but know that it had written it. No mechanism could feel (and not merely artificially signal, an easy contrivance) pleasure at its successes, grief when its valves fuse, be warmed by flattery, be made miserable by its
+
The Eugene Goostman chatbot works in much the same way as most other chatbots. It stores a large set of sentence templates that can be filled in based on a set of programmed rules that are applied to the input text it gets from its conversation partner. The chatbot’s programmers have given it linguistic rules that allow it to pinpoint key information in its input and to store that information for later use. In addition, the chatbot stores a database of “commonsense knowledge,” encoded by human programmers, along with some logic rules; these would presumably allow it to answer questions like “What color is the sky over Vladivostok?” with the human-sounding “Blue, I suppose, unless it’s a cloudy day.” If none of the chatbot’s rules apply to an input, it just changes the subject. The system’s rules also encode its “personality”—in this case, a thirteen-year-old Ukrainian boy whose English is good but (conveniently) not perfect. Here is a sample of Eugene Goostman’s conversation with one of the judges who was fooled:
  
mistakes, be charmed by sex, be angry or depressed when it cannot get what it wants.11
+
:JUDGE: What is your gender
 +
:EUGENE: I’m a young boy, if you care to know. And not very ugly, by the way!
 +
:JUDGE: what are you aware of at this moment
 +
:EUGENE: I would rather not talk about it if you don’t mind. Could you tell me what are you? I mean your profession.
 +
:JUDGE: how old are you
 +
:EUGENE: I’m a little boy 13 years old.
 +
:JUDGE: what are your qualifications
 +
:EUGENE: My qualifications? I’m not gonna tell you. Oooh. Anything else?
 +
17
 +
  
Note that this argument is saying the following: (1) Only when a machine feels things and is aware of its own actions and feelings—in short, is conscious—could we consider it actually thinking, and (2) No machine could ever do this. Ergo, no machine could ever actually think.
+
After the competition, the organizers issued a press release announcing, “The 65-year-old iconic Turing Test was passed for the very first time by computer programme Eugene Goostman,” and stating, “It is fitting that such an important landmark has been reached at the Royal Society in London, the home of British Science and the scene of many great advances in human understanding over the centuries. This milestone will go down in history as one of the most exciting.
 +
18
 +
  
I think it’s a strong argument, even though I don’t agree with it. It resonates with our intuitions about what machines are and how they are limited. Over the years, I’ve talked with any number of friends, relatives, and students about the possibility of machine intelligence, and this is the argument many of them stand by. For example, I was recently talking with my mother, a retired lawyer, after she had read a New York Times article about advances in the Google Translate program:
+
AI experts unanimously scoffed at this characterization. To anyone familiar with how chatbots are programmed, it’s glaringly obvious from the competition transcripts that Eugene Goostman is a program, and not even a very sophisticated one. The result seemed to reveal more about the judges and the test itself than about the machines. Given five minutes and a propensity to avoid hard questions by changing the subject or by responding with a new question, the program had a surprisingly easy time fooling a nonexpert judge into believing he or she was conversing with a real person. This has been demonstrated with many chatbots, ranging from the 1960s ELIZA, which mimicked a psychotherapist, to today’s malevolent Facebook bots, which use short text exchanges to trick people into revealing personal information.
  
MOM: The problem with people in the field of AI is that they anthropomorphize so much!
+
These bots are, of course, leveraging our very human tendency to anthropomorphize (you were right, Mom!).
  
ME: What do you mean, anthropomorphize?
+
We are all too willing to ascribe understanding and consciousness to computers, based on little evidence.
  
MOM: The language they use implies that machines might be able to actually think, rather than to just simulate thinking.
+
For these reasons, most AI experts hate the Turing test, at least as it has been carried out to date. They see  such competitions as publicity stunts whose results say nothing about progress in AI. But while Turing might have overestimated the ability of an “average interrogator” to see through superficial trickery, could the test still be a useful indicator of actual intelligence if the conversation time is extended and the required expertise of the judges is raised?
  
ME: What’s the difference between “actually thinking” and “simulating thinking”?
+
Ray Kurzweil, who is now director of engineering at Google, believes that a properly designed version of the Turing test will indeed reveal machine intelligence; he predicts that a computer will pass this test by 2029, a milestone event on the way to Kurzweil’s forecasted Singularity.
  
MOM: Actual thinking is done with a brain, and simulating is done with computers.
+
===The Singularity ===
  
ME: What’s so special about a brain that it allows “actual” thinking? What’s missing in computers?
+
Ray Kurzweil has long been AI’s leading optimist. A former student of Marvin Minsky’s at MIT, Kurzweil has had a distinguished career as an inventor: he invented the first text-to-speech machine as well as one of the world’s best music synthesizers. In 1999, President Bill Clinton awarded Kurzweil the National Medal of Technology and Innovation for these and other inventions.
  
MOM: I don’t know. I think there’s a human quality to thinking that can’t ever be completely mimicked by computers.
+
Yet Kurzweil is best known not for his inventions but for his futurist prognostications, most notably the idea of the Singularity: “a future period during which the pace of technological change will be so rapid, its impact so deep, that human life will be irreversibly transformed.”
 +
19
 +
Kurzweil uses the term singularity in the sense of “a unique event with … singular implications”; in particular, “an event capable of rupturing the fabric of human history.”
 +
20
 +
For Kurzweil, this singular event is the point in time when AI exceeds human intelligence.  
  
My mother isn’t the only one who has this intuition. In fact, to many people it seems so obvious as to require no argument. And like many of these people, my mother would claim to be a philosophical materialist; that is, she doesn’t believe in any nonphysical “soul” or “life force” that imbues living things with intelligence. It’s just that she doesn’t think machines could ever have the right stuff to “actually think.”
+
Kurzweil’s ideas were spurred by the mathematician I. J. Good’s speculations on the potential of an intelligence explosion: “Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind.”
 +
21
 +
  
In the academic realm, the most famous version of this argument was put forth by the philosopher John Searle. In 1980, Searle published an article called “Minds, Brains, and Programs”12 in which he vigorously argued against the possibility of machines actually thinking. In this widely read, controversial piece, Searle introduced the concepts of “strong” and “weak” AI in order to distinguish between two philosophical claims made about AI programs. While many people today use the phrase strong AI to mean “AI that can perform most tasks as well as a human” and weak
+
Kurzweil was also influenced by the mathematician and science fiction writer Vernor Vinge, who believed this event was close at hand: “The evolution of human intelligence took millions of years. We will devise an equivalent advance in a fraction of that time. We will soon create intelligences greater than our own. When this happens, human history will have reached a kind of singularity … and the world will pass far beyond our understanding.
 +
22
 +
  
AI to mean the kind of narrow AI that currently exists, Searle meant something different by these terms. For Searle, the strong AI claim would be that “the appropriately programmed digital computer does not just simulate having a mind; it literally has a mind.”13 In contrast, in Searle’s terminology, weak AI views computers as tools to simulate human intelligence and does not make any claims about them “literally” having a mind.14 We’re back to the philosophical question I was discussing with my mother: Is there a difference between “simulating a mind” and “literally having a mind”? Like my mother, Searle believes there is a fundamental difference, and he argued that strong AI is impossible even in principle.15
+
Kurzweil takes the intelligence explosion as his starting point and then turns up the sci-fi intensity, moving from AI to nanoscience, then to virtual reality and “brain uploading,” all in the same calm, confident tone of a Delphic oracle looking at a calendar and pointing to specific dates. To give you the flavor of all this, here are some of Kurzweil’s predictions:  
  
The Turing Test
+
:By the 2020s molecular assembly will provide tools to effectively combat poverty, clean up our environment, overcome disease, [and] extend human longevity.
  
Searle’s article was spurred in part by Alan Turing’s 1950 paper, “Computing Machinery and Intelligence,” which had proposed a way to cut through the Gordian knot of “simulated” versus “actual” intelligence. Declaring that “the original question ‘Can a machine think?’ is too meaningless to deserve discussion,” Turing proposed an operational method to give it meaning. In his “imitation game,” now called the Turing test, there are two contestants: a computer
+
:By the end of the 2030s … brain implants based on massively distributed intelligent nanobots will greatly expand our memories and otherwise vastly improve all our sensory, pattern-recognition, and cognitive abilities.  
  
and a human. Each is questioned separately by a (human) judge who tries to determine which is which. The judge is physically separated from the two contestants so cannot rely on visual or auditory cues; only typed text is communicated.
+
:Uploading a human brain means scanning all of its salient details and then reinstantiating those details into a suitably powerful computational substrate.The end of the 2030s is a conservative projection for successful [brain] uploading.
 +
23
 +
  
Turing suggested the following: “The question, ‘Can machines think?’ should be replaced by ‘Are there imaginable digital computers which would do well in the imitation game?’” In other words, if a computer is sufficiently humanlike to be indistinguishable from humans, aside from its physical appearance or what it sounds like (or smells or feels like, for that matter), why shouldn’t we consider it to actually think? Why should we require an entity to be created out of a particular kind of material (for example, biological cells) to grant it “thinking” status?
+
:A computer will pass the Turing test by 2029.
 +
24
 +
  
As the computer scientist Scott Aaronson put it bluntly, Turing’s proposal is “a plea against meat chauvinism.”16
+
:As we get to the 2030s, artificial consciousness will be very realistic. That’s what it means to pass the Turing test.
 +
25
 +
  
The devil is always in the details, and the Turing test is no exception. Turing did not specify the criteria for selecting the human contestant and the judge, or stipulate how long the test should last, or what conversational topics should be allowed. However, he did make an oddly specific prediction: “I believe that in about 50 years’ time it will be possible to programme computers … to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning.” In other words, in a five-minute session, the average judge will be fooled 30 percent of the time.
+
:I set the date for the Singularity … as 2045. The nonbiological intelligence created in that year will be one billion times more powerful than all human intelligence today.
 +
26
 +
  
Turing’s prediction has turned out to be pretty accurate. Several Turing tests have been staged over the years, in which the computer contestants are chatbots—programs specifically built to carry on conversations (they can’t do anything else). In 2014, the Royal Society in London was host to a Turing test demonstration featuring five computer programs, thirty human contestants, and thirty human judges of different ages and walks of life, including computer experts and nonexperts, as well as native and nonnative English speakers. Each judge conducted several rounds of five-minute conversations in which he or she conversed (by typing) in parallel with a pair of contestants— one human and one machine—after which the judge had to guess which was which. A chatbot named “Eugene Goostman,created by a group of Russian and Ukrainian programmers, won the competition by fooling ten (or 33.3 percent) of the judges. The competition organizers, following Turing’s “more than 30 percent fooled in five minutes” criterion, quickly flooded the media with reports that the Turing test had finally been passed.
+
The writer Andrian Kreye wryly referred to Kurzweil’s Singularity prediction as “nothing more than the belief in a technological Rapture.”
 +
27
 +
  
The Eugene Goostman chatbot works in much the same way as most other chatbots. It stores a large set of sentence templates that can be filled in based on a set of programmed rules that are applied to the input text it gets from its conversation partner. The chatbot’s programmers have given it linguistic rules that allow it to pinpoint key information in its input and to store that information for later use. In addition, the chatbot stores a database of “commonsense knowledge,” encoded by human programmers, along with some logic rules; these would presumably allow it to answer questions like “What color is the sky over Vladivostok?” with the human-sounding “Blue, I suppose, unless it’s a cloudy day.” If none of the chatbot’s rules apply to an input, it just changes the subject. The system’s rules also encode its “personality”—in this case, a thirteen-year-old Ukrainian boy whose English is good but (conveniently) not perfect. Here is a sample of Eugene Goostman’s conversation with one of the judges who was fooled:
+
Kurzweil bases all of his predictions on the idea of “exponential progress” in many areas of science and technology, especially computers. To unpack this idea, let’s consider how exponential growth works.  
  
JUDGE: What is your gender
+
===An Exponential Fable ===
  
EUGENE: I’m a young boy, if you care to know. And not very ugly, by the way!
+
For a simple illustration of exponential growth, I’ll recount an old fable. Long ago, a renowned sage from a poor and starving village visited a distant and rich kingdom where the king challenged him to a game of chess. The sage was reluctant to accept, but the king insisted, offering the sage a reward “of anything you desire, if you are able to defeat me in a game.” For the sake of his village, the sage finally accepted and (as sages usually do) won the game. The king asked the sage to name his reward. The sage, who enjoyed mathematics, said, “All I ask for is that you take this chessboard, put two grains of rice on the first square, four grains on the second square, eight grains on the third, and so on, doubling the number of grains on each successive square. After you complete each row, package up the rice on that row and ship it to my village.” The mathematically naive king laughed. “Is that all you want? I will have my men bring in some rice and fulfill your request posthaste.”
  
JUDGE: what are you aware of at this moment
+
The king’s men brought in a large bag of rice. After several minutes they had completed the first eight squares of the board with the requisite grains of rice: 2 on the first square, 4 on the second, 8 on the third, and so on, with 256 grains on the eighth square. They put the collection of grains (511, to be exact) in a tiny bag and sent it off by horseback to the sage’s village. They then proceeded on to the second row, with 512 grains on the first square of that row, 1,024 grains on the next square, and 2,048 grains on the following. Each pile of rice no longer fit on a chessboard square, so it was counted into a large bowl instead. By the end of the second row, the counting of grains was taking far too much time, so the court mathematicians started estimating the amounts by weight. They calculated that for the sixteenth square, 65,536 grains—about a kilogram (just over two pounds)—were required. The bag of rice shipped off for the second row weighed about two kilograms.
  
EUGENE: I would rather not talk about it if you don’t mind. Could you tell me what are you? I mean your profession.
+
The king’s men started on the third row. The seventeenth square required 2 kilos, the eighteenth required 4, and so on; by the end of the third row (square 24), 512 kilos were needed. The king’s subjects were conscripted to bring in additional giant bags of rice. The situation had become dire by the second square of the fourth row (square 26), when the mathematicians calculated that 2,048 kilos (over two tons) of rice were required. This would exhaust the entire rice harvest of the kingdom, even though the chessboard was not even half completed. The king, now realizing the trick that had been played on him, begged the sage to relent and save the kingdom from starvation. The sage, satisfied that the rice already received by his village would be enough, agreed.  
  
JUDGE: how old are you
+
Figure 5A plots the number of kilos of rice required on each chess square, up to the twenty-fourth square. The first square, with two rice grains, has a scant fraction of a kilo. Similarly, the squares up through 16 have less than 1 kilo. But after square 16, you can see the plot shoot up rapidly, due to the doubling effect. Figure 5B shows the values for the twenty-fourth through the sixty-fourth chess square, going from 512 kilos to more than 30 trillion kilos.
  
EUGENE: I’m a little boy 13 years old.
+
The mathematical function describing this graph is y = 2x, where x is the chess square (numbered from 1 to 64) and y is the number of rice grains required on that square. This is called an exponential function, because x is the exponent of the number 2. No matter what scale is plotted, the function will have a characteristic point at which the curve seems to change from slow to explosively fast growth.  
  
JUDGE: what are your qualifications
+
FIGURE 5: Plots showing how many kilos of rice are needed for each chess square in order to fulfill the sage’s request; A, squares 1–24 (with y-axis showing hundreds of kilos); B, squares 24–64 (with y-axis showing tens of trillions of kilos)
  
EUGENE: My qualifications? I’m not gonna tell you. Oooh. Anything else?17
+
===Exponential Progress in Computers ===
  
After the competition, the organizers issued a press release announcing, “The 65-year-old iconic Turing Test was passed for the very first time by computer programme Eugene Goostman,” and stating, “It is fitting that such an important landmark has been reached at the Royal Society in London, the home of British Science and the scene of many great advances in human understanding over the centuries. This milestone will go down in history as one of the most exciting.”18
+
For Ray Kurzweil, the computer age has provided a real-world counterpart to the exponential fable. In 1965, Gordon Moore, cofounder of Intel Corporation, identified a trend that has come to be known as Moore’s law: the number of components on a computer chip doubles approximately every one to two years. In other words, the components are getting exponentially smaller (and cheaper), and computer speed and memory are increasing at an exponential rate.  
  
AI experts unanimously scoffed at this characterization. To anyone familiar with how chatbots are programmed, it’s glaringly obvious from the competition transcripts that Eugene Goostman is a program, and not even a very sophisticated one. The result seemed to reveal more about the judges and the test itself than about the machines. Given five minutes and a propensity to avoid hard questions by changing the subject or by responding with a new question, the program had a surprisingly easy time fooling a nonexpert judge into believing he or she was conversing with a real person. This has been demonstrated with many chatbots, ranging from the 1960s ELIZA, which mimicked a psychotherapist, to today’s malevolent Facebook bots, which use short text exchanges to trick people into revealing personal information.
+
Kurzweil’s books are full of graphs like the ones in figure 5, and extrapolations of these trends of exponential progress, along the lines of Moore’s law, are at the heart of his forecasts for AI. Kurzweil points out that if the trends continue (as he believes they will), a $1,000 computer will “achieve human brain capability (1016 calculations per second) … around the year 2023.
 +
28
 +
At that point, in Kurzweil’s view, human-level AI will just be a matter of reverse engineering the brain.  
  
These bots are, of course, leveraging our very human tendency to anthropomorphize (you were right, Mom!).
+
===Neural Engineering ===
  
We are all too willing to ascribe understanding and consciousness to computers, based on little evidence.
+
Reverse engineering the brain means understanding enough about its workings in order to duplicate it, or at least to use the brain’s underlying principles to replicate its intelligence in a computer. Kurzweil believes that such reverse engineering is a practical, near-term approach to creating human-level AI. Most neuroscientists would vehemently
  
For these reasons, most AI experts hate the Turing test, at least as it has been carried out to date. They see  such competitions as publicity stunts whose results say nothing about progress in AI. But while Turing might have overestimated the ability of an “average interrogator” to see through superficial trickery, could the test still be a useful indicator of actual intelligence if the conversation time is extended and the required expertise of the judges is raised?
+
disagree, given how little is currently known about how the brain works. But Kurzweil’s argument again rests on exponential trends—this time in advancements in neuroscience. In 2002 he wrote, “A careful analysis of the requisite trends shows that we will understand the principles of operation of the human brain and be in a position to recreate its powers in synthetic substances well within thirty years.”
 +
29
 +
  
Ray Kurzweil, who is now director of engineering at Google, believes that a properly designed version of the Turing test will indeed reveal machine intelligence; he predicts that a computer will pass this test by 2029, a milestone event on the way to Kurzweil’s forecasted Singularity.
+
Few if any neuroscientists agree on this optimistic prediction for their field. But even if a machine operating on the brain’s principles can be created, how will it learn all the stuff it needs to know to be considered intelligent? After all, a newborn baby has a brain, but it doesn’t yet have what we’d call human-level intelligence. Kurzweil agrees: “Most of [the brain’s] complexity comes from its own interaction with a complex world. Thus, it will be necessary to provide an artificial intelligence with an education just as we do with a natural intelligence.
 +
30
 +
  
The Singularity
+
Of course, providing an education can take many years. Kurzweil thinks that the process can be vastly sped up. “Contemporary electronics is already more than ten million times faster than the human nervous system’s electrochemical information processing. Once an AI masters human basic language skills, it will be in a position to expand its language skills and general knowledge by rapidly reading all human literature and by absorbing the knowledge contained on millions of web sites.”
 +
31
 +
  
Ray Kurzweil has long been AI’s leading optimist. A former student of Marvin Minsky’s at MIT, Kurzweil has had a distinguished career as an inventor: he invented the first text-to-speech machine as well as one of the world’s best music synthesizers. In 1999, President Bill Clinton awarded Kurzweil the National Medal of Technology and Innovation for these and other inventions.
+
Kurzweil is vague on how all this will happen but assures us that to achieve human-level AI, “we will not program human intelligence link by link as in some massive expert system. Rather, we will set up an intricate hierarchy of self-organizing systems, based largely on the reverse engineering of the human brain, and then provide for its education … hundreds if not thousands of times faster than the comparable process for humans.
 +
32
  
Yet Kurzweil is best known not for his inventions but for his futurist prognostications, most notably the idea of the Singularity: “a future period during which the pace of technological change will be so rapid, its impact so deep, that human life will be irreversibly transformed.”19 Kurzweil uses the term singularity in the sense of “a unique event with … singular implications”; in particular, “an event capable of rupturing the fabric of human history.”20 For Kurzweil, this singular event is the point in time when AI exceeds human intelligence.
+
===Singularity Skeptics and Adherents ===
  
Kurzweil’s ideas were spurred by the mathematician I. J. Good’s speculations on the potential of an
+
Responses to Kurzweil’s books The Age of Spiritual Machines (1999) and The Singularity Is Near (2005) are often one of two extremes: enthusiastic embrace or dismissive skepticism. When I read Kurzweil’s books, I was (and still am) in the latter camp. I wasn’t at all convinced by his surfeit of exponential curves or his arguments for reverse engineering the brain. Yes, Deep Blue had defeated Kasparov in chess, but AI was far below the level of humans in most other domains. Kurzweil’s predictions that AI would equal us in a mere couple of decades seemed to me ridiculously optimistic.
  
intelligence explosion: “Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind.”21
+
Most of the people I know are similarly skeptical. Mainstream AI’s attitude is perfectly captured in an article by the journalist Maureen Dowd: she describes how Andrew Ng, a famous AI researcher from Stanford, rolled his eyes at her mention of Kurzweil, saying, “Whenever I read Kurzweil’s Singularity, my eyes just naturally do that.”33 On the other hand, Kurzweil’s ideas have many adherents. Most of his books have been bestsellers and have been positively reviewed in serious publications. Time magazine declared of the Singularity, “It’s not a fringe idea; it’s a serious hypothesis about the future of life on Earth.”34
  
Kurzweil was also influenced by the mathematician and science fiction writer Vernor Vinge, who believed
+
Kurzweil’s thinking has been particularly influential in the tech industry, where people often believe in exponential technological progress as the means to solve all of society’s problems. Kurzweil is not only a director of engineering at Google but also a cofounder (with his fellow futurist entrepreneur Peter Diamandis) of Singularity University (SU), a “trans-humanist” think tank, start-up incubator, and sometime summer camp for the tech elite. SU’s published mission is “to educate, inspire, and empower leaders to apply exponential technologies to address humanity’s grand challenges.”35 The organization is partially underwritten by Google; Larry Page (cofounder of Google) was an early supporter and is a frequent speaker at SU’s programs. Several other big-name technology companies have joined as sponsors.
  
this event was close at hand: “The evolution of human intelligence took millions of years. We will devise an equivalent advance in a fraction of that time. We will soon create intelligences greater than our own. When this happens, human history will have reached a kind of singularity … and the world will pass far beyond our understanding.”22
+
Douglas Hofstadter is one thinker who—again surprising me—straddles the fence between Singularity skepticism and worry. He was disturbed, he told me, that Kurzweil’s books “mixed in the zaniest science fiction scenarios with things that were very clearly true.” When I argued, Hofstadter pointed out that from the vantage of several years later, for every seemingly crazy prediction Kurzweil made, he also often predicted something that has surprisingly come true or will soon. By the 2030s, will “‘experience beamers’ … send the entire flow of their sensory experiences as well as the neurological correlates of their emotional reactions out onto the Web”?36 Sounds crazy. But in the late 1980s, Kurzweil, relying on his exponential curves, predicted that by 1998 “a computer will defeat the human world chess champion … and we’ll think less of chess as a result.”37 At the time, many thought that sounded crazy too. But this event occurred a year earlier than Kurzweil predicted.  
  
Kurzweil takes the intelligence explosion as his starting point and then turns up the sci-fi intensity, moving
+
Hofstadter has noted Kurzweil’s clever use of what Hofstadter calls the “Christopher Columbus ploy,”38 referring to the Ira Gershwin song “They All Laughed,” which includes the line “They all laughed at Christopher Columbus.” Kurzweil cites numerous quotations from prominent people in history who completely underestimated the progress and impact of technology. Here are a few examples. IBM’s chairman, Thomas J. Watson, in 1943: “I think there is a world market for maybe five computers.” Digital Equipment Corporation’s cofounder Ken Olsen in 1977: “There’s no reason for individuals to have a computer in their home.” Bill Gates in 1981: “640,000 bytes of memory ought to be enough for anybody.”39 Hofstadter, having been stung by his own wrong predictions on computer chess, was hesitant to dismiss Kurzweil’s ideas out of hand, as crazy as they sounded. “Like Deep Blue’s defeat of Kasparov, it certainly gives one pause for thought.”40
  
from AI to nanoscience, then to virtual reality and “brain uploading,” all in the same calm, confident tone of a Delphic oracle looking at a calendar and pointing to specific dates. To give you the flavor of all this, here are some of Kurzweil’s predictions:
+
===Wagering on the Turing Test ===
  
By the 2020s molecular assembly will provide tools to effectively combat poverty, clean up our environment, overcome disease, [and] extend human longevity.
+
As a career choice, “futurist” is nice work if you can get it. You write books making predictions that can’t be evaluated for decades and whose ultimate validity won’t affect your reputation—or your book sales—in the here and now. In 2002, a website called Long Bets was created to help keep futurists honest. Long Bets is “an arena for competitive, accountable predictions,”41 allowing a predictor to make a long-term prediction that specifies a date and a challenger to challenge the prediction, both putting money on a wager that will be paid off after the prediction’s date is passed. The site’s very first predictor was the software entrepreneur Mitchell Kapor. He made a negative prediction: “By 2029 no computer—or ‘machine intelligence’—will have passed the Turing Test.” Kapor, who had founded the successful software company Lotus and who is also a longtime activist on internet civil liberties, knew Kurzweil well and was on the “highly skeptical” side of the Singularity divide. Kurzweil agreed to be the challenger for this public bet, with $20,000 going to the Electronic Frontier Foundation (cofounded by Kapor) if Kapor wins and to the Kurzweil Foundation if Kurzweil wins. The test to determine the winner will be carried out before the end of 2029.  
  
By the end of the 2030s brain implants based on massively distributed intelligent nanobots will greatly expand our memories and otherwise vastly improve all our sensory, pattern-recognition, and cognitive abilities.
+
In making this wager, Kapor and Kurzweil had to—unlike Turing—specify carefully in writing how their Turing test would work. They begin with a few necessary definitions. “A Human is a biological human person as that term is understood in the year 2001 whose intelligence has not been enhanced through the use of machine (i.e., nonbiological) intelligence.A Computer is any form of nonbiological intelligence (hardware and software) and may include any form of technology, but may not be a biological Human (enhanced or otherwise) nor biological neurons (however, nonbiological emulations of biological neurons are allowed).”42
  
Uploading a human brain means scanning all of its salient details and then reinstantiating those details into a suitably powerful computational substrate.… The end of the 2030s is a conservative projection for successful [brain] uploading.23
+
The terms of the wager also specify that the test will be carried out by three human judges who will interview the computer contestant as well as three human “foils.” All four contestants will try to convince the judges that they are humans. The judges and human foils will be chosen by a “Turing test committee,” made up of Kapor, Kurzweil (or their designees), and a third member. Instead of five-minute chats, each of the four contestants will be interviewed by each judge for a grueling two hours. At the end of all these interviews, each judge will give his or her verdict (“human” or “machine”) for each contestant. “The Computer will be deemed to have passed the ‘Turing Test Human Determination Test’ if the Computer has fooled two or more of the three Human Judges into thinking that it is a human.”43
  
A computer will pass the Turing test by 2029.24
+
But we’re not done yet:
  
As we get to the 2030s, artificial consciousness will be very realistic. That’s what it means to pass the Turing test.25
+
:In addition, each of the three Turing Test Judges will rank the four Candidates with a rank from 1 (least human) to 4 (most human). The computer will be deemed to have passed the “Turing Test Rank Order Test” if the median rank of the Computer is equal to or greater than the median rank of two or more of the three Turing Test Human Foils.  
  
I set the date for the Singularity … as 2045. The nonbiological intelligence created in that year will be one billion times more powerful than all human intelligence today.26
+
:The Computer will be deemed to have passed the Turing Test if it passes both the Turing Test Human Determination Test and the Turing Test Rank Order Test.  
  
The writer Andrian Kreye wryly referred to Kurzweil’s Singularity prediction as “nothing more than the belief in a technological Rapture.”27
+
:If a Computer passes the Turing Test, as described above, prior to the end of the year 2029, then Ray Kurzweil wins the wager. Otherwise Mitchell Kapor wins the wager.44
  
Kurzweil bases all of his predictions on the idea of “exponential progress” in many areas of science and
+
Wow, pretty strict. Eugene Goostman wouldn’t stand a chance. I’d have to (cautiously) agree with this assessment from Kurzweil: “In my view, there is no set of tricks or simpler algorithms (i.e., methods simpler than those underlying human intelligence) that would enable a machine to pass a properly designed Turing Test without actually possessing intelligence at a fully human level.”45 In addition to laying out the rules of their long bet, both Kapor and Kurzweil wrote accompanying essays giving the reasons each thinks he will win. Kurzweil’s essay summarizes the arguments laid out in his books: exponential progress in computation, neuroscience, and nanotechnology, which taken together will allow for reverse engineering of the brain.
  
technology, especially computers. To unpack this idea, let’s consider how exponential growth works.
+
Kapor doesn’t buy it. His main argument centers on the influence of our (human) physical bodies and emotions on our cognition. “Perception of and [physical] interaction with the environment is the equal partner of cognition in shaping experience.… [Emotions] bound and shape the envelope of what is thinkable.”46 Kapor asserts that without the equivalent of a human body, and everything that goes along with it, a machine will never be able to learn all that’s needed to pass his and Kurzweil’s strict Turing test.  
  
An Exponential Fable
+
I assert that the fundamental mode of learning of human beings is experiential. Book learning is a layer on top of that.… If human knowledge, especially knowledge about experience, is largely tacit, i.e., never directly and explicitly expressed, it will not be found in books, and the Kurzweil approach to knowledge acquisition will fail.… It is not in what the computer knows but what the computer does not know and cannot know wherein the problem resides.47
  
For a simple illustration of exponential growth, I’ll recount an old fable. Long ago, a renowned sage from a poor and starving village visited a distant and rich kingdom where the king challenged him to a game of chess. The sage was reluctant to accept, but the king insisted, offering the sage a reward “of anything you desire, if you are able to defeat me in a game.” For the sake of his village, the sage finally accepted and (as sages usually do) won the game. The king asked the sage to name his reward. The sage, who enjoyed mathematics, said, “All I ask for is that you take this chessboard, put two grains of rice on the first square, four grains on the second square, eight grains on the third, and so on, doubling the number of grains on each successive square. After you complete each row, package up the rice on that row and ship it to my village.” The mathematically naive king laughed. “Is that all you want? I will have my men bring in some rice and fulfill your request posthaste.
+
Kurzweil responds that he agrees with Kapor on the role of experiential learning, tacit knowledge, and emotions but believes that before the 2030s virtual reality will be “totally realistic,”48 enough to re-create the physical experiences needed to educate a developing artificial intelligence. (Welcome to the Matrix.) Moreover, this artificial intelligence will have a reverse-engineered artificial brain with emotion as a key component.  
  
The king’s men brought in a large bag of rice. After several minutes they had completed the first eight squares of the board with the requisite grains of rice: 2 on the first square, 4 on the second, 8 on the third, and so on, with 256 grains on the eighth square. They put the collection of grains (511, to be exact) in a tiny bag and sent it off by horseback to the sage’s village. They then proceeded on to the second row, with 512 grains on the first square of that row, 1,024 grains on the next square, and 2,048 grains on the following. Each pile of rice no longer fit on a chessboard square, so it was counted into a large bowl instead. By the end of the second row, the counting of grains was taking far too much time, so the court mathematicians started estimating the amounts by weight. They calculated that for the sixteenth square, 65,536 grains—about a kilogram (just over two pounds)—were required. The bag of rice shipped off for the second row weighed about two kilograms.
+
Are you, like Kapor, skeptical of Kurzweil’s predictions? Kurzweil says it’s because you don’t understand exponentials. “Generally speaking, the core of a disagreement I’ll have with a critic is, they’ll say, Oh Kurzweil is underestimating the complexity of reverse-engineering the human brain or the complexity of biology. But I don’t believe I’m underestimating the challenge. I think they’re underestimating the power of exponential growth.”49
  
The king’s men started on the third row. The seventeenth square required 2 kilos, the eighteenth required 4, and so on; by the end of the third row (square 24), 512 kilos were needed. The king’s subjects were conscripted to bring in additional giant bags of rice. The situation had become dire by the second square of the fourth row (square 26), when the mathematicians calculated that 2,048 kilos (over two tons) of rice were required. This would exhaust the entire rice harvest of the kingdom, even though the chessboard was not even half completed. The king, now realizing the trick that had been played on him, begged the sage to relent and save the kingdom from starvation. The sage, satisfied that the rice already received by his village would be enough, agreed.
+
Kurzweil’s doubters point out a couple of holes in this argument. Indeed, computer hardware has seen exponential progress over the last five decades, but there are many reasons to believe this trend will not hold up in the future. (Kurzweil of course disputes this.) But more important, computer software has not shown the same exponential progress; it would be hard to argue that today’s software is exponentially more sophisticated, or brain- like, than the software of fifty years ago, or that such a trend has ever existed. Kurzweil’s claims about exponential trends in neuroscience and virtual reality are also widely disputed.  
  
Figure 5A plots the number of kilos of rice required on each chess square, up to the twenty-fourth square. The first square, with two rice grains, has a scant fraction of a kilo. Similarly, the squares up through 16 have less than 1 kilo. But after square 16, you can see the plot shoot up rapidly, due to the doubling effect. Figure 5B shows the values for the twenty-fourth through the sixty-fourth chess square, going from 512 kilos to more than 30 trillion kilos.
+
But as Singularitarians have pointed out, sometimes it’s hard to see an exponential trend if you’re in the midst of it. If you look at an exponential curve like the ones in figure 5, Kurzweil and his adherents imagine that we’re at that point where the curve is increasing slowly, and it looks like incremental progress to us, but it’s deceptive: the growth is about to explode.  
  
The mathematical function describing this graph is y = 2x, where x is the chess square (numbered from 1 to 64) and y is the number of rice grains required on that square. This is called an exponential function, because x is the exponent of the number 2. No matter what scale is plotted, the function will have a characteristic point at which the curve seems to change from slow to explosively fast growth.
+
Is the current AI spring, as many have claimed, the first harbinger of a coming explosion? Or is it simply a waypoint on a slow, incremental growth curve that won’t result in human-level AI for at least another century? Or yet another AI bubble, soon to be followed by another AI winter?
  
FIGURE 5: Plots showing how many kilos of rice are needed for each chess square in order to fulfill the sage’s request; A, squares 1–24 (with y-axis showing hundreds of kilos); B, squares 24–64 (with y-axis showing tens of trillions of kilos)
+
To help us get some bearing on these questions, we need to take a careful look at some of the crucial abilities underlying our distinctive human intelligence, such as perception, language, decision-making, commonsense reasoning, and learning. In the next chapters, we’ll see how far AI has come in capturing these abilities, and we’ll assess its prospects, for 2029 and beyond.
  
Exponential Progress in Computers
+
=Part II Looking and Seeing =
  
For Ray Kurzweil, the computer age has provided a real-world counterpart to the exponential fable. In 1965, Gordon Moore, cofounder of Intel Corporation, identified a trend that has come to be known as Moore’s law: the number of components on a computer chip doubles approximately every one to two years. In other words, the components are getting exponentially smaller (and cheaper), and computer speed and memory are increasing at an exponential rate.
+
==4 Who, What, When, Where, Why ==
  
Kurzweil’s books are full of graphs like the ones in figure 5, and extrapolations of these trends of exponential progress, along the lines of Moore’s law, are at the heart of his forecasts for AI. Kurzweil points out that if the trends continue (as he believes they will), a $1,000 computer will “achieve human brain capability (1016 calculations per second) … around the year 2023.”28 At that point, in Kurzweil’s view, human-level AI will just be a matter of reverse engineering the brain.
+
Look at the photo in figure 6 and tell me what you see. A woman petting a dog. A soldier petting a dog. A soldier who has just returned from war being welcomed by her dog, with flowers and a “Welcome Home” balloon. The soldier’s face shows her complex emotions. The dog is happily wagging its tail.  
  
Neural Engineering
+
When was this photo taken? Most likely within the past ten years. Where does this photo take place? Probably an airport. Why is the soldier petting the dog? She has probably been away for a long time, experienced many things, both good and bad, missed her dog a great deal, and is very happy to be home. Perhaps the dog is a symbol of all that is “home.” What happened just before this photo was taken? The soldier probably got off an airplane and walked through the secure part of the airport to the place where passengers can be greeted. Her family or friends greeted her with hugs, handed her the flowers and balloon, and let go of the dog’s leash. The dog came over to the soldier, who put down everything she was carrying and knelt down, carefully putting the balloon’s string under her knee to keep it from floating off. What will happen next? She’ll probably stand up, maybe wipe away some tears, gather her flowers, balloon, and laptop computer, take the dog’s leash, and walk with the dog and her family or friends to the baggage claim area.
  
Reverse engineering the brain means understanding enough about its workings in order to duplicate it, or at least to use the brain’s underlying principles to replicate its intelligence in a computer. Kurzweil believes that such reverse engineering is a practical, near-term approach to creating human-level AI. Most neuroscientists would vehemently
+
FIGURE 6: What do you see in this photo?
  
disagree, given how little is currently known about how the brain works. But Kurzweil’s argument again rests on exponential trends—this time in advancements in neuroscience. In 2002 he wrote, “A careful analysis of the requisite trends shows that we will understand the principles of operation of the human brain and be in a position to
+
When you look at this picture, at the most basic level you’re seeing bits of ink on a page (or pixels on a screen). Somehow your eyes and brain are able to take in this raw information and, within a few seconds, transform it into a detailed story involving living things, objects, relationships, places, emotions, motivations, and past and future actions. We look, we see, we understand. Crucially, we know what to ignore. There are many aspects of the photo that aren’t strictly relevant to the story we extract from it: the pattern on the carpet, the hanging straps on the soldier’s backpack, the whistle clipped to her pack’s shoulder pad, the barrettes in her hair.
  
recreate its powers in synthetic substances well within thirty years.”29
+
We humans perform this vast amount of information processing in hardly any time at all, and we have very little, if any, conscious awareness of what we’re doing or how we do it. Unless you’ve been blind since birth, visual processing, at various levels of abstraction, dominates your brain.  
  
Few if any neuroscientists agree on this optimistic prediction for their field. But even if a machine operating on the brain’s principles can be created, how will it learn all the stuff it needs to know to be considered intelligent? After all, a newborn baby has a brain, but it doesn’t yet have what we’d call human-level intelligence. Kurzweil agrees: “Most of [the brain’s] complexity comes from its own interaction with a complex world. Thus, it will be necessary to provide an artificial intelligence with an education just as we do with a natural intelligence.”30
+
Surely, the ability to describe the contents of a photograph (or a video, or a real-time stream from a camera) in this way would be one of the first things we would require for general human-level AI.  
  
Of course, providing an education can take many years. Kurzweil thinks that the process can be vastly sped up. “Contemporary electronics is already more than ten million times faster than the human nervous system’s electrochemical information processing. Once an AI masters human basic language skills, it will be in a position to expand its language skills and general knowledge by rapidly reading all human literature and by absorbing the knowledge contained on millions of web sites.”31
+
===Easy Things Are Hard (Especially in Vision) ===
  
Kurzweil is vague on how all this will happen but assures us that to achieve human-level AI, “we will not program human intelligence link by link as in some massive expert system. Rather, we will set up an intricate hierarchy of self-organizing systems, based largely on the reverse engineering of the human brain, and then provide for its education … hundreds if not thousands of times faster than the comparable process for humans.”32
+
Since the 1950s, AI researchers have been trying to get computers to make sense of visual data. In the early days of AI, achieving this goal seemed relatively straightforward. In 1966, Marvin Minsky and Seymour Papert—the symbolic-AI-promoting MIT professors whom you’ll recall from chapter 1—proposed the Summer Vision Project, in which they would assign undergraduates to work on “the construction of a significant part of a visual system.”
 +
1
 +
In the words of one AI historian, “Minsky hired a first-year undergraduate and assigned him a problem to solve over the summer: connect a television camera to a computer and get the machine to describe what it sees.
 +
2
  
Singularity Skeptics and Adherents
+
The undergraduate didn’t get very far. And while the subfield of AI called computer vision has progressed substantially over the many decades since this summer project, a program that can look at and describe photographs in the way humans do still seems far out of reach. Vision—both looking and seeing—turns out to be one of the hardest of all “easy” things.
  
Responses to Kurzweil’s books The Age of Spiritual Machines (1999) and The Singularity Is Near (2005) are often one of two extremes: enthusiastic embrace or dismissive skepticism. When I read Kurzweil’s books, I was (and still am) in the latter camp. I wasn’t at all convinced by his surfeit of exponential curves or his arguments for reverse engineering the brain. Yes, Deep Blue had defeated Kasparov in chess, but AI was far below the level of humans in most other domains. Kurzweil’s predictions that AI would equal us in a mere couple of decades seemed to me ridiculously optimistic.
+
One prerequisite to describing visual input is object recognition—that is, recognizing a particular group of pixels in an image as a particular object category, such as “woman,” “dog,” “balloon,” or “laptop computer.” Object recognition is typically so immediate and effortless for us as humans that it didn’t seem as though it would be a particularly hard problem for computers, until AI researchers actually tried to get computers to do it.  
  
Most of the people I know are similarly skeptical. Mainstream AI’s attitude is perfectly captured in an article by the journalist Maureen Dowd: she describes how Andrew Ng, a famous AI researcher from Stanford, rolled his eyes at her mention of Kurzweil, saying, “Whenever I read Kurzweil’s Singularity, my eyes just naturally do that.”33 On the other hand, Kurzweil’s ideas have many adherents. Most of his books have been bestsellers and have
+
What’s so hard about object recognition? Well, consider the problem of getting a computer program to recognize dogs in photographs. Figure 7 illustrates some of the difficulties. If the input is simply the pixels of the image, then the program first has to figure out which are “dog” pixels and which are “non-dog” pixels (for example, background, shadows, other objects). Moreover, different dogs look very different: they can have diverse coloring, shape, and size; they can be facing in various directions; the lighting can vary considerably between images; parts of the dog can be blocked by other objects (for example, fences, people). What’s more, “dog pixels” might look a lot like “cat pixels” or other animals. Under some lighting conditions, a cloud in the sky might even look very much like a dog.  
  
been positively reviewed in serious publications. Time magazine declared of the Singularity, “It’s not a fringe idea;
+
FIGURE 7: Object recognition: easy for humans, hard for computers
  
it’s a serious hypothesis about the future of life on Earth.”34
+
Since the 1950s, the field of computer vision has struggled with these and other issues. Until recently, a major job of computer-vision researchers was to develop specialized image-processing algorithms that would identify “invariant features” of objects that could be used to recognize these objects in spite of the difficulties I sketched above. But even with sophisticated image processing, the abilities of object-recognition programs remained far below those of humans.  
  
Kurzweil’s thinking has been particularly influential in the tech industry, where people often believe in exponential technological progress as the means to solve all of society’s problems. Kurzweil is not only a director of engineering at Google but also a cofounder (with his fellow futurist entrepreneur Peter Diamandis) of Singularity University (SU), a “trans-humanist” think tank, start-up incubator, and sometime summer camp for the tech elite. SU’s published mission is “to educate, inspire, and empower leaders to apply exponential technologies to address
+
===The Deep-Learning Revolution ===
  
humanity’s grand challenges.”35 The organization is partially underwritten by Google; Larry Page (cofounder of Google) was an early supporter and is a frequent speaker at SU’s programs. Several other big-name technology companies have joined as sponsors.
+
The ability of machines to recognize objects in images and videos underwent a quantum leap in the 2010s due to advances in the area called deep learning.  
  
Douglas Hofstadter is one thinker who—again surprising me—straddles the fence between Singularity skepticism and worry. He was disturbed, he told me, that Kurzweil’s books “mixed in the zaniest science fiction scenarios with things that were very clearly true.” When I argued, Hofstadter pointed out that from the vantage of several years later, for every seemingly crazy prediction Kurzweil made, he also often predicted something that has surprisingly come true or will soon. By the 2030s, will “‘experience beamers’ … send the entire flow of their
+
Deep learning simply refers to methods for training “deep neural networks,” which in turn refers to neural networks with more than one hidden layer. Recall that hidden layers are those layers of a neural network between the input and the output. The depth of a network is its number of hidden layers: a “shallow” network—like the one we saw in chapter 2—has only one hidden layer; a “deep” network has more than one hidden layer. It’s worth emphasizing this definition: the deep in deep learning doesn’t refer to the sophistication of what is learned; it refers only to the depth in layers of the network being trained.
  
sensory experiences as well as the neurological correlates of their emotional reactions out onto the Web”?36 Sounds crazy. But in the late 1980s, Kurzweil, relying on his exponential curves, predicted that by 1998 “a computer will defeat the human world chess champion … and we’ll think less of chess as a result.”37 At the time, many thought that sounded crazy too. But this event occurred a year earlier than Kurzweil predicted.
+
Research on deep neural networks has been going on for several decades. What makes these networks a revolution is their recent phenomenal success in many AI tasks. Interestingly, researchers have found that the most successful deep networks are those whose structure mimics parts of the brain’s visual system. The “traditional” multilayer neural networks I described in chapter 2 were inspired by the brain, but their structure is very un-brain- like. In contrast, the neural networks dominating deep learning are directly modeled after discoveries in neuroscience.  
  
Hofstadter has noted Kurzweil’s clever use of what Hofstadter calls the “Christopher Columbus ploy,”38 referring to the Ira Gershwin song “They All Laughed,” which includes the line “They all laughed at Christopher
+
===The Brain, the Neocognitron, and Convolutional Neural Networks ===
  
Columbus.” Kurzweil cites numerous quotations from prominent people in history who completely underestimated the progress and impact of technology. Here are a few examples. IBM’s chairman, Thomas J. Watson, in 1943: “I think there is a world market for maybe five computers.” Digital Equipment Corporation’s cofounder Ken Olsen in 1977: “There’s no reason for individuals to have a computer in their home.” Bill Gates in 1981: “640,000 bytes of
+
About the same time that Minsky and Papert were proposing their Summer Vision Project, two neuroscientists were in the midst of a decades-long study that would radically remake our understanding of vision—and particularly object recognition—in the brain. David Hubel and Torsten Wiesel were later awarded a Nobel Prize for their discoveries of hierarchical organization in the visual systems of cats and primates (including humans) and for their explanation of how the visual system transforms light striking the retina into information about what is in the scene.
  
memory ought to be enough for anybody.”39 Hofstadter, having been stung by his own wrong predictions on computer chess, was hesitant to dismiss Kurzweil’s ideas out of hand, as crazy as they sounded. “Like Deep Blue’s defeat of Kasparov, it certainly gives one pause for thought.”40
+
Hubel and Wiesel’s discoveries inspired a Japanese engineer named Kunihiko Fukushima, who in the 1970s developed one of the earliest deep neural networks, dubbed the cognitron, and its successor, the neocognitron. In his papers,
 +
3
 +
Fukushima reported some success training the neocognitron to recognize handwritten digits (like the ones I showed in chapter 1), but the specific learning methods he used did not seem to extend to more complex visual tasks. Nonetheless, the neocognitron was an important inspiration for later approaches to deep neural networks, including today’s most influential and widely used approach: convolutional neural networks, or (as most people in the field call them) ConvNets.  
  
Wagering on the Turing Test
+
ConvNets are the driving force behind today’s deep-learning revolution in computer vision, and in other areas as well. Although they have been widely heralded as the next big thing in AI, ConvNets are actually not very new: they were first proposed in the 1980s by the French computer scientist Yann LeCun, who had been inspired by Fukushima’s neocognitron.
  
As a career choice, “futurist” is nice work if you can get it. You write books making predictions that can’t be evaluated for decades and whose ultimate validity won’t affect your reputation—or your book sales—in the here and now. In 2002, a website called Long Bets was created to help keep futurists honest. Long Bets is “an arena for competitive, accountable predictions,”41 allowing a predictor to make a long-term prediction that specifies a date and a challenger to challenge the prediction, both putting money on a wager that will be paid off after the
+
FIGURE 8: Pathway of visual input from eyes to visual cortex
  
prediction’s date is passed. The site’s very first predictor was the software entrepreneur Mitchell Kapor. He made a negative prediction: “By 2029 no computer—or ‘machine intelligence’—will have passed the Turing Test.” Kapor, who had founded the successful software company Lotus and who is also a longtime activist on internet civil liberties, knew Kurzweil well and was on the “highly skeptical” side of the Singularity divide. Kurzweil agreed to be the challenger for this public bet, with $20,000 going to the Electronic Frontier Foundation (cofounded by Kapor) if Kapor wins and to the Kurzweil Foundation if Kurzweil wins. The test to determine the winner will be carried out before the end of 2029.
+
I’ll spend some time here describing how ConvNets work, because understanding them is crucial for making sense of where computer vision—as well as much else about AI—is today and what its limits are.  
  
In making this wager, Kapor and Kurzweil had to—unlike Turing—specify carefully in writing how their Turing test would work. They begin with a few necessary definitions. “A Human is a biological human person as that term is understood in the year 2001 whose intelligence has not been enhanced through the use of machine (i.e., nonbiological) intelligence.… A Computer is any form of nonbiological intelligence (hardware and software) and may include any form of technology, but may not be a biological Human (enhanced or otherwise) nor biological
+
===Object Recognition in the Brain and in ConvNets ===
  
neurons (however, nonbiological emulations of biological neurons are allowed).”42
+
Like the neocognitron, the design of ConvNets is based on several key insights about the brain’s visual system that were discovered by Hubel and Wiesel in the 1950s and ’60s. When your eyes focus on a scene, what they receive is light of different wavelengths that has been reflected by the objects and surfaces in the scene. Light falling on the eyes activates cells in each retina—essentially a grid of neurons in the back of the eye. These neurons communicate their activation through the optic nerves and into the brain, eventually activating neurons in the visual cortex, which resides in the back of the head (figure 8). The visual cortex is roughly organized as a hierarchical series of layers of neurons, like the stacked layers of a wedding cake, where the neurons in each layer communicate their activations to neurons in the succeeding layer.  
  
The terms of the wager also specify that the test will be carried out by three human judges who will interview the computer contestant as well as three human “foils.” All four contestants will try to convince the judges that they are humans. The judges and human foils will be chosen by a “Turing test committee,” made up of Kapor, Kurzweil (or their designees), and a third member. Instead of five-minute chats, each of the four contestants will be interviewed by each judge for a grueling two hours. At the end of all these interviews, each judge will give his or her verdict (“human” or “machine”) for each contestant. “The Computer will be deemed to have passed the ‘Turing Test Human Determination Test’ if the Computer has fooled two or more of the three Human Judges into thinking that it
+
FIGURE 9: Sketch of visual features detected by neurons in different layers of the visual cortex
  
is a human.”43
+
Hubel and Wiesel found evidence that neurons in different layers of this hierarchy act as “detectors” that respond to increasingly complex features appearing in the visual scene, as illustrated in figure 9: neurons at initial layers become active (that is, fire at a higher rate) in response to edges; their activation feeds into layers of neurons that respond to simple shapes made up of these edges; and so on, up through more complex shapes and finally entire objects and specific faces. Note that the arrows in figure 9 indicate a bottom-up (or feed-forward) flow of information, representing connections from lower to higher layers (in the figure, left to right). It’s important to note that a top-down (or feed-backward) flow of information (from higher to lower layers) also occurs in the visual cortex; in fact, there are about ten times as many feed-backward connections as feed-forward ones. However, the role of these backward connections is not well understood by neuroscientists, although it is well established that our prior knowledge and expectations, presumably stored in higher brain layers, strongly influence what we perceive.  
  
But we’re not done yet:
+
Like the feed-forward hierarchical structure illustrated in figure 9, a ConvNet consists of a sequence of layers of simulated neurons. I’ll again refer to these simulated neurons as units. Units in each layer provide input to units in the next layer. Just like the neural network I described in chapter 2, when a ConvNet processes an image, each unit takes on a particular activation value—a real number that is computed from the unit’s inputs and their weights.
  
In addition, each of the three Turing Test Judges will rank the four Candidates with a rank from 1 (least human) to 4 (most human). The computer will be deemed to have passed the “Turing Test Rank Order Test” if the median rank of the Computer is equal to or greater than the median rank of two or more of the three Turing Test Human Foils.
+
Let’s make this discussion more specific by imagining a hypothetical ConvNet, with four layers plus a “classification module,” that we want to train to recognize dogs and cats in images. Assume for simplicity that each input image depicts exactly one dog or cat. Figure 10 illustrates our ConvNet’s structure. It’s a bit complicated, so I’ll go through it carefully step-by-step to explain how it works.  
  
The Computer will be deemed to have passed the Turing Test if it passes both the Turing Test Human Determination Test and the Turing Test Rank Order Test.
+
FIGURE 10: Illustration of a four-layer convolutional neural network (ConvNet) designed to recognize dogs and cats in photos
  
If a Computer passes the Turing Test, as described above, prior to the end of the year 2029, then Ray Kurzweil wins the wager. Otherwise Mitchell Kapor wins the wager.44
+
===Input and Output===
 
+
Wow, pretty strict. Eugene Goostman wouldn’t stand a chance. I’d have to (cautiously) agree with this assessment from Kurzweil: “In my view, there is no set of tricks or simpler algorithms (i.e., methods simpler than those underlying human intelligence) that would enable a machine to pass a properly designed Turing Test without actually possessing intelligence at a fully human level.”45
+
 
+
In addition to laying out the rules of their long bet, both Kapor and Kurzweil wrote accompanying essays
+
 
+
giving the reasons each thinks he will win. Kurzweil’s essay summarizes the arguments laid out in his books:
+
 
+
exponential progress in computation, neuroscience, and nanotechnology, which taken together will allow for reverse engineering of the brain.
+
 
+
Kapor doesn’t buy it. His main argument centers on the influence of our (human) physical bodies and emotions on our cognition. “Perception of and [physical] interaction with the environment is the equal partner of cognition in shaping experience.… [Emotions] bound and shape the envelope of what is thinkable.”46 Kapor asserts that without the equivalent of a human body, and everything that goes along with it, a machine will never be able to learn all that’s needed to pass his and Kurzweil’s strict Turing test.
+
 
+
I assert that the fundamental mode of learning of human beings is experiential. Book learning is a layer on top of that.… If human knowledge, especially knowledge about experience, is largely tacit, i.e., never directly and explicitly expressed, it will not be found in books, and the Kurzweil approach to knowledge acquisition will fail.… It is not in what the computer knows but what the computer does not know
+
 
+
and cannot know wherein the problem resides.47
+
 
+
Kurzweil responds that he agrees with Kapor on the role of experiential learning, tacit knowledge, and emotions but believes that before the 2030s virtual reality will be “totally realistic,”48 enough to re-create the physical experiences needed to educate a developing artificial intelligence. (Welcome to the Matrix.) Moreover, this artificial intelligence will have a reverse-engineered artificial brain with emotion as a key component.
+
 
+
Are you, like Kapor, skeptical of Kurzweil’s predictions? Kurzweil says it’s because you don’t understand
+
 
+
exponentials. “Generally speaking, the core of a disagreement I’ll have with a critic is, they’ll say, Oh Kurzweil is underestimating the complexity of reverse-engineering the human brain or the complexity of biology. But I don’t believe I’m underestimating the challenge. I think they’re underestimating the power of exponential growth.”49
+
 
+
Kurzweil’s doubters point out a couple of holes in this argument. Indeed, computer hardware has seen
+
 
+
exponential progress over the last five decades, but there are many reasons to believe this trend will not hold up in the future. (Kurzweil of course disputes this.) But more important, computer software has not shown the same exponential progress; it would be hard to argue that today’s software is exponentially more sophisticated, or brain- like, than the software of fifty years ago, or that such a trend has ever existed. Kurzweil’s claims about exponential trends in neuroscience and virtual reality are also widely disputed.
+
 
+
But as Singularitarians have pointed out, sometimes it’s hard to see an exponential trend if you’re in the midst of it. If you look at an exponential curve like the ones in figure 5, Kurzweil and his adherents imagine that we’re at that point where the curve is increasing slowly, and it looks like incremental progress to us, but it’s deceptive: the growth is about to explode.
+
 
+
Is the current AI spring, as many have claimed, the first harbinger of a coming explosion? Or is it simply a waypoint on a slow, incremental growth curve that won’t result in human-level AI for at least another century? Or yet another AI bubble, soon to be followed by another AI winter?
+
 
+
To help us get some bearing on these questions, we need to take a careful look at some of the crucial abilities underlying our distinctive human intelligence, such as perception, language, decision-making, commonsense reasoning, and learning. In the next chapters, we’ll see how far AI has come in capturing these abilities, and we’ll assess its prospects, for 2029 and beyond.
+
 
+
Part II Looking and Seeing
+
  
 +
The input to our ConvNet is an image—that is, an array of numbers, corresponding to the brightness and color of the image’s pixels.
 
4
 
4
 +
Our ConvNet’s final output is the network’s confidence (0 percent to 100 percent) for each category: “dog” and “cat.” Our goal is to have the network learn to output a high confidence for the correct category and a low confidence for the other category. In doing so, the network will learn what set of features of the input image are most useful for this task.
  
Who, What, When, Where, Why
+
===Activation Maps ===
  
Look at the photo in figure 6 and tell me what you see. A woman petting a dog. A soldier petting a dog. A soldier who has just returned from war being welcomed by her dog, with flowers and a “Welcome Home” balloon. The soldier’s face shows her complex emotions. The dog is happily wagging its tail.
+
Notice in figure 10 that each layer of the network is represented by a set of three overlapping rectangles. These rectangles represent activation maps, inspired by similar “maps” found in the brain’s visual system. Hubel and Wiesel discovered that neurons in the lower layers of the visual cortex are physically arranged so that they form a rough grid, with each neuron in the grid responding to a corresponding small area of the visual field. Imagine flying at night in an airplane over Los Angeles and taking a photo; the lights seen in your photo form a rough map of the features of the lit-up city. Analogously, the activations of the neurons in each grid-like layer of the visual cortex form a rough map of the important features in the visual scene. Now imagine that you had a very special camera that could produce separate photos for house lights, building lights, and car lights. This is something like what the visual cortex does: each important visual feature has its own separate neural map. The combination of these maps is a key part of what gives rise to our perception of a scene.  
  
When was this photo taken? Most likely within the past ten years. Where does this photo take place? Probably an airport. Why is the soldier petting the dog? She has probably been away for a long time, experienced many things, both good and bad, missed her dog a great deal, and is very happy to be home. Perhaps the dog is a symbol of all that is “home.” What happened just before this photo was taken? The soldier probably got off an airplane and walked through the secure part of the airport to the place where passengers can be greeted. Her family or friends greeted her with hugs, handed her the flowers and balloon, and let go of the dog’s leash. The dog came over to the soldier, who put down everything she was carrying and knelt down, carefully putting the balloon’s string under her knee to keep it from floating off. What will happen next? She’ll probably stand up, maybe wipe away some tears, gather her flowers, balloon, and laptop computer, take the dog’s leash, and walk with the dog and her family or friends to the baggage claim area.
+
FIGURE 11: Activation maps in the first layer of our ConvNet
  
FIGURE 6: What do you see in this photo?
+
Like neurons in the visual cortex, the units in a ConvNet act as detectors for important visual features, each unit looking for its designated feature in a specific part of the visual field. And (very roughly) like the visual cortex, each layer in a ConvNet consists of several grids of these units, with each grid forming an activation map for a specific visual feature.
  
When you look at this picture, at the most basic level you’re seeing bits of ink on a page (or pixels on a screen). Somehow your eyes and brain are able to take in this raw information and, within a few seconds, transform it into a detailed story involving living things, objects, relationships, places, emotions, motivations, and past and
+
What visual features should ConvNet units detect? Let’s look to the brain first. Hubel and Wiesel found that neurons in lower layers of the visual cortex act as edge detectors, where an edge refers to a boundary between two contrasting image regions. Each neuron receives input corresponding to a specific small region of the visual scene; this region is called the neuron’s receptive field. The neuron becomes active (that is, starts firing more rapidly) only if its receptive field contains a particular kind of edge.  
 
+
future actions. We look, we see, we understand. Crucially, we know what to ignore. There are many aspects of the photo that aren’t strictly relevant to the story we extract from it: the pattern on the carpet, the hanging straps on the soldier’s backpack, the whistle clipped to her pack’s shoulder pad, the barrettes in her hair.
+
 
+
We humans perform this vast amount of information processing in hardly any time at all, and we have very little, if any, conscious awareness of what we’re doing or how we do it. Unless you’ve been blind since birth, visual processing, at various levels of abstraction, dominates your brain.
+
 
+
Surely, the ability to describe the contents of a photograph (or a video, or a real-time stream from a camera) in this way would be one of the first things we would require for general human-level AI.
+
 
+
Easy Things Are Hard (Especially in Vision)
+
 
+
Since the 1950s, AI researchers have been trying to get computers to make sense of visual data. In the early days of AI, achieving this goal seemed relatively straightforward. In 1966, Marvin Minsky and Seymour Papert—the symbolic-AI-promoting MIT professors whom you’ll recall from chapter 1—proposed the Summer Vision Project, in which they would assign undergraduates to work on “the construction of a significant part of a visual system.”1 In the words of one AI historian, “Minsky hired a first-year undergraduate and assigned him a problem to solve over the summer: connect a television camera to a computer and get the machine to describe what it sees.”2
+
 
+
The undergraduate didn’t get very far. And while the subfield of AI called computer vision has progressed
+
 
+
substantially over the many decades since this summer project, a program that can look at and describe photographs in the way humans do still seems far out of reach. Vision—both looking and seeing—turns out to be one of the hardest of all “easy” things.
+
 
+
One prerequisite to describing visual input is object recognition—that is, recognizing a particular group of pixels in an image as a particular object category, such as “woman,” “dog,” “balloon,” or “laptop computer.” Object recognition is typically so immediate and effortless for us as humans that it didn’t seem as though it would be a particularly hard problem for computers, until AI researchers actually tried to get computers to do it.
+
 
+
What’s so hard about object recognition? Well, consider the problem of getting a computer program to recognize dogs in photographs. Figure 7 illustrates some of the difficulties. If the input is simply the pixels of the image, then the program first has to figure out which are “dog” pixels and which are “non-dog” pixels (for example, background, shadows, other objects). Moreover, different dogs look very different: they can have diverse coloring, shape, and size; they can be facing in various directions; the lighting can vary considerably between images; parts of the dog can be blocked by other objects (for example, fences, people). What’s more, “dog pixels” might look a lot like “cat pixels” or other animals. Under some lighting conditions, a cloud in the sky might even look very much like a dog.
+
 
+
FIGURE 7: Object recognition: easy for humans, hard for computers
+
 
+
Since the 1950s, the field of computer vision has struggled with these and other issues. Until recently, a major job of computer-vision researchers was to develop specialized image-processing algorithms that would identify “invariant features” of objects that could be used to recognize these objects in spite of the difficulties I sketched above. But even with sophisticated image processing, the abilities of object-recognition programs remained far below those of humans.
+
 
+
The Deep-Learning Revolution
+
 
+
The ability of machines to recognize objects in images and videos underwent a quantum leap in the 2010s due to advances in the area called deep learning.
+
 
+
Deep learning simply refers to methods for training “deep neural networks,” which in turn refers to neural networks with more than one hidden layer. Recall that hidden layers are those layers of a neural network between the input and the output. The depth of a network is its number of hidden layers: a “shallow” network—like the one we saw in chapter 2—has only one hidden layer; a “deep” network has more than one hidden layer. It’s worth emphasizing this definition: the deep in deep learning doesn’t refer to the sophistication of what is learned; it refers only to the depth in layers of the network being trained.
+
 
+
Research on deep neural networks has been going on for several decades. What makes these networks a revolution is their recent phenomenal success in many AI tasks. Interestingly, researchers have found that the most successful deep networks are those whose structure mimics parts of the brain’s visual system. The “traditional” multilayer neural networks I described in chapter 2 were inspired by the brain, but their structure is very un-brain- like. In contrast, the neural networks dominating deep learning are directly modeled after discoveries in neuroscience.
+
 
+
The Brain, the Neocognitron, and Convolutional Neural Networks
+
 
+
About the same time that Minsky and Papert were proposing their Summer Vision Project, two neuroscientists were in the midst of a decades-long study that would radically remake our understanding of vision—and particularly object recognition—in the brain. David Hubel and Torsten Wiesel were later awarded a Nobel Prize for their discoveries of hierarchical organization in the visual systems of cats and primates (including humans) and for their explanation of how the visual system transforms light striking the retina into information about what is in the scene.
+
 
+
Hubel and Wiesel’s discoveries inspired a Japanese engineer named Kunihiko Fukushima, who in the 1970s developed one of the earliest deep neural networks, dubbed the cognitron, and its successor, the neocognitron. In his papers,3 Fukushima reported some success training the neocognitron to recognize handwritten digits (like the ones I showed in chapter 1), but the specific learning methods he used did not seem to extend to more complex visual tasks. Nonetheless, the neocognitron was an important inspiration for later approaches to deep neural networks,
+
 
+
including today’s most influential and widely used approach: convolutional neural networks, or (as most people in the field call them) ConvNets.
+
 
+
ConvNets are the driving force behind today’s deep-learning revolution in computer vision, and in other areas as well. Although they have been widely heralded as the next big thing in AI, ConvNets are actually not very new: they were first proposed in the 1980s by the French computer scientist Yann LeCun, who had been inspired by Fukushima’s neocognitron.
+
 
+
FIGURE 8: Pathway of visual input from eyes to visual cortex
+
 
+
I’ll spend some time here describing how ConvNets work, because understanding them is crucial for making
+
 
+
sense of where computer vision—as well as much else about AI—is today and what its limits are.
+
 
+
Object Recognition in the Brain and in ConvNets
+
 
+
Like the neocognitron, the design of ConvNets is based on several key insights about the brain’s visual system that were discovered by Hubel and Wiesel in the 1950s and ’60s. When your eyes focus on a scene, what they receive is light of different wavelengths that has been reflected by the objects and surfaces in the scene. Light falling on the eyes activates cells in each retina—essentially a grid of neurons in the back of the eye. These neurons communicate their activation through the optic nerves and into the brain, eventually activating neurons in the visual cortex, which resides in the back of the head (figure 8). The visual cortex is roughly organized as a hierarchical series of layers of neurons, like the stacked layers of a wedding cake, where the neurons in each layer communicate their activations to neurons in the succeeding layer.
+
 
+
FIGURE 9: Sketch of visual features detected by neurons in different layers of the visual cortex
+
 
+
Hubel and Wiesel found evidence that neurons in different layers of this hierarchy act as “detectors” that respond to increasingly complex features appearing in the visual scene, as illustrated in figure 9: neurons at initial layers become active (that is, fire at a higher rate) in response to edges; their activation feeds into layers of neurons that respond to simple shapes made up of these edges; and so on, up through more complex shapes and finally entire objects and specific faces. Note that the arrows in figure 9 indicate a bottom-up (or feed-forward) flow of information, representing connections from lower to higher layers (in the figure, left to right). It’s important to note that a top-down (or feed-backward) flow of information (from higher to lower layers) also occurs in the visual cortex; in fact, there are about ten times as many feed-backward connections as feed-forward ones. However, the role of these backward connections is not well understood by neuroscientists, although it is well established that our prior knowledge and expectations, presumably stored in higher brain layers, strongly influence what we perceive.
+
 
+
Like the feed-forward hierarchical structure illustrated in figure 9, a ConvNet consists of a sequence of layers of simulated neurons. I’ll again refer to these simulated neurons as units. Units in each layer provide input to units in the next layer. Just like the neural network I described in chapter 2, when a ConvNet processes an image, each unit takes on a particular activation value—a real number that is computed from the unit’s inputs and their weights.
+
 
+
Let’s make this discussion more specific by imagining a hypothetical ConvNet, with four layers plus a “classification module,” that we want to train to recognize dogs and cats in images. Assume for simplicity that each input image depicts exactly one dog or cat. Figure 10 illustrates our ConvNet’s structure. It’s a bit complicated, so I’ll go through it carefully step-by-step to explain how it works.
+
 
+
FIGURE 10: Illustration of a four-layer convolutional neural network (ConvNet) designed to recognize dogs and cats in photos
+
 
+
Input and Output
+
 
+
The input to our ConvNet is an image—that is, an array of numbers, corresponding to the brightness and color
+
 
+
of the image’s pixels.4 Our ConvNet’s final output is the network’s confidence (0 percent to 100 percent) for each category: “dog” and “cat.” Our goal is to have the network learn to output a high confidence for the correct category and a low confidence for the other category. In doing so, the network will learn what set of features of the input image are most useful for this task.
+
 
+
Activation Maps
+
 
+
Notice in figure 10 that each layer of the network is represented by a set of three overlapping rectangles. These rectangles represent activation maps, inspired by similar “maps” found in the brain’s visual system. Hubel and Wiesel discovered that neurons in the lower layers of the visual cortex are physically arranged so that they form a rough grid, with each neuron in the grid responding to a corresponding small area of the visual field. Imagine flying at night in an airplane over Los Angeles and taking a photo; the lights seen in your photo form a rough map of the features of the lit-up city. Analogously, the activations of the neurons in each grid-like layer of the visual cortex form a rough map of the important features in the visual scene. Now imagine that you had a very special camera that could produce separate photos for house lights, building lights, and car lights. This is something like what the visual cortex does: each important visual feature has its own separate neural map. The combination of these maps is a key part of what gives rise to our perception of a scene.
+
 
+
FIGURE 11: Activation maps in the first layer of our ConvNet
+
 
+
Like neurons in the visual cortex, the units in a ConvNet act as detectors for important visual features, each unit looking for its designated feature in a specific part of the visual field. And (very roughly) like the visual cortex, each layer in a ConvNet consists of several grids of these units, with each grid forming an activation map for a specific visual feature.
+
 
+
What visual features should ConvNet units detect? Let’s look to the brain first. Hubel and Wiesel found that neurons in lower layers of the visual cortex act as edge detectors, where an edge refers to a boundary between two contrasting image regions. Each neuron receives input corresponding to a specific small region of the visual scene; this region is called the neuron’s receptive field. The neuron becomes active (that is, starts firing more rapidly) only if its receptive field contains a particular kind of edge.
+
 
+
In fact, these neurons are quite specific about what kind of edge they respond to. Some neurons become active only when there is a vertical edge in their receptive field; some respond only to a horizontal edge; others fire only for edges at other specific angles. One of Hubel and Wiesel’s most important findings was that each small region of your visual field corresponds to the receptive fields of many different such “edge detector” neurons. That is, at a low level of visual processing, your neurons are figuring out what edge orientations occur in every part of the scene you are looking at. Edge-detecting neurons feed into higher layers of the visual cortex, the neurons of which seem to be
+
 
+
detectors for specific shapes, objects, and faces.5
+
 
+
Similarly, the first layer of our hypothetical ConvNet consists of edge-detecting units. Figure 11 gives a closer view of layer 1 of our ConvNet. This layer consists of three activation maps, each of which is a grid of units. Each unit in a map corresponds to the analogous location in the input image, and each unit gets its input from a small region around that location—its receptive field. (The receptive fields of neighboring units typically overlap.) Each unit in each map calculates an activation value that measures the degree to which the region matches the unit’s preferred edge orientation—for example, vertical, horizontal, or slanted at various degrees.
+
 
+
FIGURE 12: Illustration of how convolutions are used to detect vertical edges. For example, a convolution of the upper receptive field with the weights is (200 × 1) + (110 × 0) + (70 × −1) + (190 × 1) + (90 × 0) + (80 × −1) + (220 × 1) + (70 × 0) + (50 × −1) = 410.
+
 
+
Figure 12 illustrates in detail how the units in map 1—those that detect vertical edges—calculate their activations. The small white squares in the input image represent the receptive fields of two different units. The image patches inside these receptive fields, when enlarged, are shown as arrays of pixel values. Here, for simplicity, I’ve displayed each patch as a three-by-three set of pixels (the values, by convention, range from 0 to 255—the lighter the pixel, the higher the value). Each unit receives as input the pixel values in its receptive field. The unit then multiplies each input by its weight and sums the results to produce the unit’s activation.
+
 
+
The weights shown in figure 12 are designed to produce a high positive activation when there is a light-to-dark vertical edge in the receptive field (that is, high contrast between the left and the right sides of the input patch). The upper receptive field contains a vertical edge: the dog’s light fur next to the darker grass. This is reflected in the high activation value (410). The lower receptive field does not contain such an edge, only dark grass, and the activation (−10) is closer to 0. Note that a dark-to-light vertical edge will yield a “high” negative value (that is, a negative value far from 0).
+
 
+
This calculation—multiplying each value in a receptive field by its corresponding weight and summing the results—is called a convolution. Hence the name “convolutional neural network.” I mentioned above that in a ConvNet, an activation map is a grid of units corresponding to receptive fields all over the image. Each unit in a given activation map uses the same weights to compute a convolution with its receptive field; imagine the input image with the white square sliding along every patch of the image.6 The result is the activation map in figure 12:
+
 
+
the center pixel of a unit’s receptive field is colored white for high positive and negative activations and darker for activations close to 0. You can see that the white areas highlight the locations where vertical edges exist. Maps 2 and 3 in figure 11 were created in the same way, but with weights that highlight horizontal and slanted edges, respectively. Taken together, the maps of edge-detecting units in layer 1 provide the ConvNet with a representation of the input image in terms of oriented edges in different regions, something like what an edge-detection program would produce.
+
 
+
Let’s take a moment to talk about the word map here. In everyday use, map refers to a spatial representation of a geographic area, such as a city. A road map of Paris, say, shows a particular feature of the city—its layout of streets, avenues, and alleys—but doesn’t include the city’s many other features, such as buildings, houses, lampposts, trash cans, apple trees, and fishponds. Other kinds of maps focus on other features; you can find maps that highlight Paris’s bike lanes, its vegetarian restaurants, its dog-friendly parks. Whatever your interests, there is quite possibly a map that shows where to find them. If you wanted to explain Paris to a friend who had never been there, a creative approach might be to show your friend a collection of such “special interest” maps of the city.
+
 
+
A ConvNet (like the brain) represents the visual scene as a collection of maps, reflecting the specific “interests” of a set of detectors. In my example in figure 11, these interests are different edge orientations. However,
+
 
+
as we’ll see below, in ConvNets the network itself learns what its interests (that is, detectors) should be; these depend on the specific task it is trained for.
+
 
+
Making maps isn’t limited to layer 1 of our ConvNet. As you can see in figure 10, a similar structure applies at all of the layers: each layer has a set of detectors, each of which creates its own activation map. A key to the ConvNet’s success is that—again, inspired by the brain—these maps are hierarchical: the inputs to the units at layer 2 are the activation maps of layer 1, the inputs to the units at layer 3 are the activation maps of layer 2, and so on up the layers. In our hypothetical network, in which layer 1 units respond to edges, the layer 2 units would be sensitive to specific combinations of edges, such as corners and T shapes. Layer 3 detectors would be sensitive to combinations of combinations of edges. As you go up the hierarchy, the detectors become sensitive to increasingly more complex features, just as Hubel, Wiesel, and others saw in the brain.
+
 
+
Our hypothetical ConvNet has four layers, each with three maps, but in the real world these networks can have many more layers—sometimes hundreds—each with different numbers of activation maps. Determining these and many other aspects of a ConvNet’s structure is part of the art of getting these complex networks to work for a given task. In chapter 3, I described I. J. Good’s vision of a future “intelligence explosion” in which machines themselves create increasingly intelligent machines. We’re not there yet. For the time being, getting ConvNets to work well requires a lot of human ingenuity.
+
 
+
Classification in ConvNets
+
 
+
Layers 1 to 4 of our network are called convolutional layers because each performs convolutions on the preceding layer (and layer 1 performs convolutions on the input). Given an input image, each layer successively performs its calculations, and finally at layer 4 the network has produced a set of activation maps for relatively complex features. These might include eyes, leg shapes, tail shapes, or anything else that the network has learned is useful for classifying the objects it is trained on (here dogs and cats). At this point, it’s time for the classification module to use these features to predict what object the image depicts.
+
 
+
The classification module is actually an entire traditional neural network, similar to the kind I described in chapter 2.7 The inputs to the classification module are the activation maps from the highest convolutional layer. The module’s output is a set of percentage values, one for each possible category, rating the network’s confidence that the input depicts an image of that category (here dog or cat).
+
 
+
Let me summarize this brief explanation of ConvNets: Inspired by Hubel and Wiesel’s findings on the brain’s
+
 
+
visual cortex, a ConvNet takes an input image and transforms it—via convolutions—into a set of activation maps with increasingly complex features. The features at the highest convolutional layer are fed into a traditional neural network (which I’ve called the classification module), which outputs confidence percentages for the network’s known object categories. The object category with the highest confidence is returned as the network’s classification of the image.8
+
 
+
Would you like to experiment with a well-trained ConvNet? Simply take a photo of an object, and upload it to Google’s “search by image” engine.9 Google will run a ConvNet on your image and, based on the resulting confidences (over thousands of possible object categories), will tell you its “best guess” for the image.
+
 
+
Training a ConvNet
+
 
+
Our hypothetical ConvNet consists of edge detectors at its first layer, but in real-world ConvNets edge detectors aren’t built in. Instead, ConvNets learn from training examples what features should be detected at each layer, as well as how to set the weights in the classification module so as to produce a high confidence for the correct answer. And, just as in traditional neural networks, all the weights can be learned from data via the same back-propagation algorithm that I described in chapter 2.
+
 
+
More specifically, here is how you could train our ConvNet to identify a given image as a dog or cat. First, collect many example images of dogs and cats—this is your “training set.” Also, create a file that gives a label for each image—that is, “dog” or “cat.” (Or better, take a hint from computer-vision researchers: Hire a graduate student to do all this for you. If you are a graduate student, then recruit an undergrad. No one enjoys this labeling chore!) Your training program initially sets all the weights in the network to random values. Then your program commences training: one by one, each image is given as the input to the network; the network performs its layer-by- layer calculations and finally outputs confidence percentages for “dog” and “cat.” For each image, your training
+
 
+
program compares these output values to the “correct” values; for example, if the image is a dog, then “dog” confidence should be 100 percent and “cat” confidence should be 0 percent. Then the training program uses the back-propagation algorithm to change the weights throughout the network just a bit, so that the next time this image is seen, the confidences will be closer to the correct values.
+
 
+
Following this procedure—input the image, then calculate the error at the output, then change the weights— for every image in your training set is called one “epoch” of training. Training a ConvNet requires many epochs, during which the network processes each image over and over again. Initially, the network will be very bad at recognizing dogs and cats, but slowly, as it changes its weights over many epochs, it will get increasingly better at the task. Finally, at some point, the network “converges”; that is, the weights stop changing much from one epoch to the next, and the network is (in principle!) very good at recognizing dogs and cats in the images in the training set. But we won’t know if the network is actually good at this task in general until we see if it can apply what it has learned to identify images from outside its training set. What’s really interesting is that, even though ConvNets are not constrained by a programmer to learn to detect any particular feature, when trained on large sets of real-world photographs, they indeed seem to learn a hierarchy of detectors similar to what Hubel and Wiesel found in the brain’s visual system.
+
 
+
In the next chapter, I’ll recount the extraordinary ascent of ConvNets from relative obscurity to near-complete dominance in machine vision, a transformation made possible by a concurrent technological revolution: that of “big data.”
+
  
 +
In fact, these neurons are quite specific about what kind of edge they respond to. Some neurons become active only when there is a vertical edge in their receptive field; some respond only to a horizontal edge; others fire only for edges at other specific angles. One of Hubel and Wiesel’s most important findings was that each small region of your visual field corresponds to the receptive fields of many different such “edge detector” neurons. That is, at a low level of visual processing, your neurons are figuring out what edge orientations occur in every part of the scene you are looking at. Edge-detecting neurons feed into higher layers of the visual cortex, the neurons of which seem to be detectors for specific shapes, objects, and faces.
 
5
 
5
 +
  
ConvNets and ImageNet
+
Similarly, the first layer of our hypothetical ConvNet consists of edge-detecting units. Figure 11 gives a closer view of layer 1 of our ConvNet. This layer consists of three activation maps, each of which is a grid of units. Each unit in a map corresponds to the analogous location in the input image, and each unit gets its input from a small region around that location—its receptive field. (The receptive fields of neighboring units typically overlap.) Each unit in each map calculates an activation value that measures the degree to which the region matches the unit’s preferred edge orientation—for example, vertical, horizontal, or slanted at various degrees.  
 
+
Yann LeCun, the inventor of ConvNets, has worked on neural networks all of his professional life, starting in the 1980s and continuing through the winters and springs of the field. As a graduate student and postdoctoral fellow, he was fascinated by Rosenblatt’s perceptrons and Fukushima’s neocognitron, but noted that the latter lacked a good supervised-learning algorithm. Along with other researchers (most notably, his postdoctoral advisor Geoffrey Hinton), LeCun helped develop such a learning method—essentially the same form of back-propagation used on
+
 
+
ConvNets today.1
+
 
+
In the 1980s and ’90s, while working at Bell Labs, LeCun turned to the problem of recognizing handwritten digits and letters. He combined ideas from the neocognitron with the back-propagation algorithm to create the semi- eponymous “LeNet”—one of the earliest ConvNets. LeNet’s handwritten-digit-recognition abilities made it a commercial success: in the 1990s and into the 2000s it was used by the U.S. Postal Service for automated zip code recognition, as well as in the banking industry for automated reading of digits on checks.
+
 
+
LeNet and its successor ConvNets did not do well in scaling up to more complex vision tasks. By the mid- 1990s, neural networks started falling out of favor in the AI community, and other methods came to dominate the field. But LeCun, still a believer, kept working on ConvNets, gradually improving them. As Geoffrey Hinton later said of LeCun, “He kind of carried the torch through the dark ages.”2
+
 
+
LeCun, Hinton, and other neural network loyalists believed that improved, larger versions of ConvNets and
+
 
+
other deep networks would conquer computer vision if only they could be trained with enough data. Stubbornly, they kept working on the sidelines throughout the 2000s. In 2012, the torch carried by ConvNet researchers suddenly lit the vision world afire, by winning a computer-vision competition on an image data set called ImageNet.
+
 
+
Building ImageNet
+
 
+
AI researchers are a competitive bunch, so it’s no surprise that they like to organize competitions to drive the field forward. In the field of visual object recognition, researchers have long held annual contests to determine whose program performs the best. Each of these contests features a “benchmark data set”: a collection of photos, along with human-created labels that name objects in the photos.
+
 
+
From 2005 to 2010, the most prominent of these annual contests was the PASCAL Visual Object Classes competition, which by 2010 featured about fifteen thousand photographs (downloaded from the photo-sharing site Flickr), with human-created labels for twenty object categories, such as “person,” “dog,” “horse,” “sheep,” “car,” “bicycle,” “sofa,” and “potted plant.”
+
 
+
The entries to the “classification” part of this contest3 were computer-vision programs that could take a photograph as input (without seeing its human-created label) and could then output, for each of the twenty categories, whether an object of that category was present in the image.
+
 
+
Here’s how the competition worked. The organizers would split the photographs into a training set that contestants could use to train their programs and a test set, not released to contestants, that would be used to gauge the programs’ performance on images outside the training set. Prior to the competition, the training set would be offered online, and when the contest was held, researchers would submit their trained programs to be tested on the secret test set. The winning entry was the one that had the highest accuracy recognizing objects in the test-set images.
+
 
+
The annual PASCAL competitions were a very big deal and did a lot to spur research in object recognition. Over the years of the challenge, the competing programs gradually got better (curiously, potted plants remained the
+
 
+
hardest objects to recognize). However, some researchers were frustrated by the shortcomings of the PASCAL benchmark as a way to move computer vision forward. Contestants were focusing too much on PASCAL’s specific twenty object categories and were not building systems that could scale up to the huge number of object categories recognized by humans. Furthermore, there just weren’t enough photos in the data set for the competing systems to learn all the many possible variations in what the objects look like so as to be able to generalize well.
+
 
+
To move ahead, the field needed a new benchmark image collection, one featuring a much larger set of categories and vastly more photos. Fei-Fei Li, a young computer-vision professor at Princeton, was particularly focused on this goal. By serendipity, she learned of a project led by a fellow Princeton professor, the psychologist George Miller, to create a database of English words, arranged in a hierarchy moving from most specific to most general, with groupings among synonyms. For example, consider the word cappuccino. The database, called WordNet, contains the following information about this term (where an arrow means “is a kind of”):
+
 
+
cappuccino ⇒ coffee ⇒ beverage ⇒ food ⇒ substance ⇒ physical entity ⇒ entity
+
 
+
The database also contains information that, say, beverage, drink, and potable are synonyms, that beverage is part of another chain including liquid, and so forth.
+
 
+
WordNet had been (and continues to be) used extensively in research by psychologists and linguists as well as in AI natural-language processing systems, but Fei-Fei Li had a new idea: create an image database that is structured according to the nouns in WordNet, where each noun is linked to a large number of images containing examples of that noun. Thus the idea for ImageNet was born.
+
 
+
Li and her collaborators soon commenced collecting a deluge of images by using WordNet nouns as queries on image search engines such as Flickr and Google image search. However, if you’ve ever used an image search engine, you know that the results of a query are often far from perfect. For example, if you type “macintosh apple” into Google image search, you get photos not only of apples and Mac computers but also of apple-shaped candles, smartphones, bottles of apple wine, and any number of other nonrelevant items. Thus, Li and her colleagues had to have humans figure out which images were not actually illustrations of a given noun and get rid of them. At first, the humans who did this were mainly undergraduates. The work was agonizingly slow and taxing. Li soon figured out
+
 
+
that at the rate they were going, it would take ninety years to complete the task.4
+
 
+
Li and her collaborators brainstormed about possible ways to automate this work, but of course the problem of deciding if a photo is an instance of a particular noun is the task of object recognition itself! And computers were nowhere near to being reliable at this task, which was the whole reason for constructing ImageNet in the first place.
+
 
+
The group was at an impasse, until Li, by chance, stumbled upon a three-year-old website that could deliver the human smarts that ImageNet required. The website had the strange name Amazon Mechanical Turk.
+
 
+
Mechanical Turk
+
 
+
According to Amazon, its Mechanical Turk service is “a marketplace for work that requires human intelligence.” The service connects requesters, people who need a task accomplished that is hard for computers, with workers, people who are willing to lend their human intelligence to a requester’s task, for a small fee (for example, labeling the objects in a photo, for ten cents per photo). Hundreds of thousands of workers have signed up, from all over the world. Mechanical Turk is the embodiment of Marvin Minsky’s “Easy things are hard” dictum: the human workers are hired to perform the “easy” tasks that are currently too hard for computers.
+
 
+
The name Mechanical Turk comes from a famous eighteenth-century AI hoax: the original Mechanical Turk was a chess-playing “intelligent machine,” which secretly hid a human who controlled a puppet (the “Turk,” dressed like an Ottoman sultan) that made the moves. Evidently, it fooled many prominent people of the time, including Napoleon Bonaparte. Amazon’s service, while not meant to fool anyone, is, like the original Mechanical Turk, “Artificial Artificial Intelligence.”5
+
 
+
Fei-Fei Li realized that if her group paid tens of thousands of workers on Mechanical Turk to sort out irrelevant images for each of the WordNet terms, the whole data set could be completed within a few years at a relatively low cost. In a mere two years, more than three million images were labeled with corresponding WordNet nouns to form the ImageNet data set. For the ImageNet project, Mechanical Turk was “a godsend.”6 The service continues to be widely used by AI researchers for creating data sets; nowadays, academic grant proposals in AI
+
 
+
commonly include a line item for “Mechanical Turk workers.”
+
 
+
The ImageNet Competitions
+
 
+
In 2010, the ImageNet project launched the first ImageNet Large Scale Visual Recognition Challenge, in order to spur progress toward more general object-recognition algorithms. Thirty-five programs competed, representing computer-vision researchers from academia and industry around the world. The competitors were given labeled training images—1.2 million of them—and a list of possible categories. The task for the trained programs was to output the correct category of each input image. The ImageNet competition had a thousand possible categories, compared with PASCAL’s twenty.
+
 
+
The thousand possible categories were a subset of WordNet terms chosen by the organizers. The categories  are a random-looking assembly of nouns, ranging from the familiar and commonplace (“lemon,” “castle,” “grand piano”) to the somewhat less common (“viaduct,” “hermit crab,” “metronome”), and on to the downright obscure (“Scottish deerhound,” “ruddy turnstone,” “hussar monkey”). In fact, obscure animals and plants—at least ones I wouldn’t be able to distinguish—constitute at least a tenth of the thousand target categories.
+
 
+
Some of the photographs contain only one object; others contain many objects, including the “correct” one. Because of this ambiguity, a program gets to guess five categories for each image, and if the correct one is in this list, the program is said to be correct on this image. This is called the “top-5” accuracy metric.
+
 
+
The highest-scoring program in 2010 used a so-called support vector machine, the predominant object- recognition algorithm of the day, which employed sophisticated mathematics to learn how to assign a category to each input image. Using the top-5 accuracy metric, this winning program was correct on 72 percent of the 150,000 test images. Not a bad showing, though this means that the program was wrong, even with five guesses allowed, on more than 40,000 of the test images, leaving a lot of room for improvement. Notably, there were no neural networks among the top-scoring programs.
+
 
+
The following year, the highest-scoring program—also using support vector machines—showed a respectable but modest improvement, getting 74 percent of the test images correct. Most people in the field expected this trend to continue; computer-vision research would chip away at the problem, with gradual improvement at each annual competition.
+
 
+
However, these expectations were upended in the 2012 ImageNet competition: the winning entry achieved an amazing 85 percent correct. Such a jump in accuracy was a shocking development. What’s more, the winning entry did not use support vector machines or any of the other dominant computer-vision methods of the day. Instead, it was a convolutional neural network. This particular ConvNet has come to be known as AlexNet, named after its main creator, Alex Krizhevsky, then a graduate student at the University of Toronto, supervised by the eminent neural network researcher Geoffrey Hinton. Krizhevsky, working with Hinton and a fellow student, Ilya Sutskever, created a scaled-up version of Yann LeCun’s LeNet from the 1990s; training such a large network was now made possible by increases in computer power. AlexNet had eight layers, with about sixty million weights whose values
+
 
+
were learned via back-propagation from the million-plus training images.7 The Toronto group came up with some clever methods for making the network training work better, and it took a cluster of powerful computers about a week to train AlexNet.
+
 
+
AlexNet’s success sent a jolt through the computer-vision and broader AI communities, suddenly waking people up to the potential power of ConvNets, which most AI researchers hadn’t considered a serious contender in modern computer vision. In a 2015 article, the journalist Tom Simonite interviewed Yann LeCun about the unexpected triumph of ConvNets:
+
 
+
LeCun recalls seeing the community that had mostly ignored neural networks pack into the room where the winners presented a paper on their results. “You could see right there a lot of senior people in the community just flipped,” he says. “They said, ‘Okay, now we buy it.
+
 
+
That’s it, now—you won.’”8
+
 
+
At almost the same time, Geoffrey Hinton’s group was also demonstrating that deep neural networks, trained on huge amounts of labeled data, were significantly better than the current state of the art in speech recognition. The Toronto group’s ImageNet and speech-recognition results had substantial ripple effects. Within a year, a small company started by Hinton was acquired by Google, and Hinton and his students Krizhevsky and Sutskever became Google employees. This acqui-hire instantly put Google at the forefront of deep learning.
+
 
+
Soon after, Yann LeCun was lured away from his full-time New York University professorship by Facebook to head up its newly formed AI lab. It didn’t take long before all the big tech companies (as well as many smaller ones) were snapping up deep-learning experts and their graduate students as fast as possible. Seemingly overnight,
+
 
+
deep learning became the hottest part of AI, and expertise in deep learning guaranteed computer scientists a large salary in Silicon Valley or, better yet, venture capital funding for their proliferating deep-learning start-up companies.
+
 
+
The annual ImageNet competition began to see wider coverage in the media, and it quickly morphed from a friendly academic contest into a high-profile sparring match for tech companies commercializing computer vision. Winning at ImageNet would guarantee coveted respect from the vision community, along with free publicity, which might translate into product sales and higher stock prices. The pressure to produce programs that outperformed competitors was notably manifest in a 2015 cheating incident involving the giant Chinese internet company Baidu. The cheating involved a subtle example of what people in machine learning call data snooping.
+
 
+
Here’s what happened: Before the competition, each team competing on ImageNet was given training images labeled with correct object categories. They were also given a large test set—a collection of images not in the training set—without any labels. Once a program was trained, a team could see how well their method performed on this test set. This helps test how well a program has learned to generalize (as opposed to, say, memorizing the training images and their labels). Only the performance on the test set counts. The way a team could find out how well their program did on the test set was to run their program on each test-set image, collect the top five guesses for each image, and submit this list to a “test server”—a computer run by the contest organizers. The test server would compare the submitted list with the (secret) correct answers and spit out the percentage correct.
+
 
+
Each team could sign up for an account on the test server and use it to see how well various versions of their programs were scoring; this would allow them to publish (and publicize) their results before the official results were announced.
+
 
+
A cardinal rule in machine learning is “Don’t train on the test data.” It seems obvious: If you include test data in any part of training your program, you won’t get a good measure of the program’s generalization abilities. It would be like giving students the questions on the final exam before they take the test. But it turns out that there are subtle ways that this rule can be unintentionally (or intentionally) broken to make your program’s performance look better than it actually is.
+
 
+
One such method would be to submit your program’s test-set answers to the test server and, based on the result, tweak your program. Then submit again. Repeat this many times, until you have tweaked it to do better on the test set. This doesn’t require seeing the actual labels in the test set, but it does require getting feedback on accuracy and adjusting your program accordingly. It turns out that if you can do this enough times, it can be very effective in improving your program’s performance on the test set. But because you’re using information from the test set to change your program, you’ve now destroyed the ability to use the test set to see if your program generalizes well. It would be like allowing students to take a final exam many times, each time getting back a single grade, but using that single grade to try to improve their performance the next time around. Then, at the end, the students submit the version of their answers that got them the best score. This is no longer a good measure of how well the students have learned the subject, just a measure of how they adapted their answers to particular test questions.
+
 
+
To prevent this kind of data snooping while still allowing the ImageNet competitors to see how well their programs are doing, the organizers set a rule saying that each team could submit answers to the test server at most twice per week. This would limit the amount of feedback the teams could glean from the test runs.
+
 
+
The great ImageNet battle of 2015 was fought over a fraction of a percentage point—seemingly trivial but potentially very lucrative. Early in the year, a team from Baidu announced a method that achieved the highest (top-
+
 
+
5) accuracy yet on an ImageNet test set: 94.67 percent, to be exact. But on the very same day, a team from Microsoft announced a better accuracy with their method: 95.06 percent. A few days later, a rival team from Google announced a slightly different method that did even better: 95.18 percent. This record held for a few months, until Baidu made a new announcement: it had improved its method and now could boast a new record, 95.42 percent. This result was widely publicized by Baidu’s public relations team.
+
 
+
But within a few weeks, a terse announcement came from the ImageNet organizers: “During the period of November 28th, 2014 to May 13th, 2015, there were at least 30 accounts used by a team from Baidu to submit to the test server at least 200 times, far exceeding the specified limit of two submissions per week.”9 In short, the Baidu team had been caught data snooping.
+
 
+
The two hundred points of feedback potentially allowed the Baidu team to determine which tweaks to their
+
 
+
program would make it perform best on this test set, gaining it the all-important fraction of a percentage point that made the win. As punishment, Baidu was disqualified from entering its program in the 2015 competition.
+
 
+
Baidu, hoping to minimize bad publicity, promptly apologized and then laid the blame on a rogue employee:
+
 
+
“We found that a team leader had directed junior engineers to submit more than two submissions per week, a breach of the current ImageNet rules.”10 The employee, though disputing that he had broken any rules, was promptly fired from the company.
+
 
+
While this story is merely an interesting footnote to the larger history of deep learning in computer vision, I
+
 
+
tell it to illustrate the extent to which the ImageNet competition came to be seen as the key symbol of progress in computer vision, and AI in general.
+
 
+
Cheating aside, progress on ImageNet continued. The final competition was held in 2017, with a winning top- 5 accuracy of 98 percent. As one journalist commented, “Today, many consider ImageNet solved,”11 at least for the classification task. The community is moving on to new benchmark data sets and new problems, especially ones that integrate vision and language.
+
 
+
What was it that enabled ConvNets, which seemed to be at a dead end in the 1990s, to suddenly dominate the
+
 
+
ImageNet competition, and subsequently most of computer vision in the last half a decade? It turns out that the recent success of deep learning is due less to new breakthroughs in AI than to the availability of huge amounts of data (thank you, internet!) and very fast parallel computer hardware. These factors, along with improvements in training methods, allow hundred-plus-layer networks to be trained on millions of images in just a few days.
+
 
+
Yann LeCun himself was taken by surprise at how fast things turned around for his ConvNets: “It’s rarely the case where a technology that has been around for 20, 25 years—basically unchanged—turns out to be the best. The speed at which people have embraced it is nothing short of amazing. I’ve never seen anything like this before.”12
+
 
+
The ConvNet Gold Rush
+
 
+
Once ImageNet and other large data sets gave ConvNets the vast amount of training examples they needed to work well, companies were suddenly able to apply computer vision in ways never seen before. As Google’s Blaise Agüera y Arcas remarked, “It’s been a sort of gold rush—attacking one problem after another with the same set of techniques.”13 Using ConvNets trained with deep learning, image search engines offered by Google, Microsoft, and others were able to vastly improve their “find similar images” feature. Google offered a photo-storage system that
+
 
+
would tag your photos by describing the objects they contained, and Google’s Street View service could recognize and blur out street addresses and license plates in its images. A proliferation of mobile apps enabled smartphones to perform object and face recognition in real time.
+
 
+
Facebook labeled your uploaded photos with names of your friends and registered a patent on classifying the emotions behind facial expressions in uploaded photos; Twitter developed a filter that could screen tweets for pornographic images; and several photo- and video-sharing sites started applying tools to detect imagery associated with terrorist groups. ConvNets can be applied to video and used in self-driving cars to track pedestrians, or to read lips and classify body language. ConvNets can even diagnose breast and skin cancer from medical images, determine the stage of diabetic retinopathy, and assist physicians in treatment planning for prostate cancer.
+
 
+
These are just a few examples of the many existing (or soon-to-exist) commercial applications powered by ConvNets. In fact, there’s a good chance that any modern computer-vision application you use employs ConvNets. Moreover, there’s an excellent chance it was “pretrained” on images from ImageNet to learn generic visual features before being “fine-tuned” for more specific tasks.
+
 
+
Given that the extensive training required by ConvNets is feasible only with specialized computer hardware— typically, powerful graphical processing units (GPUs)—it is not surprising that the stock price of the NVIDIA Corporation, the most prominent maker of GPUs, increased by over 1,000 percent between 2012 and 2017.
+
 
+
Have ConvNets Surpassed Humans at Object Recognition?
+
 
+
As I learned more about the remarkable success of ConvNets, I wondered how close they were to rivaling our own human object-recognition abilities. A 2015 paper from Baidu (post–cheating scandal) carried the subtitle “Surpassing Human-Level Performance on ImageNet Classification.”14 At about the same time, Microsoft announced in a research blog “a major advance in technology designed to identify the objects in a photograph or video, showcasing a system whose accuracy meets and sometimes exceeds human-level performance.”15 While both companies made it clear they were talking about accuracy specifically on ImageNet, the media were not so careful, giving way to sensational headlines such as “Computers Now Better than Humans at Recognising and Sorting
+
 
+
Images” and “Microsoft Has Developed a Computer System That Can Identify Objects Better than Humans.”16
+
 
+
Let’s look a bit harder at the specific contention that machines are now “better than humans” at object recognition on ImageNet. This assertion is based on a claim that humans have an error rate of about 5 percent, whereas the error rate of machines is (at the time of this writing) close to 2 percent. Doesn’t this confirm that machines are better than humans at this task? As is often the case for highly publicized claims about AI, the claim comes with a few caveats.
+
 
+
Here’s one caveat. When you read about a machine “identifying objects correctly,” you’d think that, say, given an image of a basketball, the machine would output “basketball.” But of course, on ImageNet, correct identification means only that the correct category is in the machine’s top-five categories. If, given an image of a basketball, the machine outputs “croquet ball,” “bikini,” “warthog,” “basketball,” and “moving van,” in that order, it is considered correct. I don’t know how often this kind of thing happens, but it’s notable that the best top-1 accuracy
+
 
+
—the fraction of test images on which the correct category is at the top of the list—was about 82 percent, compared with 98 percent top-5 accuracy, in the 2017 ImageNet competition. No one, as far as I know, has reported a comparison between machines and humans on top-1 accuracy.
+
 
+
Here’s another caveat. Consider the claim, “Humans have an error rate of about 5% on ImageNet.” It turns out that saying “humans” is not quite accurate; this result is from an experiment involving a single human, one Andrej Karpathy, who was at the time a graduate student at Stanford, researching deep learning. Karpathy wanted to see if he could train himself to compete against the best ConvNets on ImageNet. Considering that ConvNets train on 1.2 million images and then are run on 150,000 test images, this is a daunting task for a human. Karpathy, who has a popular blog about AI, wrote about his experience:
+
 
+
I ended up training [myself] on 500 images and then switched to [a reduced] test set of 1,500 images. The labeling [that is, Karpathy’s guessing five categories per image] happened at a rate of about 1 per minute, but this decreased over time. I only enjoyed the first ~200, and the rest I only did #forscience.… Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds,
+
 
+
or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs.17
+
 
+
Karpathy found that he was wrong on about 75 of his 1,500 test images, and he went on to analyze the errors he made, finding that they were largely due to images with multiple objects, images with specific breeds of dogs, species of birds or plants, and so on, and object categories that he didn’t realize were included in the target categories. The kinds of errors made by ConvNets are different: while they also get confused by images containing multiple objects, unlike humans they tend to miss objects that are small in the image, objects that have been distorted by color or contrast filters the photographer applied to the image, and “abstract representations” of objects, such as a painting or statue of a dog, or a stuffed toy dog. Thus, the claim that computers have bested humans on ImageNet needs to be taken with a large grain of salt.
+
 
+
Here’s a caveat that might surprise you. When a human says that a photo contains, say, a dog, we assume it’s because the human actually saw a dog in the photo. But if a ConvNet correctly says “dog,” how do we know it actually is basing this classification on the dog in the image? Maybe there’s something else in the image—a tennis ball, a Frisbee, a chewed-up shoe—that was often associated with dogs in the training images, and the ConvNet is recognizing these and assuming there is a dog in the photo. These kinds of correlations have often ended up fooling machines.
+
 
+
One thing we could do is ask the machine to not only output an object category for an image but also learn to draw a box around the target object, so we know the machine has actually “seen” the object. This is precisely what the ImageNet competition started doing in its second year with its “localization challenge.” The localization task provided training images with such boxes drawn (by Mechanical Turk workers) around the target object(s) in each image; on the test images, the task for competing programs was to predict five object categories each with the coordinates of a corresponding box. What may be surprising is that while deep convolutional neural networks have performed very well at localization, their performance has remained significantly worse than their performance on categorization, although newer competitions are focusing on precisely this problem.
+
 
+
Probably the most important differences between today’s ConvNets and humans when it comes to recognizing objects are in how learning takes place and in how robust and reliable that learning turns out to be. I’ll explore these differences in the next chapter.
+
 
+
The caveats I described above aren’t meant to diminish the amazing recent progress in computer vision. There is no question that convolutional neural networks have been stunningly successful in this and other areas, and these successes have not only produced commercial products but also resulted in a real sense of optimism in the AI
+
 
+
community. My discussion is meant to illustrate how challenging vision turns out to be and to add some perspective on the progress made so far. Object recognition is not yet close to being “solved” by artificial intelligence.
+
 
+
Beyond Object Recognition
+
 
+
I have focused on object recognition in this chapter because this has been the area in which computer vision has recently seen the most progress. However, there’s obviously a lot more to vision than just recognizing objects. If the goal of computer vision is to “get a machine to describe what it sees,” then machines will need to recognize not only objects but also their relationships to one another and how they interact with the world. If the “objects” in question are living beings, the machines will need to know something about their actions, goals, emotions, likely next steps, and all the other aspects that figure into telling the story of a visual scene. Moreover, if we really want the machines to describe what they see, they will need to use language. AI researchers are actively working on getting machines to do these things, but as usual these “easy” things are very hard. As the computer-vision expert Ali Farhadi told The New York Times, “We’re still very, very far from visual intelligence, understanding scenes and actions the way
+
  
humans do.”18
+
FIGURE 12: Illustration of how convolutions are used to detect vertical edges. For example, a convolution of the upper receptive field with the weights is (200 × 1) + (110 × 0) + (70 × −1) + (190 × 1) + (90 × 0) + (80 × −1) + (220 × 1) + (70 × 0) + (50 × −1) = 410.  
  
Why are we still so far from this goal? It seems that visual intelligence isn’t easily separable from the rest of intelligence, especially general knowledge, abstraction, and language—abilities that, interestingly, involve parts of the brain that have many feedback connections to the visual cortex. Additionally, it could be that the knowledge needed for humanlike visual intelligence—for example, making sense of the “soldier and dog” photo at the beginning of the previous chapter—can’t be learned from millions of pictures downloaded from the web, but has to be experienced in some way in the real world.
+
Figure 12 illustrates in detail how the units in map 1—those that detect vertical edges—calculate their activations. The small white squares in the input image represent the receptive fields of two different units. The image patches inside these receptive fields, when enlarged, are shown as arrays of pixel values. Here, for simplicity, I’ve displayed each patch as a three-by-three set of pixels (the values, by convention, range from 0 to 255—the lighter the pixel, the higher the value). Each unit receives as input the pixel values in its receptive field. The unit then multiplies each input by its weight and sums the results to produce the unit’s activation.  
  
In the next chapter, I’ll look more closely at machine learning in vision, focusing in particular on the differences between the ways humans and machines learn and trying to tease out just what the machines we have trained have actually learned.
+
The weights shown in figure 12 are designed to produce a high positive activation when there is a light-to-dark vertical edge in the receptive field (that is, high contrast between the left and the right sides of the input patch). The upper receptive field contains a vertical edge: the dog’s light fur next to the darker grass. This is reflected in the high activation value (410). The lower receptive field does not contain such an edge, only dark grass, and the activation (−10) is closer to 0. Note that a dark-to-light vertical edge will yield a “high” negative value (that is, a negative value far from 0).  
  
 +
This calculation—multiplying each value in a receptive field by its corresponding weight and summing the results—is called a convolution. Hence the name “convolutional neural network.” I mentioned above that in a ConvNet, an activation map is a grid of units corresponding to receptive fields all over the image. Each unit in a given activation map uses the same weights to compute a convolution with its receptive field; imagine the input image with the white square sliding along every patch of the image.
 
6
 
6
 +
The result is the activation map in figure 12: the center pixel of a unit’s receptive field is colored white for high positive and negative activations and darker for activations close to 0. You can see that the white areas highlight the locations where vertical edges exist. Maps 2 and 3 in figure 11 were created in the same way, but with weights that highlight horizontal and slanted edges, respectively. Taken together, the maps of edge-detecting units in layer 1 provide the ConvNet with a representation of the input image in terms of oriented edges in different regions, something like what an edge-detection program would produce.
  
A Closer Look at Machines That Learn
+
Let’s take a moment to talk about the word map here. In everyday use, map refers to a spatial representation of a geographic area, such as a city. A road map of Paris, say, shows a particular feature of the city—its layout of streets, avenues, and alleys—but doesn’t include the city’s many other features, such as buildings, houses, lampposts, trash cans, apple trees, and fishponds. Other kinds of maps focus on other features; you can find maps that highlight Paris’s bike lanes, its vegetarian restaurants, its dog-friendly parks. Whatever your interests, there is quite possibly a map that shows where to find them. If you wanted to explain Paris to a friend who had never been there, a creative approach might be to show your friend a collection of such “special interest” maps of the city.
  
The deep-learning pioneer Yann LeCun has received many awards and accolades, but perhaps his ultimate (if geeky) honor is being the subject of a widely followed and very funny parody Twitter account sporting the name “Bored Yann LeCun.” With the description “Musing on the rise of deep learning in Yann’s downtime,the anonymously authored account frequently ends its clever in-joke tweets with the hashtag #FeelTheLearn.1
+
A ConvNet (like the brain) represents the visual scene as a collection of maps, reflecting the specific “interests” of a set of detectors. In my example in figure 11, these interests are different edge orientations. However, as we’ll see below, in ConvNets the network itself learns what its interests (that is, detectors) should be; these depend on the specific task it is trained for.  
  
Indeed, media reports on cutting-edge AI have been “feeling the learn” by celebrating the power of deep
+
Making maps isn’t limited to layer 1 of our ConvNet. As you can see in figure 10, a similar structure applies at all of the layers: each layer has a set of detectors, each of which creates its own activation map. A key to the ConvNet’s success is that—again, inspired by the brain—these maps are hierarchical: the inputs to the units at layer 2 are the activation maps of layer 1, the inputs to the units at layer 3 are the activation maps of layer 2, and so on up the layers. In our hypothetical network, in which layer 1 units respond to edges, the layer 2 units would be sensitive to specific combinations of edges, such as corners and T shapes. Layer 3 detectors would be sensitive to combinations of combinations of edges. As you go up the hierarchy, the detectors become sensitive to increasingly more complex features, just as Hubel, Wiesel, and others saw in the brain.
  
learning—emphasis on “learning.” We are told, for example, that “we can now build systems that learn how to perform tasks on their own,”2 that “deep learning [enables] computers to literally teach themselves,”3 and that deep- learning systems learn “in a way similar to the human brain.”4
+
Our hypothetical ConvNet has four layers, each with three maps, but in the real world these networks can have many more layers—sometimes hundreds—each with different numbers of activation maps. Determining these and many other aspects of a ConvNet’s structure is part of the art of getting these complex networks to work for a given task. In chapter 3, I described I. J. Good’s vision of a future “intelligence explosion” in which machines themselves create increasingly intelligent machines. We’re not there yet. For the time being, getting ConvNets to work well requires a lot of human ingenuity.  
  
In this chapter, I’ll look in more detail at how machines—particularly ConvNets—learn and how their learning
+
===Classification in ConvNets ===
  
processes contrast with those of humans. Furthermore, I’ll explore how differences between learning in ConvNets and in humans affect the robustness and trustworthiness of what is learned.
+
Layers 1 to 4 of our network are called convolutional layers because each performs convolutions on the preceding layer (and layer 1 performs convolutions on the input). Given an input image, each layer successively performs its calculations, and finally at layer 4 the network has produced a set of activation maps for relatively complex features. These might include eyes, leg shapes, tail shapes, or anything else that the network has learned is useful for classifying the objects it is trained on (here dogs and cats). At this point, it’s time for the classification module to use these features to predict what object the image depicts.  
 
+
Learning on One’s Own
+
 
+
The learning-from-data approach of deep neural networks has generally proved to be more successful than the “good old-fashioned AI” strategy, in which human programmers construct explicit rules for intelligent behavior. However, contrary to what some media have reported, the learning process of ConvNets is not very humanlike.
+
 
+
As we’ve seen, the most successful ConvNets learn via a supervised-learning procedure: they gradually change their weights as they process the examples in the training set again and again, over many epochs (that is, many passes through the training set), learning to classify each input as one of a fixed set of possible output categories. In contrast, even the youngest children learn an open-ended set of categories and can recognize instances of most categories after seeing only a few examples. Moreover, children don’t learn passively: they ask questions, they demand information on the things they are curious about, they infer abstractions of and connections between concepts, and, above all, they actively explore the world.
+
 
+
It is inaccurate to say that today’s successful ConvNets learn “on their own.” As we saw in the previous chapter, in order for a ConvNet to learn to perform a task, a huge amount of human effort is required to collect, curate, and label the data, as well as to design the many aspects of the ConvNet’s architecture. While ConvNets use back-propagation to learn their “parameters” (that is, weights) from training examples, this learning is enabled by a collection of what are called “hyperparameters”—an umbrella term that refers to all the aspects of the network that need to be set up by humans to allow learning to even begin. Examples of hyperparameters include the number of layers in the network, the size of the units’ “receptive fields” at each layer, how large the change in each weight should be during learning (called the learning rate), and many other technical details of the training process. This part of setting up a ConvNet is called tuning the hyperparameters. There are many values to set as well as complex design decisions to be made, and these settings and designs interact with one another in complex ways to affect the ultimate performance of the network. Moreover, these settings and designs must typically be decided anew for each task a network is trained on.
+
 
+
Tuning the hyperparameters might sound like a pretty mundane activity, but doing it well is absolutely crucial to the success of ConvNets and other machine-learning systems. Because of the open-ended nature of designing these networks, in general it is not possible to automatically set all the parameters and designs, even with automated search. Often it takes a kind of cabalistic knowledge that students of machine learning gain both from their apprenticeships with experts and from hard-won experience. As Eric Horvitz, director of Microsoft’s research lab,
+
 
+
characterized it, “Right now, what we are doing is not a science but a kind of alchemy.”5 And the people who can do this kind of “network whispering” form a small, exclusive club: according to Demis Hassabis, cofounder of Google DeepMind, “It’s almost like an art form to get the best out of these systems.… There’s only a few hundred people in the world that can do that really well.”6
+
 
+
Actually, the number of deep-learning experts is growing quickly; many universities now offer courses in the subject, and a growing list of companies have started their own deep-learning training programs for employees. Membership in the deep-learning club can be quite lucrative. At a recent conference I attended, a leader of Microsoft’s AI product group spoke to the audience about the company’s efforts to hire young deep-learning engineers: “If a kid knows how to train five layers of neural networks, the kid can demand five figures. If the kid
+
 
+
knows how to train fifty layers, the kid can demand seven figures.”7 Lucky for this soon-to-be-wealthy kid, the networks can’t yet teach themselves.
+
 
+
Big Data
+
 
+
It’s no secret: deep learning requires big data. Big in the sense of the million-plus labeled training images in ImageNet. Where does all this data come from? The answer is, of course, you—and probably everyone you know. Modern computer-vision applications are possible only because of the billions of images that internet users have uploaded and (sometimes) tagged with text identifying what is in the image. Have you ever put a photo of a friend on your Facebook page and commented on it? Facebook thanks you! That image and text might have been used to train its face-recognition system. Have you ever uploaded an image to Flickr? If so, it’s possible your image is part of the ImageNet training set. Have you ever identified a picture in order to prove to a website that you’re not a robot? Your identification might have helped Google tag an image for use in training its image search system.
+
 
+
Big tech companies offer many services for free on your computer and smartphone: web search, video calling, email, social networking, automated personal assistants—the list goes on. What’s in it for these companies? The answer you might have heard is that their true product is their users (like you and me); their customers are the advertisers who grab our attention and information about us while we use these “free” services. But there’s a second answer: when we use services provided by tech companies such as Google, Amazon, and Facebook, we are directly providing these companies with examples—in the form of our images, videos, text, or speech—that they can utilize to better train their AI programs. And these improved programs attract more users (and thus more data), helping advertisers to target their ads more effectively. Moreover, the training examples we provide them can be used to train and offer enterprise services such as computer vision and natural-language processing to businesses for a fee.
+
 
+
Much has been written about the ethics of these big companies using data you have created (such as all the images, videos, and text that you upload to Facebook) to train programs and sell products without informing or compensating you. This is an important discussion but beyond the scope of this book.8 The point I want to make here is that the reliance on extensive collections of labeled training data is one more way in which deep learning differs from human learning.
+
 
+
With the proliferation of deep-learning systems in real-world applications, companies are finding themselves in need of new labeled data sets for training deep neural networks. Self-driving cars are a noteworthy example. These cars need sophisticated computer vision in order to recognize lanes in the road, traffic lights, stop signs, and so on, and to distinguish and track different kinds of potential obstacles, such as other cars, pedestrians, bicyclists, animals, traffic cones, knocked-over garbage cans, tumbleweeds, and anything else that you might not want your car to hit. Self-driving cars need to learn what these various objects look like—in sun, rain, snow, or fog, day or night— and which objects are likely to move and which will stay put. Deep learning has helped make this task possible, at least in part, but deep learning, as always, requires a profusion of training examples.
+
 
+
Self-driving car companies collect these training examples from countless hours of video taken by cameras mounted on actual cars driving in traffic on highways and city streets. These cars may be self-driving prototypes being tested by companies or, in the case of Tesla, cars driven by customers who, upon purchase of a Tesla vehicle, must agree to a data-sharing policy with the company.9
+
 
+
Tesla owners aren’t required to label every object on the videos taken by their cars. But someone has to. In
+
 
+
2017, the Financial Times reported that “most companies working on this technology employ hundreds or even thousands of people, often in offshore outsourcing centres in India or China, whose job it is to teach the robo-cars to recognize pedestrians, cyclists and other obstacles. The workers do this by manually marking up or ‘labeling’
+
 
+
thousands of hours of video footage, often frame by frame.”10 New companies have sprung up to offer labeling data as a service; Mighty AI, for example, offers “the labeled data you need to train your computer vision models” and promises “known, verified, and trusted annotators who specialize in autonomous driving data.”11
+
 
+
The Long Tail
+
 
+
The supervised-learning approach, using large data sets and armies of human annotators, works well for at least some of the visual abilities needed for self-driving cars (many companies are also exploring the use of video-game- like driving-simulation programs to augment supervised training). But what about in the rest of life? Virtually everyone working in the AI field agrees that supervised learning is not a viable path to general-purpose AI. As the renowned AI researcher Andrew Ng has warned, “Requiring so much data is a major limitation of [deep learning]
+
 
+
today.”12 Yoshua Bengio, another high-profile AI researcher, agrees: “We can’t realistically label everything in the world and meticulously explain every last detail to the computer.”13
+
 
+
FIGURE 13: Possible situations a self-driving car might encounter, ranked by likelihood, illustrating the “long tail” of unlikely scenarios
+
 
+
This issue is compounded by the so-called long-tail problem: the vast range of possible unexpected situations an AI system could be faced with. Figure 13 illustrates this phenomenon by giving the likelihood of various hypothetical situations that a self-driving car might encounter during, say, a day’s worth of driving. Very common situations, such as encountering a red traffic light or a stop sign, are rated as having high likelihood; medium- likelihood situations include broken glass and wind-whipped plastic bags—not encountered every day (depending on where you drive), but not uncommon. It is less likely that your self-driving car would encounter a flooded road or lane markings obscured by snow, and even less likely that you would face a snowman in the middle of a high-speed
+
 
+
road.
+
 
+
I conjured up these different scenarios and guessed at their relative likelihood; I’m sure you can come up with
+
 
+
many more of your own. Any individual car is probably safe: after all, taken together, experimental autonomous cars have driven millions of miles and have caused a relatively small number of accidents (albeit a few high-profile fatal ones). But once self-driving cars are widespread, while each individual unlikely situation is, by definition, very unlikely, there are so many possible scenarios in the world of driving and so many cars that some self-driving car somewhere is likely to encounter one of them at some point.
+
 
+
The term long tail comes from statistics, in which certain probability distributions are shaped like the one in figure 13: the long list of very unlikely (but possible) situations is called the “tail” of the distribution. (The situations in the tail are sometimes called edge cases.) Most real-world domains for AI exhibit this kind of long-tail phenomenon: events in the real world are usually predictable, but there remains a long tail of low-probability, unexpected occurrences. This is a problem if we rely solely on supervised learning to provide our AI system with its knowledge of the world; the situations in the tail don’t show up in the training data often enough, if at all, so the system is more likely to make errors when faced with such unexpected cases.
+
 
+
Here are two real-world examples. In March 2016, there was a massive snowstorm forecast in the Northeast of the United States, and reports appeared on Twitter that Tesla vehicles’ Autopilot mode, which enables limited autonomous driving, was getting confused between lane markings and salt lines laid out on the highway in anticipation of the storm (figure 14). In February 2016, one of Google’s prototype self-driving cars, while making a right turn, had to veer to the left to avoid sandbags on the right side of a California road, and the car’s left front struck a public bus driving in the left lane. Each vehicle had expected the other to yield (perhaps the bus driver expected a human driver who would be more intimidated by the much larger bus).
+
 
+
Companies working on autonomous-vehicle technology are acutely aware of the long-tail problem: their teams brainstorm possible long-tail scenarios and actively create extra training examples as well as specially coded strategies for all the unlikely scenarios they can come up with. But of course it is impossible to train or code a system for all the possible situations it might encounter.
+
 
+
FIGURE 14: Salt lines on a highway, in advance of a forecasted snowstorm, were reported to be confusing Tesla’s Autopilot feature.
+
 
+
A commonly proposed solution is for AI systems to use supervised learning on small amounts of labeled data and learn everything else via unsupervised learning. The term unsupervised learning refers to a broad group of methods for learning categories or actions without labeled data. Examples include methods for clustering examples based on their similarity or learning a new category via analogy to known categories. As I’ll describe in a later chapter, perceiving abstract similarity and analogies is something at which humans excel, but to date there are no very successful AI methods for this kind of unsupervised learning. Yann LeCun himself acknowledges that “unsupervised learning is the dark matter of AI.” In other words, for general AI, almost all learning will have to be unsupervised, but no one has yet come up with the kinds of algorithms needed to perform successful unsupervised learning.
+
 
+
Humans make mistakes all the time, even (or especially) in driving; any one of us might have hit that public bus, had we been the one veering around sandbags. But humans also have a fundamental competence lacking in all current AI systems: common sense. We have vast background knowledge of the world, both its physical and its social aspects. We have a good sense of how objects—both inanimate and living—are likely to behave, and we use this knowledge extensively in making decisions about how to act in any given situation. We can infer the reason behind salt lines on the road even if we have never driven in snow before. We know how to interact socially with other humans, so we can use eye contact, hand signals, and other body language to deal with broken traffic lights during a power failure. We generally know to yield the road to a large public bus, even if we technically have the right of way. I’ve used driving as an example here, but we humans use common sense—usually subconsciously—in every facet of life. Many people believe that until AI systems have common sense as humans do, we won’t be able to trust them to be fully autonomous in complex real-world situations.
+
 
+
What Did My Network Learn?
+
 
+
A few years ago, Will Landecker, then a graduate student in my research group, trained a deep neural network to classify photographs into one of two categories: “contains an animal” and “does not contain an animal.” The network was trained on photos like the ones in figure 15, and it performed very well on this task on the test set. But what did the network actually learn? By performing a careful study, Will found an unexpected answer: in part, the network learned to classify images with blurry backgrounds as “contains an animal,” whether or not the image
+
 
+
actually contained an animal.14 The nature photos in the training and test sets obeyed an important rule of photography: focus on the subject of the photo. When the subject of the photo is an animal, the animal is the focus and the background is blurred, as in figure 15A. When the subject of the photo is the background, as in figure 15B, nothing is blurred. To Will’s chagrin, his network hadn’t learned to recognize animals; instead, it used simpler cues
+
 
+
—such as blurry backgrounds—that were statistically associated with animals.
+
 
+
FIGURE 15: Illustration of “animal” versus “no animal” classification task. Note the blurry background in the image on the left.
+
 
+
This is an example of a common phenomenon seen in machine learning. The machine learns what it observes in the data rather than what you (the human) might observe. If there are statistical associations in the training data, even if irrelevant to the task at hand, the machine will happily learn those instead of what you wanted it to learn. If
+
 
+
the machine is tested on new data with the same statistical associations, it will appear to have successfully learned to solve the task. However, the machine can fail unexpectedly, as Will’s network did on images of animals without a blurry background. In machine-learning jargon, Will’s network “overfitted” to its specific training set, and thus can’t do a good job of applying what it learned to images that differ from those it was trained on.
+
 
+
In recent years, several research teams have investigated whether ConvNets trained on ImageNet and other large data sets have likewise overfitted to their training data. One group showed that if ConvNets are trained on images downloaded from the web (like those in ImageNet), they perform poorly on images that were taken by a robot moving around a house with a camera.15 It seems that random views of household objects can look very different from photos that people put on the web. Other groups have shown that superficial changes to images, such
+
 
+
as slightly blurring or speckling an image, changing some colors, or rotating objects in the scene, can cause ConvNets to make significant errors even when these perturbations don’t affect humans’ recognition of objects.16 This unexpected fragility of ConvNets—even those that have been said to “surpass humans at object recognition”— indicates that they are overfitting to their training data and learning something different from what we are trying to teach them.
+
 
+
FIGURE 16: Labels assigned to photos by Google’s automated photo tagger, including the infamous “Gorillas” tag
+
 
+
Biased AI
+
 
+
The unreliability of ConvNets can result in embarrassing—and potentially damaging—errors. Google suffered a public relations nightmare in 2015 after it rolled out an automated photo-tagging feature (using a ConvNet) in its Photos app. In addition to correctly tagging images with generic descriptions such as “Airplanes,” “Cars,” and “Graduation,” the neural network tagged a selfie featuring two African Americans as “Gorillas,” as shown in figure
+
 
+
16. (After profuse apologies, the company’s short-term solution was to remove the “Gorillas” tag from the network’s list of possible categories.)
+
 
+
FIGURE 17: Example of a camera face-detection program identifying an Asian face as “blinking”
+
 
+
Such repellent and widely mocked misclassifications are embarrassing for the companies involved, but more subtle errors due to racial or gender biases have been noted frequently in vision systems powered by deep learning. Commercial face-recognition systems, for example, tend to be more accurate on white male faces than on female or nonwhite faces.17 Camera software for face detection is sometimes prone to missing faces with dark skin and to classifying Asian faces as “blinking” (figure 17).
+
 
+
Kate Crawford, a researcher at Microsoft and an activist for fairness and transparency in AI, pointed out that one widely used data set for training face-recognition systems contains faces that are 77.5 percent male and 83.5 percent white. This is not surprising, because the images were downloaded from online image searches, and photos of faces that appear online are skewed toward featuring famous or powerful people, who are predominately white and male.
+
 
+
Of course, these biases in AI training data reflect biases in our society, but the spread of real-world AI systems trained on biased data can magnify these biases and do real damage. Face-recognition systems, for example, are increasingly being deployed as a “secure” way to identify people in credit-card transactions, airport screening, and security cameras, and it may be only a matter of time before they are used to verify identity in voting systems, among other applications. Even small differences in accuracy between racial groups can have damaging repercussions in civil rights and access to vital services.
+
 
+
Such biases can be mitigated in individual data sets by having humans make sure that the photos (or other kinds of data) are balanced in their representation of, say, racial or gender groups. But this requires awareness and effort on the part of the humans curating the data. Moreover, it is often hard to tease out subtle biases and their effects. For example, one research group noted that their AI system—trained on a large data set of photos of people in different situations—would sometimes mistakenly classify a man as “woman” when the man was standing in a
+
 
+
kitchen, an environment in which the data set had more examples of women.18 In general, this kind of subtle bias can be apparent after the fact but hard to detect ahead of time.
+
 
+
The problem of bias in applications of AI has been getting a lot of attention recently, with many articles, workshops, and even academic research institutes devoted to this topic. Should the data sets being used to train AI accurately mirror our own biased society—as they often do now—or should they be tinkered with specifically to achieve social reform aims? And who should be allowed to specify the aims or do the tinkering?
+
 
+
Show Your Work
+
 
+
Remember back in school when your teacher would write “show your work” in red on your math homework? For me, showing my work was the least fun part of learning math but probably the most important, because showing how I derived my answer demonstrated that I had actually understood what I was doing, had grasped the correct abstractions, and had arrived at the answer for the right reasons. Showing my work also helped my teacher figure out why I made particular errors.
+
 
+
More generally, you can often trust that people know what they are doing if they can explain to you how they arrived at an answer or a decision. However, “showing their work” is something that deep neural networks—the bedrock of modern AI systems—cannot easily do. Let’s consider the “dog” and “cat” object-recognition task I described in chapter 4. Recall that a convolutional neural network decides what object is contained in an input image by performing a sequence of mathematical operations (convolutions) propagated through many layers. For a reasonably sized network, these can amount to billions of arithmetic operations. While it would be easy to program the computer to print out a list of all the additions and multiplications performed by a network for a given input, such a list would give us humans zero insight into how the network arrived at its answer. A list of a billion operations is not an explanation that a human can understand. Even the humans who train deep networks generally cannot look under the hood and provide explanations for the decisions their networks make. MIT’s Technology
+
 
+
Review magazine called this impenetrability “the dark secret at the heart of AI.”19 The fear is that if we don’t understand how AI systems work, we can’t really trust them or predict the circumstances under which they will make errors.
+
 
+
Humans can’t always explain their thought processes either, and you generally can’t look “under the hood” into other people’s brains (or into their “gut feelings”) to figure out how they came to any particular decision. But humans tend to trust that other humans have correctly mastered basic cognitive tasks such as object recognition and language comprehension. In part, you trust other people when you believe that their thinking is like your own. You assume, most often, that other humans you encounter have had sufficiently similar life experiences to your own, and thus you assume they are using the same basic background knowledge, beliefs, and values that you do in perceiving, describing, and making decisions about the world. In short, where other people are concerned, you have what psychologists call a theory of mind—a model of the other person’s knowledge and goals in particular situations. None of us have a similar “theory of mind” for AI systems such as deep networks, which makes it harder to trust them.
+
 
+
It shouldn’t come as a surprise then that one of the hottest new areas of AI is variously called “explainable AI,” “transparent AI,” or “interpretable machine learning.” These terms refer to research on getting AI systems— particularly deep networks—to explain their decisions in a way that humans can understand. Researchers in this area have come up with clever ways to visualize the features that a given convolutional neural network has learned and, in some cases, to determine which parts of the input are most responsible for the output decision. Explainable AI is a field that is progressing quickly, but a deep-learning system that can successfully explain itself in human terms is still elusive.
+
 
+
Fooling Deep Neural Networks
+
 
+
There is yet another dimension to the AI trustworthiness question: Researchers have discovered that it is surprisingly easy for humans to surreptitiously trick deep neural networks into making errors. That is, if you want to deliberately fool such a system, there turn out to be an alarming number of ways to do so.
+
 
+
Fooling AI systems is not new. Email spammers, for example, have been in an arms race with spam-detection programs for decades. But the kinds of attacks to which deep-learning systems seem to be vulnerable are at once subtler and more troubling.
+
 
+
Remember AlexNet, which I discussed in chapter 5? It was the convolutional neural network that won the 2012 ImageNet challenge and that set in motion the dominance of ConvNets in much of today’s AI world. If you’ll recall, AlexNet’s (top-5) accuracy on ImageNet was 85 percent, which blew every other competitor out of the water and shocked the computer-vision community. However, a year after AlexNet’s win, a research paper appeared, authored by Christian Szegedy of Google and several others, with the deceptively mild title “Intriguing Properties of
+
 
+
Neural Networks.”20 One of the “intriguing properties” described in the paper was that AlexNet could easily be fooled.
+
 
+
In particular, the paper’s authors had discovered that they could take an ImageNet photo that AlexNet classified correctly with high confidence (for example, “School Bus”) and distort it by making very small, specific
+
 
+
changes to its pixels so that the distorted image looked completely unchanged to humans but was now classified with very high confidence by AlexNet as something completely different (for example, “Ostrich”). The authors called the distorted image an “adversarial example.” Figure 18 shows a few samples of original images and their adversarial twins. Can’t tell the difference? Congratulations! It seems that you are human.
+
 
+
Szegedy and his collaborators created a computer program that could, given any photo from ImageNet that was correctly classified by AlexNet, find specific changes to the photo to create a new adversarial example that looked unchanged to humans but caused AlexNet to give highest confidence to an incorrect category.
+
 
+
FIGURE 18: Original and “adversarial” examples for AlexNet. The left image in each pair shows the original image, which was correctly classified by AlexNet. The right image in each pair shows the adversarial example derived from this image (small changes have been made to the pixels, but the new image appears to humans to be identical to the original). Each adversarial example was confidently classified by AlexNet as “Ostrich.”
+
 
+
Importantly, Szegedy and his collaborators found that this susceptibility to adversarial examples wasn’t special to AlexNet; they showed that several other convolutional neural networks—with different architectures, hyperparameters, and training sets—had similar vulnerabilities. Calling this an “intriguing property” of neural networks is a little like calling a hole in the hull of a fancy cruise liner a “thought-provoking facet” of the ship. Intriguing, yes, and more investigation is needed, but if the leak is not fixed, this ship is going down.
+
 
+
Not long after the paper by Szegedy and his colleagues appeared, a group from the University of Wyoming published an article with a more direct title: “Deep Neural Networks Are Easily Fooled.”21 By using a biologically inspired computational method called genetic algorithms,22 the Wyoming group was able to computationally “evolve” images that look like random noise to humans but for which AlexNet and other convolutional neural networks assigned specific object categories with greater than 99 percent confidence. Figure 19 shows some examples. The Wyoming group noted that deep neural networks (DNNs) “see these objects as near-perfect examples of recognizable images,” which “[raises] questions about the true generalization capabilities of DNNs and the potential for costly exploits [that is, malicious applications] of solutions that use DNNs.”23
+
 
+
FIGURE 19: Examples of images created by a genetic algorithm specifically to fool a convolutional neural network. In each case, AlexNet (trained on the ImageNet training set) assigned a confidence greater than 99 percent that the image was an instance of the category shown.
+
 
+
Indeed, these two papers and subsequent related discoveries raised not only questions but also genuine alarm in the deep-learning community. If deep-learning systems, so successful at computer vision and other tasks, can easily be fooled by manipulations to which humans are not susceptible, how can we say that these networks “learn like humans” or “equal or surpass humans” in their abilities? It’s clear that something very different from human perception is going on here. And if these networks are going to be used for computer vision in the real world, we’d better be darn sure that they are safeguarded from hackers using these kinds of manipulations to fool them.
+
 
+
All this has reenergized the small research community focusing on “adversarial learning”—that is, developing strategies that defend against potential (human) adversaries who could attack machine-learning systems. Adversarial-learning researchers often start their work by demonstrating possible ways in which existing systems can be attacked, and some of the recent demonstrations have been stunning. In the domain of computer vision, one group of researchers developed a program that could create eyeglass frames with specific patterns that fool a face-
+
 
+
recognition system into confidently misclassifying the wearer as another person (figure 20).24 Another group developed small, inconspicuous stickers that could be placed on a traffic sign, resulting in a ConvNet-based vision system—similar to those used in self-driving cars—to misclassify the sign (for example, a stop sign is classified as a speed-limit sign).25 Yet another group demonstrated a possible adversarial attack on deep neural networks for medical image analysis: they showed that it is not hard to alter an X-ray or microscopy image in a way that is
+
 
+
imperceptible to humans but that causes a network to change its classification from, say, 99 percent confidence that the image shows no cancer to 99 percent confidence that cancer is present.26 This group noted that such attacks could potentially be used by hospital personnel or others to create fraudulent diagnoses in order to charge insurance companies for additional (lucrative) diagnostic tests.
+
 
+
FIGURE 20: An AI researcher (left) wearing eyeglass frames with a pattern specially designed to cause a deep neural network face recognizer, trained on celebrity faces, to confidently classify the left photo as the actress Milla Jovovich (right). The paper describing this study gives many other examples of impersonation using “adversarial” eyeglass-frame patterns.
+
 
+
These are just a few examples of possible attacks that have been concocted by various research groups. Many of the possible attacks have been shown to be surprisingly robust: they work on several different networks, even when these networks are trained on different data sets. And computer vision isn’t the only domain in which networks can be fooled; researchers have also designed attacks that fool deep neural networks that deal with language, including speech recognition and text analysis. We can expect that as these systems become more widely deployed in the real world, malicious users will discover many other vulnerabilities in these systems.
+
 
+
Understanding and defending against such potential attacks are a major area of research right now, but while researchers have found solutions for specific kinds of attacks, there is still no general defense method. Like any domain of computer security, progress so far has a “whack-a-mole” quality, where one security hole is detected and defended, but others are discovered that require new defenses. Ian Goodfellow, an AI expert who is part of the Google Brain team, says, “Almost anything bad you can think of doing to a machine-learning model can be done
+
 
+
right now … and defending it is really, really hard.”27
+
 
+
Beyond the immediate issue of how to defend against attacks, the existence of adversarial examples amplifies the question I asked earlier: What, precisely, are these networks learning? In particular, what are they learning that allows them to be so easily fooled? Or perhaps more important, are we fooling ourselves when we think these networks have actually learned the concepts we are trying to teach them?
+
 
+
To my mind, the ultimate problem is one of understanding. Consider figure 18, where AlexNet mistakes a school bus for an ostrich. Why would this be very unlikely to happen to a human? Even though AlexNet performs very well on ImageNet, we humans understand many things about the objects we see that are unknown to AlexNet or any other current AI system. We know what objects look like in three dimensions and can imagine this from a two-dimensional photo. We know what the function of a given object is, what role the object’s parts play in its overall function, and in what contexts an object usually appears. Seeing an object brings up memories of seeing such objects in other circumstances, from other viewpoints, as well as in other sensory modalities (we remember what a given object feels like, smells like, perhaps what it sounds like when dropped, and so on). All of this background knowledge feeds into the human ability to robustly recognize a given object. Even the most successful AI vision systems lack this kind of understanding and the robustness that it confers.
+
 
+
FIGURE 21: A visual illusion for humans: the horizontal line segments in A and B are the same length, but most people perceive the segment in A to be longer than the one in B.
+
 
+
I’ve heard some AI researchers argue that humans are also susceptible to our own types of “adversarial examples”: visual illusions. Like AlexNet classifying a school bus as an ostrich, humans are susceptible to perceptual errors (for example, we perceive the upper line in figure 21 to be longer than the lower line, even though both are actually the same length). But the kinds of errors that humans make are quite different from those that convolutional neural networks are susceptible to: our ability to recognize objects in everyday scenes has evolved to be very robust, because our survival depends on it. Unlike today’s ConvNets, human (and animal) perception is highly regulated by cognition—the kind of context-dependent understanding that I described above. Moreover, ConvNets used in today’s computer-vision applications are typically completely feed-forward, whereas the human visual system has many more feedback (that is, reverse direction) connections than feed-forward connections. Although neuroscientists don’t yet understand the function of all this feedback, one might speculate that at least some of those feedback connections effectively prevent vulnerability to the kinds of adversarial examples that ConvNets are susceptible to. So why not just give ConvNets the same kind of feedback? This is an area of active research, but it turns out to be very difficult and hasn’t produced the kind of success seen with feed-forward networks.
+
 
+
Jeff Clune, an AI researcher at the University of Wyoming, made a very provocative analogy when he noted that there is “a lot of interest in whether Deep Learning is ‘real intelligence’ or a ‘Clever Hans.’”28 Clever Hans was a horse in early twentieth-century Germany who could—his owner claimed—perform arithmetic calculations as well as understand German. The horse responded to questions such as “What is fifteen divided by three?” by tapping his hoof the correct number of times. After Clever Hans became an international celebrity, a careful investigation
+
 
+
eventually revealed that the horse did not actually understand the questions or mathematical concepts put to him, but was tapping in response to subtle, unconscious cues given by the questioner. Clever Hans has become a metaphor for any individual (or program!) that gives the appearance of understanding but is actually responding to unintentional cues given by a trainer. Does deep learning exhibit “true understanding,” or is it instead a computational Clever Hans responding to superficial cues in the data? This is currently the subject of heated debates in the AI community, compounded by the fact that AI researchers don’t necessarily agree on the definition of “true understanding.”
+
 
+
On the one hand, deep neural networks, trained via supervised learning, perform remarkably well (though still far from perfectly) on many problems in computer vision, as well as in other domains such as speech recognition and language translation. Because of their impressive abilities, these networks are rapidly being taken from research settings and employed in real-world applications such as web search, self-driving cars, face recognition, virtual assistants, and recommendation systems, and it’s getting hard to imagine life without these AI tools. On the other hand, it’s misleading to say that deep networks “learn on their own” or that their training is “similar to human learning.” Recognition of the success of these networks must be tempered with a realization that they can fail in unexpected ways because of overfitting to their training data, long-tail effects, and vulnerability to hacking. Moreover, the reasons for decisions made by deep neural networks are often hard to understand, which makes their failures hard to predict or fix. Researchers are actively working on making deep neural networks more reliable and transparent, but the question remains: Will the fact that these systems lack humanlike understanding inevitably render them fragile, unreliable, and vulnerable to attacks? And how should this factor into our decisions about applying AI systems in the real world? The next chapter explores some of the formidable challenges of balancing the benefits of AI with the risks of its unreliability and misuse.
+
  
 +
The classification module is actually an entire traditional neural network, similar to the kind I described in chapter 2.
 
7
 
7
 +
The inputs to the classification module are the activation maps from the highest convolutional layer. The module’s output is a set of percentage values, one for each possible category, rating the network’s confidence that the input depicts an image of that category (here dog or cat).
  
On Trustworthy and Ethical AI
+
Let me summarize this brief explanation of ConvNets: Inspired by Hubel and Wiesel’s findings on the brain’s visual cortex, a ConvNet takes an input image and transforms it—via convolutions—into a set of activation maps with increasingly complex features. The features at the highest convolutional layer are fed into a traditional neural network (which I’ve called the classification module), which outputs confidence percentages for the network’s known object categories. The object category with the highest confidence is returned as the network’s classification of the image.
 
+
Imagine yourself in a self-driving car, late at night, after the office Christmas party. It’s dark out, and snow is falling. “Car, take me home,” you say, tired and a little tipsy. You lean back, gratefully allowing your eyes to close as the car starts itself up and pulls into traffic.
+
 
+
All good, but how safe should you feel? The success of self-driving cars is crucially dependent on machine learning (especially deep learning), particularly for the cars’ computer-vision and decision-making components. How can we determine if these cars have successfully learned all that they need to know?
+
 
+
This is the billion-dollar question for the self-driving car industry. I’ve encountered conflicting opinions from experts on how soon we can expect self-driving cars to play a significant role in daily life, with predictions ranging (at the time of this writing) from a few years to many decades. Self-driving cars have the potential to vastly improve our lives. Automated vehicles could substantially reduce the millions of annual deaths and injuries due to auto accidents, many of them caused by intoxicated or distracted drivers. In addition, automated vehicles would allow their human passengers to be productive rather than idle during commute times. These vehicles also have the potential to be more energy efficient than cars with human drivers and will be a godsend for blind or handicapped people who can’t drive. But all this will come to pass only if we humans are willing to trust these vehicles with our lives.
+
 
+
Machine learning is being deployed to make decisions affecting the lives of humans in many domains. What assurances do you have that the machines creating your news feed, diagnosing your diseases, evaluating your loan applications, or—God forbid—recommending your prison sentence have learned enough to be trustworthy decision makers?
+
 
+
These are vexing questions not just for AI researchers but also for society as a whole, which must eventually weigh the many current and future positive uses of AI against concerns about its trustworthiness and misuse.
+
 
+
Beneficial AI
+
 
+
When one considers the role of AI in our society, it might be easy to focus on the downsides. However, it’s essential to remember that there are huge benefits that AI systems already bring to society and that they have the potential to be even more beneficial. Current AI technology is central to services you yourself might use all the time, sometimes without even knowing that AI is involved, including speech transcription, GPS navigation and trip planning, email spam filters, language translation, credit-card fraud alerts, book and music recommendations, protection against computer viruses, and optimizing energy usage in buildings.
+
 
+
If you are a photographer, filmmaker, fine artist, or musician, you might be using AI systems that assist you in creative projects, such as programs that help photographers edit their photos or assist composers in music notation or arrangements. If you are a student, you might benefit from “intelligent tutoring systems” that adapt to your particular learning style. If you are a scientist, there’s a good chance you have used one of the many available AI tools that help analyze your data. If you are blind or otherwise visually disabled, you might use smartphone computer-vision apps that read handwritten or printed text (for example, on signs, restaurant menus, or money). If you are hearing- impaired, you can now see quite accurate captions on YouTube videos and, in some cases, get real-time speech transcription during a lecture. These are just a few examples of the ways in which current AI tools are improving people’s lives. Many additional AI technologies are still in research mode but are on the verge of becoming mainstream.
+
 
+
In the near future, AI applications will likely be widespread in health care. We will see AI systems assisting physicians in diagnosing diseases and in suggesting treatments; discovering new drugs; and monitoring the health
+
 
+
and safety of the elderly in their homes. Scientific modeling and data analysis will increasingly rely on AI tools—for example, in improving models of climate change, population growth and demographic change, ecological and food science, and other major issues that society will be facing over the next century. For Demis Hassabis, the cofounder of Google’s DeepMind group, this is the most important potential benefit of AI:
+
 
+
We might have to come to the sobering realisation that even with the smartest set of humans on the planet working on these problems, these [problems] may be so complex that it’s difficult for individual humans and scientific experts to have the time they need in their lifetimes to
+
 
+
even innovate and advance.… It’s my belief we’re going to need some assistance and I think AI is the solution to that.1
+
 
+
We’ve all heard that in the future AI will take over the jobs that humans hate—low-wage jobs that are boring, exhausting, degrading, exploitative, or downright dangerous. If this actually happens, it could be a true boon for human well-being. (Later I’ll discuss the other side of this coin—AI taking away too many human jobs.) Robots are already widely used for menial and repetitive factory tasks, though there are many such jobs still beyond the abilities of today’s robots. But as AI progresses, more and more of these jobs could be taken over by automation. Examples of future AI workplace applications include self-driving trucks and taxis, as well as robots for harvesting fruits, fighting fires, removing land mines, and performing environmental cleanups. In addition, robots will likely see an even larger role than they have now in planetary and space exploration.
+
 
+
Will it actually benefit society for AI systems to take over such jobs? We can look to the history of technology to give us some perspective. Here are a few examples of jobs that humans used to do but that technology automated long ago, at least in developed countries: clothes washer; rickshaw driver; elevator operator; punkawallah (a servant in India whose sole job was to work a manual fan for cooling the room, before the days of electric fans); computer (a human, usually female, who performed tedious calculations by hand, particularly during World War II). Most people will agree that in those instances replacing humans with machines in such jobs made life better all around. One could argue that today’s AI is simply extending that same arc of progress: improving life for humans by increasingly automating the necessary jobs that no one wants to do.
+
 
+
The Great AI Trade-Off
+
 
+
The AI researcher Andrew Ng has optimistically proclaimed, “AI is the new electricity.” Ng explains further: “Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.”2 This is an appealing analogy: the idea that soon AI will be as necessary—and as invisible—in our electronic devices as electricity itself. However, a major difference is that the science of electricity was well understood before it was widely commercialized. We are good at predicting
+
 
+
the behavior of electricity. This is not the case for many of today’s AI systems.
+
 
+
This brings us to what you might call the Great AI Trade-Off. Should we embrace the abilities of AI systems, which can improve our lives and even help save lives, and allow these systems to be employed ever more extensively? Or should we be more cautious, given current AI’s unpredictable errors, susceptibility to bias, vulnerability to hacking, and lack of transparency in decision-making? To what extent should humans be required to remain in the loop in different AI applications? What should we require of an AI system in order to trust it enough to let it work autonomously? These questions are still hotly debated, even as AI is increasingly deployed and its promised future applications (for example, self-driving cars) are touted as being just over the horizon.
+
 
+
The lack of general agreement on these issues was underscored by a recent study carried out by the Pew Research Center.3 In 2018, Pew analysts canvassed nearly one thousand “technology pioneers, innovators, developers, business and policy leaders, researchers and activists,” asking them to reply to these questions:
+
 
+
By 2030, do you think it is most likely that advancing AI and related technology systems will enhance human capacities and empower them? That is, most of the time, will most people be better off than they are today? Or is it most likely that advancing AI and related technology systems will lessen human autonomy and agency to such an extent that most people will not be better off than the way things are today?
+
 
+
The respondents were divided: 63 percent predicted that progress in AI would leave humans better off by 2030, while 37 percent disagreed. Opinions ranged from the view that AI “can virtually eliminate global poverty, massively reduce disease and provide better education to almost everyone on the planet” to predictions of an apocalyptic future: legions of jobs taken over by automation, erosion of privacy and civil rights due to AI surveillance, amoral autonomous weapons, unchecked decisions by opaque and untrustworthy computer programs,
+
 
+
magnification of racial and gender bias, manipulation of the mass media, increase of cybercrime, and what one respondent called “true, existential irrelevance” for humans.
+
 
+
Machine intelligence presents a knotty array of ethical issues, and discussions related to the ethics of AI and big data have filled several books.4 In order to illustrate the complexity of the issues, I’ll dig deeper into one example that is getting a lot of attention these days: automated face recognition.
+
 
+
The Ethics of Face Recognition
+
 
+
Face recognition is the task of labeling a face in an image or video (or real-time video stream) with a name. Facebook, for example, applies a face-recognition algorithm to every photo that is uploaded to its site, trying to detect the faces in the photo and to match them with known users (at least those users who haven’t disabled this feature).5 If you are on Facebook and someone posts a photo that includes your face, the system might ask you if you want to “tag yourself” in the photo. The accuracy of Facebook’s face-recognition algorithm can be
+
 
+
simultaneously impressive and creepy. Not surprisingly, this accuracy comes from using deep convolutional neural networks. The software can often recognize faces not only when the face is front and center in a photo but even when a person is one of many in a crowd.
+
 
+
Face-recognition technology has many potential upsides, including helping people search through their photo collections, enabling users with vision impairments to identify the people they encounter, locating missing children or criminal fugitives by scanning photos and videos for their faces, and detecting identity theft. However, it’s just as easy to imagine applications that many people find offensive or threatening. Amazon, for example, markets its face- recognition system (with the strangely dystopian-sounding name Rekognition) to police departments, which can compare, say, security-camera footage with a database of known offenders or likely suspects.
+
 
+
Privacy is an obvious issue. Even if I’m not on Facebook (or any other social media platform with face recognition), photos including me might be tagged and later automatically recognized on the site, without my permission. Consider FaceFirst, a company that offers face-recognition services for a fee. As reported by the magazine New Scientist, “Face First … is rolling out a system for retailers that it says will ‘boost sales by recognizing high-value customers each time they shop’ and send ‘alerts when known litigious individuals enter any
+
 
+
of your locations.’”6 Many other companies offer similar services.
+
 
+
Loss of privacy is not the only danger here. An even larger worry is reliability: face-recognition systems can make errors. If your face is matched in error, you might be barred from a store or an airplane flight or wrongly accused of a crime. What’s more, present-day face-recognition systems have been shown to have a significantly higher error rate on people of color than on white people. The American Civil Liberties Union (ACLU), which vigorously opposes the use of face-recognition technology for law enforcement on civil rights grounds, tested Amazon’s Rekognition system (using its default settings) on the 535 members of the U.S. Congress, comparing a photo of each member against a database of people who have been arrested on criminal charges. They found that the system incorrectly matched 28 out of the 535 members of Congress with people in the criminal database. Twenty- one percent of the errors were on photos of African American representatives (African Americans make up only
+
 
+
about 9 percent of Congress).7
+
 
+
Amid the fallout from the ACLU’s tests and other studies showing the unreliability and biases of face recognition, several high-tech companies have announced that they oppose using face recognition for law enforcement and surveillance. For example, Brian Brackeen, the CEO of the face-recognition company Kairos, wrote the following in a widely circulated article:
+
 
+
Facial recognition technologies, used in the identification of suspects, negatively affects people of color. To deny this fact would be a lie.… I (and my company) have come to believe that the use of commercial facial recognition in law enforcement or in government surveillance of any kind is wrong—and that it opens the door for gross misconduct by the morally corrupt.… We deserve a world where we’re not
+
 
+
empowering governments to categorize, track and control citizens.8
+
 
+
In a blog post on his company’s website, Microsoft’s president and chief legal officer, Brad Smith, called for Congress to regulate face recognition:
+
 
+
Facial recognition technology raises issues that go to the heart of fundamental human rights protections like privacy and freedom of expression. These issues heighten responsibility for tech companies that create these products. In our view, they also call for thoughtful government regulation and for the development of norms around acceptable uses. Facial recognition will require the public and private
+
 
+
sectors alike to step up—and to act.9
+
 
+
Google followed suit, announcing that it would not offer general-purpose face-recognition services via its cloud AI platform until the company can “ensure its use is aligned with our principles and values, and avoids abuse and harmful outcomes.”10
+
 
+
The response of these companies is encouraging, but it brings to the forefront another vexing issue: To what
+
 
+
extent should AI research and development be regulated, and who should do the regulating?
+
 
+
Regulating AI
+
 
+
Given the risks of AI technologies, many practitioners of AI, myself included, are in favor of some kind of regulation. But the regulation shouldn’t be left solely in the hands of AI researchers and companies. The problems surrounding AI—trustworthiness, explainability, bias, vulnerability to attack, and morality of use—are social and political issues as much as they are technical ones. Thus, it is essential that the discussion around these issues include people with different perspectives and backgrounds. Simply leaving regulation up to AI practitioners would be as unwise as leaving it solely up to government agencies.
+
 
+
In one example of the complexity of crafting such regulations, in 2018 the European Parliament enacted a regulation on AI that some have called the “right to explanation.”11 This regulation requires, in the case of “automated decision making,” “meaningful information about the logic involved” in any decision that affects an EU citizen. This information is required to be communicated “in a concise, transparent, intelligible and easily accessible form, using clear and plain language.”12 This opens the floodgates for interpretation. What counts as “meaningful information” or “the logic involved”? Does this regulation prohibit the use of hard-to-explain deep-learning methods in making decisions that affect individuals (such as loans and face recognition)? Such uncertainties will no doubt ensure gainful employment for policy makers and lawyers for a long time to come.
+
 
+
I believe that regulation of AI should be modeled on the regulation of other technologies, particularly those in biological and medical sciences, such as genetic engineering. In those fields, regulation—such as quality assurance and the analysis of risks and benefits of technologies—occurs via cooperation among government agencies, companies, nonprofit organizations, and universities. Moreover, there are now established fields of bioethics and medical ethics, which have considerable influence on decisions about the development and application of technologies. AI research and its applications very much need a well-thought-out regulatory and ethics infrastructure.
+
 
+
This infrastructure is just beginning to be formed. In the United States, state governments are starting to look into creating regulations, such as those for face recognition or self-driving vehicles. However, for the most part, the universities and the companies that create AI systems have been left to regulate themselves.
+
 
+
A number of nonprofit think tanks have cropped up to fill the void, often funded by wealthy tech  entrepreneurs who are worried about AI. These organizations—with names such as Future of Humanity Institute, Future of Life Institute, and Centre for the Study of Existential Risk—hold workshops, sponsor research, and create educational materials and policy suggestions on the topics of safe and ethical uses of AI. An umbrella organization, called the Partnership on AI, has been trying to bring together such groups to “serve as an open platform for
+
 
+
discussion and engagement about AI and its influences on people and society.”13
+
 
+
One stumbling block is that there is no general agreement in the field on what the priorities for developing regulation and ethics should be. Should the immediate focus be on algorithms that can explain their reasoning? On data privacy? On robustness of AI systems to malicious attacks? On bias in AI systems? On the potential “existential risk” from superintelligent AI? My own opinion is that too much attention has been given to the risks from superintelligent AI and far too little to deep learning’s lack of reliability and transparency and its vulnerability to attacks. I will say more about the idea of superintelligence in the final chapter.
+
 
+
Moral Machines
+
 
+
So far, my discussion has focused on ethical issues of how humans use AI. But there’s another important question: Could machines themselves be able to have their own sense of morality, complete enough for us to allow them to make ethical decisions on their own, without humans having to oversee them? If we are going to give decision-
+
 
+
making autonomy to face-recognition systems, self-driving cars, elder-care robots, or even robotic soldiers, don’t we need to give these machines the same ability to deal with ethical and moral questions that we humans have?
+
 
+
People have been thinking about “machine morality” for as long as they’ve been thinking about AI.14 Probably the best-known discussion of machine morality comes from Isaac Asimov’s science fiction stories, in which he proposed the three “fundamental Rules of Robotics”:
+
 
+
1. A robot may not injure a human being, or, through inaction, allow a human being to come to harm.
+
 
+
2. A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law.
+
 
+
3. A robot must protect its own existence, as long as such protection does not conflict with the First or Second Law 15
+
 
+
These laws have become famous, but in truth, Asimov’s purpose was to show how such a set of rules would inevitably fail. “Runaround,” the 1942 story in which Asimov first introduced these laws, features a situation in which a robot, following the second law, moves toward a dangerous substance, at which point the third law kicks in, so the robot moves away, at which point the second law kicks in again, trapping the robot in an endless loop, resulting in a near disaster for the robot’s human masters. Asimov’s stories often focused on the unintended consequences of programming ethical rules into robots. Asimov was prescient: as we’ve seen, the problem of incomplete rules and unintended consequences has hamstrung all approaches to rule-based AI intelligence; moral reasoning is no different.
+
 
+
The science fiction writer Arthur C. Clarke used a similar plot device in his 1968 book, 2001: A Space Odyssey.16 The artificially intelligent computer HAL is programmed to always be truthful to humans, but at the same time to withhold the truth from human astronauts about the actual purpose of their space mission. HAL, unlike Asimov’s clueless robot, suffers from the psychological pain of this cognitive dissonance: “He was … aware of the conflict that was slowly destroying his integrity—the conflict between truth, and concealment of truth.”17 The result is a computer “neurosis” that turns HAL into a killer. Reflecting on real-life machine morality, the mathematician Norbert Wiener noted as long ago as 1960 that “we had better be quite sure that the purpose put into the machine is the purpose which we really desire.”18
+
 
+
Wiener’s comment captures what is called the value alignment problem in AI: the challenge for AI
+
 
+
programmers to ensure that their systems’ values align with those of humans. But what are the values of humans? Does it even make sense to assume that there are universal values that society shares?
+
 
+
Welcome to Moral Philosophy 101. We’ll start with every moral philosophy student’s favorite thought experiment, the trolley problem: You are driving a speeding trolley down a set of tracks, and just ahead you see five workers standing together in the middle of the tracks. You step on the brakes, but you find that they don’t work. Fortunately, there is a spur of tracks leading off to the right. You can steer the trolley onto the spur and avoid hitting the five workers. Unfortunately, there is a single worker standing in the middle of the spur. If you do nothing, the trolley will drive straight into the five workers and kill them all. If you steer the trolley to the right, the trolley will kill the single worker. What is the moral thing to do?
+
 
+
The trolley problem has been a staple of undergraduate ethics classes for the last century. Most people answer that it would be morally preferable for the driver to steer onto the spur, killing the single worker and saving the group of five. But philosophers have found that a different framing of essentially the same dilemma can lead people to the opposite answer.19 Human reasoning about moral dilemmas turns out to be very sensitive to the way in which the dilemmas are presented.
+
 
+
The trolley problem has recently reemerged as part of the media’s coverage of self-driving cars,20 and the question of how an autonomous vehicle should be programmed to deal with such problems has become a central talking point in discussions on AI ethics. Many AI ethics thinkers have pointed out that the trolley problem itself, in which the driver has only two horrible options, is a highly contrived scenario that no real-world driver will ever encounter. But the trolley problem has become a kind of symbol for asking about how we should program self- driving cars to make moral decisions on their own.
+
 
+
In 2016, three researchers published results from surveys of several hundred people who were given trolley- problem-like scenarios that involved self-driving cars, and were asked for their views of the morality of different actions. In one survey, 76 percent of participants answered that it would be morally preferable for a self-driving car to sacrifice one passenger rather than killing ten pedestrians. But when asked if they would buy a self-driving car programmed to sacrifice its passengers in order to save a much larger number of pedestrians, the overwhelming
+
 
+
majority of survey takers responded that they themselves would not buy such a car.21 According to the authors, “We
+
 
+
found that participants in six Amazon Mechanical Turk studies approved of utilitarian AVs [autonomous vehicles] (that is, AVs that sacrifice their passengers for the greater good) and would like others to buy them, but they would themselves prefer to ride in AVs that protect their passengers at all costs.” In his commentary on this study, the psychologist Joshua Greene noted, “Before we can put our values into machines, we have to figure out how to make our values clear and consistent.”22 This seems to be harder than we might have thought.
+
 
+
Some AI ethics researchers have suggested that we give up trying to directly program moral rules for machines, and instead have machines learn moral values on their own by observing human behavior.23 However, this self-learning approach inherits all of the problems of machine learning that I described in the previous chapter.
+
 
+
To my mind, progress on giving computers moral intelligence cannot be separated from progress on other
+
 
+
kinds of intelligence: the true challenge is to create machines that can actually understand the situations that they confront. As Isaac Asimov’s stories demonstrate, a robot can’t reliably follow an order to avoid harming a human unless it can understand the concept of harm in different situations. Reasoning about morality requires one to recognize cause-and-effect relationships, to imagine different possible futures, to have a sense of the beliefs and goals of others, and to predict the likely outcomes of one’s actions in whatever situation one finds oneself. In other words, a prerequisite to trustworthy moral reasoning is general common sense, which, as we’ve seen, is missing in even the best of today’s AI systems.
+
 
+
So far in this book we’ve seen how deep neural networks, trained on enormous data sets, can rival the visual abilities of humans in particular tasks. We’ve also seen some of the weaknesses of these networks, including their reliance on massive quantities of human-labeled data and their propensity to fail in very un-humanlike ways. How can we create an AI system that truly learns on its own—one that is more trustworthy because, like humans, it can reason about its current situation and plan for the future? In the next part of the book, I’ll describe how AI researchers are using games such as chess, Go, and even Atari video games as “microcosms” in order to develop machines with more humanlike learning and reasoning capabilities, and I’ll assess how the resulting superhuman game-playing machines might transfer their skills to the real world.
+
 
+
Part III Learning to Play
+
 
+
 
8
 
8
 +
  
Rewards for Robots
+
Would you like to experiment with a well-trained ConvNet? Simply take a photo of an object, and upload it to Google’s “search by image” engine.
 
+
When the journalist Amy Sutherland was doing research for a book on exotic animal trainers, she learned that their primary method is preposterously simple: “reward behavior I like and ignore behavior I don’t.” And as she wrote in The New York Times’ Modern Love column, “Eventually it hit me that the same techniques might work on that stubborn but lovable species, the American husband.” Sutherland wrote about how, after years of futile nagging, sarcasm, and resentment, she used this simple method to covertly train her oblivious husband to pick up his socks,
+
 
+
find his own car keys, show up to restaurants on time, and shave more regularly.1
+
 
+
This classic training technique, known in psychology as operant conditioning, has been used for centuries on animals and humans. Operant conditioning inspired an important machine-learning approach called reinforcement learning. Reinforcement learning contrasts with the supervised-learning method I’ve described in previous chapters: in its purest form, reinforcement learning requires no labeled training examples. Instead, an agent—the learning program—performs actions in an environment (usually a computer simulation) and occasionally receives rewards from the environment. These intermittent rewards are the only feedback the agent uses for learning. In the case of Amy Sutherland’s husband, the rewards were her smiles, kisses, and words of praise. While a computer program might not respond to a kiss or an enthusiastic “you’re the greatest,” it can be made to respond to a machine equivalent of such appreciation—such as positive numbers added to its memory.
+
 
+
FIGURE 22: A Sony Aibo robotic dog, about to kick a robot soccer ball
+
 
+
While reinforcement learning has been part of the AI toolbox for decades, it has long been overshadowed by neural networks and other supervised-learning methods. This changed in 2016 when reinforcement learning played a central role in a stunning and momentous achievement in AI: a program that learned to beat the best humans at the complex game of Go. In order to explain that program, as well as other recent achievements of reinforcement learning, I’ll first take you through a simple example to illustrate how reinforcement learning works.
+
 
+
Training Your Robo-Dog
+
 
+
For our illustrative example, let’s look to the fun game of robot soccer, in which humans (usually college students) program robots to play a simplified version of soccer on a room-sized “field.” Sometimes the players are cute doglike Aibo robots like the one shown in figure 22. An Aibo robot (made by Sony) has a camera to capture visual inputs, an internal programmable computer, and a collection of sensors and motors that enable it to walk, kick, head-
+
 
+
butt, and even wag its plastic tail.
+
 
+
Imagine that we want to teach our robo-dog the simplest soccer skill: when facing the ball, walk over to it, and kick it. A traditional AI approach would be to program the robot with the following rules: Take a step toward the ball. Repeat until one of your feet is touching the ball. Then kick the ball with that foot. Of course, shorthand descriptions such as “take a step toward the ball,” “until one of your feet is touching the ball,” and “kick the ball” must be carefully translated into detailed sensor and motor operations built into the Aibo.
+
 
+
Such explicit rules might be sufficient for a task as simple as this one. However, the more “intelligent” you want your robot to be, the harder it is to manually specify rules for behavior. And of course, it’s impossible to devise a set of rules that will work in every situation. What if there is a large puddle between the robot and the ball? What if a soccer cone is blocking the robot’s vision? What if a rock is blocking the ball’s movement? As always, the real world is awash with hard-to-predict edge cases. The promise of reinforcement learning is that the agent—here our robo-dog—can learn flexible strategies on its own simply by performing actions in the world and occasionally receiving rewards (that is, reinforcement) without humans having to manually write rules or directly teach the agent every possible circumstance.
+
 
+
Let’s call our robo-dog Rosie, after my favorite television robot, the wry robotic housekeeper from the classic cartoon The Jetsons.2 To make things easier for this example, let’s assume that Rosie comes from the factory preprogrammed with the following ability: if a soccer ball is in Rosie’s line of sight, she can estimate the number of steps she would need to take to get to the ball. This number is called the “state.” In general, the state of an agent at a given time is the agent’s perception of its current situation. Rosie is the simplest of possible agents, in that her state
+
 
+
is a single number. When I say that Rosie is “in” a given state x, I mean that she is currently estimating that she is x
+
 
+
steps away from the ball.
+
 
+
In addition to being able to identify her state, Rosie has three built-in actions she can perform: she can take a step Forward, take a step Backward, and she can Kick. (If Rosie happens to step out-of-bounds, she is programmed to immediately step back in.) In the spirit of operant conditioning, let’s give Rosie a reward only when she succeeds in kicking the ball. Note that Rosie doesn’t know ahead of time which, if any, states or actions will lead to rewards.
+
 
+
Given that Rosie is a robot, her “reward” is simply a number, say, 10, added to her “reward memory.” We can consider the number 10 the robot equivalent of a dog treat. Or perhaps not. Unlike a real dog, Rosie has no intrinsic desire for treats, positive numbers, or anything else. As I’ll detail below, in reinforcement learning, a human-created algorithm guides Rosie’s process of learning in response to rewards; that is, the algorithm tells Rosie how to learn from her experiences.
+
 
+
Reinforcement learning occurs by having Rosie take actions over a series of learning episodes, each of which consists of some number of iterations. At each iteration, Rosie determines her current state and chooses an action to take. If Rosie receives a reward, she then learns something, as I’ll illustrate below. Here I’ll let each episode last until Rosie manages to kick the ball, at which time she receives a reward. This might take a long time. As in training a real dog, we have to be patient.
+
 
+
Figure 23 illustrates a hypothetical learning episode. The episode begins with the trainer (me) placing Rosie and the ball in some initial locations on the field, with Rosie facing the ball (figure 23A). Rosie determines her current state: twelve steps away from the ball. Because Rosie hasn’t learned anything yet, our dog, an innocent “tabula rasa,” doesn’t know which action should be preferred, so she chooses an action at random from her three possibilities: Forward, Backward, Kick. Let’s say she chooses Backward and takes a step back. We humans can see that Backward is a bad action to take, but remember, we’re letting Rosie figure out on her own how to perform this task.
+
 
+
FIGURE 23: A hypothetical first episode of reinforcement learning
+
 
+
At iteration 2 (figure 23B), Rosie determines her new state: thirteen steps from the ball. She then chooses a new action to take, again at random: Forward. At iteration 3 (figure 23C), Rosie determines her “new” state: twelve steps away from the ball. She’s back to square one, but Rosie doesn’t even know that she has been in this state before! In the purest form of reinforcement learning, the learning agent doesn’t remember its previous states. In general, remembering previous states might take a lot of memory and doesn’t turn out to be necessary.
+
 
+
At iteration 3, Rosie—again at random—chooses the action Kick, but because she’s kicking empty air, she
+
 
+
doesn’t get a reward. She has yet to learn that kicking gives a reward only if she’s next to the ball.
+
 
+
Rosie continues to choose random actions, without any feedback, for many iterations. But at some point, let’s say at iteration 351, just by dumb luck Rosie ends up next to the ball and chooses Kick (figure 23D). Finally, she gets a reward and uses it to learn something.
+
 
+
What does Rosie learn? Here we take the simplest approach to reinforcement learning: upon receiving a reward, Rosie learns only about the state and action that immediately preceded the reward. In particular, Rosie learns that if she is in that state (for example, zero steps from the ball), taking that action (for example, Kick) is a good idea. But that’s all she learns. She doesn’t learn, for example, that if she is zero steps from the ball, Backward would be a bad choice. After all, she hasn’t tried that yet. For all she knows, taking a step backward in that state might lead to a much bigger reward! Rosie also doesn’t learn at this point that if she is one step away, Forward would be a good choice. She has to wait for the next episode for that. Learning too much at one time can be detrimental; if Rosie happens to kick the air two steps away from the ball, we don’t want her to learn that this ineffective kick was actually a necessary step toward getting the reward. In humans, this kind of behavior might be called superstition— namely, erroneously believing that a particular action can help cause a particular good or bad outcome. In reinforcement learning, superstition is something that you have to be careful to avoid.
+
 
+
A crucial notion in reinforcement learning is that of the value of performing a particular action in a given state. The value of action A in state S is a number reflecting the agent’s current prediction of how much reward it will eventually obtain if, when in state S, it performs action A, and then continues performing high-value actions. Let me explain. If your current state is “holding a chocolate in your hand,” an action with high value would be to bring your hand to your mouth. Subsequent actions with high value would be to open your mouth, put the chocolate inside, and chew. Your reward is the delicious sensation of eating the chocolate. Bringing your hand to your mouth doesn’t immediately produce this reward, but this action is on the right path, and if you’ve eaten chocolate before, you can predict the intensity of the upcoming reward. The goal of reinforcement learning is for the agent to learn values that are good predictions of upcoming rewards (assuming that the agent keeps doing the right thing after
+
 
+
taking the action in question).3 As we’ll see, the process of learning the values of particular actions in a given state typically takes many steps of trial and error.
+
 
+
FIGURE 24: Rosie’s Q-table after her first episode of reinforcement learning
+
 
+
Rosie keeps track of the values of actions in a big table in her computer memory. This table, illustrated in figure 24, lists all the possible states for Rosie (that is, all possible distances she could be from the ball, up to the length of the field), and for each state, her possible actions. Given a state, each action in that state has a numerical value; these values will change—becoming more accurate predictions of upcoming rewards—as Rosie continues to learn. This table of states, actions, and values is called the Q-table. This form of reinforcement learning is sometimes called Q-learning. The letter Q is used because the letter V (for value) was used for something else in the original
+
 
+
paper on Q-learning.4
+
 
+
At the beginning of Rosie’s training, I initialize the Q-table by setting all the values to 0—a “blank slate.” When Rosie receives a reward for kicking the ball at the end of episode 1, the value of the action Kick when in state “zero steps away” is updated to 10, the value of the reward. In the future, when Rosie is in the “zero steps away” state, she can look at the Q-table, see that Kick has the highest value—that is, it predicts the highest reward—and decide to choose Kick rather than choosing randomly. That’s all that “learning” means here!
+
 
+
Episode 1 ended with Rosie finally kicking the ball. We now move on to episode 2 (figure 25), which starts with Rosie and the ball in new locations (figure 25A). Just as before, at each iteration Rosie determines her current
+
 
+
state—initially, six steps away—and chooses an action, now by looking in her Q-table. But at this point, the values of actions in her current state are still all 0s; there’s no information yet to help her choose among them. So Rosie again chooses an action at random: Backward. And she chooses Backward again at the next iteration (figure 25B). Our robo-dog’s training has a long way to go.
+
 
+
FIGURE 25: The second episode of reinforcement learning
+
 
+
Everything continues as before, until Rosie’s floundering random trial-and-error actions happen to land her one step away from the ball (figure 25C), and she happens to choose Forward. Suddenly Rosie finds her foot next to the ball (figure 25D), and the Q-table has something to say about this state. In particular, it says that her current state
+
 
+
—zero steps from the ball—has an action—Kick—that is predicted to lead to a reward of 10. Now she can use this information, learned at the previous episode, to choose an action to perform, namely Kick. But here’s the essence of Q-learning: Rosie can now learn something about the action (Forward) she took in the immediately previous state (one step away). That is what led her to be in the excellent position she is in now! Specifically, the value of action Forward in the state “one step away” is updated in the Q-table to have a higher value, some fraction of the value of the action “Kick when zero steps away,” which directly leads to a reward. Here I’ve updated this value to 8 (figure 26).
+
 
+
FIGURE 26: Rosie’s Q-table after her second episode of reinforcement learning
+
 
+
The Q-table now tells Rosie that it’s really good to kick when in the “zero steps away” state and that it’s almost as good to step forward when in the “one step away” state. The next time Rosie finds herself in the “one step away” state, she’ll have some information about what action she should take, as well as the ability to learn an update for the immediately past action—the Forward action in the “two steps away” state. Note that it is important for these learned action values to be reduced (“discounted”) as they go back in time from the actual reward; this allows the system to learn an efficient path to an actual reward.
+
 
+
Reinforcement learning—here, the gradual updating of values in the Q-table—continues, episode to episode, until Rosie has finally learned to perform her task from any initial starting point. The Q-learning algorithm is a way to assign values to actions in a given state, including those actions that don’t lead directly to rewards but that set the stage for the relatively rare states in which the agent does receive rewards.
+
 
+
I wrote a program that simulated Rosie’s Q-learning process as described above. At the beginning of each episode, Rosie was placed, facing the ball, a random number of steps away (with a maximum of twenty-five and a minimum of zero steps away). As I mentioned earlier, if Rosie stepped out-of-bounds, my program simply has her step back in. Each episode ended when Rosie succeeded in reaching and kicking the ball. I found that it took about three hundred episodes for her to learn to perform this task perfectly, no matter where she started.
+
 
+
This “training Rosie” example captures much of the essence of reinforcement learning, but I left out many issues that reinforcement-learning researchers face for more complex tasks.5 For example, in real-world tasks, the agent’s perception of its state is often uncertain, unlike Rosie’s perfect knowledge of how many steps she is from the ball. A real soccer-playing robot might have only a rough estimate of distance, or even some uncertainty about which light-colored, small object on the soccer field is actually the ball. The effects of performing an action can also
+
 
+
be uncertain: for example, a robot’s Forward action might move it different distances depending on the terrain, or even result in the robot falling down or colliding with an unseen obstacle. How can reinforcement learning deal with uncertainties like these?
+
 
+
Additionally, how should the learning agent choose an action at each time step? A naive strategy would be to always choose the action with the highest value for the current state in the Q-table. But this strategy has a problem: it’s possible that other, as-yet-unexplored actions will lead to a higher reward. How often should you explore— taking actions that you haven’t yet tried—and how often should you choose actions that you already expect to lead to some reward? When you go to a restaurant, do you always order the meal you’ve already tried and found to be good, or do you try something new, because the menu might contain an even better option? Deciding how much to explore new actions and how much to exploit (that is, stick with) tried-and-true actions is called the exploration
+
 
+
versus exploitation balance. Achieving the right balance is a core issue for making reinforcement learning successful.
+
 
+
These are samples of ongoing research topics among the growing community of people working on reinforcement learning. Just as in the field of deep learning, designing successful reinforcement-learning systems is still a difficult (and sometimes lucrative!) art, mastered by a relatively small group of experts who, like their deep- learning counterparts, spend a lot of time tuning hyperparameters. (How many learning episodes should be allowed? How many iterations per episode should be allowed? How much should a reward be “discounted” as it is spread back in time? And so on.)
+
 
+
Stumbling Blocks in the Real World
+
 
+
Setting these issues aside for now, let’s look at two major stumbling blocks that might arise in extrapolating our “training Rosie” example to reinforcement learning in real-world tasks. First, there’s the Q-table. In complex real- world tasks—think, for example, of a robot car learning to drive in a crowded city—it’s impossible to define a small set of “states” that could be listed in a table. A single state for a car at a given time would be something like the entirety of the data from its cameras and other sensors. This means that a self-driving car effectively faces an infinite number of possible states. Learning via a Q-table like the one in the “Rosie” example is out of the question. For this reason, most modern approaches to reinforcement learning use a neural network instead of a Q-table. The neural network’s job is to learn what values should be assigned to actions in a given state. In particular, the network is given the current state as input, and its outputs are its estimates of the values of all the possible actions the agent can take in that state. The hope is that the network can learn to group related states into general concepts (It’s safe to drive forward or Stop immediately to avoid hitting an obstacle).
+
 
+
The second stumbling block is the difficulty, in the real world, of actually carrying out the learning process over many episodes, using a real robot. Even our “Rosie” example isn’t feasible. Imagine yourself initializing a new episode—walking out on the field to set up the robot and the ball—hundreds of times, not to mention waiting around for the robot to perform its hundreds of actions per episode. You just wouldn’t have enough time. Moreover, you might risk the robot damaging itself by choosing the wrong action, such as kicking a concrete wall or stepping forward over a cliff.
+
 
+
Just as I did for Rosie, reinforcement-learning practitioners almost always deal with this problem by building simulations of robots and environments and performing all the learning episodes in the simulation rather than in the real world. Sometimes this approach works well. Robots have been trained using simulations to walk, hop, grasp objects, and drive a remote-control car, among other tasks, and the robots were able, with various levels of success, to transfer the skills learned during simulation to the real world.6 However, the more complex and unpredictable the
+
 
+
environment, the less successful are the attempts to transfer what is learned in simulation to the real world. Because of these difficulties, it makes sense that to date the greatest successes of reinforcement learning have been not in robotics but in domains that can be perfectly simulated on a computer. In particular, the best-known reinforcement- learning successes have been in the domain of game playing. Applying reinforcement learning to games is the topic of the next chapter.
+
 
+
 
9
 
9
 +
Google will run a ConvNet on your image and, based on the resulting confidences (over thousands of possible object categories), will tell you its “best guess” for the image.
  
Game On
+
===Training a ConvNet ===
  
Since the earliest days of AI, enthusiasts have been obsessed with creating programs that can beat humans at games. In the late 1940s, both Alan Turing and Claude Shannon, two founders of the computer age, wrote programs to play chess before there were even computers that could run their code. In the decades that followed, many a young game fanatic has been driven to learn to program in order to get computers to play their favorite game, whether it be checkers, chess, backgammon, Go, poker, or, more recently, video games.
+
Our hypothetical ConvNet consists of edge detectors at its first layer, but in real-world ConvNets edge detectors aren’t built in. Instead, ConvNets learn from training examples what features should be detected at each layer, as well as how to set the weights in the classification module so as to produce a high confidence for the correct answer. And, just as in traditional neural networks, all the weights can be learned from data via the same back-propagation algorithm that I described in chapter 2.  
  
In 2010, a young British scientist and game enthusiast named Demis Hassabis, along with two close friends, launched a company in London called DeepMind Technologies. Hassabis is a colorful and storied figure in the modern AI world. A chess prodigy who was winning championships by the age of six, he started programming video games professionally at fifteen and founded his own video game company at twenty-two. In addition to his entrepreneurial activities, he obtained a PhD in cognitive neuroscience from University College London in order to further his goal of building brain-inspired AI. Hassabis and his colleagues founded DeepMind Technologies in order
+
More specifically, here is how you could train our ConvNet to identify a given image as a dog or cat. First, collect many example images of dogs and cats—this is your “training set.” Also, create a file that gives a label for each image—that is, “dog” or “cat.” (Or better, take a hint from computer-vision researchers: Hire a graduate student to do all this for you. If you are a graduate student, then recruit an undergrad. No one enjoys this labeling chore!) Your training program initially sets all the weights in the network to random values. Then your program commences training: one by one, each image is given as the input to the network; the network performs its layer-by- layer calculations and finally outputs confidence percentages for “dog” and “cat.” For each image, your training program compares these output values to the “correct” values; for example, if the image is a dog, then “dog” confidence should be 100 percent and “cat” confidence should be 0 percent. Then the training program uses the back-propagation algorithm to change the weights throughout the network just a bit, so that the next time this image is seen, the confidences will be closer to the correct values.  
  
to “tackle [the] really fundamental questions” about artificial intelligence.1 Perhaps not surprisingly, the DeepMind group saw video games as the proper venue for tackling those questions. Video games are, in Hassabis’s view, “like microcosms of the real world, but … cleaner and more constrained.”2
+
Following this procedure—input the image, then calculate the error at the output, then change the weights— for every image in your training set is called one “epoch” of training. Training a ConvNet requires many epochs, during which the network processes each image over and over again. Initially, the network will be very bad at recognizing dogs and cats, but slowly, as it changes its weights over many epochs, it will get increasingly better at the task. Finally, at some point, the network “converges”; that is, the weights stop changing much from one epoch to the next, and the network is (in principle!) very good at recognizing dogs and cats in the images in the training set. But we won’t know if the network is actually good at this task in general until we see if it can apply what it has learned to identify images from outside its training set. What’s really interesting is that, even though ConvNets are not constrained by a programmer to learn to detect any particular feature, when trained on large sets of real-world photographs, they indeed seem to learn a hierarchy of detectors similar to what Hubel and Wiesel found in the brain’s visual system.  
  
FIGURE 27: An illustration of Atari’s Breakout game
+
In the next chapter, I’ll recount the extraordinary ascent of ConvNets from relative obscurity to near-complete dominance in machine vision, a transformation made possible by a concurrent technological revolution: that of “big data.”
  
Whatever your stance on video games, if you are going more for “clean and constrained” and less for “real world,” you might consider creating AI programs to play Atari video games from the 1970s and ’80s. This is exactly
+
==5 ConvNets and ImageNet ==
  
what the group at DeepMind decided to do. Depending on your age and interests, you might remember some of these classic games, such as Asteroids, Space Invaders, Pong, and Ms. Pac-Man. Are any of these ringing a bell? With their uncomplicated graphics and joystick controls, the games were easy enough for young children to learn but challenging enough to hold adults’ interest.
+
Yann LeCun, the inventor of ConvNets, has worked on neural networks all of his professional life, starting in the 1980s and continuing through the winters and springs of the field. As a graduate student and postdoctoral fellow, he was fascinated by Rosenblatt’s perceptrons and Fukushima’s neocognitron, but noted that the latter lacked a good supervised-learning algorithm. Along with other researchers (most notably, his postdoctoral advisor Geoffrey Hinton), LeCun helped develop such a learning method—essentially the same form of back-propagation used on
  
Consider the single-player game called Breakout, illustrated in figure 27. The player uses the joystick to move a “paddle” (white rectangle at lower right) back and forth. A “ball” (white circle) can be bounced off the paddle to hit different-colored rectangular “bricks.” The ball can also bounce off the gray “walls” at the sides. If the ball hits one of the bricks (patterned rectangles), the brick disappears, the player gains points, and the ball bounces back. Bricks in higher layers are worth more points than those in lower layers. If the ball hits the “ground” (bottom of the screen), the player loses one of five “lives,” and if any “lives” remain, a new ball shoots into play. The player’s goal is to maximize the score over the five lives.
+
===ConvNets today.1 ===
  
There’s an interesting side note here. Breakout was the result of Atari’s effort to create a single-player version of its successful game Pong. The design and implementation of Breakout were originally assigned in 1975 to a twenty-year-old employee named Steve Jobs. Yes, that Steve Jobs (later, cofounder of Apple). Jobs lacked sufficient engineering skills to do a good job on Breakout, so he enlisted his friend Steve Wozniak, aged twenty-five (later, the other cofounder of Apple), to help on the project. Wozniak and Jobs completed the hardware design of Breakout in four nights, starting work each night after Wozniak had completed his day job at Hewlett-Packard. Once released, Breakout, like Pong, was hugely popular among gamers.
+
In the 1980s and ’90s, while working at Bell Labs, LeCun turned to the problem of recognizing handwritten digits and letters. He combined ideas from the neocognitron with the back-propagation algorithm to create the semi- eponymous “LeNet”—one of the earliest ConvNets. LeNet’s handwritten-digit-recognition abilities made it a commercial success: in the 1990s and into the 2000s it was used by the U.S. Postal Service for automated zip code recognition, as well as in the banking industry for automated reading of digits on checks.  
  
If you’re getting nostalgic but neglected to hang on to your old Atari 2600 game console, you can still find many websites offering Breakout and other games. In 2013, a group of Canadian AI researchers released a software platform called the Arcade Learning Environment that made it easy to test machine-learning systems on forty-nine of these games.3 This was the platform used by the DeepMind group in their work on reinforcement learning.
+
LeNet and its successor ConvNets did not do well in scaling up to more complex vision tasks. By the mid- 1990s, neural networks started falling out of favor in the AI community, and other methods came to dominate the field. But LeCun, still a believer, kept working on ConvNets, gradually improving them. As Geoffrey Hinton later said of LeCun, “He kind of carried the torch through the dark ages.”2
  
Deep Q-Learning
+
LeCun, Hinton, and other neural network loyalists believed that improved, larger versions of ConvNets and other deep networks would conquer computer vision if only they could be trained with enough data. Stubbornly, they kept working on the sidelines throughout the 2000s. In 2012, the torch carried by ConvNet researchers suddenly lit the vision world afire, by winning a computer-vision competition on an image data set called ImageNet.
  
The DeepMind group combined reinforcement learning—in particular Q-learning—with deep neural networks to create a system that could learn to play Atari video games. The group called their approach deep Q-learning. To explain how deep Q-learning works, I’ll use Breakout as a running example, but DeepMind used the same method on all the Atari games they tackled. Things will get a bit technical here, so fasten your seat belt (or skip to the next section).
+
===Building ImageNet ===
  
FIGURE 28: Illustration of a Deep Q-Network (DQN) for Breakout
+
AI researchers are a competitive bunch, so it’s no surprise that they like to organize competitions to drive the field forward. In the field of visual object recognition, researchers have long held annual contests to determine whose program performs the best. Each of these contests features a “benchmark data set”: a collection of photos, along with human-created labels that name objects in the photos.
  
Recall how we used Q-learning to train Rosie the robo-dog. In an episode of Q-learning, at each iteration the learning agent (Rosie) does the following: it figures out its current state, looks up that state in the Q-table, uses the values in the table to choose an action, performs that action, possibly receives a reward, and—the learning step— updates the values in its Q-table.
+
From 2005 to 2010, the most prominent of these annual contests was the PASCAL Visual Object Classes competition, which by 2010 featured about fifteen thousand photographs (downloaded from the photo-sharing site Flickr), with human-created labels for twenty object categories, such as “person,” “dog,” “horse,” “sheep,” “car,” “bicycle,” “sofa,” and “potted plant.
  
DeepMind’s deep Q-learning is exactly the same, except that a convolutional neural network takes the place of the Q-table. Following DeepMind, I’ll call this network the Deep Q-Network (DQN). Figure 28 illustrates a DQN that is similar to (though simpler than) the one used by DeepMind for learning to play Breakout. The input to the DQN is the state of the system at a given time, which here is defined to be the current “frame”—the pixels of the current screen—plus three prior frames (screen pixels from three previous time steps). This definition of state provides the system with a small amount of memory, which turns out to be useful here. The outputs of the network are the estimated values for each possible action, given the input state. The possible actions are the following: move the paddle Left, move the paddle Right, and No-Op (“no operation,” that is, don’t move the paddle). The network itself is a ConvNet virtually identical to the one I described in chapter 4. Instead of the values in a Q-table, as we saw in the “Rosie” example, in deep Q-learning it is the weights in this network that are learned.
+
The entries to the “classification” part of this contest3 were computer-vision programs that could take a photograph as input (without seeing its human-created label) and could then output, for each of the twenty categories, whether an object of that category was present in the image.  
  
DeepMind’s system learns to play Breakout over many episodes. Each episode corresponds to a play of the game, and each iteration during an episode corresponds to the system performing a single action. In particular, at each iteration the system inputs its state to the DQN and chooses an action based on the DQN’s output values. The system doesn’t always choose the action with the highest estimated value; as I mentioned above, reinforcement learning requires a balance between exploration and exploitation.4 The system performs its chosen action (for
+
Here’s how the competition worked. The organizers would split the photographs into a training set that contestants could use to train their programs and a test set, not released to contestants, that would be used to gauge the programs’ performance on images outside the training set. Prior to the competition, the training set would be offered online, and when the contest was held, researchers would submit their trained programs to be tested on the secret test set. The winning entry was the one that had the highest accuracy recognizing objects in the test-set images.  
  
example, moving the paddle some amount to the left) and possibly receives a reward if the ball happens to hit one of the bricks. The system then performs a step of learning—that is, updating the weights in the DQN via back- propagation.
+
The annual PASCAL competitions were a very big deal and did a lot to spur research in object recognition. Over the years of the challenge, the competing programs gradually got better (curiously, potted plants remained the hardest objects to recognize). However, some researchers were frustrated by the shortcomings of the PASCAL benchmark as a way to move computer vision forward. Contestants were focusing too much on PASCAL’s specific twenty object categories and were not building systems that could scale up to the huge number of object categories recognized by humans. Furthermore, there just weren’t enough photos in the data set for the competing systems to learn all the many possible variations in what the objects look like so as to be able to generalize well.  
  
How are the weights updated? This is the crux of the difference between supervised learning and reinforcement learning. As you’ll recall from earlier chapters, back-propagation works by changing a neural network’s weights so as to reduce the error in the network’s outputs. With supervised learning, measuring this error is straightforward. Remember our hypothetical ConvNet back in chapter 4 whose goal was to learn to classify photos as “dog” or “cat”? If an input training photo pictured a dog but the “dog” output confidence was only 20 percent, then the error for that output would be 100% − 20% = 80%; that is, ideally, the output should have been 80 points higher. The network could calculate the error because it had a label provided by a human.
+
To move ahead, the field needed a new benchmark image collection, one featuring a much larger set of categories and vastly more photos. Fei-Fei Li, a young computer-vision professor at Princeton, was particularly focused on this goal. By serendipity, she learned of a project led by a fellow Princeton professor, the psychologist George Miller, to create a database of English words, arranged in a hierarchy moving from most specific to most general, with groupings among synonyms. For example, consider the word cappuccino. The database, called WordNet, contains the following information about this term (where an arrow means “is a kind of”):
  
However, in reinforcement learning we have no labels. A given frame from the game doesn’t come labeled with the action that should be taken. How then do we assign an error to an output in this case?
+
:cappuccino ⇒ coffee ⇒ beverage ⇒ food ⇒ substance ⇒ physical entity ⇒ entity
  
Here’s the answer. Recall that if you are the learning agent, the value of an action in the current state is your estimate of how much reward you will receive by the end of the episode, if you choose this action (and continue choosing high-value actions). This estimate should be better the closer you get to the end of the episode, when you can tally up the actual rewards you received! The trick is to assume that the network’s outputs at the current iteration are closer to being correct than its outputs at the previous iteration. Then learning consists in adjusting the network weights (via back-propagation) so as to minimize the difference between the current and the previous iteration’s
+
The database also contains information that, say, beverage, drink, and potable are synonyms, that beverage is part of another chain including liquid, and so forth.  
  
outputs. Richard Sutton, one of the originators of this method, calls this “learning a guess from a guess.”5 I’ll amend that to “learning a guess from a better guess.
+
WordNet had been (and continues to be) used extensively in research by psychologists and linguists as well as in AI natural-language processing systems, but Fei-Fei Li had a new idea: create an image database that is structured according to the nouns in WordNet, where each noun is linked to a large number of images containing examples of that noun. Thus the idea for ImageNet was born.  
  
In short, instead of learning to match its outputs to human-given labels, the network learns to make its outputs consistent from one iteration to the next, assuming that later iterations give better estimates of value than earlier iterations. This learning method is called temporal difference learning.
+
Li and her collaborators soon commenced collecting a deluge of images by using WordNet nouns as queries on image search engines such as Flickr and Google image search. However, if you’ve ever used an image search engine, you know that the results of a query are often far from perfect. For example, if you type “macintosh apple” into Google image search, you get photos not only of apples and Mac computers but also of apple-shaped candles, smartphones, bottles of apple wine, and any number of other nonrelevant items. Thus, Li and her colleagues had to have humans figure out which images were not actually illustrations of a given noun and get rid of them. At first, the humans who did this were mainly undergraduates. The work was agonizingly slow and taxing. Li soon figured out that at the rate they were going, it would take ninety years to complete the task.4
  
To recap, here’s how deep Q-learning works for the game of Breakout (and all the other Atari games). The system gives its current state as input to the Deep Q-Network. The Deep Q-Network outputs a value for each possible action. The system chooses and performs an action, resulting in a new state. Now the learning step takes place: the system inputs its new state to the network, which outputs a new set of values for each action. The difference between the new set of values and the previous set of values is considered the “error” of the network; this error is used by back-propagation to change the weights of the network. These steps are repeated over many episodes (plays of the game). Just to be clear, everything here—the Deep Q-Network, the virtual “joystick,” and the game itself—is software running in a computer.
+
Li and her collaborators brainstormed about possible ways to automate this work, but of course the problem of deciding if a photo is an instance of a particular noun is the task of object recognition itself! And computers were nowhere near to being reliable at this task, which was the whole reason for constructing ImageNet in the first place.  
  
This is essentially the algorithm developed by DeepMind’s researchers, although they used some tricks to improve it and speed it up.6 At first, before much learning has happened, the network’s outputs are quite random, and the system’s game playing looks quite random as well. But gradually, as the network learns weights that improve its outputs, the system’s playing ability improves, in many cases quite dramatically.
+
The group was at an impasse, until Li, by chance, stumbled upon a three-year-old website that could deliver the human smarts that ImageNet required. The website had the strange name Amazon Mechanical Turk.  
  
The $650 Million Agent
+
===Mechanical Turk ===
  
The DeepMind group applied their deep Q-learning method to the forty-nine different Atari games in the Arcade Learning Environment. While DeepMind’s programmers used the same network architecture and hyperparameter settings for each game, their system learned each game from scratch; that is, the system’s knowledge (the network weights) learned for one game was not transferred when the system started learning to play the next game. Each game required training for thousands of episodes, but this could be done relatively quickly on the company’s advanced computer hardware.
+
According to Amazon, its Mechanical Turk service is “a marketplace for work that requires human intelligence.” The service connects requesters, people who need a task accomplished that is hard for computers, with workers, people who are willing to lend their human intelligence to a requester’s task, for a small fee (for example, labeling the objects in a photo, for ten cents per photo). Hundreds of thousands of workers have signed up, from all over the world. Mechanical Turk is the embodiment of Marvin Minsky’s “Easy things are hard” dictum: the human workers are hired to perform the “easy” tasks that are currently too hard for computers.  
  
FIGURE 29: DeepMind’s Breakout player discovered the strategy of tunneling through the bricks, which allowed it to quickly destroy high-value top bricks by bouncing off the “ceiling.
+
The name Mechanical Turk comes from a famous eighteenth-century AI hoax: the original Mechanical Turk was a chess-playing “intelligent machine,which secretly hid a human who controlled a puppet (the “Turk,” dressed like an Ottoman sultan) that made the moves. Evidently, it fooled many prominent people of the time, including Napoleon Bonaparte. Amazon’s service, while not meant to fool anyone, is, like the original Mechanical Turk, “Artificial Artificial Intelligence.”5
  
After a Deep Q-Network for each game was trained, DeepMind compared the machine’s level of play with that of a human “professional games tester,” who was allowed two hours of practice playing each game before being evaluated. Sound like a fun job? Only if you like being humiliated by a computer! DeepMind’s deep Q-learning programs turned out to be better players than the human tester on more than half the games. And on half of those games, the programs were more than twice as good as the human. And on half of those games, the programs were more than five times better. One stunning example was on Breakout, where the DQN program scored on average more than ten times the human’s average score.
+
Fei-Fei Li realized that if her group paid tens of thousands of workers on Mechanical Turk to sort out irrelevant images for each of the WordNet terms, the whole data set could be completed within a few years at a relatively low cost. In a mere two years, more than three million images were labeled with corresponding WordNet nouns to form the ImageNet data set. For the ImageNet project, Mechanical Turk was “a godsend.”6 The service continues to be widely used by AI researchers for creating data sets; nowadays, academic grant proposals in AI commonly include a line item for “Mechanical Turk workers.
  
What, exactly, did these superhuman programs learn to do? Upon investigation, DeepMind found that their programs had discovered some very clever strategies. For example, the trained Breakout program had discovered a devious trick, illustrated in figure 29. The program learned that if the ball was able to knock out bricks so as to build a narrow tunnel through the edge of the brick layer, then the ball would bounce back and forth between the “ceiling” and the top of the brick layer, knocking out high-value top bricks very quickly without the player having to move the paddle at all.
+
===The ImageNet Competitions ===
  
DeepMind first presented this work in 2013 at an international machine-learning conference.7 The audience was dazzled. Less than a year later, Google announced that it was acquiring DeepMind for £440 million (about $650 million at the time), presumably because of these results. Yes, reinforcement learning occasionally leads to big rewards.
+
In 2010, the ImageNet project launched the first ImageNet Large Scale Visual Recognition Challenge, in order to spur progress toward more general object-recognition algorithms. Thirty-five programs competed, representing computer-vision researchers from academia and industry around the world. The competitors were given labeled training images—1.2 million of them—and a list of possible categories. The task for the trained programs was to output the correct category of each input image. The ImageNet competition had a thousand possible categories, compared with PASCAL’s twenty.  
  
With a lot of money in their pockets and the resources of Google behind them, DeepMind—now called Google DeepMind—took on a bigger challenge, one that had in fact long been considered one of AI’s “grand challenges”: creating a program that learns to play the game Go better than any human. DeepMind’s program
+
The thousand possible categories were a subset of WordNet terms chosen by the organizers. The categories  are a random-looking assembly of nouns, ranging from the familiar and commonplace (“lemon,” “castle,” “grand piano”) to the somewhat less common (“viaduct,” “hermit crab,” “metronome”), and on to the downright obscure (“Scottish deerhound,” “ruddy turnstone,” “hussar monkey”). In fact, obscure animals and plants—at least ones I wouldn’t be able to distinguish—constitute at least a tenth of the thousand target categories.  
  
AlphaGo builds on a long history of AI in board games. Let’s start with a brief survey of that history, which will help in explaining how AlphaGo works and why it is so significant.
+
Some of the photographs contain only one object; others contain many objects, including the “correct” one. Because of this ambiguity, a program gets to guess five categories for each image, and if the correct one is in this list, the program is said to be correct on this image. This is called the “top-5” accuracy metric.  
  
Checkers and Chess
+
The highest-scoring program in 2010 used a so-called support vector machine, the predominant object- recognition algorithm of the day, which employed sophisticated mathematics to learn how to assign a category to each input image. Using the top-5 accuracy metric, this winning program was correct on 72 percent of the 150,000 test images. Not a bad showing, though this means that the program was wrong, even with five guesses allowed, on more than 40,000 of the test images, leaving a lot of room for improvement. Notably, there were no neural networks among the top-scoring programs.
  
In 1949, the engineer Arthur Samuel joined IBM’s laboratory in Poughkeepsie, New York, and immediately set about programming an early version of IBM’s 701 computer to play checkers. If you yourself have any computer programming experience, you will appreciate the challenge he faced: as noted by one historian, “Samuel was the first person to do any serious programming on the 701 and as such had no system utilities [that is, essentially no operating system!] to call on. In particular he had no assembler and had to write everything using the op codes and
+
The following year, the highest-scoring program—also using support vector machines—showed a respectable but modest improvement, getting 74 percent of the test images correct. Most people in the field expected this trend to continue; computer-vision research would chip away at the problem, with gradual improvement at each annual competition.  
  
addresses.”8 To translate for my nonprogrammer readers, this is something like building a house using only a handsaw and a hammer. Samuel’s checkers-playing program was among the earliest machine-learning programs; indeed, it was Samuel who coined the term machine learning.
+
However, these expectations were upended in the 2012 ImageNet competition: the winning entry achieved an amazing 85 percent correct. Such a jump in accuracy was a shocking development. What’s more, the winning entry did not use support vector machines or any of the other dominant computer-vision methods of the day. Instead, it was a convolutional neural network. This particular ConvNet has come to be known as AlexNet, named after its main creator, Alex Krizhevsky, then a graduate student at the University of Toronto, supervised by the eminent neural network researcher Geoffrey Hinton. Krizhevsky, working with Hinton and a fellow student, Ilya Sutskever, created a scaled-up version of Yann LeCun’s LeNet from the 1990s; training such a large network was now made possible by increases in computer power. AlexNet had eight layers, with about sixty million weights whose values were learned via back-propagation from the million-plus training images.7 The Toronto group came up with some clever methods for making the network training work better, and it took a cluster of powerful computers about a week to train AlexNet.  
  
FIGURE 30: Part of a game tree for checkers. For simplicity, this figure shows only three possible moves from each board position. The white arrows point from a moved piece’s previous square to its current square.
+
AlexNet’s success sent a jolt through the computer-vision and broader AI communities, suddenly waking people up to the potential power of ConvNets, which most AI researchers hadn’t considered a serious contender in modern computer vision. In a 2015 article, the journalist Tom Simonite interviewed Yann LeCun about the unexpected triumph of ConvNets: LeCun recalls seeing the community that had mostly ignored neural networks pack into the room where the winners presented a paper on their results. “You could see right there a lot of senior people in the community just flipped,” he says. “They said, ‘Okay, now we buy it. That’s it, now—you won.’”8
  
Samuel’s checkers player was based on the method of searching a game tree, which is the basis of all programs for playing board games to this day (including AlphaGo, which I’ll describe below). Figure 30 illustrates part of a game tree for checkers. The “root” of the tree (by convention drawn at the top, unlike the root of a natural tree) shows the initial checkerboard, before either player has moved. The “branches” from the root lead to all possible moves for the first player (here, Black). There are seven possible moves (for simplicity, the figure shows only three of these). For each of those seven moves for Black, there are seven possible response moves for White (not all shown in the figure), and so on. Each of the boards in figure 30, showing a possible arrangement of pieces, is called a board position.
+
At almost the same time, Geoffrey Hinton’s group was also demonstrating that deep neural networks, trained on huge amounts of labeled data, were significantly better than the current state of the art in speech recognition. The Toronto group’s ImageNet and speech-recognition results had substantial ripple effects. Within a year, a small company started by Hinton was acquired by Google, and Hinton and his students Krizhevsky and Sutskever became Google employees. This acqui-hire instantly put Google at the forefront of deep learning.  
  
Imagine yourself playing a game of checkers. At each turn, you might construct a small part of this tree in your mind. You might say to yourself, “If I make this move, then my opponent could make that move, in which case I could make that move, which will set me up for a jump.” Most people, including the best players, consider only a few possible moves, looking ahead only a few steps before choosing which move to make. A fast computer, on the
+
Soon after, Yann LeCun was lured away from his full-time New York University professorship by Facebook to head up its newly formed AI lab. It didn’t take long before all the big tech companies (as well as many smaller ones) were snapping up deep-learning experts and their graduate students as fast as possible. Seemingly overnight, deep learning became the hottest part of AI, and expertise in deep learning guaranteed computer scientists a large salary in Silicon Valley or, better yet, venture capital funding for their proliferating deep-learning start-up companies.  
  
other hand, has the potential to perform this kind of look-ahead on a much larger scale. What’s stopping the computer from looking at every possible move and seeing which sequence of moves most quickly leads to a win? The problem is the same kind of exponential increase we saw back in chapter 3 (remember the king, the sage, and the grains of rice?). The average game of checkers has about fifty moves, which means that the game tree in figure 30 might extend down for fifty levels. At each level, there are on average six or seven branches from each possible board position. This means that the total number of board positions in the tree could be more than six raised to the fiftieth power—a ridiculously huge number. A hypothetical computer that could look at a trillion board positions per
+
The annual ImageNet competition began to see wider coverage in the media, and it quickly morphed from a friendly academic contest into a high-profile sparring match for tech companies commercializing computer vision. Winning at ImageNet would guarantee coveted respect from the vision community, along with free publicity, which might translate into product sales and higher stock prices. The pressure to produce programs that outperformed competitors was notably manifest in a 2015 cheating incident involving the giant Chinese internet company Baidu. The cheating involved a subtle example of what people in machine learning call data snooping.  
  
second would take more than 1019 years to consider all the board positions in a single game tree. (As is often done, we can compare this number with the age of the universe, which is merely on the order of 1010 years.) Clearly a complete search of the game tree is not feasible.
+
Here’s what happened: Before the competition, each team competing on ImageNet was given training images labeled with correct object categories. They were also given a large test set—a collection of images not in the training set—without any labels. Once a program was trained, a team could see how well their method performed on this test set. This helps test how well a program has learned to generalize (as opposed to, say, memorizing the training images and their labels). Only the performance on the test set counts. The way a team could find out how well their program did on the test set was to run their program on each test-set image, collect the top five guesses for each image, and submit this list to a “test server”—a computer run by the contest organizers. The test server would compare the submitted list with the (secret) correct answers and spit out the percentage correct.  
  
Fortunately, it’s possible for computers to play well without doing this kind of exhaustive search. On each of
+
Each team could sign up for an account on the test server and use it to see how well various versions of their programs were scoring; this would allow them to publish (and publicize) their results before the official results were announced.  
  
its turns, Samuel’s checkers-playing program created (in the computer’s memory) a small part of a game tree like the one in figure 30. The root of the tree was the player’s current board position, and the program, using its built-in knowledge of the rules of checkers, generated all the legal moves it could make from this current board position. It then generated all the legal moves that the opponent could make from each of the resulting positions, and so on, up to four or five turns (or “plies”) of look-ahead.9
+
A cardinal rule in machine learning is “Don’t train on the test data.” It seems obvious: If you include test data in any part of training your program, you won’t get a good measure of the program’s generalization abilities. It would be like giving students the questions on the final exam before they take the test. But it turns out that there are subtle ways that this rule can be unintentionally (or intentionally) broken to make your program’s performance look better than it actually is.  
  
The program then evaluated board positions that appeared at the end of the look-ahead process; in figure 30, these would be the board positions in the bottom row in the partial tree. Evaluating a board position means assigning it a numerical value that estimates how likely it is to lead to a win for the program. Samuel’s program used an evaluation function that gave points for various features of the board, such as Black’s advantage in total number of pieces, Black’s number of kings, and how many of Black’s pieces were close to being kinged. These specific features were chosen by Samuel using his knowledge of checkers. Once each of the bottom-row board positions was thus evaluated, the program employed a classic algorithm, called minimax, which used these values—from the end of the look-ahead process—in order to rate the program’s immediate possible moves from its current board position. The program then chose the highest-rated move.
+
One such method would be to submit your program’s test-set answers to the test server and, based on the result, tweak your program. Then submit again. Repeat this many times, until you have tweaked it to do better on the test set. This doesn’t require seeing the actual labels in the test set, but it does require getting feedback on accuracy and adjusting your program accordingly. It turns out that if you can do this enough times, it can be very effective in improving your program’s performance on the test set. But because you’re using information from the test set to change your program, you’ve now destroyed the ability to use the test set to see if your program generalizes well. It would be like allowing students to take a final exam many times, each time getting back a single grade, but using that single grade to try to improve their performance the next time around. Then, at the end, the students submit the version of their answers that got them the best score. This is no longer a good measure of how well the students have learned the subject, just a measure of how they adapted their answers to particular test questions.  
  
The intuition here is that the evaluation function will be more accurate when applied to board positions further along in the game; thus the program’s strategy is to first look at all possible move sequences a few steps into the future and then apply the evaluation function to the resulting board positions. The evaluations are then propagated back up the tree by minimax, which produces a rating of all the possible immediate moves from the current board position.10
+
To prevent this kind of data snooping while still allowing the ImageNet competitors to see how well their programs are doing, the organizers set a rule saying that each team could submit answers to the test server at most twice per week. This would limit the amount of feedback the teams could glean from the test runs.  
  
What the program learned was which features of the board should be included in the evaluation function at a given turn, as well as how to weight these different features when summing their points. Samuel experimented with several methods for learning in his system. In the most interesting version, the system learned while playing itself! The method for learning was somewhat complicated, and I won’t detail it here, but it had some aspects that foreshadowed modern reinforcement learning.11
+
The great ImageNet battle of 2015 was fought over a fraction of a percentage point—seemingly trivial but potentially very lucrative. Early in the year, a team from Baidu announced a method that achieved the highest (top-5) accuracy yet on an ImageNet test set: 94.67 percent, to be exact. But on the very same day, a team from Microsoft announced a better accuracy with their method: 95.06 percent. A few days later, a rival team from Google announced a slightly different method that did even better: 95.18 percent. This record held for a few months, until Baidu made a new announcement: it had improved its method and now could boast a new record, 95.42 percent. This result was widely publicized by Baidu’s public relations team.  
  
In the end, Samuel’s checkers player impressively rose to the level of a “better-than-average player,” though by no means a champion. It was characterized by some amateur players as “tricky but beatable.”12 But notably, the program was a publicity windfall for IBM: the day after Samuel demonstrated it on national television in 1956, IBM’s stock price rose by fifteen points. This was the first of several times IBM saw its stock price increase after a demonstration of a game-playing program beating humans; as a more recent example, IBM’s stock price similarly
+
But within a few weeks, a terse announcement came from the ImageNet organizers: “During the period of November 28th, 2014 to May 13th, 2015, there were at least 30 accounts used by a team from Baidu to submit to the test server at least 200 times, far exceeding the specified limit of two submissions per week.”9 In short, the Baidu team had been caught data snooping.
  
rose after the widely viewed TV broadcasts in which its Watson program won in the game show Jeopardy!
+
The two hundred points of feedback potentially allowed the Baidu team to determine which tweaks to their program would make it perform best on this test set, gaining it the all-important fraction of a percentage point that made the win. As punishment, Baidu was disqualified from entering its program in the 2015 competition.
  
While Samuel’s checkers player was an important milestone in AI history, I made this historical digression primarily to introduce three all-important concepts that it illustrates: the game tree, the evaluation function, and learning by self-play.
+
Baidu, hoping to minimize bad publicity, promptly apologized and then laid the blame on a rogue employee: “We found that a team leader had directed junior engineers to submit more than two submissions per week, a breach of the current ImageNet rules.”10 The employee, though disputing that he had broken any rules, was promptly fired from the company.  
  
Deep Blue
+
While this story is merely an interesting footnote to the larger history of deep learning in computer vision, I tell it to illustrate the extent to which the ImageNet competition came to be seen as the key symbol of progress in computer vision, and AI in general.
  
Although Samuel’s “tricky but beatable” checkers program was remarkable, especially for its time, it hardly challenged people’s idea of themselves as uniquely intelligent. Even if a machine could win against human checkers champions (as one finally did in 199413), mastering the game of checkers was never seen as a proxy for general intelligence. Chess is a different story. In the words of DeepMind’s Demis Hassabis, “For decades, leading computer scientists believed that, given the traditional status of chess as an exemplary demonstration of human
+
Cheating aside, progress on ImageNet continued. The final competition was held in 2017, with a winning top- 5 accuracy of 98 percent. As one journalist commented, “Today, many consider ImageNet solved,”11 at least for the classification task. The community is moving on to new benchmark data sets and new problems, especially ones that integrate vision and language.
  
intellect, a competent computer chess player would soon also surpass all other human abilities.”14 Many people, including the early pioneers of AI Allen Newell and Herbert Simon, professed this exalted view of chess; in 1958 Newell and Simon wrote, “If one could devise a successful chess machine, one would seem to have penetrated to the
+
What was it that enabled ConvNets, which seemed to be at a dead end in the 1990s, to suddenly dominate the ImageNet competition, and subsequently most of computer vision in the last half a decade? It turns out that the recent success of deep learning is due less to new breakthroughs in AI than to the availability of huge amounts of data (thank you, internet!) and very fast parallel computer hardware. These factors, along with improvements in training methods, allow hundred-plus-layer networks to be trained on millions of images in just a few days.
  
core of human intellectual endeavor.”15
+
Yann LeCun himself was taken by surprise at how fast things turned around for his ConvNets: “It’s rarely the case where a technology that has been around for 20, 25 years—basically unchanged—turns out to be the best. The speed at which people have embraced it is nothing short of amazing. I’ve never seen anything like this before.”12
  
Chess is significantly more complex than checkers. For example, I said above that in checkers there are, on average, six or seven possible moves from any given board position. In contrast, chess has on average thirty-five moves from any given board position. This makes the chess game tree enormously larger than that of checkers. Over the decades, chess-playing programs kept improving, in lockstep with improvements in the speed of computer hardware. In 1997, IBM had its second big game-playing triumph with Deep Blue, a chess-playing program that beat the world champion Garry Kasparov in a widely broadcast multigame match.
+
=== The ConvNet Gold Rush ===
  
Deep Blue used much the same method as Samuel’s checkers player: at a given turn, it created a partial game tree using the current board position as the root; it applied its evaluation function to the furthest layer in the tree and then used the minimax algorithm to propagate the values up the tree in order to determine which move it should make. The major differences between Samuel’s program and Deep Blue were Deep Blue’s deeper look-ahead in its game tree, its more complex (chess-specific) evaluation function, hand-programmed chess knowledge, and specialized parallel hardware to make it run very fast. Furthermore, unlike Samuel’s checkers-playing program, Deep Blue did not use machine learning in any central way.
+
Once ImageNet and other large data sets gave ConvNets the vast amount of training examples they needed to work well, companies were suddenly able to apply computer vision in ways never seen before. As Google’s Blaise Agüera y Arcas remarked, “It’s been a sort of gold rush—attacking one problem after another with the same set of techniques.”13 Using ConvNets trained with deep learning, image search engines offered by Google, Microsoft, and others were able to vastly improve their “find similar images” feature. Google offered a photo-storage system that would tag your photos by describing the objects they contained, and Google’s Street View service could recognize and blur out street addresses and license plates in its images. A proliferation of mobile apps enabled smartphones to perform object and face recognition in real time.  
  
Like Samuel’s checkers player before it, Deep Blue’s defeat of Kasparov spurred a significant increase in IBM’s stock price.16 This defeat also generated considerable consternation in the media about the implications for superhuman intelligence as well as doubts about whether humans would still be motivated to play chess. But in the decades since Deep Blue, humanity has adapted. As Claude Shannon wrote presciently in 1950, a machine that can surpass humans at chess “will force us either to admit the possibility of mechanized thinking or to further restrict our concept of thinking.”17 The latter happened. Superhuman chess playing is now seen as something that doesn’t require general intelligence. Deep Blue isn’t intelligent in any sense we mean today. It can’t do anything but play chess, and it doesn’t have any conception of what “playing a game” or “winning” means to humans. (I once heard a speaker say, “Deep Blue may have beat Kasparov, but it didn’t get any joy out of it.”) Moreover, chess has survived
+
Facebook labeled your uploaded photos with names of your friends and registered a patent on classifying the emotions behind facial expressions in uploaded photos; Twitter developed a filter that could screen tweets for pornographic images; and several photo- and video-sharing sites started applying tools to detect imagery associated with terrorist groups. ConvNets can be applied to video and used in self-driving cars to track pedestrians, or to read lips and classify body language. ConvNets can even diagnose breast and skin cancer from medical images, determine the stage of diabetic retinopathy, and assist physicians in treatment planning for prostate cancer.  
  
—even prospered—as a challenging human activity. Nowadays, computer-chess programs are used by human
+
These are just a few examples of the many existing (or soon-to-exist) commercial applications powered by ConvNets. In fact, there’s a good chance that any modern computer-vision application you use employs ConvNets. Moreover, there’s an excellent chance it was “pretrained” on images from ImageNet to learn generic visual features before being “fine-tuned” for more specific tasks.
  
players as a kind of training aid, in the way a baseball player might practice using a pitching machine. Is this a result of our evolving notion of intelligence, which advances in AI help to clarify? Or is it another example of John McCarthy’s maxim: “As soon as it works, no one calls it AI anymore”?18
+
Given that the extensive training required by ConvNets is feasible only with specialized computer hardware— typically, powerful graphical processing units (GPUs)—it is not surprising that the stock price of the NVIDIA Corporation, the most prominent maker of GPUs, increased by over 1,000 percent between 2012 and 2017.
  
The Grand Challenge of Go
+
===Have ConvNets Surpassed Humans at Object Recognition? ===
  
The game of Go has been around for more than two thousand years and is considered among the most difficult of all board games. If you’re not a Go player, don’t worry; none of my discussion here will require any prior knowledge of the game. But it’s useful to know that the game has serious status, especially in East Asia, where it is extremely popular. “Go is a pastime beloved by emperors and generals, intellectuals and child prodigies,” writes the scholar and journalist Alan Levinovitz, who goes on to quote the South Korean Go champion Lee Sedol: “There is chess in
+
As I learned more about the remarkable success of ConvNets, I wondered how close they were to rivaling our own human object-recognition abilities. A 2015 paper from Baidu (post–cheating scandal) carried the subtitle “Surpassing Human-Level Performance on ImageNet Classification.”14 At about the same time, Microsoft announced in a research blog “a major advance in technology designed to identify the objects in a photograph or video, showcasing a system whose accuracy meets and sometimes exceeds human-level performance.”15 While both companies made it clear they were talking about accuracy specifically on ImageNet, the media were not so careful, giving way to sensational headlines such as “Computers Now Better than Humans at Recognising and Sorting Images” and “Microsoft Has Developed a Computer System That Can Identify Objects Better than Humans.”16
  
the western world, but Go is incomparably more subtle and intellectual.”19
+
Let’s look a bit harder at the specific contention that machines are now “better than humans” at object recognition on ImageNet. This assertion is based on a claim that humans have an error rate of about 5 percent, whereas the error rate of machines is (at the time of this writing) close to 2 percent. Doesn’t this confirm that machines are better than humans at this task? As is often the case for highly publicized claims about AI, the claim comes with a few caveats.  
  
Go is a game that has fairly simple rules but produces what you might call emergent complexity. At each turn, a player places a piece of his or her color (black or white) on a nineteen-by-nineteen-square board, following rules for where pieces may be placed and how to capture one’s opponent’s pieces. Unlike chess, with its hierarchy of pawns, bishops, queens, and so on, pieces in Go (“stones”) are all equal. It’s the configuration of stones on the board that a player must quickly analyze to decide on a move.
+
Here’s one caveat. When you read about a machine “identifying objects correctly,” you’d think that, say, given an image of a basketball, the machine would output “basketball.” But of course, on ImageNet, correct identification means only that the correct category is in the machine’s top-five categories. If, given an image of a basketball, the machine outputs “croquet ball,” “bikini,” “warthog,” “basketball,” and “moving van,in that order, it is considered correct. I don’t know how often this kind of thing happens, but it’s notable that the best top-1 accuracy —the fraction of test images on which the correct category is at the top of the list—was about 82 percent, compared with 98 percent top-5 accuracy, in the 2017 ImageNet competition. No one, as far as I know, has reported a comparison between machines and humans on top-1 accuracy.  
  
Creating a program to play Go well has been a focus of AI since the field’s early days. However, Go’s complexity made this task remarkably hard. In 1997, the same year Deep Blue beat Kasparov, the best Go programs could still be easily defeated by average players. Deep Blue, you’ll recall, was able to do a significant amount of look-ahead from any board position and then use its evaluation function to assign values to future board positions, where each value predicted whether a particular board position would lead to a win. Go programs are not able to use this strategy for two reasons. First, the size of a look-ahead tree in Go is dramatically larger than that in chess. Whereas a chess player must choose from on average 35 possible moves from a given board position, a Go player has on average 250 such possibilities. Even with special-purpose hardware, a Deep Blue–style brute-force search of
+
Here’s another caveat. Consider the claim, “Humans have an error rate of about 5% on ImageNet.” It turns out that saying “humans” is not quite accurate; this result is from an experiment involving a single human, one Andrej Karpathy, who was at the time a graduate student at Stanford, researching deep learning. Karpathy wanted to see if he could train himself to compete against the best ConvNets on ImageNet. Considering that ConvNets train on 1.2 million images and then are run on 150,000 test images, this is a daunting task for a human. Karpathy, who has a popular blog about AI, wrote about his experience: I ended up training [myself] on 500 images and then switched to [a reduced] test set of 1,500 images. The labeling [that is, Karpathy’s guessing five categories per image] happened at a rate of about 1 per minute, but this decreased over time. I only enjoyed the first ~200, and the rest I only did #forscience.… Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs.17
  
the Go game tree is just not feasible. Second, no one has succeeded in creating a good evaluation function for Go board positions. That is, no one has been able to construct a successful formula that examines a board position in Go and predicts who is going to win. The best (human) Go players rely on their pattern-recognition skills and their ineffable intuition.
+
Karpathy found that he was wrong on about 75 of his 1,500 test images, and he went on to analyze the errors he made, finding that they were largely due to images with multiple objects, images with specific breeds of dogs, species of birds or plants, and so on, and object categories that he didn’t realize were included in the target categories. The kinds of errors made by ConvNets are different: while they also get confused by images containing multiple objects, unlike humans they tend to miss objects that are small in the image, objects that have been distorted by color or contrast filters the photographer applied to the image, and “abstract representations” of objects, such as a painting or statue of a dog, or a stuffed toy dog. Thus, the claim that computers have bested humans on ImageNet needs to be taken with a large grain of salt.  
  
AI researchers haven’t yet figured out how to encode intuition into an evaluation function. This is why, in 1997, the same year that Deep Blue beat Kasparov, the journalist George Johnson wrote in The New York Times, “When or if a computer defeats a human Go champion, it will be a sign that artificial intelligence is truly beginning to become as good as the real thing.”20 This may sound familiar—just like what people used to say about chess! Johnson quoted one Go enthusiast’s prediction: “It may be a hundred years before a computer beats humans at Go—
+
Here’s a caveat that might surprise you. When a human says that a photo contains, say, a dog, we assume it’s because the human actually saw a dog in the photo. But if a ConvNet correctly says “dog,” how do we know it actually is basing this classification on the dog in the image? Maybe there’s something else in the image—a tennis ball, a Frisbee, a chewed-up shoe—that was often associated with dogs in the training images, and the ConvNet is recognizing these and assuming there is a dog in the photo. These kinds of correlations have often ended up fooling machines.
  
maybe even longer.” A mere twenty years later, AlphaGo, which learned to play Go via deep Q-learning, beat Lee Sedol in a five-game match.
+
One thing we could do is ask the machine to not only output an object category for an image but also learn to draw a box around the target object, so we know the machine has actually “seen” the object. This is precisely what the ImageNet competition started doing in its second year with its “localization challenge.” The localization task provided training images with such boxes drawn (by Mechanical Turk workers) around the target object(s) in each image; on the test images, the task for competing programs was to predict five object categories each with the coordinates of a corresponding box. What may be surprising is that while deep convolutional neural networks have performed very well at localization, their performance has remained significantly worse than their performance on categorization, although newer competitions are focusing on precisely this problem.  
  
AlphaGo Versus Lee Sedol
+
Probably the most important differences between today’s ConvNets and humans when it comes to recognizing objects are in how learning takes place and in how robust and reliable that learning turns out to be. I’ll explore these differences in the next chapter.
  
Before I explain how AlphaGo works, let’s first commemorate its spectacular wins against Lee Sedol, one of the world’s best Go players. Even after watching AlphaGo defeat the then European Go champion Fan Hui half a year earlier, Lee remained confident that he would prevail: “I think [AlphaGo’s] level doesn’t match mine.… Of course, there would have been many updates in the last four or five months, but that isn’t enough time to challenge me.”21
+
The caveats I described above aren’t meant to diminish the amazing recent progress in computer vision. There is no question that convolutional neural networks have been stunningly successful in this and other areas, and these successes have not only produced commercial products but also resulted in a real sense of optimism in the AI community. My discussion is meant to illustrate how challenging vision turns out to be and to add some perspective on the progress made so far. Object recognition is not yet close to being “solved” by artificial intelligence.  
  
Perhaps you were one of the more than two hundred million people who