|
<html><head> |
|
<title>The Bitter Lesson</title> |
|
<style type="text/css"> |
|
|
|
|
|
|
|
</style> |
|
</head> |
|
<body> |
|
<span class="style1"> |
|
<h1>The Bitter Lesson<br> |
|
</h1> |
|
<h2>Rich Sutton</h2> |
|
<h3>March 13, 2019<br> |
|
</h3> |
|
The biggest lesson that can be read from 70 years of AI research is |
|
that general methods that leverage computation are ultimately the most |
|
effective, and by a large margin. The ultimate reason for this is |
|
Moore's law, or rather its generalization of continued exponentially |
|
falling cost per unit of computation. Most AI research has been |
|
conducted as if the computation available to the agent were constant |
|
(in which case leveraging human knowledge would be one of the only ways |
|
to improve performance) but, over a slightly longer time than a typical |
|
research project, massively more computation inevitably becomes |
|
available. Seeking an improvement that makes a difference in the |
|
shorter term, researchers seek to leverage their human knowledge of the |
|
domain, but the only thing that matters in the long run is the |
|
leveraging of computation. These two need not run counter to each |
|
other, but in practice they tend to. Time spent on one is time not |
|
spent on the other. There are psychological commitments to investment |
|
in one approach or the other. And the human-knowledge approach tends to |
|
complicate methods in ways that make them less suited to taking |
|
advantage of general methods leveraging computation. There were |
|
many examples of AI researchers' belated learning of this bitter |
|
lesson, |
|
and it is instructive to review some of the most prominent.<br> |
|
<br> |
|
In computer chess, the methods that defeated the world champion, |
|
Kasparov, in 1997, were based on massive, deep search. At the time, |
|
this was looked upon with dismay by the majority of computer-chess |
|
researchers who had pursued methods that leveraged human understanding |
|
of the special structure of chess. When a simpler, search-based |
|
approach with special hardware and software proved vastly more |
|
effective, these human-knowledge-based chess researchers were not good |
|
losers. They said that ``brute force" search may have won this time, |
|
but it was not a general strategy, and anyway it was not how people |
|
played chess. These researchers wanted methods based on human input to |
|
win and were disappointed when they did not.<br> |
|
<br> |
|
A similar pattern of research progress was seen in computer Go, only |
|
delayed by a further 20 years. Enormous initial efforts went into |
|
avoiding search by taking advantage of human knowledge, or of the |
|
special features of the game, but all those efforts proved irrelevant, |
|
or worse, once search was applied effectively at scale. Also important |
|
was the use of learning by self play to learn a value function (as it |
|
was in many other games and even in chess, although learning did not |
|
play a big role in the 1997 program that first beat a world champion). |
|
Learning by self play, and learning in general, is like search in that |
|
it enables massive computation to be brought to bear. Search and |
|
learning are the two most important classes of techniques for utilizing |
|
massive amounts of computation in AI research. In computer Go, as in |
|
computer chess, researchers' initial effort was directed towards |
|
utilizing human understanding (so that less search was needed) and only |
|
much later was much greater success had by embracing search and |
|
learning.<br> |
|
<br> |
|
In speech recognition, there was an early competition, sponsored by |
|
DARPA, in the 1970s. Entrants included a host of special methods that |
|
took |
|
advantage of human knowledge---knowledge of words, of phonemes, of the |
|
human vocal tract, etc. On the other side were newer methods that were |
|
more statistical in nature and did much more computation, based on |
|
hidden Markov models (HMMs). Again, the statistical methods won out |
|
over the human-knowledge-based methods. This led to a major change in |
|
all of natural language processing, gradually over decades, where |
|
statistics and computation came to dominate the field. The recent rise |
|
of deep learning in speech recognition is the most recent step in this |
|
consistent direction. Deep learning methods rely even less on human |
|
knowledge, and use even more computation, together with learning on |
|
huge training sets, to produce dramatically better speech recognition |
|
systems. As in the games, researchers always tried to make systems that |
|
worked the way the researchers thought their own minds worked---they |
|
tried to put that knowledge in their systems---but it proved ultimately |
|
counterproductive, and a colossal waste of researcher's time, when, |
|
through Moore's law, massive computation became available and a means |
|
was found to put it to good use.<br> |
|
<br> |
|
In computer vision, there has been a similar pattern. Early methods |
|
conceived of vision as searching for edges, or generalized cylinders, |
|
or in terms of SIFT features. But today all this is discarded. Modern |
|
deep-learning neural networks use only the notions of convolution and |
|
certain kinds of invariances, and perform much better.<br> |
|
<br> |
|
This is a big lesson. As a field, we still have not thoroughly learned |
|
it, as we are continuing to make the same kind of mistakes. To see |
|
this, and to effectively resist it, we have to understand the appeal of |
|
these mistakes. We have to learn the bitter lesson that building in how |
|
we think we think does not work in the long run. The bitter lesson is |
|
based on the historical observations that 1) AI researchers have often |
|
tried to build knowledge into their agents, 2) this always helps in the |
|
short term, and is personally satisfying to the researcher, but 3) in |
|
the long run it plateaus and even inhibits further progress, and 4) |
|
breakthrough progress eventually arrives by an opposing approach based |
|
on scaling computation by search and learning. The eventual success is |
|
tinged with bitterness, and often incompletely digested, because it is |
|
success over a favored, human-centric approach. <br> |
|
<br> |
|
One thing that should be learned from the bitter lesson is the great |
|
power of general purpose methods, of methods that continue to scale |
|
with increased computation even as the available computation becomes |
|
very great. The two methods that seem to scale arbitrarily in this way |
|
are <span style="font-style: italic;">search</span> and <span style="font-style: italic;">learning</span>. <br> |
|
<br> |
|
The second general point to be learned from the bitter lesson is that |
|
the actual contents of minds are tremendously, irredeemably complex; we |
|
should stop trying to find simple ways to think about the contents of |
|
minds, such as simple ways to think about space, objects, multiple |
|
agents, or symmetries. All these are part of the arbitrary, |
|
intrinsically-complex, outside world. They are not what should be built |
|
in, as their complexity is endless; instead we should build in only the |
|
meta-methods that can find and capture this arbitrary complexity. |
|
Essential to these methods is that they can find good approximations, |
|
but the search for them should be by our methods, not by us. We want AI |
|
agents that can discover like we can, not which contain what we have |
|
discovered. Building in our discoveries only makes it harder to see how |
|
the discovering process can be done.<br> |
|
<br> |
|
</span> |
|
|
|
|
|
</body></html> |