Sutton, Richard S.,
Reinforcement learning : an introduction /
Richard S. Sutton and Andrew G. Barto.
- Second edition.
- xxii, 526 pages : illustrations (some color) ; 24 cm.
- Adaptive computation and machine learning series .
The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence. Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics. Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.
Includes bibliographical references (pages 481-518) and index.
Contents Preface to the Second Edition xiii Preface to the First Edition xvii Summary of Notation xix 1 Introduction 1 1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 6 1.4 Limitations and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 An Extended Example: Tic-Tac-Toe . . . . . . . . . . . . . . . . . . . . . 8 1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 Early History of Reinforcement Learning . . . . . . . . . . . . . . . . . . . 13 I Tabular Solution Methods 23 2 Multi-armed Bandits 25 2.1 A k -armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Action-value Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 The 10-armed Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Incremental Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Tracking a Nonstationary Problem . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Upper-Confidence-Bound Action Selection . . . . . . . . . . . . . . . . . . 35 2.8 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.9 Associative Search (Contextual Bandits) . . . . . . . . . . . . . . . . . . . 41 2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 viii Contents 3 Finite Markov Decision Processes 47 3.1 The Agent–Environment Interface . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Goals and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Returns and Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Unified Notation for Episodic and Continuing Tasks . . . . . . . . . . . . 57 3.5 Policies and Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Optimal Policies and Optimal Value Functions . . . . . . . . . . . . . . . 62 3.7 Optimality and Approximation . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Dynamic Programming 73 4.1 Policy Evaluation (Prediction) . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . . . . . . . 85 4.6 Generalized Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 87 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5 Monte Carlo Methods 91 5.1 Monte Carlo Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Monte Carlo Estimation of Action Values . . . . . . . . . . . . . . . . . . 96 5.3 Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Monte Carlo Control without Exploring Starts . . . . . . . . . . . . . . . 100 5.5 Off-policy Prediction via Importance Sampling . . . . . . . . . . . . . . . 103 5.6 Incremental Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7 Off-policy Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . . 110 5.8 *Discounting-aware Importance Sampling . . . . . . . . . . . . . . . . . . 112 5.9 *Per-decision Importance Sampling . . . . . . . . . . . . . . . . . . . . . . 114 5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6 Temporal-Difference Learning 119 6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . . . . . . . . 124 6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.4 Sarsa: On-policy TD Control . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5 Q-learning: Off-policy TD Control . . . . . . . . . . . . . . . . . . . . . . 131 6.6 Expected Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.7 Maximization Bias and Double Learning . . . . . . . . . . . . . . . . . . . 134 6.8 Games, Afterstates, and Other Special Cases . . . . . . . . . . . . . . . . 136 6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Contents ix 7 n-step Bootstrapping 141 7.1 n-step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.2 n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.3 n-step Off-policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.4 *Per-decision Methods with Control Variates . . . . . . . . . . . . . . . . 150 7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm . . . . . . . . . . . . . . . . . . . . . . 152 7.6 *A Unifying Algorithm: n-step Q(σ) . . . . . . . . . . . . . . . . . . . . . 154 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8 Planning and Learning with Tabular Methods 159 8.1 Models and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.2 Dyna: Integrated Planning, Acting, and Learning . . . . . . . . . . . . . . 161 8.3 When the Model Is Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.4 Prioritized Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.5 Expected vs. Sample Updates . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.6 Trajectory Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.7 Real-time Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 177 8.8 Planning at Decision Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.9 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.10 Rollout Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.11 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.12 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.13 Summary of Part I: Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 189 II Approximate Solution Methods 195 9 On-policy Prediction with Approximation 197 9.1 Value-function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.2 The Prediction Objective (VE) . . . . . . . . . . . . . . . . . . . . . . . . 199 9.3 Stochastic-gradient and Semi-gradient Methods . . . . . . . . . . . . . . . 200 9.4 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 9.5 Feature Construction for Linear Methods . . . . . . . . . . . . . . . . . . 210 9.5.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.5.2 Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 9.5.3 Coarse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.5.4 Tile Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.5.5 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.6 Selecting Step-Size Parameters Manually . . . . . . . . . . . . . . . . . . . 222 9.7 Nonlinear Function Approximation: Artificial Neural Networks . . . . . . 223 9.8 Least-Squares TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 x Contents 9.9 Memory-based Function Approximation . . . . . . . . . . . . . . . . . . . 230 9.10 Kernel-based Function Approximation . . . . . . . . . . . . . . . . . . . . 232 9.11 Looking Deeper at On-policy Learning: Interest and Emphasis . . . . . . 234 9.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 10 On-policy Control with Approximation 243 10.1 Episodic Semi-gradient Control . . . . . . . . . . . . . . . . . . . . . . . . 243 10.2 Semi-gradient n-step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . 247 10.3 Average Reward: A New Problem Setting for Continuing Tasks . . . . . . 249 10.4 Deprecating the Discounted Setting . . . . . . . . . . . . . . . . . . . . . . 253 10.5 Differential Semi-gradient n-step Sarsa . . . . . . . . . . . . . . . . . . . . 255 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 11 *Off-policy Methods with Approximation 257 11.1 Semi-gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 11.2 Examples of Off-policy Divergence . . . . . . . . . . . . . . . . . . . . . . 260 11.3 The Deadly Triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 11.4 Linear Value-function Geometry . . . . . . . . . . . . . . . . . . . . . . . 266 11.5 Gradient Descent in the Bellman Error . . . . . . . . . . . . . . . . . . . . 269 11.6 The Bellman Error is Not Learnable . . . . . . . . . . . . . . . . . . . . . 274 11.7 Gradient-TD Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 11.8 Emphatic-TD Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 11.9 Reducing Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 12 Eligibility Traces 287 12.1 The λ-return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 12.2 TD(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 12.3 n-step Truncated λ-return Methods . . . . . . . . . . . . . . . . . . . . . . 295 12.4 Redoing Updates: Online λ-return Algorithm . . . . . . . . . . . . . . . . 297 12.5 True Online TD(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 12.6 *Dutch Traces in Monte Carlo Learning . . . . . . . . . . . . . . . . . . . 301 12.7 Sarsa(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 12.8 Variable λ and γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 12.9 Off-policy Traces with Control Variates . . . . . . . . . . . . . . . . . . . 309 12.10 Watkins’s Q(λ) to Tree-Backup(λ) . . . . . . . . . . . . . . . . . . . . . . 312 12.11 Stable Off-policy Methods with Traces . . . . . . . . . . . . . . . . . . . 314 12.12 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 12.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Contents xi 13 Policy Gradient Methods 321 13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . . . 322 13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 324 13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . . . 326 13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 329 13.5 Actor–Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 13.6 Policy Gradient for Continuing Problems . . . . . . . . . . . . . . . . . . 333 13.7 Policy Parameterization for Continuous Actions . . . . . . . . . . . . . . . 335 13.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 III Looking Deeper 339 14 Psychology 341 14.1 Prediction and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 14.2 Classical Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 14.2.1 Blocking and Higher-order Conditioning . . . . . . . . . . . . . . . 345 14.2.2 The Rescorla–Wagner Model . . . . . . . . . . . . . . . . . . . . . 346 14.2.3 The TD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 14.2.4 TD Model Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 350 14.3 Instrumental Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 14.4 Delayed Reinforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 14.5 Cognitive Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 14.6 Habitual and Goal-directed Behavior . . . . . . . . . . . . . . . . . . . . . 364 14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 15 Neuroscience 377 15.1 Neuroscience Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 15.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors . . 380 15.3 The Reward Prediction Error Hypothesis . . . . . . . . . . . . . . . . . . 381 15.4 Dopamine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 15.5 Experimental Support for the Reward Prediction Error Hypothesis . . . . 387 15.6 TD Error/Dopamine Correspondence . . . . . . . . . . . . . . . . . . . . . 390 15.7 Neural Actor–Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 15.8 Actor and Critic Learning Rules . . . . . . . . . . . . . . . . . . . . . . . 398 15.9 Hedonistic Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 15.10 Collective Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 404 15.11 Model-based Methods in the Brain . . . . . . . . . . . . . . . . . . . . . . 407 15.12 Addiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 15.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 xii Contents 16 Applications and Case Studies 421 16.1 TD-Gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 16.2 Samuel’s Checkers Player . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 16.3 Watson’s Daily-Double Wagering . . . . . . . . . . . . . . . . . . . . . . . 429 16.4 Optimizing Memory Control . . . . . . . . . . . . . . . . . . . . . . . . . . 432 16.5 Human-level Video Game Play . . . . . . . . . . . . . . . . . . . . . . . . 436 16.6 Mastering the Game of Go . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 16.6.1 AlphaGo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 16.6.2 AlphaGo Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 16.7 Personalized Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 16.8 Thermal Soaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 17 Frontiers 459 17.1 General Value Functions and Auxiliary Tasks . . . . . . . . . . . . . . . . 459 17.2 Temporal Abstraction via Options . . . . . . . . . . . . . . . . . . . . . . 461 17.3 Observations and State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 17.4 Designing Reward Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 17.5 Remaining Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 17.6 Experimental Support for the Reward Prediction Error Hypothesis . . . . 475 References 481 Index 519
"Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms."--
ISBN: 9780262039246 (hardcover : alk. paper)
LCCN: 2018023826
Subjects--Topical Terms: Reinforcement learning.
LC Class. No.: Q325.6 / .R45 2018
Dewey Class. No.: 006.31 SUT