CS 294: Deep Reinforcement Learning, Fall 2015
Instructors: John Schulman, Pieter Abbeel |
GSI: Rocky Duan |
Lectures: Mondays and Wednesday, Session 1: 10:00am-11:30am in 405 Soda Hall / Session 2: 2:30pm-4:00pm in 250 Sutardja Dai Hall. |
Office Hours: Tuesday 4pm-5pm, Thursday 11am-12pm, both in 511 Soda Hall. |
Communication: Piazza will be used for announcements, general questions about the course, clarifications about assignments, student questions to each other, discussions about material, and so on. To sign up, go to the Piazza website and sign up with “UC Berkeley” and “CS294-112” for your school and class. |
Table of Contents
Prerequisites
This course will assume some familiarity with reinforcement learning, numerical optimization and machine learning. Students who are not familiar with the concepts below are encouraged to brush up using the references provided right below this list. We’ll review this material in class, but it will be rather cursory.
- Reinforcement learning and MDPs
- Definition of MDPs
- Exact algorithms: policy and value iteration
- Search algorithms
- Numerical Optimization
- gradient descent, stochastic gradient descent
- backpropagation algorithm
- Machine Learning
- Classification and regression problems: what loss functions are used, how to fit linear and nonlinear models
- Training/test error, overfitting.
For introductory material on RL and MDPs, see
- CS188 EdX course, starting with Markov Decision Processes I
- Sutton & Barto, Ch 3 and 4.
- For a concise intro to MDPs, see Ch 1-2 of Andrew Ng’s thesis
- David Silver’s course, links below
For introductory material on machine learning and neural networks, see
Syllabus
Below you can find a tentative outline of the course. Slides, videos, and references will be posted as the course proceeds. Dates are tentative.
Course Introduction and Overview
- Date: 8/26
- Topics:
- What is deep reinforcement learning?
- Current applications of RL
- Frontiers: where might deep RL be applied?
- Slides
- References and further reading
- See Powell textbook for more information on applications in operations research.
- See Stephane Ross’ thesis (Introduction) for more info on structured prediction as reinforcement learning and how is RL difference from supervised learning?
Markov Decision Processes
- Date: 8/31
- MDP Cheatsheet
Review of Backpropagation and Numerical Optimization
- Date: 9/2
- For a very thorough analysis of reverse mode automatic differentiation, see Griewank and Walther’s textbook Evaluating Derivatives. Chances are, you’re computing derivatives all day, so it pays off to go into some depth on this topic!
Policy Gradient Methods
-
General references on policy gradients:
- Peters & Schaal: Reinforcement learning of motor skills with policy gradients: solid review article on policy gradient methods.
- Sham Kakade’s thesis, chapter 4: nice historical overview and theoretical study, with an alternative perspective based on advantage functions.
- Greensmith, Baxter, & Bartlett: Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning, and also see the paper Infinite-Horizon Policy-Gradient Estimation which introduces the theoretical framework.
-
Trust region / proximal / batch methods
- Kakade & Langford: Approximately optimal approximate reinforcement learning – Conservative policy iteration algorithm, providing some nice intuition about policy gradient methods, and some generally useful theoretical ideas.
- Kakade: A Natural Policy Gradient
- Schulman, Levine, Moritz, Jordan, Abbeel: Trust Region Policy Optimization: combines theoretical ideas from conservative policy gradient algorithm to prove that monotonic improvement can be guaranteed when one solves a series of subproblems of optimizing a bound on the policy performance. The conclusion is that one should use KL-divergence constraint.
- Schulman, Moritz, Levine, Jordan, Abbeel: High-Dimensional Continuous Control Using Generalized Advantage Estimation: Better estimation of advantage function for policy gradient algorithms, using a λ parameter.
Approximate Dynamic Programming Methods
- Q-Learning / Q-Value Iteration
- Q-learning convergence result is originally due to Watkins. A compact and general alternative proof is provided by Jordan, Jaakola, and Singh: On the Convergence of Stochastic Iterative Dynamic Programming Algorithm, which also applies to TD(λ)
- Neural Fitted Q Iteration (NFQ) by Riedmiller
- Deep Q Network (DQN) by Mnih et al. of DeepMind: ArXiv, Nature
- Approximate Policy Iteration methods
- Scherrer et al., Approximate Modified Policy Iteration. A very general framework that subsumes many other ADP algorithms as special cases. Also see the related paper with more practical tips: Approximate Dynamic Programming Finally Performs Well in the Game of Tetris.
- See Bertsekas’ textbook, 2ed for a very extensive treatment of approximate value function estimation methods, and approximate policy iteration.
- Least Squares Policy Iteration: policy iteration with Q function.
Search + Supervised Learning
- DAGGER and related ideas based on querying an expert (or search algorithm) while executing agent’s policy:
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning - DAGGER
- Reinforcement and Imitation Learning via Interactive No-Regret Learning AGGREVATE – same authors as DAGGER, cleaner and more general framework (in my opinion).
- Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning Monte-Carlo Tree Search + DAGGER
- SEARN in Practice - similar to DAGGER/AGGREVATE but using a stochastic policy, and targeted at structured prediction problems. Stephane Ross’ thesis has a nice explanation of SEARN.
- Trajectory Optimization + Supervised Learning:
- Guided Policy Search: use (modification of) importance sampling to get policy gradient, where samples are obtained via trajectory optimization.
- Constrained Guided Policy Search: Formulates an objective that jointly includes a collection of trajectories and a policy, and encourages them to become consistent. End-To-End Training of Deep Visuomotor Policies uses CGPS learn mapping from image pixels to low-level control signal in robotic manipulation problems.
- Combining the Benefits of Function Approximation and Trajectory Optimization jointly optimizing trajectories and policies, with some different design choices from CGPS. (Igor has written a new NIPS paper on this topic, it will be linked when it’s ready)
- Slides from lecture on DAGGER and friends
Frontiers:
Lecture Videos
We did not record lecture videos for the course, but I (John) gave a lecture series at MLSS, and videos are available:
- Lecture 1: intro, derivative free optimization
- Lecture 2: score function gradient estimation and policy gradients
- Lecture 3: actor critic methods
- Lecture 4: trust region and natural gradient methods, open problems
Related Materials
Courses
- Dave Silver’s course on reinforcement learning / Lecture Videos
- Nando de Freitas’ course on machine learning
- Andrej Karpathy’s course on neural networks
Textbooks
- Sutton & Barto, Reinforcement Learning: An Introduction
- Szepesvari, Algorithms for Reinforcement Learning
- Bertsekas, Dynamic Programming and Optimal Control, Vols I and II
- Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Powell, Approximate Dynamic Programming
Misc Links
Feedback
Send feedback to the instructor. Feel free to remain anonymous.