CS 294: Deep Reinforcement Learning, Fall 2015
Instructors: John Schulman, Pieter Abbeel 
GSI: Rocky Duan 
Lectures: Mondays and Wednesday, Session 1: 10:00am11:30am in 405 Soda Hall / Session 2: 2:30pm4:00pm in 250 Sutardja Dai Hall. 
Office Hours: Tuesday 4pm5pm, Thursday 11am12pm, both in 511 Soda Hall. 
Communication: Piazza will be used for announcements, general questions about the course, clarifications about assignments, student questions to each other, discussions about material, and so on. To sign up, go to the Piazza website and sign up with “UC Berkeley” and “CS294112” for your school and class. 
Table of Contents
Prerequisites
This course will assume some familiarity with reinforcement learning, numerical optimization and machine learning. Students who are not familiar with the concepts below are encouraged to brush up using the references provided right below this list. We’ll review this material in class, but it will be rather cursory.
 Reinforcement learning and MDPs
 Definition of MDPs
 Exact algorithms: policy and value iteration
 Search algorithms
 Numerical Optimization
 gradient descent, stochastic gradient descent
 backpropagation algorithm
 Machine Learning
 Classification and regression problems: what loss functions are used, how to fit linear and nonlinear models
 Training/test error, overfitting.
For introductory material on RL and MDPs, see
 CS188 EdX course, starting with Markov Decision Processes I
 Sutton & Barto, Ch 3 and 4.
 For a concise intro to MDPs, see Ch 12 of Andrew Ng’s thesis
 David Silver’s course, links below
For introductory material on machine learning and neural networks, see
Syllabus
Below you can find a tentative outline of the course. Slides, videos, and references will be posted as the course proceeds. Dates are tentative.
Course Introduction and Overview
 Date: 8/26
 Topics:
 What is deep reinforcement learning?
 Current applications of RL
 Frontiers: where might deep RL be applied?
 Slides
 References and further reading
 See Powell textbook for more information on applications in operations research.
 See Stephane Ross’ thesis (Introduction) for more info on structured prediction as reinforcement learning and how is RL difference from supervised learning?
Markov Decision Processes
 Date: 8/31
 MDP Cheatsheet
Review of Backpropagation and Numerical Optimization
 Date: 9/2
 For a very thorough analysis of reverse mode automatic differentiation, see Griewank and Walther’s textbook Evaluating Derivatives. Chances are, you’re computing derivatives all day, so it pays off to go into some depth on this topic!
Policy Gradient Methods

General references on policy gradients:
 Peters & Schaal: Reinforcement learning of motor skills with policy gradients: solid review article on policy gradient methods.
 Sham Kakade’s thesis, chapter 4: nice historical overview and theoretical study, with an alternative perspective based on advantage functions.
 Greensmith, Baxter, & Bartlett: Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning, and also see the paper InfiniteHorizon PolicyGradient Estimation which introduces the theoretical framework.

Trust region / proximal / batch methods
 Kakade & Langford: Approximately optimal approximate reinforcement learning – Conservative policy iteration algorithm, providing some nice intuition about policy gradient methods, and some generally useful theoretical ideas.
 Kakade: A Natural Policy Gradient
 Schulman, Levine, Moritz, Jordan, Abbeel: Trust Region Policy Optimization: combines theoretical ideas from conservative policy gradient algorithm to prove that monotonic improvement can be guaranteed when one solves a series of subproblems of optimizing a bound on the policy performance. The conclusion is that one should use KLdivergence constraint.
 Schulman, Moritz, Levine, Jordan, Abbeel: HighDimensional Continuous Control Using Generalized Advantage Estimation: Better estimation of advantage function for policy gradient algorithms, using a λ parameter.
Approximate Dynamic Programming Methods
 QLearning / QValue Iteration
 Qlearning convergence result is originally due to Watkins. A compact and general alternative proof is provided by Jordan, Jaakola, and Singh: On the Convergence of Stochastic Iterative Dynamic Programming Algorithm, which also applies to TD(λ)
 Neural Fitted Q Iteration (NFQ) by Riedmiller
 Deep Q Network (DQN) by Mnih et al. of DeepMind: ArXiv, Nature
 Approximate Policy Iteration methods
 Scherrer et al., Approximate Modified Policy Iteration. A very general framework that subsumes many other ADP algorithms as special cases. Also see the related paper with more practical tips: Approximate Dynamic Programming Finally Performs Well in the Game of Tetris.
 See Bertsekas’ textbook, 2ed for a very extensive treatment of approximate value function estimation methods, and approximate policy iteration.
 Least Squares Policy Iteration: policy iteration with Q function.
Search + Supervised Learning
 DAGGER and related ideas based on querying an expert (or search algorithm) while executing agent’s policy:
 A Reduction of Imitation Learning and Structured Prediction to NoRegret Online Learning  DAGGER
 Reinforcement and Imitation Learning via Interactive NoRegret Learning AGGREVATE – same authors as DAGGER, cleaner and more general framework (in my opinion).
 Deep Learning for RealTime Atari Game Play Using Offline MonteCarlo Tree Search Planning MonteCarlo Tree Search + DAGGER
 SEARN in Practice  similar to DAGGER/AGGREVATE but using a stochastic policy, and targeted at structured prediction problems. Stephane Ross’ thesis has a nice explanation of SEARN.
 Trajectory Optimization + Supervised Learning:
 Guided Policy Search: use (modification of) importance sampling to get policy gradient, where samples are obtained via trajectory optimization.
 Constrained Guided Policy Search: Formulates an objective that jointly includes a collection of trajectories and a policy, and encourages them to become consistent. EndToEnd Training of Deep Visuomotor Policies uses CGPS learn mapping from image pixels to lowlevel control signal in robotic manipulation problems.
 Combining the Benefits of Function Approximation and Trajectory Optimization jointly optimizing trajectories and policies, with some different design choices from CGPS. (Igor has written a new NIPS paper on this topic, it will be linked when it’s ready)
 Slides from lecture on DAGGER and friends
Frontiers:
Lecture Videos
We did not record lecture videos for the course, but I (John) gave a lecture series at MLSS, and videos are available:
 Lecture 1: intro, derivative free optimization
 Lecture 2: score function gradient estimation and policy gradients
 Lecture 3: actor critic methods
 Lecture 4: trust region and natural gradient methods, open problems
Related Materials
Courses
 Dave Silver’s course on reinforcement learning / Lecture Videos
 Nando de Freitas’ course on machine learning
 Andrej Karpathy’s course on neural networks
Textbooks
 Sutton & Barto, Reinforcement Learning: An Introduction
 Szepesvari, Algorithms for Reinforcement Learning
 Bertsekas, Dynamic Programming and Optimal Control, Vols I and II
 Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
 Powell, Approximate Dynamic Programming
Misc Links
Feedback
Send feedback to the instructor. Feel free to remain anonymous.