AlphaGoZero
Thank you to Marc Lanctot, Hugo Larochelle, Katherine Lee, and Tim Lillicrap for contributions to this guide.
Additionally, this would not have been possible without the generous support of Prof. Joan Bruna and his class at NYU, The Mathematics of Deep Learning. Special thanks to him, as well as Martin Arjovsky, my colleague in leading this recitation, and my fellow students Ojas Deshpande, Anant Gupta, Xintian Han, Sanyam Kapoor, Chen Li, Yixiang Luo, Chirag Maheshwari, Zsolt PajorGyulai, Roberta Raileanu, Ryan Saxe, and Liang Zhuo.
Why
AlphaGoZero was a big splash when it debuted and for good reason. The grand effort was led by David Silver at DeepMind and was an extension of work that he started during his PhD. The main idea is to solve the game of Go and the approach taken is to use an algorithm called Monte Carlo Tree Search (MCTS). This algorithm acts as an expert guide to teach a deep neural network how to approximate the value of each state. The convergence properties of MCTS provides the neural network with a founded way to reduce the search space.
In this curriculum, you will focus on the study of twoperson zerosum perfect information games and develop understanding so that you can completely grok AlphaGoZero.
Common Resources:
 Knuth: An Analysis of AlphaBeta Pruning
 SB: Reinforcement Learning: An Introduction, Sutton & Barto.
 Kun: Jeremy Kun: Optimizing in the Face of Uncertainty.
 Vodopivec: On Monte Carlo Tree Search and Reinforcement Learning.
1 Minimax & Alpha Beta Pruning
Motivation: Minimax and AlphaBeta Pruning are original ideas that blossomed from the study of games starting in the 50s. To this day, they are components in strong gameplaying computer engines like Stockfish. In this class, we will go over these foundations, learn from Prof. Knuth’s work analyzing their properties, and prove that these algorithms are theoretically sound solutions to twoplayer games.
Topics:
 Perfect Information Games.
 Minimax.
 AlphaBeta Pruning.
Required Reading:
 Cornell Recitation on Minimax & AB Pruning.
 Knuth: Section 6 (Theorems 1&2, Corollaries 1&3).
Optional Reading:
 CMU’s Mathematical Foundations of AI Lecture 1.
 Knuth: Sections 13.
 Chess Programming on Minimax.
 Chess Programming on AB Pruning.
Questions:
 Knuth: Show that AlphaBetaMin AlphaBetaMax. (p. 300)
Solution
If \(d = 0\), then \(F2(p, \alpha, \beta) = f(p)\) and \(G2(p, \alpha, \beta) = g(p) = f(p)\) as desired, where the last step follows from equation 2 on p. 295.
Otherwise, \(d > 0\) and we proceed by induction on the height \(h\). The base case of \(h = 0\) is trivial because then the tree is a single root and consequently is the \(d = 0\) case. Assume it is true for height \(< h\), then for \(p\) of height \(h\), we have that \(m = a\) at the start of \(F2(p, \alpha, \beta)\) and \(m\prime = \alpha\) at the start of \(G2(p, \beta, \alpha)\). So \(m = m\prime\).
In the ith iteration of the loop, let's label the resulting value of \(m\) as \(m_{n}\). We have that \(t = G2(p_{i}, m , \beta) = F2(p_i, \beta, m) = t\) by the inductive assumption. Then, \(t > m \iff t < m \iff t\prime < m\prime \iff m_{n} = t = m_{n}\prime\), which means that every time there is an update to the value of \(m\), it will be preserved across both functions. Further, because \(m \geq \beta \iff m \leq \beta \iff m\prime \leq \beta\), we have that \(G2\) and \(F2\) will have the same stopping criterion. Together, these imply that \(G2(p, \alpha, \beta) = F2(p, \beta, \alpha)\) after each iteration of the loop as desired.
 Knuth: For Theorem 1.1, why are the successor positions of type 2? (p. 305)
Solution
By the definition of being type 1, \(p = a_{1} a_{2} \ldots a_{l}\), where each \(a_{k} = 1\). Its successor positions \(p_{l+1} = p (l+1)\) all have length \(l + 1\) and their first term \(> 1\) is at position \(l+1\), the last entry. Consequently, \((l+1)  (l+1) = 0\) is even and they are type 2.
 Knuth: For Theorem 1.2, why is it that p’s successor position is of type 3
if p is not terminal?
Solution
If \(p\) is type 2 and size \(l\), then for \(j\) s.t. \(a_j\) is the first entry where \(a_j > 1\), we have that \(l  j\) is even. When it's not terminal, then its successor position \(p_1 = a_{1} \ldots a_{j} \dots a_{l} 1\) has a length of size \(l + 1\), which implies that \(l + 1  j\) is odd and so \(p_1\) is type 3.
 Knuth: For Theorem 1.3, why is it that p’s successor positions are of type 2
if p is not terminal?
Hint
This is similar to the above two.
 Knuth: Show that the three inductive steps of Theorem 2 are correct.
2 MultiArmed Bandits & Upper Confidence Bounds
Motivation: The multiarmed bandits problem is a framework for understanding the exploitation vs exploration tradeoff. Upper Confidence Bounds, or UCB, is an algorithmically tight approach to addressing that tradeoff under certain constraints. Together, they are important components of how Monte Carlo Tree Search (MCTS), a key aspect of AlphaGoZero, was originally formalized. For example, in MCTS there is a notion of node selection where UCB is used extensively. In this section, we will cover bandits and UCB.
Topics:
 Basics of reinforcement learning.
 Multiarmed bandit algorithms and their bounds.
Required Reading:
 SB: Sections 2.1  2.7.
 Kun.
Optional Reading:
Questions:
 SB: Exercises 2.3, 2.4, 2.6.
 SB: What are the pros and cons of the optimistic initial values method? (Section 2.6)
 Kun: In the proof for the expected cumulative regret of UCB1, why is
a trivial regret bound if the deltas are all the same?
Solution
\( \begin{align} \mathbb{E}[R_{A}(T)] &= \mu^{*}T  \mathbb{E}[G_{A}(T)] \\ &= \mu^{*}T  \sum_{i} \mu_{i}\mathbb{E}[P_{i}(T)] \\ &= \sum_{i} (\mu^{*}  \mu_{i})\mathbb{E}[P_{i}(T)] \\ &= \sum_{i} \delta_{i} \mathbb{E}[P_{i}(T)] \\ &\leq \delta \sum_{i} \mathbb{E}[P_{i}(T)] \\ &= \delta * T \end{align} \)
The third line follows from \(sum_{i} \mathbb{E}[P_{i}(T)] = T\) and the fifth line from the definition of \(\delta\).
 Kun: Do you understand the argument for why the regret bound is ?
Hint
What happens if you break the arms into those with regret \(< \sqrt{K(\log{T})/T}\) and those with regret \(\geq \sqrt{K(\log{T})/T}\)? Can we use this to bound the total regret?
 Reproduce the UCB1 algorithm in code with minimal supervision.
3 Policy & Value Functions
Motivation: Policy and value functions are at the core of reinforcement learning. The policy function is the representative probabilities that our policy assigns to each action. When we sample from these, we would like for better actions to have higher probability. The value function is our estimate of the strength of the current state. In AlphaGoZero, a single network calculates both a value and a policy, then later updates its weights according to how well the agent performs in the game.
Topics:
 Bellman equation.
 Policy gradient.
 Onpolicy / offpolicy.
 Policy iteration.
 Value iteration.
Required Reading:
 Value Function:
 SB: Sections 3.5, 3.6, 3.7.
 SB: Sections 9.1, 9.2, 9.3*.
 Policy Function:
 SB: Sections 4.1, 4.2, 4.3.
 SB: Sections 13.1, 13.2*, 13.3, 13.4.
Optional Reading:
 Sergey Levine: Berkeley Fall’17: Policy Gradients → This is really good.
 Sergey Levine: Berkeley Fall’17: Value Functions → This is really good.
 Karpathy does Pong.
 David Silver on PG.
 David Silver on Value.
Questions:
 Why does policy gradient have such high variance?
 What is the difference between offpolicy and onpolicy?
Solution
Onpolicy algorithms learn from the current policy's action decisions. Offpolicy algorithms learn from another arbitrary policy's actions. An example of an onpolicy algorithm is SARSA or REINFORCE. An example of an offpolicy algorithm is QLearning.
 SB: Exercises 3.13, 3.14, 3.15, 3.20, 4.3.
 SB: Exercise 4.6  How would policy iteration be defined for action values?
Give a complete algorithm for computing , analogous to that on page 65
for computing .
Solution
The solution follows the proof (page 65) for \(v^{*}\), with the following modifications:
 Consider a randomly initialized Q(s, a) and a random policy \( \pi(s) \).
 Policy Evaluation : Update Q(s, a) \( \leftarrow \sum_{s'} P_{ss'}^{a} R_{ss'}^{a} + \gamma \sum_{s'} \sum_{a'} P_{ss'}^{a} Q^{\pi}(s', a') \pi(a'  s') \)
Note that \( P_{ss'}^{a} \leftarrow P(s' s, a) , R_{ss'}^{a} \leftarrow R(s, a, s').\)  Policy Improvement : Update \( \pi(s) = {argmax}_{a} Q^{\pi}(s, a) \). If \(unstable\), go to step 2. Here, \( unstable \), implies \( \pi_{before\_update}(s) \neq \pi_{after\_update}(s) \)
 \( q^{*} \leftarrow Q(s, a) \)
 SB: Exercise 13.2  Prove that the eligibility vector
using the definitions and elementary calculus. Here, = softmax( ).
Solution
By definition, we have \( \pi( a s, \theta) = \frac{e^{ \theta^{T} \mathbf{x}( s, a) }}{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b)) }} \), where \( \mathbf{x}(s, a) \) is the stateaction feature representation. Consequently:
\( \begin{align} \nabla_{\theta} \ln \pi (a  s, \theta) &= \nabla_\theta \Big( \theta^{T}\mathbf{x}(s, a)  \ln \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } \Big) \\ &= \mathbf{x}(s, a)  \sum_b \mathbf{x}(s, b) \frac{ e^{ \theta^{T}\mathbf{x}(s, b) } }{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } } \\ &= \mathbf{x}(s, a)  \sum_{b} \pi (b  s, \theta)\mathbf{x}(s, b) \\ \end{align} \)
4 MCTS & UCT
Motivation: Monte Carlo Tree Search (MCTS) forms the backbone of AlphaGoZero. It is what lets the algorithm reliably explore and then hone in on the best policy. UCT (UCB for Trees) combines MCTS and UCB so that we get reliable convergence guarantees. In this section, we will explore how MCTS works and how to make it excel for our purposes in solving Go, a game with an enormous branching factor.
Topics:
 Conceptual understanding of Monte Carlo Tree Search.
 Optimality of UCT.
Required Reading:
 SB: Section 8.11
 Browne: Sections 2.2, 2.4, 3.13.5, 8.28.4.
 Silver Thesis: Sections 1.4.2 and 3.
Optional Reading:
 Jess Hamrick on Browne.
 Original MCTS Paper.
 Original UCT Paper.
 Browne:
 Section 4.8: MCTS applied to Stochastic or Imperfect Information Games.
 Sections 7.2, 7.3, 7.5, 7.7: Applications of MCTS.
Questions:
 Can you detail each of the four parts of the MCTS algorithm?
Solution
 Selection: Select child node from the current node based on the tree policy.
 Expansion: Expand the child node based on the exploration / exploitation tradeoff.
 Simulation: Simulate from the child node until termination or upon reaching a suitably small future reward (like from reward decay).
 Backup: Backup the reward along the path taken according to the tree policy.
 What happens to the information gained from the Tree Search after each run?
Solution
We can reuse the accumulated statistics in subsequent runs. We could also ignore those statistics and build fresh each subsequent root. Both are used in actual implementations.
 What characteristics of a domain would make MCTS a good algorithmic choice?
Solution
A few such characteristics are:
 MCTS is aheuristic, meaning that it does not require any domainspecific knowledge. Consequently, if it is difficult to produce game heuristics for your target domain (e.g. Go), then it can perform much better than alternatives like Minimax. And on the flip side, if you did have domainspecific knowledge, MCTS can incorporate it and will improve dramatically.
 If the target domain needs actions online, then MCTS is a good choice as all values are always up to date. Go does not have this property but digital games like in the General Game Playing suite may. If the target domain's game tree is of a nontrivial size, then MCTS may be a much better choice than other algorithms as it tends to build unbalanced trees that explore the more promising routes rather than consider all routes. </li>
 If there is noise or delayed rewards in the target domain, then MCTS is a good choice because it is robust to these effects which can gravely impact other algorithms such as modern Deep Reinforcement Learning.
 What are examples of domain knowledge default policies in Go?
Solution
 Crazy Stone, an early program that won the 2006 9x9 Computer Go Olympiad, used an urgency heuristic value for each of the moves on the board.

MoGo, the algorithm that introduced UCT, bases its default policies on this
sequence:
 Respond to ataris by playing a saving move at random.
 If one of the eight intersections surrounding the last move matches a simple pattern for cutting or hane, randomly play one.
 If there are capturing moves, play one at random.
 Play a random move.
 The second version of Crazy Stone used an algorithm learned from actual game play to learn a library of strong patterns. It incorporated this into its default policy.
 Why is UCT optimal? For a finitehorizon MDP with rewards scaled to lie in
, can you prove that the failure probability at the root converges
to zero at a polynomial rate in the number of games simulated?
Hint
Try using induction on \(D\), the horizon of the MDP. At \(D=1\), to what result does this correspond?
Hint 2
Assume that the result holds for a horizon up to depth \(D  1\) and consider a tree of depth \(D\). We can keep the cumulative rewards bounded in the interval by dividing by \(D\). Now can you show that the UCT payoff sequences at the root satisfy the drift conditions, repeated below?
 The payoffs are bounded  \(0 \leq X_{it} \leq 1\), where \(i\) is the arm number and \(t\) is the time step.
 The expected values of the averages, \(\overline{X_{it}} = \frac{1}{n} \sum_{t=1}^{n} X_{it}\), converge.
 Define \(\mu_{in} = \mathbb{E}[\overline{X_{in}}]\) and \(\mu_{i} = \lim_{n\to\inf} \mu_{in}\). Then, for \(c_{t, s} = 2C_{p}\sqrt{\frac{\ln{t}}{s}}\), where \(C_p\) is a suitable constant, both \(\mathbb{P}(\overline{X_{is}} \geq \mu_{i} + c_{t, s}) \leq t^{4}\) and \(\mathbb{P}(\overline{X_{is}} \leq \mu_{i}  c_{t, s}) \leq t^{4}\) hold.
Solution
For a complete detail of the proof, see the original UCT paper.
5 MCTS & RL
Motivation: Up to this point, we have learned a lot about how games can be solved and how reinforcement learning works on a foundational level. Before we jump into the paper, one last foray contrasting and unifying the games vs learning perspective is worthwhile for understanding the domain more fully. In particular, we will focus on a paper from Vodopivec et al. After completing this section, you should have an understanding of what research directions in this field have been thoroughly explored and which still have open directions.
Topics:
 Integrating MCTS and reinforcement learning.
Required Reading:
 Vodopivec: 1. Section 3.13.4: Connection between MCTS and RL. 2. Section 4.14.3: Integrating MCTS and RL.
 Why did TDGammon Work?
Optional Reading:
 Vodopivec: Section 5: Survey of research inspired by both fields.
Questions:
 What are key differences between MCTS and RL?
 UCT can be described in RL terms as the following “The original UCT searches
identically as an offline onpolicy everyvisit MC control algorithm that uses
UCB1 as the policy.” What do each of these terms mean?
Solution
 UCT is trained onpolicy, which means it improves the policy used to make the action decisions, i.e. UCB1.
 The offline means that we can't learn until after the episode is completed. An alternative online algorithm would learn while the episode was running.
 Everyvisit versus firstvisit decides if we are going to update a state for every time it's accessed in an episode or just the first time. The original UCT algorithm did everyvisit. Subsequent versions relaxed this.
 MC control means that we are using Monte Carlo as the policy, i.e. we use the average value of the state as the true value.
 What is a Representation Policy? Give an example not described in the text.
Solution
A Representation Policy defines the model of the state space (e.g. in the form of a value function) and the boundary between memorized and nonmemorized parts of the space.
 What is a Control Policy? Give an example not described in the text.
Solution
A Control Policy dictates what actions will be performed and (consequently) which states will be visited. In MCTS, it includes the tree and default policies.
6 The Paper
Motivation: Let’s read the paper! We have a deep understanding of the background, so let’s delve into the apex result. Note that we don’t just focus on the final AlphaGoZero paper but also explore a related paper written coincidentally by a team at UCL using Hex as the game of choice. Their algorithm is very similar to the AlphaGoZero algorithm and considering both in context is important to gauging what was really the most important aspects of this research.
Topics:
 MCTS learning and computational capacity.
Required Reading:
 Mastering the Game of Go Without Human Knowledge
 Thinking Fast and Slow with Deep Learning and Tree Search
Optional Reading:
 Deep Learning for RealTime Atari Game Play Using Offline MonteCarlo Tree Search Planning
 Silver Thesis: Section 4.6
 Mastering the game of Go with deep neural networks and tree search
 Mastering Chess and Shogi by SelfPlay with a General Reinforcement Learning Algorithm
Questions:
 What were the differences between the two papers “Mastering the Game of Go
Without Human Knowledge” and “Thinking Fast and Slow with Deep Learning and Tree Search”?
Solution
Some differences between the former (AG0) and the latter (ExIt) are:
 AG0 uses MC value estimates from the expert for the value network where ExIt uses estimates from the apprentice. This requires more computation by AG0 but produces better estimates.
 The losses were different. For the value network, AG0 uses an MSE loss with L2 regularization and ExIt uses a cross entropy loss with early stopping. For the policy part, AG0 used cross entropy while ExIt uses a weighted crossentropy that takes into account how confident MCTS is in the action based on the state count.
 AG0 uses the value network to evaluate moves; ExIt uses RAVE and rollouts, plus warm starts from the MCTS.
 AG0 adds in Dirichlet noise to the prior probability at the root node.
 AG0 elevates a new network as champion only when it's markedly better than the prior champion; ExIt replaces the old network without verification of if it is better.
 What was common to both of “Mastering the Game of Go Without Human Knowledge”
and “Thinking Fast and Slow with Deep Learning and Tree Search”?
Solution
The most important commonality is that they both use MCTS as an expert guide to help a neural network learn through selfplay.
 Will the system get stuck if the current neural network can’t beat the previous ones?
Solution
No. The algorithm won’t accept a policy that is worse than the current best and MCTS’s convergence properties imply that it will eventually tend towards the equilibrium solution in a zerosum two player game
 Why include both a policy and a value head in these algorithms? Why not just use policy?
Solution
Value networks reduce the required search depth. This helps tremendously because a rollout approach without the value network is inaccurate and spends too much time on suboptimal directions.