Jekyll2020-09-16T04:59:25+00:00https://www.depthfirstlearning.com/feed.xmlDepth First LearningMachine Learning CurriculaResurrecting the Sigmoid: Theory and Practice2020-04-07T10:00:00+00:002020-04-07T10:00:00+00:00https://www.depthfirstlearning.com/2020/Resurrecting-Sigmoid<p>[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]</p>
<p>This guide would not have been possible without the help and feedback from many people.</p>
<p>Special thanks to Yasaman Bahri for her feedback, support, and mentoring.</p>
<p>Thank you to Kumar Krishna Agrawal, Sam Schoenholz, and Jeffrey Pennington for their valuable input and guidance.</p>
<p>Finally, thank you to our group members Chris Akers, Brian Friedenberg, Sajel Shah, Vincent Su, Witold Szejgis, for their curiosity, commitment to the course material, and feedback on the curriculum.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/resurrecting-the-sigmoid.svg" width="200"></iframe>
<div>Concept dependency graph. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>As deep networks continue to make progress in a variety of tasks such as vision and language processing,
it is important to understand how to properly train very deep networks with gradient-based methods.
This paper studies, from a rigorous theoretical perspective, which combinations of network weight initializations and network activation functions result in trainable deep networks.
The analysis framework used is broadly applicable to general network architectures.</p>
<p>In this currriculum, we will go through all the background topics necessary to understand this mathematically heavy paper. By the end, you will have an understanding of the dynamics of signal propagation in very wide neural networks, as well as an introduction to random matrix theory.</p>
<p><br /></p>
<h1 id="general-resources">General resources</h1>
<p>This paper is founded upon Random Matrix Theory (RMT), and mean-field analysis of signal propagation.
The first resource below is a friendly introduction to RMT, while the second and third are the
papers in which the mean-field analysis for deep neural networks was developed.
These are good resources to return to throughout the course. For Deep Learning, w recommend Goodfellow et al,
listed as the fourth resource. And finally the course outline is listed below in case you want an offline copy.</p>
<ol>
<li>Livan, Novaes & Vivo: <a href="https://arxiv.org/abs/1712.07903">Introduction to Random Matrices - Theory and Practice</a>.</li>
<li>Poole, Lahiri, Raghu, Sohl-Dickstein & Ganguli: <a href="https://arxiv.org/abs/1606.05340">Exponential expressivity in deep neural networks through transient chaos</a>.</li>
<li>Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein: <a href="https://arxiv.org/pdf/1611.01232.pdf">Deep information propagation</a>.</li>
<li>Goodfellow, Bengio & Courville: <a href="http://www.deeplearningbook.org">Deep Learning</a>.</li>
<li><a href="/assets/sigmoid/misc/Course Outline.docx">Course Outline</a>.</li>
</ol>
<h1 id="1-introduction-to-trainability">1 Introduction [to Trainability].</h1>
<p><strong>Motivation</strong>: The paper we will study here is part of a body of
work with the broad goal of understanding what combination of network architecture
and initialization allow a neural network to be trained with gradient-based methods. This week, you will read about this problem, specifically its manifestation in
deep neural networks.</p>
<p>We also suggest that you skim the paper itself, focusing on the introductory sections,
to understand the relevance of vanishing/exploding gradients to the trainability of neural networks.</p>
<p><strong>Objectives</strong>: Understand the following background.</p>
<ul>
<li>Explain the vanishing/exploding gradient problem and why it worsens with network depth.</li>
<li>Relate vanishing/exploding gradients to the spectrums of various Jacobians.</li>
<li>Understand heuristics used by the community to circumvent the vanishing/exploding gradients, e.g.
<ul>
<li>Common initialization schemes, such as Xavier initialization.</li>
<li>Pre-training.</li>
<li>Skip connections / residual neural networks.</li>
<li>Non-saturating activation functions (ReLU and its variants.</li>
</ul>
</li>
</ul>
<p>We would also like you to have an overview of the paper’s structure and the problem the paper is trying to solve - how to concentrate the entire spectrum of the network’s Jacobian around unity.
Understand why mean-field signal propagation analysis and random matrix theory are necessary for this task.</p>
<p><strong>Topics</strong>:</p>
<ul>
<li>Trainability of networks, specifically the vanishing/exploding gradient problem.</li>
<li>Introduction to the paper and course overview.</li>
</ul>
<p><strong>Required Reading</strong></p>
<p>Prerequisite:
For familiarity with deep learning, please read the following sections from the <a href="http://www.deeplearningbook.org">Deep Learning book</a>.</p>
<p>Preliminaries:</p>
<ul>
<li>2.7 (Eigendecomposition).</li>
<li>2.8 (Singular value decomposition).</li>
<li>3.2 (Random variables).</li>
<li>3.3 (Probability distributions).</li>
<li>3.7 (Independence and conditional independence).</li>
<li>3.8 (Expectation, variance, and covariance).</li>
<li>5.7 (Supervised learning algorithms).</li>
</ul>
<p>Initialization:</p>
<ul>
<li>8.2 (Challenges in neural network optimization).</li>
<li>8.4 (Parameter initialization strategies).</li>
</ul>
<p>Other:</p>
<ul>
<li><a href="https://arxiv.org/pdf/1511.06422.pdf">All you need is a good init</a> by Mishkin et al., sections 1 and 2.</li>
<li><a href="http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf">Understanding the difficulty of training deep feedforward neural networks</a> by Glorot and Bengio.</li>
<li>Wikipedia <a href="https://en.wikipedia.org/wiki/Residual_neural_network">article</a> on residual networks (skip connections).</li>
</ul>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1512.03385.pdf">Deep Residual Learning for Image Recognition</a>.</li>
<li><a href="http://www.depthfirstlearning.com/2019/NeuralODEs#3-resnets">Depth-first learning : NeuralODEs</a>, section 3 on ResNets.</li>
</ol>
<p><br /></p>
<h1 id="2-signal-propagation">2 Signal propagation</h1>
<p><strong>Motivation</strong>: The <em>Resurrecting the Sigmoid</em> paper relies on signal propagation in wide neural networks.
Understanding this framework connects us to more recent investigations of neural networks as Gaussian processes and the neural tangent kernel (<a href="https://arxiv.org/abs/1806.07572">Jacot et al., 2018</a> and <a href="https://arxiv.org/pdf/1902.06720">Lee et al., 2019</a>)).</p>
<p><strong>Topics</strong>:</p>
<ul>
<li>Mean-field analysis of signal propagation in deep neural networks.</li>
</ul>
<p><strong>Required Reading</strong>:</p>
<p>Since this analysis is relatively new, the main sources of information online are the original papers in which it was developed, namely:</p>
<ul>
<li>Poole, Lahiri, Raghu, Sohl-Dickstein & Ganguli: <a href="https://arxiv.org/abs/1606.05340">Exponential expressivity in deep neural networks through transient chaos
</a> (Sections 1, 2, and 3).</li>
<li>Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein: <a href="https://arxiv.org/pdf/1611.01232.pdf">Deep information propagation</a> (Sections 1, 2, 3, and 5).</li>
</ul>
<p>These are very useful references, but not necessarily pedagogical for those unfamiliar with the field.
The problem set below is designed to walk you through understanding the formalism in a self-contained manner. We <strong>strongly</strong> suggest doing the problem set before reading the papers above and only consulting afterwards or for reference.
Note that certain problems point to sections of the above papers for hints.</p>
<p><strong>Optional Reading</strong>:</p>
<p>Once you understand the mean-field analysis framework, you will have a good foundation for the following papers. These are ‘bonus’ and not connected to the target paper.</p>
<ol>
<li><a href="https://arxiv.org/abs/1711.00165">Deep Neural Networks as Gaussian Processes</a></li>
<li><a href="https://arxiv.org/abs/1902.06720">Wide Neural Networks of Any Depth Evolve as Linear Models under Gradient Descent</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<p>This week’s problem set is <a href="/assets/sigmoid/problem-sets/pdfs/1.pdf">here</a>. In this section we highlight a couple of the problems.</p>
<ol>
<li>
<p>Problem 2: The mean field approximation.</p>
<p>In this problem, we use the knowledge we gained in problem 1 to properly choose to initialize the weights and biases according to \(W^l \sim \mathcal{N}(0, \sigma_w^2/N)\) and \(b^l \sim \mathcal{N}(0, \sigma_b^2)\). We’ll investigate some techniques that will be useful in understanding precisely how the network’s random initialization influences what the net does to its inputs; specifically, we’ll be able to take a look at how the <em>depth</em> of the network together with the initialization governs the propagation of an input point as it flows forward through the network’s layers.</p>
<ol>
<li>A natural property to study in a network is its length. Intuitively, this is closely related to how the net transforms the input space, and to how its depth relates to that transformation. Compute the length \(q^l\) of the activation vector output by layer \(l\). When considering non-rectangular nets, where layer \(l\) has length \(N_l\), we want to distinguish this activation norm from the width of individual layers. What’s a more appropriate quantity we can track to understand how the lengths of activation vectors change in the net?
<details><summary>Solution</summary>
<p>
The length is simply the Euclidean magnitude, i.e. \(\sum_{i = 1}^N (h_i^l)^2\). We can stabilize this quantity, especially when \(N\) differs across layers, by normalizing:
$$ q^l = \frac{1}{N_l} \sum_{i = 1}^{N_l} (h_i^l)^2 $$
</p>
</details>
</li>
<li>What probabilistic quantity of the neuronal activations does \(q^l\) approximate (with the approximation improving for larger \(N\))?
<details><summary>Hint</summary>
Recall that all neuronal activations \(h^l_i\) are zero-mean, and consider the definition of \(q^l\) from part (a) in terms of the empirical distribution of \(h^l_i\).
</details>
<details><summary>Solution</summary>
<p>
\(q^l\) is the second moment of the empirical distribution of layer \(l\) activations, and hence approximates the variance. Indeed, as \(N \to \infty\), the empirical average can be written \(q^l = \mathbb{E} \left( (h^l_i)^2 \right) = \text{Var}(h^l_i)\).
</p>
</details>
</li>
<li>Calculate the variance of an individual neuron’s pre-activations, that is, the variance of \(h_i^l\). Your answer should be a recurrence relation, expressing this variance in terms of \(h^{l-1}\) (and the parameters \(\sigma_w\) and \(\sigma_b\)).
<details><summary>Solution</summary>
<p>
Because the means of both the weight and bias distributions are zero, to calculate the variance we just need to calculate the second moment. We can use the fact that the weights and biases are initialized independently, so that the variance of \(h_i^l\) is the sum of a bias term and a variance term:
$$ \begin{align*}
\langle (h_i^l)^2 \rangle &= \left\langle \left( \sum_j W_{ij}^l x_j^{l-1} \right) ^2 \right\rangle \\
&= \left\langle \sum_{jj'} W_{ij}^l W_{ij'}^l x_j^{l-1} x_{j'}^{l-1} \right\rangle + \langle (b_i^l)^2 \rangle\\
&= \left\langle \sum_{jj'} W_{ij}^l W_{ij'}^l x_j^{l-1} x_{j'}^{l-1} \right\rangle + \sigma_b^2\\
&= \frac{\sigma_w^2}{N} \sum_j \langle (x_j^{l-1})^2 \rangle + \sigma_b^2\\
&= \sigma_w^2 \langle (x^{l-1})^2 \rangle + \sigma_b^2\\
&= \sigma_w^2 \langle \phi(h^{l - 1})^2 \rangle + \sigma_b^2
\end{align*} $$
</p>
</details>
</li>
<li>Now consider the limit that the number of hidden neurons, (N), approaches infinity. Use the central limit theorem to argue that in this limit, the pre-activations will be zero-mean Gaussian distributed. Be explicit about the conditions under which this result holds.
<details><summary>Solution</summary>
<p>
The basic idea here is to use the central limit theorem since the pre-activation is a sum of a large number of random variables, i.e.:
$$ h_i^l = \sum_j^N W_{ij}^l x_j^{l-1} + b_i^l $$
There are \(N\) terms in the sum, so as \(N\) goes to infinity, we should have a sum of a large number of random variables which should be well-approximated by a Gaussian.
However, there are a few things we need to be careful of:
1. CLT can show that the sum \(\sum_j^N W_{ij}^l x_j^{l-1}\) is Gaussian-distributed, but there is still the bias term \(b_i^l\). So we do have to assume that the bias term is Gaussian-distributed as well.
2. In order to use CLT, we need each of the variables being added to have finite variance. These individual variables are \(W_{ij}^{l}x_j^{l-1}\). By construction the weights have finite variance; what about the previous layer's activations, \(x_j^{l-1}\)? Unless the activation function \(\phi\) is pathological, if we \emph{assume that the previous-layer pre-activations have finite variance}, there should not be a problem here. In fact, if we just assume that the input distribution, i.e. \(x^0\), has finite variance, all the layers' activations do too. Certainly the commonly used activation functions sigmoid, ReLU, etc. cannot turn a finite-variance sample of pre-activations into an infinite-variance sample of activations.
3. In order to use the CLT, we also need each of the variables being added to have identical distributions. This is true by symmetry.
4. The final condition for use of CLT is that the variables being added are all independent. Taking another look at the definition of \(q^l\),
$$ q^l = \frac{1}{N} \sum_{i = 1}^N (h_i^l)^2 $$
we want to show that each \(h_i^l\) is independent (from which the independence of their squares follows). Each \(h_i^l\) is in turn defined
$$ h_i^l = \sum_{j = 1}^N W^l_{ij} \phi(h^{l - 1}_j) + b^l_i $$
By assumption, \( W^l_{ij} \) and \(b^l_i\) are independent from each other and, over all \(i, j\), from any quantities in previous layers, including \(\phi(h^{l - 1}_j)\). But are the \(W^l_{ij}\) independent of the \(h^{l - 1}_j\)? To justify this, observe that we can view the sum above as a linear combination of the random variables \(W^l_{ij}\); even though, technically, the linear combination is also over random variables \(\phi(h^{l - 1}_j)\), the key is that over \(1 \leq i \leq N\), all the \(h^{l - 1}_j\)'s are the same. In other words, each neuronal activation in layer \(l\) depends on the same exact realization of the random variables that are the activations of the previous layer. So, \(h^l_i\) is essentially a linear combination of the (independent) \(W^l_{ij}\) with deterministic weights, at least with respect to \(i\). So, we can justify the use of the CLT in analyzing \(\lim_{N \to \infty} q^l\).
</p>
</details>
</li>
<li>With this zero-mean Gaussian approximation of \(q^l\), we have a single parameter characterizing this aspect of signal propagation in the net: the variance, \(q^l\), of individual neuronal activations (a proxy for squared activation vector lengths). Let’s now look at how this variance changes from layer to layer, by deriving the relationship between \(q^l\) and \(q^{l - 1}\). In part (c), your answer should have included a term \(\langle (x^{l-1})^2 \rangle\). In terms of the activation function \(\phi\) and the variance \(q^{l-1}\), write this expectation value as an integral over the standard Gaussian measure.
<details><summary>Solution</summary>
<p>
Since \(x_i^{l-1} = \phi(h_i^{l-1})\), we can write the variance \(\langle (x^{l-1})^2 \rangle\) as
$$ \begin{align*}
\langle (x^{l-1})^2 \rangle &= \langle \phi(h^{l-1})^2 \rangle \\
&= \int_\mathbb{R}~dx ~\phi(x)^2 ~p_{h^{l-1}}(x),
\end{align*} $$
where \(p_{h^{l-1}}(x)\) is the pdf of the pre-activations \(h^{l-1}\). By assumption this is a zero-mean Gaussian of variance \(q^{l-1}\), i.e.
$$ p_{h^{l-1}}(x) = \frac{1}{\sqrt{2\pi q^{l-1}}} e^{-\frac{x^2}{2q^{l-1}}} $$
This can be written in terms of the standard Gaussian distribution \(\rho(x)\) via the change of variables
$$ p_{h^{l-1}}(x) = \frac{1}{\sqrt{q^{l-1}}} \rho(x/\sqrt{q^{l-1}}) $$
meaning that the variance \(\langle (x^{l-1})^2 \rangle\) becomes
$$ \langle (x^{l-1})^2 \rangle = \frac{1}{\sqrt{q^{l-1}}} \int_\mathbb{R}~dx ~\phi(x)^2 ~\rho(x/\sqrt{q^{l-1}}) $$
Let \(y = x/\sqrt{q^{l-1}}\), then
$$\langle (x^{l-1})^2 \rangle =\int_\mathbb{R}~dy ~\phi(y\sqrt{q^{l-1}})^2 ~\rho(y).$$
</p>
</details>
</li>
<li>Use this result to write a recursion relation for \(q^l\) in terms of \(q^{l-1}\), \(\sigma_w\), and \(\sigma_b\).
<details><summary>Solution</summary>
<p>
We just plug in, to get
$$ q^l = \sigma_w^2 \int_\mathbb{R}~dy ~\phi(y\sqrt{q^{l-1}})^2 ~\rho(y) + \sigma_b^2 $$
</p>
</details>
</li>
</ol>
</li>
<li>
<p>Problem 3: Fixed points and stability.</p>
<p>In the previous problem, we found a recurrence relation relating the length of a vector at layer \(l\) of a network to the length of the vector at the previous layer, \(l-1\) of the network. In this problem, we are interested in studying the properties of this recurrence relation. In the <em>Resurrecting the sigmoid</em> paper, the results of this problem are used to understand at which bias point to evaluate the Jacobian of the input-output map of the network.</p>
<p>Note that in this problem, we are just taking the recurrence relation as a given, i.e. we do not need to worry about random variables or probabilities; all of that went into determining the recurrence relation. Instead, we’ll use tools from the theory of dynamical systems to investigate the properties - in particular, the asymptotics - of this recurrence relation.</p>
<ol>
<li>A simple example of a dynamical system is a recurrence defined by some initial value \(x_0\) and a relation \(x_n = f(x_{n-1})\) for all \(n>0\). This system defines the resulting sequence \(x_n\). Sometimes, these systems have <em>fixed points</em>, which are values \(x^*\) such that \(f(x^*) = x^*\). If the value of the system, \(x_m\), at some time-step \(m\) , happens to be a fixed point \(x^*\), what is the subsequent evolution of the system?
<details><summary>Solution</summary>
<p>
Since \(f(x^*) = x^*\), for all times greater than \(m\), the system simply stays at \(x^*\).
</p>
</details>
</li>
<li>For the recurrence relation you derived in the previous problem, what is the equation which a fixed-point of the variance, \(q^*\), must satisfy? Under some conditions (i.e. for some values of \(\sigma_w\) and \(\sigma_b\)), the value \(q^*=0\) is a fixed point of the system. What are these conditions?
<details><summary>Solution</summary>
<p>
A fixed point has to satisfy
$$ q^* = \sigma_w^2 \int_\mathbb{R} \phi \left( \rho \sqrt{q^*} \right)^2 \text{ d}\rho + \sigma_b^2 $$
where \( \text{d}\rho \) is the standard Gaussian measure. If \(\sigma_b = 0\), i.e. there is no bias term, and the nonlinearity has a zero y-intercept, then there is a trivial fixed point of \(q^* = 0\).
</p>
</details>
</li>
<li>Now let us be concrete, and look at the recurrence relation in the special case of a nonlinearity \(\phi(h)\) which is both monotonically increasing and satisfies \(\phi(0) = 0\). Note that both of the nonlinearities considered in the paper we are studying, the \(\tanh\) and ReLU nonlinearities, satisfy this property. Show that those two properties (monotonicity and \(\phi(0)=0\)) imply that the length map \(q^l(q^{l-1})\) is monotonically increasing. What is the maximum number of times any concave function can intersect the line \(y = x\)? What does this imply about the number of fixed points the length map \(q^l(q^{l-1})\) can have?
<details><summary>Solution</summary>
<p>
To prove that the function is monotonically increasing with its argument \(q\), we take the derivative:
$$ \begin{align*}
f(q) &= \sigma_w^2 \int_\mathbb{R}~\phi(\rho\sqrt{q})^2~d\rho + \sigma_b^2 \\
f'(q) &= \frac{\sigma_w^2}{\sqrt{q}} \int_\mathbb{R}~\phi(\rho\sqrt{q}) \phi ' (\rho\sqrt{q}) \rho d\rho
\end{align*} $$
The derivative is positive since by assumption \(\phi '\) is positive everywhere, and \(\phi \rho\) is also positive everywhere. So the function is monotonically increasing.
Note that since a fixed point is defined as a point, \(x^*\), such that \(f(x^*) = x^*\), graphically the fixed point can be found from the intersection of the length map \(q^l(q^{l-1})\) with the line \(y = x\).
If you think about the definition of a concave function (specifically, the version of the definition which states that beween any two points \(x=a\) and \(x=b\), the graph of the function must lie above the line defined by \(f(a)\) and \(f(b)\)), you will realize that a concave function cannot intersect any line more than twice. Thus, concavity implies that the function can have at most two fixed points.
</p>
</details>
</li>
<li>Let’s be concrete now and consider the nonlinearity to be a ReLU. Compute (analytically) the length map \(q^l = f(q^{l-1})\), which will also depend on \(\sigma_w\) and \(\sigma_b\) . For what values of \(\sigma_w\) and \(\sigma_b\) does the system have fixed point(s)? How does the value of the fixed point depend on \(\sigma_w\) and \(\sigma_b\)?
<details><summary>Solution</summary>
<p>
Starting from
$$ f(q) = \sigma_w^2 \int_\mathbb{R}~\phi(\rho\sqrt{q})^2~d\rho + \sigma_b^2 $$
and explicitly inserting the nonlinearity \(\phi\) gives
$$ f(q) = \sigma_w^2 \int_0^\infty \rho^2 q~d\rho + \sigma_b^2 $$
Note that since the ReLU nonlinearity is zero when the argument is zero and just the identity function when the argument is greater than zero, we can take its effect into account simply by changing the above limits of integration so that we only integrate over the region in which the argument is positive. Now we can pull \(q\) out of the integral,
$$ f(q) = q \sigma_w^2 \int_0^\infty \rho^2 ~d\rho + \sigma_b^2 $$
and to evaluate the integral, note that by symmetry of the Gaussian distribution, it's half of what it would be if we had the limits from \(-\infty\) to \(\infty\), in which case it would just be the variance of a standard Gaussian, and so
$$ f(q) = q \frac{\sigma_w^2}{2} + \sigma_b^2 $$
The important things to note here are that because \(f(q)\) is a simple linear function, there is at most a single fixed point of the system. If \(\sigma_b^2\) is zero, that fixed point is at \(q=0\). If \(\sigma_b^2 > 0\), then there is a fixed point only if \(\sigma_w < \sqrt{2}\). Otherwise, the system does not have any fixed point. This is a qualitative difference from the \(\tanh\) case, in which there is always a fixed point.
A slightly strange case is when \(\sigma_w = \sqrt{2}\) exactly, and \(\sigma_b = 0\). In this case, the recurrence relation gives \(q^l(q^{l-1}) = q^{l-1}\), meaning that every point is a fixed point.
</p>
</details>
</li>
<li>Now let’s consider the sigmoid nonlinearity \(\phi(h) = \tanh(h)\). In this case the length map cannot be computed analytically, but it can be done numerically. Numerically plot the length map, \(q^l=f(q^{l-1})\), for a few values of \(\sigma_w\) and \(\sigma_b\) in the following regimes: (i) \(\sigma_b=0\) and \(\sigma_w < 1\), (ii) \(\sigma_b = 0\) and \(\sigma_w > 1\), and (iii) \(\sigma_b > 0\). Describe qualitatively the fixed points of the map in each regime.
<details><summary>Solution</summary>
<p>
The following Python code should work:
<pre>
import numpy as np
import scipy.integrate as integrate
def integrand(x):
gaussian = np.sqrt(2 * np.pi), -np.inf, np.inf) * np.exp(-0.5 * x**2)
return np.tanh(x * np.sqrt(q))**2 * gaussian
def fint(q):
result = integrate.quad(integrand)
return result[0]
def lengthmap(q, sigma_w, sigma_b):
return sigma_w**2 * fint(q) + sigma_b**2
</pre>
The behavior that should be seen is the following, as described in the transient chaos paper (ignoring the parts about stability because we haven't covered that yet. See next part of the problem):
_For \(\sigma_b = 0\) and \(\sigma_w < 1\), the only intersection is at \(q^*=0\). In this bias-free, small weight regime, the network shrinks all inputs to the origin. For \(\sigma_w > 1\) and \(\sigma_b = 0\), the \(q*=0\) fixed point becomes unstable and the length map acquires a second nonzero fixed point, which is stable. In this bias-free, large weight regime, the network expands small inputs and contracts large inputs. Also, for any nonzero bias σb, the length map has a single stable non-zero fixed point. In such a regime, even with small weights, the injected biases at each layer prevent signals from decaying to 0._
</p>
</details>
</li>
<li>Let’s now talk about the stability of fixed points. In a dynamical system, once the system reaches (or starts at) a fixed point, by definition it can never leave. But what happens if the system gets or starts near a fixed point? In real physical systems, this question is very relevant because physical systems almost always have some noise which pushes the system away from a fixed point. In general, the fixed point can be either stable or unstable. For a stable fixed point, initializing the system near the fixed point will result in behavior which converges to the fixed point, i.e reducing the magnitude of the perturbation away from the fixed point. Conversely, for an unstable fixed point, the system initialized nearby will be repelled from the fixed point. Use the derivative of the length map at a fixed point to derive conditions on the stability of the fixed point.
<details><summary>Solution</summary>
<p>
If the absolute value of the derivative \(\frac{df}{dx}\), evaluated at the fixed point \(x^*\), is less than \(1\), then the system is stable. This can be seen from considering initializing the system near the fixed point, say at \(x^* + \Delta x\). After going through the length map, the value will be
$$ \begin{aligned}
f(x^* + \Delta x) &\approx f(x^*) + f'(x^*) \Delta x \\
&= x^* + f'(x^*) \Delta x
\end{aligned} $$
So the deviation from the fixed point \(x^*\) has changed to \(f'(x^*) \Delta x\). If the magnitude of \(f'(x^*)\) is less than \(1\), then the magnitude of this deviation is lower than \(\Delta x\), the system is getting closer to the fixed point, and the fixed point is said to be stable.
Conversely, if the magnitude of \(f'(x^*)\) is greater than \(1\), then the deviations away from equilibrium grow, and the equilibrium is unstable.
</p>
</details>
</li>
<li>With this understanding of stability, revisit your result in part (e) for the \(\tanh\) nonlinearity. Specifically, discuss the stability of the fixed points in each of the three regimes. You can estimate the derivative of the length map by looking at the graphs.
<details><summary>Solution</summary>
<p>
See the italicized paragraph in the solutions above, from the transient chaos paper. In regime (i), there is a single fixed point, \(q^*=0\), and it is stable. In regime (ii), there are two fixed points, \(q^*=0\) (unstable) and some other positive value (stable), and in regime (iii), there is only a positive fixed point, which is stable.
</p>
</details>
</li>
<li>Do the same stability analysis for the ReLU network.
<details><summary>Solution</summary>
<p>
In the \(\sigma_b = 0\) case, where the only fixed point is at \(q=0\), that point is stable if \(\sigma_w < \sqrt{2}\) (because then the slope of the line is less than unity) and unstable if \(\sigma_w > \sqrt{2}\). Even for non-zero \(\sigma_b\), the fixed point (which will now be non-zero) is stable if \(\sigma_w < \sqrt{2}\).
The slightly strange case is when \(\sigma_w = \sqrt{2}\) exactly, and \(\sigma_b = 0\). In this case, the recurrence relation gives \(q^l(q^{l-1}) = q^{l-1}\), meaning that every point is a fixed point. In this case, the fixed points are neither stable nor unstable, since perturbations from them will neither grow or shrink.
</p>
</details>
</li>
<li>(Optional) You should have found above that the both the ReLU and \(\tanh\) systems never had more than one stable fixed point. Show that this is a consequence of the concavity of the length map.
<details><summary>Hint</summary>
You can just draw a picture for this one. Consider using the fact that the length map is concave, which we discussed in part c).
</details>
<details><summary>Solution</summary>
<p>
Having two stable fixed points would mean having two intersection points with the line \(y=x\) at which the slope of the function is less than unity. But this means that in both cases we approach the function from above, which means that there must have been a third intersection point in the middle. But we already proved that because of the concavity of the length map, the system can have at most two fixed points.
<a href="../assets/sigmoid/problem-sets/source/1/stable_fixed_point.png">[stable fixed point]</a>
</p>
</details>
</li>
</ol>
</li>
</ol>
<p><br /></p>
<h1 id="3-random-matrix-theory-introduction">3 Random Matrix Theory: Introduction.</h1>
<p><strong>Motivation</strong>: The crux of the paper uses tools from random matrix theory, which studies ensembles of matrix-valued random variables.
Here, we will take a first stab at analyzing some relevant questions and get a feel for how the spectra of random matrices from deterministic matrices.
We will also see that the spectra depend on how we sample the matrices. Finally, we will investigate what random matrices from different ensembles have in common.</p>
<p><strong>Objectives</strong>:</p>
<ul>
<li>Gain familiarity with working with the spectra of random matrices.</li>
<li>Understand the typical behavior of a random matrix’s eigenvalues.</li>
<li>Understand how standard RMT eigenvalue distributions are influenced by both level repulsion and confinement.</li>
<li>Understand why RMT is used in the Resurrecting the Sigmoid paper.</li>
</ul>
<p><strong>Topics</strong>:</p>
<ul>
<li>Eigenvalue spacing in random matrices.</li>
</ul>
<p><strong>Readings</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1712.07903.pdf">Livan RMT textbook, sections 2.1 - 2.3</a>.</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="http://math.mit.edu/~edelman/publications/random_matrix_theory_innovative.pdf">Random Matrix Theory and its Innovative Applications</a> by Edelman and Yang.</li>
<li><a href="https://arxiv.org/pdf/1712.07903.pdf">Livan RMT textbook, chapters 3, 6, and 7</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<p>The full problem set, from which the problems below are taken, is <a href="/assets/sigmoid/problem-sets/pdfs/2.pdf">here</a>.</p>
<ol>
<li>
<p>Avoided crossings in the spectra of random matrices.</p>
<p>In the first DFL session’s intro to RMT, we mentioned that eigenvalues of random matrices tend to repel each other. Indeed, as one of the recommended textbooks on RMT states, this interplay between confinement and repulsion is the physical mechanism at the heart of many results in RMT. This problem explores that statement, relating it to a concept which comes up often in physics - the avoided crossing.</p>
<ol>
<li>
<p>The simplest example of an avoided crossing is in a \(2 \times 2\) matrix. Let’s take the matrix</p>
\[\begin{pmatrix}
\Delta & J \\
J & -\Delta
\end{pmatrix}\]
<ol>
<li>Since this matrix is symmetric, its eigenvalues will be real. What are its eigenvalues?
<details><summary>Solution</summary>
<p>
The polynomial to solve for the eigenvalues \(\lambda\) is
$$ \begin{eqnarray}
(\Delta - \lambda) (-\Delta - \lambda ) - J^2 &=& 0 \\
\lambda^2 - (\Delta^2 + J^2) &=& 0
\end{eqnarray} $$
So the eigenvalues are \(\pm \sqrt{\Delta^2 + J^2}\).
</p>
</details>
</li>
<li>To see the avoided crossing here, plot the eigenvalues as a function of \(\Delta\), first for \(J=0\), then for a few non-zero values of \(J\).
<details><summary>Solution</summary>
<p>
Here is an example graph, showing \(J\) values \(0\), \(1\), and \(2\). The blue line shows no gap when \(J = 0\), and the gap opens up when \(J\) is non-zero.
<a href="../assets/sigmoid/problem-sets/source/2/avoided_crossing.png">[avoided crossing]</a>
</p>
</details>
</li>
<li>You should see a gap (i.e. the minimal distance between the eigenvalue curves) open up as \(J\) becomes non-zero. What is the size of this gap?
<details><summary>Solution</summary>
<p>
To get the gap, evaluate the expression for the eigenvalues when \(\Delta\) is zero, and you find that the gap is \(2|J|\).
</p>
</details>
</li>
</ol>
<p><br /></p>
</li>
<li>
<p>Now take a matrix of the form</p>
\[\begin{pmatrix}
A & C \\
C & D
\end{pmatrix}.\]
<p>In terms of (A), (C), and (D), what is the absolute value of the difference between the two eigenvalues of this matrix?</p>
<details><summary>Solution</summary>
<p>
The difference in eigenvalues won't shift if we add a multiple of the identity matrix to our original matrix, meaning that the eigenvalue difference is the same as that of the matrix
$$ \begin{pmatrix}
\frac{1}{2}\left(A - D\right) & C \\
C & -\frac{1}{2}\left(A - D\right)
\end{pmatrix} $$
The eigenvalue difference is thus (using the eigenvalues we calculated from the previous part):
$$ s = \sqrt{4C^2 + \left( A - D \right) ^2} $$
</p>
</details>
</li>
<li>
<p>Now let’s make the matrix a random matrix. We will take \(A\), \(C\), and \(D\) to be independent random variables, where the diagonal entries \(A\) and \(D\) are distributed according to a normal distribution with mean zero and variance one, while the off-diagonal entry \(C\) is also a zero-mean Gaussian but with a variance of \(\frac{1}{2}\).</p>
<ol>
<li>Use the formula you derived in the previous part of the question to calculate the probability distribution function for the spacing between the two eigenvalues of the matrix.
<details><summary>Solution</summary>
<p>
From the previous part we know the spacing as a function of the random variables \(A\), \(B\), \(C\):
$$ s = \sqrt{4C^2 + \left( A - D \right) ^2} $$
So we can write in terms of the joint probability density function of \(A\), \(B\), and \(C\) that
$$ \begin{eqnarray}
p_s(x) &=& \int~da~db~dc~p_{A,B,C}(a, b, c) ~ \delta(x - s(a,b,c)) \\
&=& \frac{1}{2\pi \sqrt{\pi}}\int_{-\infty}^{\infty}da \int_{-\infty}^{\infty}db \int_{-\infty}^{\infty}dc~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - s(a,b,c))
\end{eqnarray} $$
where \(\delta\) is the Dirac delta function. Now perform the change of variables to cylindrical coordinates \(r\), \(\theta\), \(z\), with
$$ \begin{eqnarray}
r \cos{\theta} &=& a - d \\
r \sin{\theta} &=& 2c \\
z &=& a + d
\end{eqnarray} $$
The inverse of this transformation is
$$ \begin{eqnarray}
a &=& \frac{1}{2} ( r\cos(\theta) + z) \\
b &=& \frac{1}{2} ( z - r\cos(\theta)) \\
c &=& \frac{1}{2} r\sin(\theta)
\end{eqnarray} $$
For later, note that
$$ a^2 + b^2 = \frac{1}{2} (r^2 \cos^2(\theta) + z^2) $$
And the Jacobian can be calculated as \(J=-r/4\) (see the Livan book, section 1.2).
In terms of the new variables, the spacing \(s\) becomes \(r\), and the integration becomes
$$ \begin{eqnarray}
p_s(x) &=& \frac{1}{2\pi \sqrt{\pi}}\int_{-\infty}^{\infty}da \int_{-\infty}^{\infty}db \int_{-\infty}^{\infty}dc~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - s(a,b,c)) \\
&=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - r) \\
&=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{2} (a^2 + b^2 + 2c^2)} ~ \delta(x - r) \\
&=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{4} (r^2 \cos^2(\theta) + z^2 + r^2\sin^2\theta)} ~ \delta(x - r) \\
&=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{4} (r^2 + z^2)} ~ \delta(x - r) \\
&=& \frac{1}{4\sqrt{\pi}}\int_{0}^{\infty} re^{-\frac{1}{4} r^2}~\delta(x - r)~dr \int_{-\infty}^{\infty}dz~ e^{-\frac{z^2}{4}} ~ \\
&=& \frac{1}{2}\int_{0}^{\infty} re^{-\frac{1}{4} r^2}~\delta(x - r)~dr \\
&=& \frac{x}{2}e^\frac{-x^2}{4}
\end{eqnarray} $$
</p>
</details>
</li>
<li>What is the behavior of this pdf at zero? How does this relate to the avoided crossing you calculated earlier?
<details><summary>Solution</summary>
<p>
Clearly the pdf we calculated above is exactly zero at \(s=0\), and grows linearly with \(s\). This absence of spacings at zero is the same phenomenon as the avoided crossing noted above for deterministic matrices. Another way to see this is to note that from the first part of the problem, the only way to have a spacing of zero is to have the diagonal elements equal each other while the off-diagonal element needs to be zero. The set of points satisfying this condition is a line in the full 3D space of points, so will have a very low probability of occurring.
</p>
</details>
</li>
<li>Verify using numerical simulation that the pdf you found in the previous part is correct.
<details><summary>Solution</summary>
<p>
The following Python code should work; by generating plots using the two functions, we can verify that they match.
<pre>
import numpy as np
def eigenvalue_spacing():
A = np.random.normal(scale=1)
D = np.random.normal(scale=1)
C = np.random.normal(scale=np.sqrt(0.5))
M = np.array([[A, C], [C, D]]
eigenvalues, _ = np.linalg.eig(M)
return abs(eigenvalues[0] - eigenvalues[1])
def pdf(x):
return x / 2 * np.exp(- x**2 / 4)
</pre>
</p>
</details>
</li>
</ol>
</li>
</ol>
</li>
</ol>
<p><br /></p>
<h1 id="4-random-matrix-theory-central-concepts">4 Random Matrix Theory: Central concepts.</h1>
<p><strong>Motivation</strong>: In this section we cover the final topic before we can get to the calculations in the paper - free probability.
Specifically, we discuss its instantiation in random matrix theory.
This is a huge topic, but to understand the paper, we luckily don’t need to cover too much space.
The basic question to think about is, “<strong>Given two random matrices whose spectral densities we know, when can we calculate the spectral density of their sum or product?</strong>”</p>
<p>We also address canonical results in random matrix theory, like the semicircular law.</p>
<p><strong>Objectives</strong>:</p>
<ul>
<li>Understand some basic properties that are of interest when working with random matrices.</li>
<li>Specifically, for this paper, understand why we are interested in the eigenvalue/singular-value distribution of matrices.</li>
<li>Be able to describe some canonical ensembles of random matrices, and their properties.</li>
<li>Be able to explain why we need the theory of freely-independent matrices in this paper.</li>
</ul>
<p><strong>Topics</strong>:</p>
<ul>
<li>Free independence.</li>
<li>The \(R\)- and \(S\)-transforms.</li>
<li>The semicircle law.</li>
</ul>
<p><strong>Reading</strong>:</p>
<p>The primary learning tool is again the problem set. The following readings will help contextualize the problems.</p>
<ol>
<li><a href="https://arxiv.org/pdf/1712.07903.pdf">Livan RMT textbook, chapter 17</a>.</li>
<li>Section 2.3 of the <a href="https://arxiv.org/pdf/1711.04735.pdf">Resurrecting the Sigmoid</a> paper.</li>
<li><a href="https://arxiv.org/abs/1204.2257">Partial Freeness of Random Matrices</a> by Chen et al., Sections 1, 2, 3, and 5.</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<p>It is tough to find an exposition of free probability theory (i.e., the theory of non-commuting random variables) at an elementary level. The chapter in the Livan textbook listed above is a great resource, and the following papers might also help.</p>
<ol>
<li><a href="https://arxiv.org/pdf/0910.1205.pdf">Financial Applications of Random Matrix Theory: a short review</a> by Bouchaud and Potters, section III.</li>
<li><a href="https://arxiv.org/pdf/physics/0603024.pdf">Applying Free Random Variables to Random Matrix Analysis of Financial Data Part I: A Gaussian Case</a> by Burda et al.</li>
</ol>
<p><strong>Questions</strong>:</p>
<p>The full problem set, from which the below problems are taken, is <a href="/assets/sigmoid/problem-sets/pdfs/3.pdf">here</a>.</p>
<ol>
<li>
<p>Why we need free probability.</p>
<p>In the upcoming lectures, we will encounter the concept of free independence of random matrices. As a reminder, in standard probability theory (of scalar-valued random variables), two random variables \(X\) and \(Y\) are said to be independent if their joint pdf is simply the product of the individual marginals, i.e.</p>
\[p_{X,Y}(x,y) = p_X(x) p_Y(x)\]
<p>When we have independent scalar random variables \(X\) and \(Y\), then in principle it is possible to calculate the distribution of any function of these variables, say the sum \(X + Y\) or the product \(XY\).</p>
<p>When it comes to random matrices, we are often interested in calculating the spectral density (the probability density of eigenvalues) of the sum or product of random matrices. In the <em>Resurrecting the Sigmoid</em> paper, for example, we will calculate the spectral density of the network’s input-output Jacobian, which is the product of several matrices for each layer. So we need an analogue of independent variables for matrices (this condition is known as <em>free independence</em>), such that if we know the spectral densities of each one, we can calculate spectral densities of sums and products.</p>
<p>The simplest condition we might imagine under which two matrix-valued random variables (or, equivalently, two matrix ensembles) being freely independent is that all of the entries of each matrix are mutually independent. However, it turns out that this condition is not good enough! In other words, independent entries sometimes are not enough to destroy all possible angular correlations between the eigenbases of two matrices. Instead, the property that generalizes statistical independence to random matrices is stronger and known as <em>freeness</em>.</p>
<p>In this problem, we will see a concrete example of matrix ensembles with mutually independent entries, yet knowing the eigenvalue spectral density of each ensemble is not enough to determine the eigenvalue spectral density of the sum.</p>
<p>Define three different ensembles of 2 by 2 matrices:</p>
<ul>
<li>Ensemble 1: To sample a matrix from ensemble 1, sample a standard Gaussian scalar random variable \(z\) and multiply it by each element in the matrix \(\sigma_z\), where</li>
</ul>
\[\sigma_z = \left( \begin{array}{cc} 1 & 0 \\ 0 & -1 \end{array} \right)\]
<p>Thus the sampled matrix will be \(z \sigma_z\).</p>
<ul>
<li>Ensemble 2: To sample a matrix from ensemble 2, sample a standard Gaussian scalar random variable \(z\) and multiply it by each element in the matrix \(\sigma_x\), where</li>
</ul>
\[\sigma_x = \left( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right)\]
<p>Thus the sampled matrix will be \(z \sigma_x\).</p>
<ol>
<li>
<p>What is the spectral density \(\rho_1(x)\) of eigenvalues of matrices sampled from ensemble 1?</p>
<details><summary>Solution</summary>
<p>
The eigenvalues of \(\sigma_z\) are \(\pm 1\), so the spectral density \(\rho_1(x)\) will be identical to the probability density of \(z\) except for a factor of \(2\) (because it has to integrate to \(2\), the number of eigenvalues, instead of \(1\) namely
$$\rho_1(x) = \frac{\sqrt{2}}{\sqrt{\pi}} e^{-x^2/2}$$
</p>
</details>
</li>
<li>
<p>What is the spectral density \(\rho_2(x)\) of eigenvalues of matrices sampled from ensemble 2?</p>
<details><summary>Solution</summary>
<p>
The eigenvalues of \(\sigma_x\) are exactly the same as those of \(\sigma_z\), namely \(\pm 1\), so the spectral density \(\rho_2(x)\) is the same as \(\rho_1(x)\):
$$\rho_2(x) = \frac{\sqrt{2}}{\sqrt{\pi}} e^{-x^2/2}$$
</p>
</details>
<p><br />
You should have found above that the spectral densities of both ensembles are the same. However, we will see now that simply knowing the spectral density is not enough to determine the spectral density of the sum.</p>
</li>
<li>
<p>Let \(A\) and \(B\) be two matrices independently sampled from ensemble 1. Calculate <em>analytically</em> the spectral density of the sum, \(A + B\).</p>
<details><summary>Solution</summary>
<p>
We can write this matrix as \((z_1 + z_2)\sigma_z\), where \(z_1\) and \(z_2\) are standard normal variables. The eigenvalues are thus \(\pm (z_1 + z_2)\). Since \(z_1\) and \(z_2\) are independent, their sum will be a zero-mean Gaussian with variance \(2\). So the spectral density of eigenvalues will be twice that of such a Gaussian, namely:
$$\rho_{A+B}(\lambda) = \frac{1}{\sqrt{\pi}} e^{-\lambda^2/4}$$
</p>
</details>
</li>
<li>
<p>Now let \(C\) be a matrix sampled from ensemble 2. In the next part, you will calculate the spectral density of the sum \(A + C\), where \(A\) is drawn from ensemble 1 and \(C\) is drawn from ensemble 2. However, to see immediately that the distributions of \(A+B\) and \(A+C\) will be different, consider the behavior of the spectral density of \(A+C\) at zero. Based on your knowledge of avoided crossings from the previous problem set, describe the spectral density of \(A+C\) at \(\lambda =0\) and contrast this to the spectral density of \(A+B\).</p>
<details><summary>Solution</summary>
<p>
Notice that the matrix \(A+C\) will have the same form as the matrix considered in the previous problem set, and we found that for such a matrix the presence of the off-diagonal term caused their to be a level repulsion. So, the eigenvalue spectral density should go to zero as \(\lambda\) approaches zero, for the matrix \(A+C\). However, in the above part, we calculated that for matrices \(A+B\), there is no avoided crossing and the pdf is finite at \(\lambda = 0\).
</p>
</details>
</li>
<li>
<p>Now let \(C\) be a matrix sampled from ensemble 2. Calculate the spectral density of the sum, \(A + C\). Make sure this is consistent with what you argued above about the behavior at \(\lambda = 0\).</p>
<details><summary>Solution</summary>
<p>
Notice that the matrix \(A+C\)will have the same form as the matrix considered in the previous problem set, and we found that for such a matrix the presence of the off-diagonal term caused their to be a level repulsion. So, the eigenvalue spectral density should go to zero as \(\lambda\)approaches zero, for the matrix \(A+C\). However, in the above part, we calculated that for matrices \(A+B\), there is no avoided crossing and the pdf is finite at \(\lambda = 0\).
</p>
</details>
<p>Notice that the answers you got in the previous two parts were different, even though the underlying matrices that were being added had the same spectral density and independent entries.</p>
</li>
</ol>
</li>
<li>
<p>Using free probability theory.</p>
<p>From the last problem, you learned that if you’re given two different random matrix ensembles, and you know the spectral density of the eigenvalues of each one, that might not be enough to determine the eigenvalue distribution of the sum (or product) of the two random matrices, <em>even if all of the entries of the two matrices are mutually independent!</em> As we mentioned in the last problem, the (stronger) condition that we are after is known as <em>free independence</em>. In general, proving that two matrix ensembles are “free” (freely independent) is quite tough, so we will not do that here. Instead, we will look at the tools we use to do calculations <em>assuming</em> we have random matrix ensembles which are freely independent.</p>
<p>Specifically, we will show that the sum of two freely independent random matrices, each of whose spectral density is given by a semicircle, is also described by the semicircle distribution.</p>
<ol>
<li>
<p>Recall that the spectral density of the Gaussian orthogonal ensemble (in the large \(N\) limit) is given by the semicircle law:</p>
\[\rho_{sc}(x) = \frac{1}{\pi}\sqrt{2-x^2}\]
<p>(sometimes you see this with a \(4\) or \(8\) in the square root and a different factor accompanying \(\pi\) in the denominator. This is just a matter of choosing which Gaussian ensemble—orthogonal, unitary, or symplectic—to use, and doesn’t really matter for this problem)</p>
<p>In a previous problem set, you calculated the Stieltjes transform associated with the spectral density for the Gaussian <em>unitary</em> ensemble. Recall that the Stieltjes transform, \(G(z)\), is defined via the relation</p>
\[G(z) = \int_\mathbb{R}~dt \frac{\rho(t)}{z - t}\]
<p>(In the previous problem set, this was called \(s_{\mu_N}(z)\). In literature you often see the \(G(z)\) notation, since the Stieltjes transform is also known as the <em>resolvent</em> or <em>Green’s function</em>.)</p>
<p>In the last problem set, you should have found that under the Stieltjes transform,</p>
\[\frac{1}{2\pi}\sqrt{4-x^2} \mapsto \frac{z - \sqrt{z^2 - 4}}{2}\]
<p>Use the above fact to calculate the Stieltjes transform of the GOE semicircle given at the beginning of this problem (part (a)). This is the first step to calculating the spectral density of the sum.</p>
<details><summary>Solution</summary>
<p>
Define
$$\begin{eqnarray}
f(x) &=& \frac{1}{2\pi}\sqrt{4 - x^2} \\
g(x) &=& \frac{1}{\pi}\sqrt{2 - x^2}
\end{eqnarray}$$
Notice that
\begin{equation}
g(x) = \sqrt{2} f(x\sqrt{2}).
\end{equation}
If we define \(G_g(z)\) and \(G_f(z)\) as the Green's functions corresponding to \(g(x)\) and \(f(x)\), respectively, then we can get a relation between the two:
\begin{eqnarray}
G_g(z) &=& \int~dt~\frac{g(t)}{z - t} \\
&=& \sqrt{2}\int~dt~\frac{f(t\sqrt{2})}{z - t} \\
&=& \sqrt{2}\int~\frac{dy}{\sqrt{2}} \frac{f(y)}{z - y/\sqrt{2}} \\
&=& \sqrt{2}\int~dy~\frac{f(y)}{z\sqrt{2} - y} \\
&=& \sqrt{2}G_f(z\sqrt{2}).
\end{eqnarray}
Since we have previously calculated that
\begin{equation}
G_f(z) = \frac{z - \sqrt{z^2 - 4}}{2},
\end{equation}
this immediately gives us that
\begin{equation}
G_g(z) = z - \sqrt{z^2 - 2}.
\end{equation}
</p>
</details>
<p><br /></p>
</li>
<li>
<p>We have calculated the Stieltjes transform or Green’s function of the semicircle. Now we proceed to calculate the so-called Blue’s function, which is just defined as the functional inverse of the Green’s function. That is, the Green’s function \(G(z)\) and the Blue’s function \(B(z)\) satisfy</p>
\[G(B(z)) = B(G(z)) = z\]
<p>Calculate the Blue’s function corresponding to the semicircle Green’s function you derived above.</p>
<details><summary>Solution</summary>
<p>
The inverse function is defined by the relation
\begin{equation}
z = B - \sqrt{B^2 - 2}.
\end{equation}
Then
\begin{eqnarray}
\sqrt{B^2 - 2} &=& B - z \\
B^2 - 2 &=& B^2 - 2Bz + z^2 \\
2Bz &=& z^2 + 2 \\
B &=& \frac{z}{2} + \frac{1}{z}
\end{eqnarray}
</p>
</details>
</li>
<li>
<p>You should have noticed that the Blue’s function you calculated had a singularity at the origin, that is, a term given by \(1/z\). The \(R\)-transform is defined as the Blue’s function minus that singularity; that is,</p>
\[R(z) = B(z) - \frac{1}{z}\]
<p>What is the \(R\)-transform of the GOE semicircle?</p>
<details><summary>Solution</summary>
<p>
Since
\begin{equation}
B = \frac{z}{2} + \frac{1}{z},
\end{equation}
we can immediately write
\begin{equation}
R(z) = \frac{z}{2}
\end{equation}
</p>
</details>
</li>
<li>
<p>Finally we come to the law of addition of freely independent random matrices: If we are given freely independent random matrices \(X\) and \(Y\), whose \(R\)-transforms are \(R_X(z)\) and \(R_Y(z)\), respectively, then the \(R\)-transform of the sum (or more precisely, the \(R\)-transform of the spectral density of the sum \(X + Y\)) is simply given by \(R_X(z) + R_Y(z)\).</p>
<p>Assume that two standard GOE matrices, say \(H_1\) and \(H_2\), are freely independent. What is the \(R\)-transform of the spectral density of the sum \(H_+ = pH_1 + (1 - p) H_2\)?</p>
<details><summary>Solution</summary>
<p>
$$R_{H_+}(z) = z$$
</p>
</details>
</li>
<li>
<p>Using the results above, argue that the sum of two freely-independent ensembles described by the semicircular law is also described by the semicircular law.</p>
<details><summary>Solution</summary>
<p>
The \(R\)-transform of the sum (\(z\)) has the same functional form as the individual \(R\)-transforms (\(z/2\)), so it seems plausible that this means that when we invert it, we get a semicircle. Let's make sure of this fact.
We should first figure out how the scaling of the \(R\)-transform affects the scaling of the matrix it is describing. We can guess that this amounts to a simple scaling of the matrix itself. Under the scaling of a general matrix \(H \mapsto cH\), the eigenvalue distribution goes from \(\rho(\lambda) \mapsto \rho(\lambda/c) /c\). Then, by the same logic we used in part (a) of this problem, the Green's function goes \(G \mapsto G(z/c)/c\) (in part (a), \(c\) was \(\sqrt{2}\)). To figure out the change in the Blue's function, we can write:
\begin{eqnarray}
G_{pH}(B_{pH}(z)) = \frac{1}{p} G_{H}(B_{H}(z)/p) = z \\
G_{H}(B_{H}(z)/p) = pz \\
B_{pH}(z) = p B_H(pz) \\
\end{eqnarray}
And finally we can get the scaling of the \(R\)-transform:
\begin{eqnarray}
R_{pH}(z) &=& pB_H(pz) - \frac{1}{z} \\
&=& p\left( R_H(pz) + \frac{1}{pz} \right) - \frac{1}{z} \\
&=& p R_H(pz)
\end{eqnarray}
With this result, we know that if we multiply the GOE matrix by \(\sqrt{2}\), the \(R\)-transform goes from \(z/2\) to \(z\). This means that the spectral density of a sum of two GOE matrices is still semicircular, just with a \(\sqrt{2}\) scaling. Another way of saying this is that the semicircular law is stable under free addition.
</p>
</details>
</li>
</ol>
</li>
</ol>
<p><br /></p>
<h1 id="5-calculations-in-resurrecting-the-sigmoid">5 Calculations in Resurrecting the Sigmoid.</h1>
<p><strong>Motivation</strong>: We are ready to actually perform the calculations from the paper using RMT and building off of the signal propagation concepts from section two.
Using this analysis, we will be able to predict under what conditions is dynamical isometry achievable. This principle is the one that guarantees that inputs and gradients neither vanish nor explode as they pass through the net.</p>
<p><strong>Objectives</strong>:</p>
<ul>
<li>Be able to use the \(S\)-transform to calculate \(\sigma_{JJ^T}^2\) and \(\lambda_\text{max}\) for Gaussian nets.</li>
<li>For Gaussian-initialized neural networks, explain why dynamical isometry is unattainable.</li>
<li>Be able to use the \(S\)-transform to calculate \(\sigma_{JJ^T}^2\) and \(\lambda_\text{max}\) for orthogonal nets.</li>
<li>Explain why orthogonal-initizlied neural networks can be initialized attain dynamical isometry when used with a sigmoidal activation function.</li>
<li>Understand how to choose initialization parameters of an orthogonal, sigmoidal net of a given depth to ensure dynamical isometry.</li>
</ul>
<p><strong>Topics</strong>:</p>
<ul>
<li>Jacobian spectra of neural networks with Gaussian- and orthogonal- initialized random weight matrices.</li>
<li>Decomposing neural network Jacobians via weight matrices and diagonal “nonlinearity” matrices.</li>
</ul>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1711.04735.pdf"><em>Resurrecting the Sigmoid</em>, sections 2.2 and 2.5</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<p>The full problem set, from which the below problems are taken, is <a href="/assets/sigmoid/problem-sets/pdfs/4.pdf">here</a>.</p>
<ol>
<li>
<p>Setting up the calculations.</p>
<p>In this problem set, we perform the main calculations from the <em>Resurrecting the Sigmoid</em> paper. The ultimate aim is to look for conditions under which we can achieve <em>dynamical isometry</em>, the condition that all of the singular values of the network’s Jacobian have magnitude \(1\). Thus, the problems in this set are all aimed at calculating the eigenvalue spectral density \(\rho_{JJ^T}(\lambda)\) of nets’ Jacobians for specific choices of nonlinearities and weight-matrix initializations. We accomplish this by using the rule we learned from free probability: \(S\)-transforms of freely-independent matrix ensembles multiply under matrix multiplication. Following this logic, we will calculate \(S\)-transforms for the matrices \(WW^T\) and \(D^2\), combine these results to arrive at \(S_{JJ^T}\), and from that calculate \(\rho_{JJ^T}(\lambda)\). In this problem set, as in the paper, we do not prove that the matrices are freely independent, but instead take that as an assumption.</p>
<p>Recall that our neural network is defined by the relations:</p>
\[\begin{aligned}
h^l &= W^l x^{l-1} + b^l \\
x^l &= \phi(h^l)
\end{aligned}\]
<p>where the input is denoted \(h^0\) and the output is given by \(x^L\).</p>
<ol>
<li>
<p>What is the Jacobian \(J\) of the input-output relation of this network?</p>
<details><summary>Hint</summary>
See eq. 2 of the paper.
</details>
<details><summary>Solution</summary>
<p>
Using the chain rule gives:
$$ J = \prod_{l=1}^L D^l W^l$$
where \(D^l\) is a matrix of pointwise derivatives of the nonlinearity \(\phi\) at layer \(l\):
\begin{equation}
(D^l)_{ij} = \frac{dx^l_j}{dh^l_i} = \phi ' (h^l_i)\delta_{ij}.
\end{equation}
</p>
</details>
</li>
<li>
<p>As the paper discusses, we are interested in the spectrum of singular values of \(J\), but all of the tools we have developed so far deal with the eigenvalue spectrum.</p>
<p>In terms of the singular values of \(J\), what are the eigenvalues of \(JJ^T\)?</p>
<details><summary>Solution</summary>
<p>
The definition of dynamical isometry, the condition we're after, is that the magnitude of the singular values of \(J\) should concentrate around 1.
</p>
</details>
<p>What is the dynamical isometry condition in terms of the eigenvalues of \(JJ^T\)?</p>
<details><summary>Solution</summary>
<p>
The singular values of a matrix \(A\) are the square roots of the eigenvalues of \(AA^T\), so the eigenvalues of \(JJ^T\) are the squared singular values of \(J\).
Quick proof: By SVD, \(A=U \Sigma V^\dagger\) and so
\(AA^T = (U\Sigma V^\dagger)(U\Sigma V^\dagger)^\dagger = U \Sigma^T \Sigma U^\dagger = U \Sigma^2 U^\dagger\) where \(\Sigma^2\) is composed of squared singular values and \(V^\dagger\) is matrix \(V\)'s conjugate transpose. Note that \(\Sigma^2\) equals matrix \(D\) from a spectral decomposition of \(AA^T\), which contains eigenvalues of \(AA^T\). Thus the squared singular values of \(A\) equal the eigenvalues of \(AA^T\). \(\square\) <br /> <br />
So, the dynamical isometry condition is that the spectrum of eigenvalues of \(JJ^T\) concentrates around unity.
</p>
</details>
</li>
</ol>
<p><br /></p>
</li>
<li>
<p>Now that we’re focused on \(JJ^T\) instead of \(J\), read the following section reproduced from the main paper, about the \(S\)-transform of \(JJ^T\)’s spectral density:</p>
<p>\(S_{JJ^T} = \prod_{l=1}^L S_{W_lW_l^T} S_{D_l^2} = S_{WW^T}^L S_{D^2}^L\) where we have used the identical distribution of the weights to define \(S_{WW^T} = S_{W_l W_l^T}\) for all \(l\), and we have also used the fact the pre-activations are distributed independently of depth as \(h_l \sim \mathcal{N}(0,q^*)\), which implies that \(S_{D_l^2} = S_{D^2}\) for all \(l\). Eqn.(12) provides a method to compute the spectrum \(\rho_{JJ^T} (\lambda)\). Starting from \(\rho_{W^T W} (\lambda)\) and \(\rho_{D^2}\), we compute their respective \(S\)-transforms through the sequence of equations eqns. (7), (9), and (10), take the product in eqn. (12), and then reverse the sequence of steps to go from \(S_{JJ^T}\) to \(\rho_{JJ^T} (\lambda)\) through the inverses of eqns. (10), (9), and (8). Thus we must calculate the \(S\)-transforms of \(WW^T\) and \(D^2\), which we attack next for specific nonlinearities and weight ensembles in the following sections. In principle, this procedure can be carried out numerically for an arbitrary choice of nonlinearity, but we postpone this investigation to future work.</p>
<p>Prove the equation at the top of the box.</p>
<details><summary>Hint</summary>
This is done in the first appendix of the paper. Note that you should assume free independence of the \(D\)'s and \(W\)'s.
</details>
<p>The upshot of this problem is that we need to calculate the quantities \(S_{WW^T}\) and \(S_{D^2}\) for whatever nonlinearities and weight initialization schemes we’re interested in.</p>
<details><summary>Solution</summary>
<p>
\(JJ^T =\left( \prod_{l=1} D^l W^l\right) \left(\prod_{l=1} D^l W^l\right)^T = \left(D_L W_L \ldots D_1 W_1\right) \left(D_L W_L \ldots D_1 W_1\right)^T\) by expanding the product. So
$$ S_{JJ^T} = S_{\left( D_L W_L \ldots D_1 W_1\right) \left( D_L W_L \ldots D_1 W_1\right)^T} $$
Since the \(S\)-transform is defined in terms of moments of the eigenvalue distribution, it is invariant to cyclic permutations (since the trace, which defines moments, is invariant to cyclic permutations). So, we can re-order matrices in the product, yielding:
$$ S_{JJ^T} = S_{(W_L^T D_L^T D_L W_L) (D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T} $$
Then, assuming free independence, the \(S\)-transforms multiply:
$$ S_{JJ^T} = S_{(W_L^T D_L^T D_L W_L)} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$
Again using invariance to cyclic permutations:
$$ S_{JJ^T} = S_{(D_L^T D_L W_L W_L^T)} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$
And again assuming free independence:
$$ S_{JJ^T} = S_{D_L^T D_L} S_{W_L W_L^T} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$
Since \(D\) is diagonal,
$$ S_{JJ^T} = S_{D_L^2} S_{W_L W_L^T} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$
Continuing this procedure we get
$$ S_{JJ^T} = \prod_{l=1}^L S_{D_l^2} S_{W_l^T W_l} $$
Since the weight matrices \(W^l\) for each layer are identically distributed, their \(S\)-transforms are equal, so we can drop the subscript and write:
$$ S_{JJ^T} = S_{W^T W}^L \prod_{l=1}^L S_{D_l^2}$$
Finally, using the fact that \(D^l\) matrices are identically distributed gives the desired expression.
$$ S_{JJ^T} = \left(S_{W^T W}\right)^L \left(S_{D^2}\right)^L$$
</p>
</details>
<p><br /></p>
</li>
<li>
<p>\(S_{D^2}\) for ReLU and hard-tanh networks</p>
<p>In this problem, we turn to networks with nonlinearities. We look at two nonlinearities here, the ReLU function and a piecewise approximation to the sigmoid known as the hard-tanh. These functions are defined as follows:</p>
<p>\(f_{\mathrm{ReLU}}(x) = \begin{cases}
0 & x\leq 0 \\
x & x\geq 0
\end{cases}\)
\(f_{\mathrm{HardTanh}}(x) = \begin{cases}
-1 & x\leq -1 \\
x & -1\leq x\leq 1 \\
1 & x\geq 1
\end{cases}\)</p>
<p>We want the spectral density, \(\rho_{JJ^T}(\lambda)\), of \(JJ^T\), where \(J\) is the Jacobian. We will find this by first calculating its \(S\)-transform, \(S_{JJ^T}\). As discussed in the introduction, this involves two separate steps: finding \(S_{D^2}\) and finding \(S_{WW^T}\). Note that finding \(S_{D^2}\)’s closed form relies primarily on choice of nonlinearity, and finding \(S_{WW^T}\)’s closed form relies only on choice of weight initialization (and not on choice of nonlinearity). In this problem, we focus on the nonlinearities (\(S_{D^2}\)); the next problems focus on the weight initializations (\(S_{WW^T}\)), and how to combine these to get the \(S\) transform of the Jacobian.</p>
<ol>
<li>
<p>The probability density function of the \(D\) matrix depends on the distributions of inputs to the nonlinearity. To calculate this, we will make a couple simplifying assumptions. The first assumption is that we initialize the network at a critical point (defined in problem set 2).</p>
<p>If we are interested in finding conditions for achieving dynamical isometry, why is it a good assumption that the network is initialized at criticality?</p>
<details><summary>Solution</summary>
<p>
The criticality condition, \(\chi = 1\), implies that the <em>mean</em> squared singular value of \(J\), or equivalently that the mean eigenvalue of \(JJ^T\), is unity. Dynamical isometry means that the <em>entire</em> spectrum of squared singular values of \(J\) is concentrated around unity. So criticality is a prerequisite for dynamical isometry.
</p>
</details>
</li>
<li>The second assumption we make in calculating the distribution of inputs to the nonlinearity is that the we have settled to a stationary point of the length map (the variance map). Reread section 2.2 of <em>Resurrecting the Sigmoid</em>, and argue why this is also a good assumption.
<details><summary>Solution</summary>
<p>
As described in both <em>Exponential Expressivity in Deep Neural Networks Through Transient Chaos</em> and in section 2.2 of <em>Resurrecting the Sigmoid</em>, the empirical distribution of network pre-activations approximates a \(0\)-mean, \(q^l\)-variance Gaussian distribution in the large-width limit. The length map describing the evolution of \(q^l\) has a fixed point, which the papers show empirically is rapidly converged to. Because of this rapid convergence, it is natural to assume that only a few initial layers are not characterized by this variance, and that we can neglet them in computing the spectrum of the network's Jacobian. Conveniently, assuming we are at a fixed point makes \(D^2\) is independent of \(l\), greatly simplifying our analysis.
</p>
</details>
</li>
<li>
<p>To find the critical points of both the ReLU and hard-tanh networks, recall from problem set 2 that criticality was defined by the condition \(\chi = 1\), where \(\chi\) is defined in eqn. (5) of the main paper.
As in the paper, define \(p(q^*)\) as the probability, given the variance \(q^*\), that a given neuron in a layer is in its linear (i.e. not constant) regime. Show that \(\chi = \sigma_w^2 p(q^*)\).</p>
<details><summary>Hint</summary>
Plug the nonlinearity into the equation for \(\chi\) and reduce.
</details>
<details><summary>Solution</summary>
<p>
$$\chi = \sigma_w^2 \int D h \phi ' ((\sqrt(q^*)h)^2$$
Where \(D h\) is the standard Gaussian measure. Note that when \(\phi' = 0\) (the slope of the activation function is zero) then \(\chi=0\). Thus, since \(\chi\) only takes on values in \({0,1}\). Thus the Gaussian measure integral, which represents probability that \(\phi ' \neq 0\) reduces to \(p(q^*)\), the probability that \(\phi' = 0\), so \(\chi = \sigma_w^2 p(q^*)\).
</p>
</details>
</li>
<li>In terms of \(p(q^*)\), what is the spectral density \(\rho_{D^2}(z)\) (for both ReLU and hard-tanh networks) of the eigenvalues of \(D^2\)?.
<details><summary>Solution</summary>
<p>
Bernoulli with parameter equal to the probability of being in the linear regime. The Dirac delta expresses the fact that both ReLU and hard-tanh are piecewise linear with sections at value \(0\), so their probability of being in the linear regime is a step function -- it allows us to express a discrete pdf (in this case with two values, \(0\) and \(1\)).
</p>
</details>
</li>
<li>Following equations 7-10 in the main paper, derive the Stieltjes transform \(G_{D^2}(z)\), the moment-generating function \(M_{D^2}(z)\), and the \(S\)-transform \(S_{D^2}(z)\) in terms of \(p(q^*)\).
Note: This should be the same for both ReLU and hard-tanh networks.
<details><summary>Solution</summary>
<p>
Recall that:
$$ \rho_{D^2} (z) = (1-p(q^*)) \delta (z) + p(q^*) \delta(z-1)$$
Then recall the definition
$$ G_{D^2} (z) = \int_\mathcal{R} \frac{\rho_x (t) dt}{z-t} = \frac{\rho_x(0)}{z} + \frac{\rho_x (1)}{z-1} = \frac{1-p(q^*)}{z} + \frac{p(q^*)}{z-1}$$
Then
$$ \begin{eqnarray}
M_{D^2}(z) &=& z G_{D^2}(z) - 1 \\
&=& z \left(\frac{1-p(q^*)}{z} + \frac{p(q^*)}{z-1}\right) - 1 \\
&=& -p(q^*) + \frac{z p(q^*)}{z-1} \\
&=& p(q^*) \left(\frac{z}{z-1} - 1\right) \\
&=& \frac{p(q^*)} \\
\end{eqnarray} $$
Next use the definition
\begin{equation*}
S_{D ^2} (z) = \frac{1+z}{z M_{D^2}^{-1} (z)}.
\end{equation*}
The inverse \(M_{D^2}^{-1}(z)\) is \(\frac{p(q^*)}{z} + 1\). Thus:
$$ S_{D^2}(z) = \frac{1+z}{z \left(\frac{p(q^*)}{z} + 1\right)} = \frac{z+1}{z+ p(q^*)}$$
</p>
</details>
</li>
<li>Now that we’ve calculated the transforms we wanted in terms of \(p(q^*)\), let us see what the critical point (which determines \(q^*\) and \(p(q^*)\)) looks like for our two nonlinearity options. For ReLU networks, what is \(p(q^*)\)? Show that this implies that the only critical point for ReLU networks is \((\sigma_w, \sigma_b) = (\sqrt{2},0).\)
<details><summary>Solution</summary>
<p>
For ReLUs, the nonlinearity is half in the positive linear regime and half at \(0\). Assuming \(0\)-mean symmetric activation distributions, the probability of being in the linear regime is \(p(q^*) = \frac{1}{2}\).
Using the above result that \( \chi = \sigma_w^2 p(q^*) \) immediately tells us that \( \sigma_w^2 = 2 \).
Using equation (4) in the <em>Resurrecting the Sigmoid</em> paper,
$$q^* = \sigma_w^2 \int \mathcal{D} h ~\phi(\sqrt{q^*}h)^2 + \sigma_b^2$$
and using the fact that \(\phi\) is a ReLU, we can write -->
$$q^* = q^* \sigma_w^2 \int_{h>0} \mathcal{D} h~ h^2 + \sigma_b^2.$$
Since the integrand is an even function, it can be evaluated easily
$$q^* = \frac{1}{2} q^* \sigma_w^2 \int \mathcal{D} h~ h^2 + \sigma_b^2.$$
The integral now is the variance of \(h\), which is unity by construction, so we simply get
$$q^* = \frac{1}{2} q^* \sigma_w^2 + \sigma_b^2.$$
Plugging in \(\sigma_w^2=2\) gives \(q^* = q^* + \sigma_b^2\), meaning \(\sigma_b^2 = 0\).
</p>
</details>
</li>
<li>For hard-tanh networks, the behavior is a bit more complex, but we can calculate it numerically. As we saw in problem set 2, for the smooth tanh network there is a 1D curve in the \((\sigma_w, \sigma_b)\) plane which satisfies criticality. The same is true for the hard tanh network, as we’ll now see. We are interested in three quantities, all of which are functions of \(\sigma_w\) and \(\sigma_b\): \(q^*\), \(p(q^*)\), and \(\chi\). We’ve already seen (in part (c) above) that if we know \(\sigma_w\) and \(p(q^*)\), we can easily determine \(\chi\). It turns out that there is also a simple relation between \(q^*\) and \(p(q^*)\). Show that for the hard tanh network, \(p(q^*) = \mathrm{erf}(1/\sqrt{2q^*})\).
<details><summary>Solution</summary>
<p>
For hard-tanh, \(p(q^*)\) is the probability that a normally distribution set of activations takes on values in hard-tanh's linear regime (recall this is between \(-1\) and \(1\)). Thus we integrate \(\int_{-1}^{1} z dz\) where \(z\) is a zero-mean Gaussian with variance \(q^*\). The integral of the Gaussian is given by the error function. The error function (denoted \(erf\) and defined as the integral of the standard Gaussian) is commonly defined without the leading factor \(\frac{2}{\pi}\), so \(\int z dz = erf(\sqrt(1/2q^*)\) (the parameter \(1/2q^*\) is arrived at by substituting \(t=h/\sqrt{2q^*}\)). Thus \(p(q^*) = erf(\sqrt(1/2q^*)\).
</p>
</details>
<p>Now all that’s left is to determine \(q^*\) as a function of \(\sigma_w\) and \(\sigma_b\), and then we can get both \(q^*\) and \(p(q^*)\). Remember that in problem set 2, you derived the relation</p>
<p>\(q^* = \sigma_w^2 \int~ \mathcal{D}h~ \phi(\sqrt{q^*}h)^2 + \sigma_b^2\)
Use this relation to get an implicit expression for \(q^*\) in terms of \(\sigma_w\) and \(\sigma_b\).</p>
<details><summary>Solution</summary>
<p>
$$ q^* = \sigma_w^2 \int~ \mathcal{D}h~ \phi(\sqrt{q^*}h)^2 + \sigma_b^2 $$
The hard-tanh nonlinearity squares to unity when \(|\sqrt{q^* h}|\leq 1\), and otherwise squares to \(q^* h^2\). So we can immediately write
$$ q^* = \sigma_w^2 \left[ 1 + \int_{-1/\sqrt{q^*}}^{+1/sqrt{q^*}} \frac{q^*h^2 - 1}{\sqrt{2\pi}} e^{-h^2/2} \right] + \sigma_b^2 $$
</p>
</details>
</li>
</ol>
<p><br /></p>
</li>
<li>
<p>Can Gaussian initialization achieve dynamical isometry?</p>
<p>In this problem, we will consider weights with a Gaussian initialization, and use the results from the previous problems to investigate whether dynamical isometry can be achieved for such nets over our two main activation functions of interest (ReLU and hard-tanh).</p>
<ol>
<li>
<p>As we’ve seen in the decomposition from the previous problems, the \(S\)-transform of \(\mathbf{J} \mathbf{J}^T\) depends on the \(S\)-transform of \(D^2\), which was computed above, and that of \( WW^T \), which is a <em>Wishart random matrix</em>, i.e. the product of two random Gaussian matrices.</p>
<p>Prove that \(S_{WW^T}(z) = \frac{1}{\sigma_w^2 \cdot (z + 1)}\), using the following connection between the moments of a Wishart matrix and the Catalan numbers:
\(m_k = \frac{\sigma_w^{2k}}{k + 1} {2k \choose k}\) where \(m_k\) is the \(k^\text{th}\) moment of \(WW^T\).</p>
<details><summary>Solution</summary>
<p>
Given the moments, we can easily form the moment-generating function
$$ M_{WW^T}(z) := \sum_{k = 1}^\infty \frac{m_k}{z^k} = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k \frac{1}{k + 1} {2k \choose k} = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k $$
where \(C_k\) is the \(k^\text{th}\) Catalan number. So, we can now exploit the defining recurrence relation for the Catalan numbers, that \(C_k = \sum_{j = 0}^{k - 1} C_j C_{k - j - 1}\) (if you think of the \(k^\text{th}\) Catalan number as the number of ways to balance \(2k\) parentheses, this recurrence is pretty intuitive). To start off, this recurrence starts with the \(C_0\), though our MGF does not, and this might make the calculation more difficult; let's temporarily work with
$$ f(x) := \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k = 1 + M_{WW^T}(z) $$
Next, the recurrence is in a sum of products of Catalan numbers; specifically, products whose indices have a constant sum. Seeing as \(f(x)\) is basically an infinitely long polynomial, and polynomial multiplication also involves such product sums, a good first attempt to apply this recurrence is to square our function. Indeed, we have:
$$ f(x)^2 = \sum_{k = 0}^\infty \sum_{j = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^{k + j} C_k C_j $$
which after collecting like terms, is
$$ f(x)^2 = \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k \sum_{j = 0}^{k - 1} C_j C_{k - j} = \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_{k + 1} $$
Thus,
$$ \frac{\sigma_w^2}{z} f(x)^2 = \frac{\sigma_w^2}{z} \left( M_{WW^T}(z) + 1 \right)^2 = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k = M_{WW^T}(z) $$
Solving the quadratic equation yields
$$ M_{WW^T}(z) = \frac{z}{2 \sigma_w^2} - 1 - \frac{1}{2} \sqrt{1 - \frac{4 \sigma_w^2}{z}} $$
Now that we've reduced the MGF to a quadratic polynomial, inverting it is easy enough, and we are left with
$$ M_{WW^T}^{-1}(z) = \sigma_w^2 \frac{(z + 1)^2}{z} $$
$$ S_{WW^T}(z) = \left( \sigma_w^2 \cdot (z + 1) \right)^{-1} $$
</p>
</details>
</li>
<li>
<p>We now have enough pieces to begin attacking the calculation of the Jacobian singular value distribution - recall that due to the decomposition</p>
\[S_{JJ^T} = (S_{WW^T})^L \cdot (S_{D^2})^L\]
<p>once we’ve calculated the \(S\)-transforms for \(D^2\) and \(WW^T\), we can easily obtain the \(S\)-transform of \(\mathbf{J} \mathbf{J}^T\).</p>
<p>Using your solution to the previous part and the calculation of \(S_{D^2}\) from the earlier problems, show that</p>
\[S_{JJ^T} = \sigma_w^{-2L} \cdot (z + p(q^*))^{-L} .\]
<details><summary>Solution</summary>
<p>
We calculated \(S_{D^2}\) in part (e) of problem 3, showing that
$$S_{D^2}(z) = \frac{z+1}{z+ p(q^*)}$$
And from the previous part we know that
$$S_{WW^T}(z) = \frac{1}{\sigma_w^2(z+1)},$$
so combining these gives
$$ S_{JJ^T} = (S_{WW^T})^L (S_{D^2})^L = \left( \sigma_w^{-2} (1 + z)^{-1} \right)^L \left( \frac{1 + z}{z + p(q^*)} \right)^L = \sigma_w^{-2L} (z + p(q^*))^{-L} $$
</p>
</details>
</li>
<li>
<p>From the \(S\)-transform, one route to getting information about the spectrum of \(JJ^T\) is to compute the spectral density \(\rho_{JJ^T}(\lambda)\). While that calculation is too involved, we can get the answer to the question of achieving dynamical isometry by a slightly more indirect route.</p>
<p>Use the \(S\)-transform you calculated above to calculate \(M_{JJ^T}^{-1}\) (the inverse of the moment-generating function for \(\mathbf{J} \mathbf{J}^T\)).</p>
<details><summary>Hint</summary>
To compute the inverse MGF, recall the definition of the \(S\)-transform given in the paper (section 2.3, eqn. 10).
</details>
<details><summary>Solution</summary>
<p>
The \(S\)-transform is defined so that \(S_{JJ^T} = \frac{1 + z}{z M^{-1}_{JJ^T}(z)}\), so
$$ M^{-1}_{JJ^T}(z) = \frac{1 + z}{z S_{JJ^T}(z)} = \frac{1 + z}{z} \left(z + p(q^*)\right)^L \sigma_w^{2L} $$
</p>
</details>
</li>
<li>
<p>We can now compute the variance of the \(JJ^T\) eigenvalue distribution, \(\sigma_{JJ^T}^2\). You should have calculated above that</p>
\[M_{JJ^T}^{-1}(z) = \frac{1 + z}{z} \cdot (z + p(q^*))^L \cdot \sigma_w^{2L}\]
<p>Using the definition that</p>
\[M_{JJ^T}(z) = \sum_{k = 1}^\infty \frac{m_k}{z^k}\]
<p>and the expression for the functional inverse of \(M_{JJ^T}\) to compute that the first two moments are</p>
<p>\(m_1 = \sigma_w^{2L} p(q^*)^L\)
\(m_2 = m_1^2 \cdot \frac{L + p(q^*)}{p(q^*)}\)</p>
<details><summary>Hint</summary>
Use the <a href="https://en.wikipedia.org/wiki/Lagrange_invesion_theorem">Lagrange inversion theorem</a> (eqn. 18 in the paper) to obtain a power series for the inverse MGF and equate corresponding coefficients with our calculated expressions.
</details>
<details><summary>Solution</summary>
<p>
Note that we have a formula for \(M^{-1}(z)\) (suppressing the \(JJ^T\) subscript for clarity, but the moments are defined in terms of \(M(z)\). In the paper, the Lagrange inversion theorem is used to express the constant and \(1/z\) coefficients of \(M^{-1}(z)\) in terms of the \(m_1\) and \(m_2\). Here is a slightly hand-wavy proof of that result (the rigorous proof turns out to be quite difficult):
<br />
Assume that the \(M^{-1}(z)\) can be written as a Taylor series with an additional \(1/z\) term (This assumption is one of the weaknesses of this proof). So
$$
M^{-1}(z) = \frac{a}{z} + b + cz + dz^2 + \cdots
$$
Since we know that
$$
M(z) = \frac{m_1}{z} + \frac{m_2}{z^2} + \cdots,
$$
we can write
$$
z = \frac{m_1}{M^{-1}(z)} + \frac{m_2}{M^{-1}(z)^2} + \cdots
$$
Plugging in our ansatz above gives
$$
z = \frac{m_1}{\left( \frac{a}{z} + b + cz + dz^2 + \cdots \right)} + \frac{m_2}{\left( \frac{a}{z} + b + cz + dz^2 + \cdots \right)^2} + \cdots
$$
We'll expand the RHS of the above equations assuming \(z\) to be small, and then equate coefficients of the RHS and LHS. Specifically, we will expand the RHS to second order in \(z\).
$$
z = \frac{m_1 z}{a} \bigg( 1 + (b/a) + (c/a)z + (d/a)z^2 + \cdots \bigg)^{-1} + \cdots
$$
$$
\frac{m_2 z^2}{a^2} \bigg( 1 + (b/a) + (c/a)z + (d/a)z^2 + \cdots \bigg)^{-2} + \cdots = \frac{m_1}{a} z - \frac{m_1 b}{a^2} z^2 + \frac{m_2}{a^2} z^2 + O(z^3)
$$
Since the coefficient of \(z\) above has to be unity, and the coefficient of \(z^2\) has to be zero, this implies that \(a = m_1\) and \(b .= m_2/m_1\). This implies that our sought-after expression for \(M^{-1}(z)\) is
$$M^{-1}(z) = \frac{m_1}{z} + \frac{m_2}{m_1} + \cdots$$
With this expression in hand, we can directly extract the constant and \(1/z\) coefficients of the function \(M^{-1}(z)\):
Given the result of our earlier calculation that
$$ M_{JJ^T}^{-1}(z) = \left(1+\frac{1}{z}\right) \cdot (z + p(q^*))^L \cdot \sigma_w^{2L}, $$
we see that the only place to get a \(1/z\) term here is from the constant term when the \((z+p(q^*))^L\) is expanded. This constant term will simply be \(p(q^*)^L\), so the \(1/z\) term here, which is \(m_1\), is
$$
m_1 = \sigma_w^{2L} p(q^*)^L
$$
The constant term comes from two places. One, the \(p(q^*)^L\) multiplies the \(1\) in the first term, and the term \(Lzp(q^*)^{L-1}\), coming from the binomial expansion, multiplies the \(1/z\) in the first term.
So this means that the constant coefficient, \(m_2/m_1\), is given by
$$
\frac{m_2}{m_1} = \sigma_w^{2L} p(q^*)^L + \sigma_w^{2L} L p(q^*)^{L-1}.
$$
Recognizing the first term in the RHS sum as \(m_1\), we can factor to get
$$
\frac{m_2}{m_1} = m_1 \left( 1 + \frac{L}{p(q^*)} \right),
$$
or,
$$
m_2 = m_1^2 \left( 1 + \frac{L}{p(q^*)} \right),
$$
as desired.
</p>
</details>
<p><br /></p>
</li>
</ol>
</li>
</ol>
<h1 id="6-experimental-results--future-work">6 Experimental Results & Future Work.</h1>
<p><strong>Motivation</strong>: Before wrapping up, we will programatically validate the theoretical results we derived above. You can find starter code in an IPython notebook <a href="https://drive.google.com/open?id=1ocuk_mH4fJrFkDKhMT4ZdAqIMIgnFDet">here</a>.</p>
<p><strong>Objectives</strong>:</p>
<ul>
<li>Experimentally confirm the linear dependence of the singular value spectrum of a neural net’s Jacobian under various random initializations.</li>
<li>Experimentally confirm the positive impact of dynamical isometry at initialization on the trainability of a neural net.</li>
</ul>
<p><strong>Follow-up reading</strong>:
Here are a couple papers you might enjoy, which build upon the results of this paper:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1806.05393.pdf">Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks</a> by Xiao et al.</li>
<li><a href="https://arxiv.org/pdf/1802.09979.pdf">The Emergence of Spectral Universality in Deep Networks</a>.</li>
<li><a href="https://www.annualreviews.org/doi/full/10.1146/annurev-conmatphys-031119-050745">Statistical Mechanics of Deep
Learning</a>.
<br /></li>
</ol>piyush[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.] This guide would not have been possible without the help and feedback from many people.Stein Variational Gradient Descent2020-03-02T10:00:00+00:002020-03-02T10:00:00+00:00https://www.depthfirstlearning.com/2020/SVGD<p>[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]</p>
<p>This guide is thanks to a many different people, all of whom took their time to give feedback, write reviews, and provide their own insights to the curriculum.</p>
<p>Special thanks to Cinjon Resnick, who was incredibly helpful throughout the iterations of the class, curriculum, and final notes. A special thanks as well to Professor Qiang Liu, who took the time to help shape the curriculum.</p>
<p>Thank you to Calvin Woo, Sanyam Kapoor, Thomas Pinder, Swapneel Mehta, and Avital Oliver for useful contributions to this guide, as well as countless insights during our discussions.</p>
<p>A special thanks to the many outside guests who offered to provide their time, including Dilin Wang, Tongzheng Ren, and Haoran Tang.</p>
<p>Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/svgd-deps.svg" width="200"></iframe>
<div>Concepts used in SVGD. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Stein’s Method is a powerful statistical method, one that is at the disposal (and the focus) of many statisticians today. Recently, Stein’s Method has made its way into machine learning and has already proved to be a fruitful research area. Stein’s Method has deep connections to many machine learning problems of interest, and by the end of this guide, you should be able to understand the relevant mathematics behind this powerful tool.</p>
<p><br /></p>
<h1 id="1-basics-behind-kernelized-stein-discrepancy">1 Basics Behind Kernelized Stein Discrepancy</h1>
<p><strong>Motivation</strong>: Before jumping into all the math and methodology, we have to be able to understand the basics of what’s going on. Most importantly, we will review the basics of measure theory and reproducing kernel hilbert spaces. Measure theory allows us to understand the notion of discrepancy measures between distributions, which we will use later on to quantify the difference between two arbitrary distributions of interest. Our other topic, Reproducing Kernel Hilbert Spaces (RKHS), will serve as the connection between measure theory and a practical machine learning algorithm. With RKHS, we will be able to define and optimize intractable measures which previously, were only useful for theoretical analysis or a restrictive class of functions. These two together set the foundation for defining a tractable Kernelized Stein Discrepancy, which serves as the driving factor behind Stein Variational Gradient Descent.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Measure Theory</li>
<li>Kernels</li>
<li>Reproducing Kernel Hilbert Space</li>
<li>Machine Learning Basics</li>
</ol>
<p><strong>Notes</strong>: In this class, we went over the basic mathematical concepts we will need throughout the rest of the curriculum. See here for the notes in <a href="https://colab.research.google.com/drive/1x3bgKtYWaYRTV1VGaf0bKRyQ_qxNZpjh">Colab</a> or here for the <a href="/assets/svgd_notes/week01.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/lecture4_introToRKHS.pdf">Reproducing Kernel Hilbert Spaces Tutorial, Section 1 - 3</a>.</li>
<li><a href="https://www.win.tue.nl/~rvhassel/Onderwijs/Old-Onderwijs/2DE08-1011/ConTeXt-OWN-FA-201209-Bib/Literature/sigma-algebra/gc_06_measure_theory.pdf">A gentle introduction to Measure Theory (Chandalia)</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://mlss.tuebingen.mpg.de/2015/slides/gretton/part_1.pdf">Slides on RKHS from Arthur Gretton</a>.</li>
<li><a href="http://cs231n.github.io/python-numpy-tutorial/">CS231n’s Numpy and Python Tutorial</a>.</li>
<li><a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">CS229’s Linear Algebra Refresher</a>.</li>
<li><a href="https://xavierbourretsicotte.github.io/Kernel_feature_map.html">Xavier Bourret Sicotte’s Blog on Kernels and Feature Maps</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>“However, Cauchy sequences are not the same as convergent sequences”, but a property of Cauchy sequences is that they are bounded. What’s the difference?
<details><summary>Solution</summary>
<p>
Convergent sequences have a limit, but Cauchy sequences are only required to be bounded. But what exactly does bounded mean? Here's a proof that shows that they are bounded, which might shed some light on the definition itself:
<br />
<p>
<b>a.</b> There exists \(N\) such that \(|a_n - a_m| < 1 \quad \forall m, n \geq N\) (Property of Cauchy Sequence iterates getting closer)
<br />
<b>b.</b> \(\implies \forall n \geq N, |a_n - a_N| < 1\)
<br />
<b>c.</b> \(a_n \in (a_N - 1, a_N + 1) \forall n \geq N\). (\(n \geq N\) is bounded)
<br />
<b>d.</b> Since the sequence is \(n < N\) is finite (since \(N\) is finite), it is also bounded.
</p>
Therefore the Cauchy sequence \(\{ a_n \}\) is bounded \(\square\)
</p>
</details>
</li>
<li>“The open interval (0, 1) is not complete whereas the closed interval [0, 1] is complete.” Why? Can we use this example to get a intuitive definition of complete?
<details><summary>Solution</summary>
<p>
Intuitively, a space is complete if there are no "points missing" from it (inside or at the boundary). For instance, the set of rational numbers is not complete, because e.g. \(\sqrt{2}\) is "missing" from it, even though one can construct a Cauchy sequence of rational numbers that converges to it. More information can be found at <a href="https://en.wikipedia.org/wiki/Complete_metric_space">Wikipedia: Complete Metric Space.</a>
</p>
</details>
</li>
<li>Explain the difference between a Banach and Hilbert Space. Is every Hilbert space a Banach space?
<details><summary>Solution</summary>
<p>
A Banach space is a vector space in which each vector has a non-negative length, or norm, and in which every Cauchy sequence converges to a point of the space. Also known as complete normed linear space.
A Hilbert space is a Banach space with inner product, which defines the norm.
</p>
</details>
</li>
<li>In Machine Learning, kernels can be thought of as a “dot product” (a kind of similarity score) in high-dimensional space. Why would this be useful? Given a feature map, do we always have a corresponding kernel? Given any kernel, can we always explicitly write out the elements of the corresponding feature map?
<details><summary>Solution</summary>
<p>
Kernels (and the corresponding kernel trick) allow us to compute similarities in high-dimensional space without explicitly writing out and computing the dot product.
However, not ever feature map corresponds to a kernel; there are certain properties a kernel must have, and not every feature map imbues it with those properties.
Likewise, given a kernel, it may be the case that we can never write out (explicitly) the corresponding feature map. A good example of this is the popular exponential kernel.
</p>
</details>
</li>
<li>Assume that we just need the log-likeihood in many machine learning tasks so that we can compute \(KL(q||p)\) , and iteratively fit our model \(p\) to the underlying, generating data distribution \(q\). Why is this already too large of an assumption (“We assume that we have the ability to calculate the log-likelihood under the model that we specify”)?
<details><summary>Solution</summary>
<p>
The dreaded normalization constant! Most models we see will give an unnormalized likelihood, and the normalization constant (which we will see in a few weeks, often denoted as \(Z\)) is intractable to compute. We need the normalization constant to bring a probability function to a probability density function.
</p>
</details>
</li>
<li>What is the use of Monte-Carlo methods in machine learning?
<details><summary>Solution</summary>
<p>
They are a way to estimate quantities in the presence of complex, many-random-variable situations. They do so by repeatedly generating (via simulation) instances from which they estimate the quantities.
</p>
</details>
</li>
<li>Explain the reproducing property in your own words.
<details><summary>Solution</summary>
<p>
Sanyam Kapoor's answer from our class was: "Every feature map is a linear combination of the full Hilbert space weighted by the kernel evaluations."
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-steins-method">2 Stein’s Method</h1>
<p><strong>Motivation</strong>: Most of the theory we will see in this curriculum builds off the general theoretical framework of Stein’s Method, a tool to obtain bounds on distances between distributions. In Machine Learning (as we shall later see), distances between distributions can be used to quantify how well (or poorly) a model is at approximating a certain distribution of interest. We shall start from Stein’s Identity and Operator, while explaining their theoretical significance and working through some proofs to get an understanding of some terms (Stein’s Method, Stein’s Discrepancy) we’ll see in the coming weeks. Lastly, we will discuss why Stein’s Method has historically been a theoretical tool, and hint at how ideas from Week 1 (particularly RKHS) can be used in combination with Stein’s Method to build the tractable discrepancy measure at the center of Week 3’s discussion.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Stein’s Method</li>
<li>The Stein Operator</li>
<li>Stein Equation</li>
<li>Stein’s Identity</li>
</ol>
<p><strong>Notes</strong>: In this class, we discussed the theoretical concepts behind Stein’s method, and discussed different ways to interpret the core ideas. See here for the notes in <a href="https://colab.research.google.com/drive/1HqHSP9x01te7e33-zDR00vAsPX_M19h2">Colab</a> or here for the <a href="/assets/svgd_notes/week02.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1602.03253">Section 2 of Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://en.wikipedia.org/wiki/Stein%27s_method">Stein’s Method on Wikipedia</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1404.1392">A Short History of Stein’s Method</a>.</li>
<li><a href="http://www.ims.nus.edu.sg/Programs/stein09/files/A%20Gentle%20Introduction%201.pdf">Gentle Introduction to Stein’s Method</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Prove Stein’s Identity for a standard Gaussian random variable \(Z\).
<details><summary>Solution</summary>
<p>
Recall that Stein's Identity tells us that for a unit-normal random variable \(Z\) (i.e \(Z \sim \mathcal{N}(0, 1)\)):
$$ \mathbf{E}f'(Z) = \mathbf{E}Zf(Z)$$
for all absolutely continuous functions \(f\) with \( \mathbf{E}[f'(Z)] < \infty \).
To start, we state, without proof, that the density function of the unit normal Gaussian:
$$ p(z) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^2}{2}} $$
satisfies \( zp(z) = p'(z) \).
For some normal \(Z\), we can break the left hand side of the original identity into two integrals:
$$\mathbf{E}f'(Z) = \int_0^\infty f'(z)p(z)dz + \int_{-\infty}^0 f'(z)p(z)dz $$
For each left-hand side integral, we use <a href="https://en.wikipedia.org/wiki/Fubini%27s_theorem">Fubini's Theorem</a>:
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty f'(z) \int_z^\infty yp(y)dydz $$
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_z^\infty f'(z)yp(y)dydz $$
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_0^y f'(z)yp(y)dzdy $$
Leading us to our final integral:
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty [f(y) - f(0)] yp(y)dy $$
For the second integral, it evaluates to \( \int_{-\infty}^0 [f(y) - f(0)] yp(y)dy \)
When we combine each individual result, we get:
$$ \mathbf{E}f'(Z) = \mathbf{E}Z[f(Z) - f(0)] = \mathbf{E}Zf(Z)$$
which proves the forward direction.
</p>
</details>
</li>
<li>Explain why Stein’s Identity is useful.
<details><summary>Solution</summary>
<p>
Stein's Identity in the converse as well; if the identity holds, we can conclude the random variable, which we call \(W\), is also normal. However, if the two quantities in Stein's Identity are approximately equal, then Stein's Identity also lets us conclude that \(W\) is also approximately normal. Stein's Identity and Method are used to quantify this "approximately" term, which we briefly discuss below.
Probability metrics (between two random variables \(X\) and \(Y\)) take the general form of:
$$d(X, Y) = \sup_{h \in \mathcal{H}} | \mathbf{E}h(X) - \mathbf{E}h(Y) |$$
for some class of functions \( \mathcal{H} \). We normally want to bound the distances between the corresponding distribution functions \(P \) and \(Q \), but that choice is less important for this brief discussion.
When we choose different classes of functions, we can recover various distances that we often use (in machine learning) to compare probability distributions, such as the Kolmorgov or Wasserstein distance.
We get to the Stein Discrepancy by measuring the distance between \(W\) to our standard normal \(Z\) via:
$$ \mathbf{E}h(W) - \mathcal{N}h $$
where \(\mathcal{N}h = \mathbf{E}h\) for \(h \in \mathcal{H}\).
Stein's Identity tells us that the discrepancy can also be measured by:
$$ \mathbf{E}[f'(W) - Wf(W)]$$
which, when we evaluate at \(w\), gives us the Stein Equation:
$$ f'(w) - wf(w) = h(w) - \mathcal{N}h $$
Since we're trying to bound: \(\mathbf{E}h(W) - \mathcal{N}h\), we can now instead bound the LHS, which turns out to be a lot easier once we account for all of the boundary conditions.
</p>
</details>
<p><br /></p>
</li>
</ol>
<h1 id="3-kernelized-stein-discrepancy">3 Kernelized Stein Discrepancy</h1>
<p><strong>Motivation</strong>: The main theoretical meat comes from a single 2016 paper titled Kernelized Stein Discrepancy (KSD). KSD takes the powerful Stein’s Identity, and uses RKHS theory to define a tractable discrepancy between a ground truth distribution and samples from an arbitrary one. Most importantly, KSD defines a discrepancy function that does not involve calculating the normalizing constant, allowing it to be much more widely applicable in practical tasks. We will discuss the difference between likelihood-free and likelihood-based methods in machine learning, how this normalization constant proves to be problematic in machine learning, and how KSD allows us to sidestep this issue with a new, tractable discrepancy. KSD will serve as the launch pad for the algorithm at the focus of this curriculum, Stein Variational Gradient Descent.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>A Stein Discrepancy</li>
<li>Goodness of Fit</li>
<li>Tractable Optimization of the Stein Discrepancy</li>
</ol>
<p><strong>Notes</strong>: In this class, we worked through the Kernelized Stein Discrepancy paper, focusing on the optimization and use cases of such a method. See here for the notes in <a href="https://colab.research.google.com/drive/1V7zpm9U3TCjIM9DxeRWo6IEypDkZObrH">Colab</a> or here for the <a href="/assets/svgd_notes/week03.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.cs.dartmouth.edu/~qliu/PDF/ksd_short.pdf">A Short Introduction to Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/slides_ksd_icml2015.pdf">ICML 2015 Slides on KSD</a>.</li>
<li><a href="https://arxiv.org/abs/1602.03253">Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://stats.stackexchange.com/questions/276497/maximum-mean-discrepancy-distance-distribution">What is Maximum Mean Discrepancy?</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<p>Although we focus on the work leading up to Stein Variational Gradient Descent, this week’s optional reading provides historical context on how Stein’s Method was introduced into the context of machine learning.</p>
<ol>
<li><a href="https://arxiv.org/abs/1506.03039">Measuring Sample Quality with Stein’s Method</a>.</li>
<li><a href="https://arxiv.org/abs/1602.02964">A Kernel Test of Goodness of Fit</a></li>
<li><a href="https://arxiv.org/abs/1703.01717">Measuring Sample Quality with Kernels</a></li>
</ol>
<p>The first reference, from Gorham and Mackey, introduced the notion of a Stein Discrepancy. Kernelized Stein Discrepancy, the paper of focus for this week, built upon that idea with kernels, enabling the use of kernel functions in the Stein Discrepancy. The latter two references are also works that independently developed kernel-based Stein Discrepancies.</p>
<p><strong>Questions</strong>:</p>
<ol>
<li>What determines the choice of kernel in KSD?</li>
</ol>
<details><summary>Solution</summary>
<p>
Since KSD requires an RKHS for optimization, the kernel must be positive definite. However, whenever given a positive definite kernel \(K\), we can always build an associated RKHS as follows.
If we take \(H\) as the Hilbert space of functions \(f: \mathcal{X} \rightarrow \mathbf{R}\) defined on some set \(\mathcal{X}\) with some inner product \( \langle \cdot, \cdot \rangle_H \) defined on \(H\), then we can define the evaluation functional \(e_x: H \rightarrow \mathbf{R}\) as \(f \rightarrow e_x(f) = f(x) \).
Using the above definitions, our space \( H\) is an RKHS iff the evaluation functionals are continuous. As we saw in the notes, we call the given kernel \(K\) a reproducing kernel if:
<br />
<b>1.</b> \(K(x, \cdot), \; \forall x \in \mathcal{X}\)
<br />
<b>2.</b> \(\langle f, K_x \rangle = f(x) \; \forall f \in H, \forall x \in \mathcal{X}\).
<br />
Thus, every reproducing kernel \( K\) induces a unique RKHS given the kernel is positive definite.
Excitingly, in the context of machine learning, positive definite kernels themselves can be defined in terms of inner products. Therefore, we can generate arbitrary kernels and RKHS with some feature map \( \Phi: \mathcal{X} \rightarrow \mathcal{F}\) where feature space \( \mathcal{F}\) is a Hilbert space with some inner product \( \langle \cdot, \cdot \rangle \).
</p>
</details>
<p><br /></p>
<h1 id="4-stein-variational-gradient-descent">4 Stein Variational Gradient Descent</h1>
<p><strong>Motivation</strong>: Stein Variational Gradient Descent (SVGD) is a popular, non-parametric Bayesian Inference algorithm that’s been applied to Variational Inference, Reinforcement Learning, GANs, and much more. This week, we study the algorithm in its entirety, building off of last week’s work on KSD, and seeing how viewing KSD from a KL-Divergence-minimization lens induces a powerful, practical algorithm. We discuss the benefits of SVGD over other similar approximators, and look at a practical implementation of the algorithm.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Stein Variational Gradient Descent</li>
<li>Implementing the Algorithm</li>
</ol>
<p><strong>Notes</strong>: In this class, we go over the core paper, Stein Variational Gradient Descent. At the end of the notes, we provide link to implementations in a variety of different languages. See here for the notes in <a href="https://colab.research.google.com/drive/0B2rVTvobCLlWNEY4SENKdG1OQ3c">Colab</a> or here for the <a href="/assets/svgd_notes/week04.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/steinslides16.pdf">SVGD Slides</a>.</li>
<li><a href="https://arxiv.org/abs/1608.04471">SVGD Paper</a>.</li>
<li><a href="https://www.sanyamkapoor.com/machine-learning/stein-gradient/">Sanyam Kapoor’s great notebook on Stein Gradients</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/svgd_aabi2016.pdf">SVGD: Theory and Applications</a>.</li>
<li><a href="https://arxiv.org/abs/1707.06626">Learning to Sample with Amortized SVGD</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Compare and contrast the method shown here and MCMC. What are some advantages MCMC still has over SVGD?
<details><summary>Solution</summary>
<p>
Below are some ideas we discussed in our class. <br />
<b>1.</b> SVGD requires a compact subspace \( \mathcal{X} \), and as noted <a href="http://proceedings.mlr.press/v97/chen19b/chen19b.pdf">here in Chen '19</a>, requires the number of particles to be fixed apriori.<br />
<b>2.</b> SVGD has a lot less theoretical understanding compared to MCMC (which, is potentially due to the recency of the result). SVGD has had analysis done in the infinite-particle regime, but minimal work done in finite particle scenarios (an example of such work can be found <a href="https://papers.nips.cc/paper/8101-stein-variational-gradient-descent-as-moment-matching.pdf">here</a>. A concern of theoretical analysis is the complexity of analyzing the interacting particle updates, so the works covered here either view it from a dynamical systems / differential equation perspective (which concerns the smooth transformation of density), or discuss the properties of the final particles, regardless of how they were algorithmically attained.<br />
<b>3.</b> SVGD still seems to collapse in high-dimensional spaces, leading to exciting new research in <a href="https://arxiv.org/abs/1902.03394">why this occurs</a> and <a href="https://arxiv.org/abs/1711.04425">ideas on how to get around it</a>.
</p>
</details>
</li>
<li>Prove that the discrepancy in Equation 3 of the Stein Variational Gradient Descent Paper only equals 0 when (p) and (q) are equal.
<details><summary>Solution</summary>
<p>
Recall the operator definition of Stein's Identity:
$$ \mathbf{E}_p[\mathcal{A}_pf(x)] = 0$$
If \( p \neq q \), we get \( \mathbf{E}_q[\mathcal{A}_pf(x)] \) for some choice of function \( f \).
We can expand this to:
$$\mathbf{E}_q[\mathcal{A}_pf(x)] - \mathbf{E}_q[\mathcal{A}_qf(x)]$$
Recalling the full definition of the operator:
$$\mathcal{A}_pf(x) = \mathbf{E}_p[s_p(x)f(x) + \nabla_x f(x)] = 0$$
where score function \( s_p(x) \) is just \( \nabla_x \log p(x) \), we are left with
$$\mathbf{E}_q[(s_p(x) - s_q(x))f(x)]$$
This means unless \(p = q \rightarrow s_p(x) = s_q(x) \; \forall x \in \mathcal{X} \), we can always find some function \(f\) for which the above quantity is nonzero.
</p>
</details>
</li>
<li>
<p>Implement SVGD in your favorite language (see the notes for links to different implementations). Then, let’s take a look at the role of the kernel in SVGD:</p>
<ul>
<li>
<p>Remove the repulsive kernel term and observe how particles collapse to modes.</p>
</li>
<li>
<p>Remove the kernel’s contribution in the first term.</p>
</li>
</ul>
<p>What happens?</p>
</li>
</ol>
<p><br /></p>
<h1 id="5-svgd-as-gradient-flow">5. SVGD as Gradient Flow</h1>
<p><strong>Motivation</strong>: SVGD as Gradient Flow is one of the first papers that analyzes the dynamics and theoretical properties of SVGD. This paper covers an incredible amount of seemingly-disparate topics, connecting them in a succinct explanation. Due to the relative difficulty of the material, especially the necessary background, the attached notes are self-contained and should be read alongside the paper.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Large Sample Regime of SVGD</li>
<li>Continuous Time Analysis of SVGD</li>
<li>Optimal Transport, Wasserstein Distances, and Differential Geometry</li>
<li>SVGD as a Gradient Flow</li>
</ol>
<p><strong>Notes</strong>: In this class, we try to understand the geometric implications of SVGD. The notes are structured relatively differently - with the amount of background needed, relevant material is introduced in-line. As a result, the ideal way to understand this week requires reading the notes alongside the paper, using the background sections to understand the concepts and their connections within the paper. See here for the notes in <a href="/assets/svgd_notes/week05.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1704.07520">Stein Variational Gradient Descent as Gradient Flow</a>.</li>
</ol>
<p><br /></p>
<h1 id="6-stein-in-reinforcement-learning">6. Stein in Reinforcement Learning</h1>
<p><strong>Motivation</strong>: One of the most exciting use cases of SVGD is in reinforcement learning, due to its connection to maximum entropy reinforcement learning. This week, we study two key techniques in reinforcement learning that use SVGD as the underlying mechanism. In reinforcement learning, the target distribution is not known, so we derive gradient updates to our parameters using policy gradients. As we derive the gradient estimators in the maximum-entropy framework of reinforcement learning, we will start to see what benefits SVGD-based methods have. In particular, we will focus on the explore-exploit tradeoff, as well as normalization constants for intractable distributions, and see how SVGD helps us get around complicated problems regarding both.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Reinforcement Learning</li>
<li>Explore vs. Exploit</li>
<li>Maximum Entropy Reinforcement Learning</li>
</ol>
<p><strong>Notes</strong>: In this class, we look at the application area of reinforcement learning, and see how the diversity induced by SVGD (and its connection to maxmimum entropy reinforcement learning) generates strongly-exploring policies. See here for the notes in <a href="https://colab.research.google.com/drive/178X8BgGrUmPaRTLulL_ETUKaBf-MfrgS">Colab</a> or here for the <a href="/assets/svgd_notes/week06.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1704.02399">Stein Variational Policy Gradient</a>.</li>
<li><a href="https://arxiv.org/abs/1702.08165">Reinforcement Learning with Energy-Based Policies</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html">A Long Peek into Reinforcement Learning</a>.</li>
<li><a href="https://arxiv.org/abs/1707.06626">Learning to Draw Samples with Amortized SVGD - Same as W4</a>.</li>
<li><a href="https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/">Soft Q-Learning BAIR Blogpost</a>.</li>
<li><a href="https://arxiv.org/abs/1805.10309">Learning Self-Imitating Diverse Policies (an improved SVPG)</a>.</li>
<li><a href="https://arxiv.org/abs/1806.03836">Bayesian MAML</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are some of the issues with using the RBF kernel when comparing RL policies? Is parameter space appropriate for comparing policies?
<details><summary>Solution</summary>
<p>
While it works in practice, the networks used for particles in the original SVPG paper were reasonably small. With larger numbers of parameters (i.e which are necessary when working with image-based observations), parameter-based discrepancies start to make even less sense. This is one of two core ideas that drove the formulation of the Self-Imitating Diverse Policies paper, seen as Resource 4 in Optional Reading.
</p>
</details>
</li>
<li>
<p>In SVPG, the introduction of a prior (and priors in RL) is one active area of research. To incorporate priors in this framework, what “space” does the prior need to be over?</p>
<details><summary>Solution</summary>
<p>
SVPG incorporates a prior over \(q \), which is actually a prior over the distribution of particle parameters \(\theta\). Since this space is uninterpretable, the prior term is set to be a constant, generating an "improper" prior that, in most use cases, can get dropped out of the optimization. Even if you were to use an old set of particles as a prior, the term is basically unusable, because in order to estimate the density of \(q\), you'd need to fit high-dimensional ( \( d = \mathbf{R}^{|\theta|} \)) kernel-density estimators. In addition, usually the number of particles used is much less than the number of parameters each has, making the density estimation an ill-posed problem.
</p>
</details>
</li>
<li>With the code implementation linked in the notes (or, your own), ablate on the architecture of each SVPG particle. What types of behavioral differences do you see in the different policies as you increase or decrease? Try adding a second layer instead; for example, how does a 2-layer, 200 neuron-per-layer network compare to a single-layer, 400 neuron particle?</li>
</ol>
<p><br /></p>bhairav[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]Neural ODEs2019-09-23T10:00:00+00:002019-09-23T10:00:00+00:00https://www.depthfirstlearning.com/2019/NeuralODEs<p>This guide would not have been possible without the help and feedback from many people.</p>
<p>Special thanks to Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring19">Mathematics of Deep Learning</a>, and to Cinjon Resnick, who introduced me to DFL and helped complete this guide.</p>
<p>Thank you to Avital Oliver, Matt Johnson, Dougal MacClaurin, David Duvenaud, and Ricky Chen for useful contributions to this guide.</p>
<p>Thank you to Tinghao Li, Chandra Prakash Konkimalla, Manikanta Srikar Yellapragada, Shan-Conrad Wolf, Deshana Desai, Yi Tang, Zhonghui Hu for helping me prepare the notes.</p>
<p>Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/nodes-deps.svg" width="200"></iframe>
<div>Concepts used in Neural ODEs. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Neural ODEs are neural network models which generalize standard layer to layer propagation to continuous depth models. Starting from the observation that the forward propagation in neural networks is equivalent to one step of discretation of an ODE, we can construct and efficiently train models via ODEs. On top of providing a novel family of architectures, notably for invertible density models and continuous time series, neural ODEs also provide a memory efficiency gain in supervised learning tasks.</p>
<p>In this curriculum, we will go through all the background topics necessary to understand these models. At the end, you should be able to implement neural ODEs and apply them to different tasks.</p>
<p><br /></p>
<h1 id="common-resources">Common resources:</h1>
<ol>
<li>Süli & Mayers: <a href="https://www.cambridge.org/core/books/an-introduction-to-numerical-analysis/FD8BCAD7FE68002E2179DFF68B8B7237#">An Introduction to Numerical Analysis</a>.</li>
<li>Quarteroni et al.: <a href="https://www.springer.com/us/book/9783540346586?token=holiday18&utm_campaign=3_fjp8312_us_dsa_springer_holiday18&gclid=Cj0KCQiAvebhBRD5ARIsAIQUmnlViB7VsUn-2tABSAhIvYaJgSEqmJXD7F4A7EgyDQtY9v_GeUsNif8aArGAEALw_wcB">Numerical Mathematics</a>.</li>
</ol>
<h1 id="1-numerical-solution-of-odes---part-1">1 Numerical solution of ODEs - Part 1</h1>
<p><strong>Motivation</strong>: ODEs are used to mathematically model a number of natural processes and phenomena. The study of their numerical
simulations is one of the main topics in numerical analysis and of fundamental importance in applied sciences. To understand Neural ODEs, we need to first understand how ODEs are solved with numerical techniques.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Initial values problems.</li>
<li>One-step methods.</li>
<li>Consistency and convergence.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week1.pdf">class</a>, we touched upon one-step method and their analysis. We also looked at some illustrative examples.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Sections 12.1-4 from Süli & Mayers.</li>
<li>Sections 11.1-3 from Quarteroni et al.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Runge-Kutta methods: Section 12.5 from Süli & Mayers.</li>
<li><a href="http://podcasts.ox.ac.uk/odes-and-nonlinear-dynamics-42">Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.2</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercise 1 in Section 11.12 of Quarteroni et al.
<details><summary>Solution</summary>
<p>
The truncation error can be split as
$$h\tau_{n+1} = y_{n+1} - y_n - h\Phi(t_n,y_n;h) = E_1 + E_2$$
where
$$E_1 = \int_{t_n}^{t_{n+1}} f(s, y(s))\,ds - \frac{h}{2}\left( f(t_n,y_n) + f(t_{n+1},y_{n+1}) \right)$$
and
$$E_2 = \frac{h}{2}\left( f(t_{n+1},y_{n+1}) - f(t_{n+1},y_n + hf(t_n,y_n) \right)$$
We can bound \(E_2\) as
$$|E_2| = \frac{h}{2} \left| f(t_{n+1},y_{n+1}) - f(t_{n+1}, y_n + h f(t_n,y_n)) \right| \leq \frac{hL}{2}|y_{n+1}-y_{n} - hf(t_n,y_n)| = \frac{hL}{2}O(h^2) = O(h^3)$$
where \(L\) is the Lipschitz constant of \(f\). On the other hand, \(E_1\) is bounded above by \(O(h^3)\); see this <a href="https://en.wikipedia.org/wiki/Trapezoidal_rule#Error_analysis">link</a> for a proof. It follows that \(\tau_{n} = O(h^2)\).
</p>
</details>
</li>
<li>Exercises 12.3,12.4, 12.7 in Section 12 of Süli & Mayers.
<details><summary>Solution to Exercise 12.3</summary>
<p>
Notice that we can write
$$\left(y + \frac{q}{p}\right)'=p\left(y + \frac{q}{p}\right)$$
It follows that \(y(t) = Ce^{pt} - q/p\) for some constant \(C\). Imposing the initial condition \(y(0)=1\), we get \(y(t)=e^{pt} + q/p(e^{pt}-1)\). In particular, we expand \(y\) in its Taylor series:
$$y(t) = 1 + \left(y + \frac{q}{p}\right)\sum_{k=1}^\infty \frac{(pt)^k}{k!}$$
To conclude the exercise we only need to notice that
$$y_n(t) = q/p + \left(y + \frac{q}{p}\right)\sum_{k=1}^n \frac{(pt)^k}{k!}$$
satisfies Picard's iteration: \(y_0 \equiv 1\), \(y_{n+1}(t) = y_0 + \int_0^t (py_n(s) + q)\,ds\).
</p>
</details>
<details><summary>Solution to Exercise 12.4</summary>
<p>
Applying Euler's method with step-size \(h\), we get \(\hat{y}(0) = 0\), \(\hat{y}(h) = \hat{y}(0) + h \hat{y}(0)^{1/5} = 0\), \(\hat{y}(2h) = \hat{y}(h) + h \hat{y}(h)^{1/5} =0\). Iterating, we see that \(y(nh) = 0\) for all \(n\geq 0\). On the other hand, the implicit Euler's method says that
$$\hat{y}_{n+1} = \hat{y}_n + h \hat{y}_{n+1}^{1/5}$$
for \(n \geq 0\) and \(\hat{y}_0 = 0\). After substituting \(\hat{y}_{n} = (C_nh)^{5/4}\) in the above relation, we only need to check that there exists a sequence \(C_n\) satisfying the requirements.
</p>
</details>
<details><summary>Solution to Exercise 12.7</summary>
<p>
First, notice that
$$e_{n+1} = y(x_{n+1}) - y_{n} - \frac{1}{2}h(f_{n+1} + f_n)= e_n - \frac{1}{2}h (f_{n+1}+f_n) + \int_{x_n}^{x_{n+1}} f(s,y(s))\,ds$$
and that the second component of the RHS is the same as \(E_1\) in Exercise 1 above. Therefore the first bound follows. The last inequality is simply obtained by re-arranging the terms.
</p>
</details>
</li>
<li>
<p>Consider the following method for solving \(y' = f(y)\):</p>
\[y_{n+1} = y_n + h(\theta f(y_n) + (1-\theta) f(y_{n+1}))\]
<p>Assuming sufficient smoothness of \(y\) and \(f\), for what value of \(0 \leq\theta\leq 1\) is the truncation error the smallest? What does this mean about the accuracy of the method?</p>
<details><summary>Solution</summary>
<p>
By definition, it holds that
$$h\tau_n = y_{n+1} - y_n - h (\theta f_n + (1-\theta) f_{n+1}) = y_{n+1} - y_n - h \theta y_n' - h(1-\theta) y_{n+1}'$$
Taylor-expanding, we get
$$h\tau_n = y_{n} + hy_n' + h^2/2y_n'' + O(h^3) - y_n - h \theta y_n' - h(1-\theta) y_{n}' - h^2(1-\theta) y_{n}'' + O(h^3) = h^2(\theta - 1/2)y_n''+O(h^3)$$
It follows that the truncation error is the smallest for \(\theta=1/2\). For \(\theta = 1/2\), the method has order \(2\), otherwise it has order \(1\).
</p>
</details>
</li>
<li><a href="https://colab.research.google.com/drive/1bNg-RzZoelB3w8AUQ6mefRQuN3AdrIqX">Colab notebook</a>.
<details><summary>Solution</summary>
<p>
See this <a href="https://colab.research.google.com/drive/1wTQXy2_4InQH51rEmiCtvl5Q7MiyrC4k">Colab</a> for the solution.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-numerical-solution-of-odes---part-2">2 Numerical solution of ODEs - Part 2</h1>
<p><strong>Motivation</strong>: In the previous class, we introduced some simple schemes to numerically solve ODEs. In order to understand which numerical scheme is more proper to apply, it is important to know and understand their different properties. For this reason, in this class, we go through some more involved schemes and analyze them with regards to convergence and stability.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Runge-Kutta methods.</li>
<li>Multi-step methods.</li>
<li>System of ODEs and absolute converge.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week2.pdf">class</a>, we went through different ways to construct multi-step methods and their convergence analysis. We then looked into absolute stability regions for different methods.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Runge-Kutta methods: Section 11.8 from Quarteroni et al. or Sections 12.{5,12} from Süli & Mayers.</li>
<li>Multi-step methods: Sections 12.6-9 from Quarteroni et al. or Section 11.5-6 from Süli & Mayers.</li>
<li>System of ODEs: Sections 12.10-11 from Quarteroni et al. or Sections 11.9-10 from Süli & Mayers.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://podcasts.ox.ac.uk/odes-and-nonlinear-dynamics-41">Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.1</a>.</li>
<li>Predictor-corrector methods: Section 11.7 from Quarteroni et al.</li>
<li>Richardson extrapolation: Section 16.4 from <a href="http://numerical.recipes/">Numerical Recipes</a>.</li>
<li><a href="https://epubs.siam.org/doi/pdf/10.1137/0904010?">Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equations</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercises 12.11, 12.12, 12.19 in Section 12 of Süli & Mayers.
<details><summary>Solution to Exercise 12.11</summary>
<p>
By definition, the truncation error is given by
$$h\tau_n = y_{n+3} + \alpha y_{n+2} -\alpha y_{n+1} - y_n -h\beta y_{n+2}' - h\beta y_{n+1}'$$
Taylor-expanding, we have that
$$y_{n+3} = y_n + 3hy_n' + 9/2h^2 y_n'' + 9/2h^3 y_n''' + 27/8h^4 y_n^{(4)} + O(h^5)$$
$$y_{n+2} = y_n + 2hy_n' + 2h^2 y_n'' + 4/3h^3 y_n''' + 2/3h^4 y_n^{(4)} + O(h^5)$$
$$y_{n+1} = y_n + hy_n' + h^2 y_n'' + h^3 y_n''' + h^4y_n^{(4)} + O(h^5)$$
$$y_{n+2}' = y_n' + 2hy_n'' + 2h^2y_n''' + 4/3 h^3 y_{n}^{(4)}$$
$$y_{n+1}' = y_n' + hy_n'' + h^2y_n''' + h^3 y_{n}^{(4)}$$
Substituting these in the first equation and imposing the terms in \(h^i\), \(i = 0,1,2,3,4\), to be \(0\), we get the equations
$$3 + \alpha - 2\beta = 0$$
$$27 + 7\alpha - 15\beta = 0$$
$$27 + 5\alpha - 12\beta = 0$$
Solving for these, we find \(\alpha = 9\) and \(\beta = 6\). The resulting method reads
$$y_{n+3} + 9(y_{n+2} - y_{n+1}) - y_n = 6h(f_{n+2} + f_{n+1})$$
The characteristic polynomial is given by
$$\rho(z) = z^3 +9z^2 - 9z -1$$
One of the roots of this polynomial satisfies \(|z|>1\) and this implies that the method is not zero-stable.
</p>
</details>
<details><summary>Solution to Exercise 12.12</summary>
<p>
By definition, the truncation error is given by
$$h\tau_n = y_{n+1} + b y_{n-1} +a y_{n-2} -h y_{n}'$$
Taylor-expanding, we have that
$$y_{n+1} = y_n + hy_n' + 1/2h^2 y_n'' + O(h^3)$$
$$y_{n-1} = y_n - hy_n' + 1/2h^2 y_n'' + O(h^3)$$
$$y_{n-2} = y_n - 2hy_n' + 2h^2 y_n'' + O(h^3)$$
Substituting these in the first equation and solving for the terms in \(h^i\), \(i = 0,1\), to be \(0\), we get \(a=1\) and \(b=-2\). In particular
$$\tau_n = 3/2h + O(h^2)$$
and thus the method has order of accuracy \(1\).
The resulting method reads
$$y_{n+1} -2 y_{n-1} + y_{n-2} = h f_{n}$$
The characteristic polynomial is given by
$$\rho(z) = z^3 -2z -1$$
One of the roots of this polynomial satisfies \(|z|>1\) and this implies that the method is not zero-stable.
</p>
</details>
<details><summary>Solution to Exercise 12.19</summary>
<p>
The first equation can be found by substituting \(f(t,y) = \lambda y\) in equation (12.51) in the book and by solving for \(k_1,k_2\) (it is a \(2\times 2\) linear system). Substituting the values of \(A\) and \(b\) from the Butcher tableau in this formula and in the one right before equation (12.51) in the book, and simplifying, we get the formula for \(R(\lambda h)\). Finally, \(p\) and \(q\) are given by \(p,q=-3\pm i \sqrt{3}\). One can see that this implies \(|R(z)|<1\) if \(Re(z) <0\) and thus the method is A-stable.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-resnets">3 ResNets</h1>
<p><strong>Motivation</strong>: The introduction of Residual Networks (ResNets) made it possible to train very deep networks. In this section, we study residual architectures and their properties. We then look into how ResNets approximate ODEs and how this interpretation can motivate neural net architectures and new training approaches. This is important in order to understand the basic models underlying Neural ODEs and gain some insights into their connection to numerical solutions of ODEs.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>ResNets.</li>
<li>ResNets and ODEs.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week3.pdf">class</a>, we defined and briefly discussed residual network architecture. We then looked at a stability notion for ResNets, derived from the connection with discretisation of ODEs, and to a simple way to make such architectures reversible.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>ResNets:
<ul>
<li><a href="https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9">ResNets</a>.</li>
<li><a href="https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035">An Overview of ResNet and its Variants</a>.</li>
</ul>
</li>
<li>ResNets and ODEs:
<ul>
<li>Sections 1-3 from <a href="https://arxiv.org/pdf/1710.10348.pdf">Multi-level Residual Networks from Dynamical Systems View</a>.</li>
<li><a href="https://arxiv.org/abs/1709.03698">Reversible Architectures for Arbitrarily Deep Residual Neural Networks</a>.</li>
<li>Invertible ResNets: <a href="https://arxiv.org/pdf/1707.04585.pdf">The Reversible Residual Network: Backpropagation Without Storing Activations</a></li>
<li><a href="https://arxiv.org/pdf/1705.03341.pdf">Stable Architectures for Deep Neural Networks</a>.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>The original ResNets paper: <a href="https://arxiv.org/abs/1512.03385">Deep Residual Learning for Image Recognition</a>.</li>
<li>Another blog post on ResNets: <a href="https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624">Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Do you understand why adding ‘residual layers’ should not degrade the network performance?
<details><summary>Solution</summary>
<p>
Let
$$x_k = x_{k-1} + f(W_k, x_{k-1})$$
be the output of the \(k\)-th layer of a residual net. Then, adding a residual layer consists of considering $$x_{k+1} = x_{k} + f(W_{k+1}, x_{k})$$ instead of \(x_k\). For most common architectures, it holds that \(f(W, x) \equiv 0\) for \(W=0\). This is why adding a layer should not degrade the performances: any residual network with \(k\) layers can be also written as a residual network with \(k+1\) layers, by simply taking \(W_{k+1}=0\).
</p>
</details>
</li>
<li>How do the authors of (Multi-level Residual Networks from Dynamical Systems View) explain the phenomena of still having almost as good performances in residual networks when removing a layer?
<details><summary>Solution</summary>
<p>
Viewing the network output as time-step of the forward Euler's method, we have that
$$x^{(n+1)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$
where \(x^{(n)}(x_i)\) is the output of the \(n\)-th layer of the network evaluated on the input point \(x_i\). Then
$$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta) + h F(x^{(n+1)}(x_i); \theta)$$
Therefore, removing layer \(n+1\) consists of taking
$$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$
instead. As \(h\) is small (and this is motivated by the experiments in Section 3.2), the removed term is small and so is the variation in the output layer. Nevertheless, it must be noticed that this analysis is only based on empirical evaluations.
</p>
</details>
</li>
<li>Implement your favourite ResNet variant.
<details><summary>Example</summary>
<p>
See this <a href="https://keras.io/examples/cifar10_resnet/">tutorial</a> for an example of implementation of a ResNet.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="4-normalising-flows">4 Normalising Flows</h1>
<p><strong>Motivation</strong>: In this class, we take a little detour to learn about Normalising Flows. These are used for density estimation and generative modeling, and their implementation is motivated by a discretisation of an ODE. Understanding it at a basic level is necessary to understanding continuous normalizing flows, a central application of neural ODEs.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Normalising Flows.</li>
<li>End-to-end implementations with neural nets.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week4.pdf">class</a>, we defined nomalising flow, starting from the non-parametric form and then deriving their algorithmic (and parametric) implementation. We concluded by discussing some architectures proposed in the literature and their trade-offs.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><em>DE</em>: <a href="https://math.nyu.edu/faculty/tabak/publications/CMSV8-1-10.pdf">Density Estimation by Dual Ascent of the Log-likelihood</a> (Skip Section 3).</li>
<li><a href="https://math.nyu.edu/faculty/tabak/publications/Tabak-Turner.pdf">A family of non-parametric density estimation algorithms</a>.</li>
<li><a href="http://akosiorek.github.io/ml/2018/04/03/norm_flows.html">A post on Normalising flow</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1505.05770.pdf">Variational Inference with Normalizing Flows</a>.</li>
<li><a href="https://arxiv.org/pdf/1302.5125.pdf">High-Dimensional Probability Estimation with Deep Density Models</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>In <em>DE</em>, what is the difference between \(\rho_t\) and \(\tilde{\rho}_t\), i.e. what do they represent?
<details><summary>Solution</summary>
<p>
The function \(\tilde{\rho}_t\) is the density of the distribution of the random variable \(\phi_t^{-1}(y)\) where \(y\sim \mu\). The function \(\rho_t\) is the density of the distribution of the random variable \(\phi_t(x)\) where \(x\sim \rho\).
</p>
</details>
</li>
<li>What is the computational complexity of evaluating a determinant of an \(N\times N\) matrix, and why is that relevant in this context?
<details><summary>Solution</summary>
<p>
In general, the cost of computing the determinant of an \(N\times N\) matrix is \(O(N^3)\). To compute densities transported by normalising flows, we need to compute the determinants of the Jacobians; therefore, an important feature of practical normalising flows, is that the Jacobian structure must allow an efficient computation of its determinant. See this week notes for more discussion on this.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-the-adjoint-method-and-auto-diff">5 The Adjoint Method (and Auto-Diff)</h1>
<p><strong>Motivation</strong>: The adjoint method is a numerical method for efficiently computing the gradient of a function in numerical optimization problems. Understanding this method is essential to understand how to train ‘continuous depth’ nets. We also review the basics of Automatic Differentiation, which will help us understand the efficiency of the algorithm proposed in the NeuralODE paper.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Adjoint Method.</li>
<li>Auto-Diff.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week5.pdf">class</a>, we discussed the adjoint method. We started from the case of linear system and went through non-linear equations and recurrent relations. We concluded by discussing their application to ODE constrained optimization problems, which is the case of interest for Neural ODEs.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Section 8.7 from <em>CSE</em>: <a href="http://math.mit.edu/~gs/cse/">Computational Science and Engineering</a>.</li>
<li>Sections 2 and 3 from <a href="http://www.jmlr.org/papers/volume18/17-468/17-468.pdf">Automatic Differentiation in Machine Learning: a Survey</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://math.mit.edu/~stevenj/notes.html">Prof. Steven G. Johnson’s notes on adjoint method</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercises 1,2,3 from Section 8.7 of <em>CSE</em>.
<details><summary>Solution to Exercise 1</summary>
<p>
This follows immediately by noticing that the number of multiply-add operations of multiplying an \(N\times M\) matrix with an \(M\times P\) matrix is given by \(O(NMP)\).
</p>
</details>
<details><summary>Solution to Exercise 2</summary>
<p>
Apply the chain rule. Since \(\frac{\partial C}{\partial S} = 2S\) and \(\frac{dT}{dS} = \frac{\partial T}{\partial S} + \frac{\partial T}{\partial C}\frac{\partial C}{\partial S}\), we get \(\frac{d T}{d S} = 1 -2S\).
</p>
</details>
<details><summary>Solution to Exercise 3</summary>
<p>
This follows from Exercise 1 by seeing \(u^T\) and \(w^T\) as \(1\times N\) matrices and \(v\) as an \(N\times 1\) matrix.
</p>
</details>
</li>
<li>Consider the problem of optimizing a real-valued function \(g\) over the solution of the ODE \(y'(t) = A(p)y(t)\), \(y(0) = b(p)\) at time \(T>0\): \(\min_p\, g(T) \doteq g(y(T; p))\). Find \(\frac{dg(T)}{dp}\) by solving the ODE and by applying chain rule. Check the correctness of equations (16-17) in <em>CSE</em>.
<details><summary>Solution</summary>
<p>
It holds that
$$y(t) = e^{tA(p)}y(0)$$
Applying the chain rule, we get
$$\frac{dg}{dp} = \frac{dg}{dy}e^{TA(p)}\frac{db}{dp} + T\frac{dg}{dy}\frac{\partial A}{\partial p}e^{TA(p)}b(p)$$
On the other hand, the adjoint ODE reads
$$\lambda'(t) = -A(p)^T\lambda(t)$$
with the final condition \(\lambda(T) = \left(\frac{\partial g}{\partial y}\right)^T\), which gives \(\lambda(t) = e^{A(p)^T(T-t)}\left(\frac{\partial g}{\partial y}\right)^T\). Equation (17) from <i>CSE</i> gives
$$\frac{dg}{dp} = \left(e^{TA(p)^T}\left(\frac{\partial g}{\partial y}\right)^T\right)^T\frac{\partial b}{\partial p} + \int_0^T \frac{\partial g}{\partial y} e^{A(p)(T-t)}\frac{\partial A}{\partial p}e^{tA(p)}b(p)\,dt$$
which coincides with the above.
</p>
</details>
</li>
<li>Prove equations (14-15) in Section 8.7 of <em>CSE</em>.
<details><summary>Solution</summary>
<p>
By definition, it holds that
$$\frac{dG}{dp} = \int_0^T\left(\frac{\partial g}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\right)\,dt $$
On the other hand, it holds that
$$\lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T\lambda^T \frac{\partial f}{\partial p}\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt $$
Using equation (14) from <i>CSE</i> and the equality \(\frac{\partial u}{\partial p} = \frac{\partial f}{\partial p} + \frac{\partial f}{\partial u}\frac{\partial u}{\partial p}\), we get
$$\int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} + \lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p} - \lambda^T \frac{\partial f}{\partial p} -\lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} \right)\,dt$$
which gives
$$
\lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T \lambda^T\frac{\partial f}{\partial p}\,dt = \int_0^T \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\,dt
$$
and thus completes the proof.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the paper! Here is a summary of what’s going on to help with your understanding:</p>
<p>Any residual network can be seen as the Explicit Euler’s method discretisation of a certain ODE; given the network parameters, any numerical ODE solver can be used to evaluate the output layer. The application of the adjoint method makes it possible to efficiently back-propagate (and thus train) these models. The same idea can be used to train time-continuous normalising flows. In this case, moving to the continuous formulation allows us to avoid the computation of the determinant of the Jacobian, one of the major bottlenecks of normalising flows. Neural ODEs can also be used to model latent dynamics in time-series modeling, allowing us to easily tackle irregularly sampled data.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Normalising Flows.</li>
<li>End-to-end implementations with neural nets.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week6.pdf">class</a>, we defined Neural ODEs and derived the respective adjoint method, essential for their implementation. We then discussed continuous normalising flows and the computational advantages offered by Neural ODEs in this setting.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1806.07366">Neural Ordinary Differential Equations</a>.</li>
<li><a href="https://rkevingibson.github.io/blog/neural-networks-as-ordinary-differential-equations/">A blog post on NeuralODEs</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>A follow-up paper by the authors on scalable continuous normalizing flows: <a href="https://arxiv.org/abs/1810.01367">Free-form Continuous Dynamics for Scalable Reversible Generative Models</a>.</li>
</ol>
<p><br /></p>lucaThis guide would not have been possible without the help and feedback from many people.Wasserstein GAN2019-05-02T14:00:00+00:002019-05-02T14:00:00+00:00https://www.depthfirstlearning.com/2019/WassersteinGAN<p>[Editor’s Note: We are especially proud of this one. James and his group went above and beyond the call of duty and made a guide from their class that we feel is especially superb for understanding their target paper. Moving forward, he has forced us to up our game because it will be hard to release a curriculum that is not as strong as this one. We highly recommend earnestly studying with this at hand.]</p>
<p>A number of people need to be thanked for their parts in making this happen. Thank you to Martin Arjovsky, Avital Oliver, Cinjon Resnick, Marco Cuturi, Kumar Krishna Agrawal, and Ishaan Gulrajani for contributing to this guide.</p>
<p>Of course, thank you to Sasha Naidoo, Egor Lakomkin, Taliesin Beynon, Sebastian Bodenstein, Julia Rozanova, Charline Le Lan, Paul Cresswell, Timothy Reeder, and Michał Królikowski for beta-testing the guide and giving invaluable feedback. A special thank you to Martin Arjovsky, Tim Salimans, and Ishaan Gulrajani for joining us for the weekly meetings.</p>
<p>Finally, thank you to Ulrich Paquet and Stephan Gouws for introducing many of us to Cinjon.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/wgan-deps.svg" width="400"></iframe>
<div>Concepts used in Wasserstein GAN. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>The Wasserstein GAN (WGAN) is a GAN variant which uses the 1-Wasserstein distance, rather than the JS-Divergence, to measure the difference between the model and target distributions. This seemingly simple change has big consequences! Not only does WGAN train more easily (a common struggle with GANs) but it also achieves very impressive results — generating some stunning images. By studying the WGAN, and its variant the WGAN-GP, we can learn a lot about GANs and generative models in general. After completing this curriculum you should have an intuitive grasp of why the WGAN and WGAN-GP work so well, as well as, a thorough understanding of the mathematical reasons for their success. You should be able to apply this knowledge to understanding cutting edge research into GANs and other generative models.</p>
<p><br /></p>
<h1 id="1-basics-of-probability--information-theory">1 Basics of Probability & Information Theory</h1>
<p><strong>Motivation</strong>: To understand GAN training (and eventually WGAN & WGAN-GP) we need to first have some understanding of probability and information theory. In particular, we will focus on Maximum Likelihood Estimation and the KL-Divergence. This week we will make sure that we understand the basics so that we can build upon them in the following weeks.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Probability Theory</li>
<li>Information Theory</li>
<li>Mean Squared Error (MSE)</li>
<li>Maximum Likelihood Estimation (MLE)</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Chs 3.1 - 3.5 of <a href="https://www.deeplearningbook.org/">Deep Learning</a> by Goodfellow <em>et. al</em> (the DL book)
<ul>
<li>These chapters are here to introduce fundamental concepts such as random variables, probability distributions, marginal probability, and conditional probability. If you have the time, reading the whole of chapter 3 is highly recommended. A solid grasp of these concepts will be important foundations for what we will cover over the next 5 weeks.</li>
</ul>
</li>
<li>Ch 3.13 of the DL book
<ul>
<li>This chapter covers KL-Divergence & the idea of distances between probability distributions which will also be a key concept going forward.</li>
</ul>
</li>
<li>Chs 5.1.4 and 5.5 of the DL book
<ul>
<li>The aim of these chapters is to make sure that everyone understands maximum likelihood estimation (MLE) which is a fundamental concept in machine learning. It is used explicitly or implicitly in both supervised and unsupervised learning as well as in both discriminative and generative methods. In fact, many methods using gradient descent are doing approximate MLE. It is important to understanding MLE as a fundamental concept, and its use in machine learning in practice. Note that, if you are not familiar with the notation used in these chapters, you might want to start at the beginning of the chapter. Also note that, if you are not familiar with the concept of estimators, you might want to read Ch 5.4. However, you can probably get by simply knowing that minimizing mean squared error (MSE) is a method for optimizing some approximation for a function we are trying to learn (an estimator).</li>
</ul>
</li>
<li>The first 3 sections of <a href="https://colinraffel.com/blog/gans-and-divergence-minimization.html">GANs and Divergence Minimization</a> (check out the rest after week 3)
<ul>
<li>This blog gives a great description of the connections between the KL divergence and MLE. It also provides a nice teaser for what is to come in the following weeks, particularly with regards to the difficulties of training generative models.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Ch 2 from <a href="http://www.inference.org.uk/itprnn/book.pdf">Information Theory, Inference & Learning Algorithms by David MacKay</a> (MacKay’s book)
<ul>
<li>This is worth reading if you feel like you didn’t quite grok the probability and information theory content in the DL book. MacKay provides a different perspective on these ideas which might help make things click. These concepts are going to be crucial going forward so it is definitely worth making sure you are comfortable with them.</li>
</ul>
</li>
<li>Chs 1.6 and 10.1 of <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf">Pattern Recognition and Machine Learning by Christopher M. Bishop</a> (PRML)
<ul>
<li>Similarly, this is worth reading if you don’t feel comfortable with the KL-Divergence and want another perspective.</li>
</ul>
</li>
<li>Aurélien Géron’s video <a href="https://www.youtube.com/watch?v=ErfnhcEV1O8">A Short Introduction to Entropy, Cross-Entropy and KL-Divergence</a>
<ul>
<li>An introductory, but interesting video that describes the KL-Divergence.</li>
</ul>
</li>
<li><a href="http://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf">Notes on MLE and MSE</a>
<ul>
<li>An alternative discussion on the links between MLE and MSE.</li>
</ul>
</li>
<li>The first 37ish minutes of Arthur Gretton’s MLSS Africa talk on comparing probability distributions — <a href="https://www.youtube.com/watch?v=5sijxSg8P14">video</a>, <a href="https://drive.google.com/file/d/1RNrgDs5xw-9HTjikFU1L0iO1PBMDaGwE/view">slides</a>
<ul>
<li>An interesting take on comparing probability distributions. The first 37 minutes are fairly general and give some nice insights as well as some foreshadowing of what we will be covering in the following weeks. The rest of the talk is also very interesting and ends up covering another GAN called the MMD-GAN, but it isn’t all that relevant for us.</li>
</ul>
</li>
<li><a href="https://pdfs.semanticscholar.org/6af2/fa8887a2cb0386f79e3a2822b661e2dc8369.pdf">On Integral Probability Metrics, φ-Divergences and Binary Classification</a>
<ul>
<li>For those of you whose curiosity was piqued by Arthur’s talk, this paper goes into depth describing IPMs (such as MMD and the 1-Wasserstein distance) and comparing them the φ-divergences (such as the KL-Divergence). <em>This paper is fairly heavy mathematically so don’t be discouraged if you struggle to follow it</em>.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The questions this week are here to make sure that you can put all the theory you’ve been reading about to a little practice. For example, do you understand how to perform calculations on probabilities, or, what Bayes’ rule is and how to use it?</em></p>
<ol>
<li>Examples/Exercises 2.3, 2.4, 2.5, 2.6, and 2.26 in MacKay’s book
<ul>
<li>Bonus: 2.35, and 2.36</li>
</ul>
<details><summary>Solutions</summary>
<p>
Examples 2.3, 2.5, and 2.6 have their solutions directly following them.
</p>
<p>
Exercise 2.26 has a solution on page 44.
</p>
<p>
Exercise 2.35 has a solution on page 45.
</p>
<p>
Exercise 2.36: 1/2 and 2/3.
</p>
<p>
(Page numbers from Version 7.2 (fourth printing) March 28, 2005, of MacKay's book.)
</p>
</details>
</li>
<li>Derive Bayes’ rule using the definition of conditional probability.
<details><summary>Solution</summary>
<p>
The definition of conditional probability tells us that
$$p(y|x) = \frac{p(y,x)}{p(x)}$$
and that
$$p(x|y) = \frac{p(y,x)}{p(y)}.$$
From this we can see that \(p(y,x) = p(y|x)p(x) = p(x|y)p(y)\). Finally if we divide everything by \(p(x)\) we get
$$p(y|x) = \frac{p(x|y)p(y)}{p(x)}$$
which is Bayes' rule.
</p>
</details>
</li>
<li>Exercise 1.30 in PRML
<details><summary>Solution</summary>
<p>
<a href="https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians?rq=1">Here</a> is a solution.
</p>
<p>
The result should be \(\log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}\).
</p>
</details>
</li>
<li>Prove that minimizing MSE is equivalent to maximizing likelihood (assuming Gaussian distributed data).
<details><summary>Solution</summary>
<p>
Mean squared error is defined as
$$MSE = \frac{1}{N}\sum^N_{n=1}(\hat{y}_n - y_n)^2$$
where \(N\) is the number of examples, \(y_n\) are the true labels, and \(\hat{y}_n\) are the predicted labels.
Log-likelihood is defined as \(LL = \log(p(\mathbf{y}|\mathbf{x}))\). Assuming that the examples are independent and identically distributed (i.i.d.) we get
$$ LL = \log\prod_{n=1}^Np(y_n|x_n) = \sum_{n=1}^{N}\log p(y_n|x_n). $$
Now, substituting in the definition of the normal distribution
$$ \mathcal{N}(y;\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y - \mu)^2}{2\sigma^2}}$$
for \(p(y_n|x_n)\) and simplifying the expression, we get
$$ LL = \sum_{n=1}^{N} -\frac{1}{2}\log(2\pi) - \log\sigma - \frac{(y_n - \mu_n)^2}{2\sigma^2}.$$
Finally, replacing \(\mu\) with \(\hat{y}\) (because we use the mean as our prediction), and noticing that maximizing the expression above depends only on the third term (because the others are constants), we arrive at the conclusion that to maximize the log-likelihood we must minimize
$$ \frac{(y_n - \hat{y}_n)^2}{2\sigma^2} $$
which is the same as minimising the MSE.
</p>
</details>
</li>
<li>Prove that maximizing likelihood is equivalent to minimizing KL-Divergence.
<details><summary>Solution</summary>
<p>
KL-Divergence is defined as
$$ D_{KL}(p||q) = \sum_x p(x) \log\frac{p(x)}{q(x|\bar{\theta})}$$
where \(p(x)\) is the true data distribution, \(q(x|\bar{\theta})\) is our model distribution, and \(\bar{\theta}\) are the parameters of our model. We can rewrite this as
$$ D_{KL}(p||q) = \mathbb{E}_p[\log p(x)] - \mathbb{E}_p[\log q(x|\bar{\theta})]$$
where the notation \(\mathbb{E}_p[f(x)]\) means that we are taking the expected value of \(f(x)\) by sampling \(x\) from \(p(x)\). We notice that minimizing \(D_{KL}(p||q)\) means maximizing \(\mathbb{E}_p[\log q(x|\bar{\theta})]\) since the first term in the expression above is constant (we can't change the true data distribution). Now, to maximize the likelihood of our model, we need to maximize
$$q(\bar{x}|\bar{\theta}) = \prod_{n=1}^Nq(x_n|\bar{\theta}).$$
Recall that taking a logarithm does not change the result of optimization which means that we can maximize
$$\log q(\bar{x}|\bar{\theta}) = \sum_{n=1}^N\log q(x_n|\bar{\theta}).$$
If we divide this term by a constant factor of \(N\) we the same term that would minimize the to maximize the KLD: \(\mathbb{E}_p[\log q(x|\bar{\theta})]\).
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week1.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="2-generative-models">2 Generative Models</h1>
<p><strong>Motivation</strong>: This week we’ll take a look at generative models. We will aim to understand how they are similar and how they differ from the discriminative models covered last week. In particular, we want to understand the challenges that come with training generative models.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Generative Models</li>
<li>Evaluation of Generative Models</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>The “Overview”, “What are generative models?”, and “Differentiable inference” sections of the webpage for David Duvenaud’s <a href="https://www.cs.toronto.edu/~duvenaud/courses/csc2541/index.html">course on Differentiable Inference and Generative Models</a>.
<ul>
<li>Here we want to get a sense of the big picture of what generative models are all about. There are also some fantastic resources here for further reading if you are interested.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1511.01844.pdf">A note on the evaluation of generative models</a>
<ul>
<li>This paper is the real meat of this week’s content. After reading this paper you should have a good idea of the challenges involved in evaluating (and therefore training) generative models. Understanding these issues will be important for appreciating what the WGAN is all about. Don’t worry too much if some sections don’t completely make sense yet - we’ll be returning to the key ideas in the coming weeks.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Ch 20 of the DL book, particularly:
<ul>
<li>Differentiable Generator Networks (Ch 20.10.2)
<ul>
<li>Description of a broad class of generative models to which GANs belong which will help contextualize GANs when we look at them next week.</li>
</ul>
</li>
<li>Variational Autoencoders (Ch 20.10.3)
<ul>
<li>Description of another popular class of differentiable generative model which might be nice to contrast to GANs next week.</li>
</ul>
</li>
<li>Evaluating Generative Models (Ch 20.14)
<ul>
<li>Summary of techniques and challenges for evaluating generative models which might put Theis <em>et al.</em>’s paper into context.</li>
</ul>
</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first two questions are here to make sure that you understand what a generative model is and how it differs from a discriminative model. The last two questions are a good barometer for determining your understanding of the challenges involved in training generative models.</em></p>
<ol>
<li>Fit a <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.multivariate_normal.html">multivariate Gaussian distribution</a> to the <a href="https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset">Fisher Iris dataset</a> using maximum likelihood estimation (see Section 2.3.4 of PRML for help) then:
<ol>
<li>Determine the probability of seeing a flower with a sepal length of 7.9, a sepal width of 4.4, a petal length of 6.9, and a petal width of 2.5.</li>
<li>Determine the distribution of flowers with a sepal length of 6.3, a sepal width of 4.8, and a petal length of 6.0 (see section 2.3.2 of PRML for help).</li>
<li>Generate 20 flower measurements.</li>
<li>Generate 20 flower measurements with a sepal length of 6.3.</li>
</ol>
<p>(congrats you’ve just trained and used a generative model)</p>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/JamesAllingham/DFL-WGAN/blob/master/DFL_WGAN_week2.ipynb">Here</a> is a Jupyter notebook with solutions. Open the notebook on your computer or Google colab to render the characters properly.
</p>
</details>
</li>
<li>Describe in your own words the difference between a generative and a discriminative model.
<details><summary>Solution</summary>
<p>
This is an open ended question but here are some of the differences:
<ul>
<li>In the generative setting, we usually model \(p(x)\), while in the discriminative setting we usually model \(p(y|x)\).</li>
<li>Generative models are usually non-deterministic, and we can sample from them, while discriminative models are often deterministic, and we can't necessarily sample from them.</li>
<li>Discriminative models need labels while generative models typically do not.</li>
<li>In generative modelling the goal is often to learn some latent variables that describe the data in a compact manner, this is not usually the case for discriminative models.</li>
</ul>
</p>
</details>
</li>
<li>Theis <em>et al.</em> claim that “a model with zero KL divergence will produce perfect samples” — why is this the case?
<details><summary>Solution</summary>
<p>
As we showed last week, \(D_{KL}(p||q) = 0\) if and only if \(p(x)\), the true data distribution, and \(q(x)\) the model distribution, are the same.
</p>
<p>
Therefore, if \(D_{KL}(p||q) = 0\), samples from our model will be indistinguishable from the real data.
</p>
</details>
</li>
<li>Explain why the high log-likelihood of a generative model might not correspond to realistic samples?
<details><summary>Solution</summary>
<p>
Theis <i>et al.</i> outlined two scenarios where this is the case:
<ul>
<li><b>Low likelihood & good samples</b>: our model can overfit to the training data and produce good samples, however, because the model has overfitted it will have a low likelihood for unseen test data.</li>
<li><b>High likelihood & poor samples</b>: here the issue is that high dimensional data will tend to have higher log-likelihoods than low dimensional data. </li>
</ul>
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week2.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Tim Salimans sit in on the session!</p>
<p><br /></p>
<h1 id="3-generative-adversarial-networks">3 Generative Adversarial Networks</h1>
<p><strong>Motivation</strong>: Let’s read the original GAN paper. Our main goal this week is to understand how GANs solve some of the problems with training generative models, as well as, some of the new issues that come with training GANs.</p>
<p><em>The second paper this week is actually optional but <strong>highly</strong> recommended — we think that it contains some interesting material and sets the stage for looking at WGAN in week 4, however, the core concepts will be repeated again. Depending on your interest you might want to spend more or less time on this paper (we recommend that most people don’t spend too much time).</em></p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Generative Adversarial Networks</li>
<li>The Jensen-Shannon Divergence (JSD)</li>
<li>Why training GANs is hard</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1406.2661.pdf">Goodfellow’s GAN paper</a>
<ul>
<li>This is the paper the started it all and if we want to understand WGAN & WGAN-GP we’d better understand the original GAN.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1701.04862.pdf">Toward Principled Methods for Generative Adversarial Network Training</a>
<ul>
<li>This paper explores the difficulties in training GANs and is a precursor to the WGAN paper that we will look at next week. The paper is quite math heavy so unless math is your cup of tea you shouldn’t spend too much time trying to understand the details of the proofs, corollaries, and lemmas. The important things to understand here are: what is the problem, and how do the proposed solutions solve the problem. Focus on the introduction, the English descriptions of the theorems and the figures. <strong>Don’t spend too much time on this paper</strong>.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s tutorial on GANs</a>
<ul>
<li>A more in-depth explanation of GANs from the man himself.</li>
</ul>
</li>
<li>The GAN chapter in the DL book (20.10.4)
<ul>
<li>A summary of what a GAN is and some of the issues involved in GAN training.</li>
</ul>
</li>
<li>Coursera (Stanford) course on game theory videos: <a href="https://www.youtube.com/watch?v=-j44yHK0nn4&index=5&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh">1-05</a>, <a href="https://www.youtube.com/watch?v=BsgnKTfOxTs&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=11">2-01</a>, <a href="https://www.youtube.com/watch?v=FU6ax5K9HIA&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=12">2-02</a>, and <a href="https://www.youtube.com/watch?v=RIneClCKgAw&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=22">3-04b</a>
<ul>
<li>This is really here just for people who are interested in the game theory ideas such as minmax.</li>
</ul>
</li>
<li>Finish reading <a href="https://colinraffel.com/blog/gans-and-divergence-minimization.html">GANs and Divergence Minimization</a>.
<ul>
<li>Now that we know what a GAN is it will be worth it to go back and finish reading this blog. It should help to tie together many of the concepts we’ve covered so far. It also has some great resources for extra reading at the end.</li>
</ul>
</li>
<li><a href="https://ahmedhanibrahim.wordpress.com/2017/01/17/generative-adversarial-networks-when-deep-learning-meets-game-theory/comment-page-1/">Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory</a>
<ul>
<li>A short blog post which briefly summarises many of the topics we’ve covered so far.</li>
</ul>
</li>
<li><a href="https://www.inference.vc/how-to-train-your-generative-models-why-generative-adversarial-networks-work-so-well-2/">How to Train your Generative Models? And why does Adversarial Training work so well?</a> and <a href="https://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/">An Alternative Update Rule for Generative Adversarial Networks</a>
<ul>
<li>Two great blog posts from Ferenc Huszár that discuss the challenges in training GANs as well as the differences between the JSD, KLD and reverse KLD.</li>
</ul>
</li>
<li><a href="https://github.com/HIPS/autograd/blob/master/examples/generative_adversarial_net.py">Simple Python GAN example</a>
<ul>
<li>This example illustrates how simple GANs are to implement by doing it in 145 lines of Python using Numpy and a simple autograd library.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first three questions this week are here to make sure that you understand some of the most important points in the GAN paper. The last question is to make sure you understood the overall picture of what a GAN is, and to get your hands dirty with some of the practical difficulties of training GANs.</em></p>
<ol>
<li>Prove that minimizing the optimal discriminator loss, with respect to the generator model parameters, is equivalent to minimizing the JSD.
<ul>
<li>Hint, it may help to somehow introduce the distribution \(p_m(x) = \frac{p_d(x) + p_g(x)}{2}\).</li>
</ul>
<details><summary>Solution</summary>
<p>
The loss we are minimizing is
$$\mathbb{E}_{x \sim p_d(x)}[\log D^*(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D^*(G(x)))]$$
where \(p_d(x)\) is the true data distribution, \(p_z(z)\) is the noise distribution from which we draw samples to pass through our generator, \(D\) and \(G\) are the discriminator and generator, and \(D^*\) is the optimal discriminator which has the form:
$$ D^*(x) = \frac{p_d(x)}{p_d(x) + p_g(x)}.$$
Here \(p_g(x)\) is the distribution of the data sampled from the generator. Substiting in \(D^*(x)\) and \(p_g(x)\), we can rewrite the loss as
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_d(x) + p_g(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_d(x) + p_g(x)}]. $$
Now we can multiply the values inside the logs by \(1 = \frac{0.5}{0.5}\) to get
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{0.5 p_d(x)}{0.5(p_d(x) + p_g(x))}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{0.5 p_g(x)}{0.5(p_d(x) + p_g(x))}]. $$
Recall that \(\log(ab) = \log(a) + \log(b)\) and define \(p_m(x) = \frac{p_d(x) + p_g(x)}{2}\), we now get
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_m(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_m(x)}] - 2\log2. $$
Using the definition of the KL-Divergence, this simplifies to
$$ D_{KL}(p_d||p_m) + D_{KL}(p_g||p_m) - 2\log2. $$
Finally, using the definition of the JS-Divergence and noting that for the purposes of minimization the \(2\log2\) term can be ignored, we get
$$ D_{JS}(p_d||p_g).$$
</p>
</details>
</li>
<li>Explain why Goodfellow says that \(D\) and \(G\) are playing a two-player minmax game and derive the definition of the value function \(V(G,D)\).
<details><summary>Solution</summary>
<p>
\(G\) wants to maximize the probability that \(D\) thinks the generated samples are real \(\mathbb{E}_{z \sim p_z(z)}[D(G(z))]\). This is the same as minimizing the probability that \(D\) thinks the generated samples are not fake \(\mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]\).
</p>
<p>
On the other hand, \(D\) wants to maximise the probability that it assigns the labels correctly \(\mathbb{E}_{x \sim p_d(x)}[D(x)] + \mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]\). Note that \(D(x)\) should be 1 if \(x\) is real, and 0 if \(x\) is fake.
</p>
<p>
We can take logs without changing the optimization, which gives
$$ V(G,D) = \min_G\max_D \mathbb{E}_{x \sim p_d(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]. $$
</p>
</details>
</li>
<li>Why is it important to carefully tune the amount that the generator and discriminator are trained in the original GAN formulation?
<ul>
<li>Hint, it has to do with the approximation for the JSD & the dimensionality of the data manifolds.</li>
</ul>
<details><summary>Solution</summary>
<p>
If we train the discriminator too much we get vanishing gradients. This is due to the fact that when the true data distribution and model distribution lie on low dimensional manifolds (or have disjoint support almost everywhere), the optimal discriminator will be perfect — i.e. the gradient will be zero almost everywhere. This is something that almost always happens.
</p>
<p>
On the other hand, if we train the discriminator too little, then the loss for the generator no longer approximates the JSD. This is because the approximation only holds if the discriminator is near the optimal \(D^*(x) = \frac{p)d(x)}{p_d(x) + p_g(x)}\).
</p>
</details>
</li>
<li>Implement a GAN and train it on Fashion MNIST.
<ul>
<li><a href="https://colab.research.google.com/drive/1OWZEeF-SB0r1f6mHm-7-hfxd2zsecEwq#scrollTo=Q8YoJ4mejp97">This notebook</a> contains a skeleton with boilerplate code and hints.</li>
<li>Try various settings of hyper-parameters, other than those suggested, and see if the model converges.</li>
<li>Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.</li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/dcgan/dcgan.py">Here</a> is a GAN implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week3.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="4-wasserstein-gan">4 Wasserstein GAN</h1>
<p><strong>Motivation</strong>: Last week we saw how GANs solve some problems in training generative models but also that they bring in new problems. This week we’ll look at the Wasserstein GAN which goes a long way to solving these problems.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Wasserstein Distance vs KLD/JSD</li>
<li>Wasserstein GAN</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1701.07875.pdf">The WGAN paper</a>
<ul>
<li>This should be pretty self-explanatory! We’re doing a DFL on Wasserstein GANs so we’d better read the paper! (This isn’t the end of the road, however, next week we’ll look at WGAN-GP.) The paper builds upon an intuitive idea: the family of Wasserstein distances is a nice distance between probability distributions, that is well grounded in theory. The authors propose to use the 1-Wasserstein distance to estimate generative models. More specifically, they propose to use the 1-Wasserstein distance in place of the JSD in a standard GAN — that is to measure the difference between the true distribution and the model distribution of the data. They show that the 1-Wasserstein distance is an integral probability metric (IPM) with a meaningful set of constraints (1-Lipschitz functions), and can, therefore, be optimized by focusing on discriminators that are “well behaved” (meaning that their output does not change to much if you perturb the input, i.e. they are Lipschitz!).</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.alexirpan.com/2017/02/22/wasserstein-gan.html">Summary blog for the paper</a>
<ul>
<li>This is a brilliant blog post that summarises almost all of the key points we’ve covered over the last 4 weeks and puts them in the context of the WGAN paper. In particular, if any of the more theoretic aspects of the WGAN paper were a bit much for you then this post is worth reading.</li>
</ul>
</li>
<li><a href="https://mindcodec.ai/2018/09/23/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy/">Another good summary of the paper</a></li>
<li>Wasserstein / Earth Mover distance <a href="https://vincentherrmann.github.io/blog/wasserstein/">blog</a> <a href="https://mindcodec.ai/2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/">posts</a></li>
<li><a href="https://www.youtube.com/watch?v=6iR1E6t1MMQ">Set of</a> <a href="https://www.youtube.com/watch?v=1ZiP_7kmIoc">three</a> <a href="https://www.youtube.com/watch?v=SZHumKEhgtA">lectures</a> by Marco Cuturi on optimal transport (with accompanying <a href="https://drive.google.com/file/d/1oYX41dIAXhU6EShcid6eYrrK7svi5NXW/view">slides</a>)
<ul>
<li>If you are interested in the history of optimal transport and would like to see where the KR duality comes from (that’s the crucial argument in the WGAN paper which connects the 1-Wasserstein distance to an IPM with a Lipschitz constraint), the Wasserstein distance, or if you feel like you need a different explanation of what the Wasserstein distance and the Kantorovich-Rubinstein duality are, then watching these lectures is recommended. There are some really cool applications of optimal transport here too, and a more exhaustive description of other families of Wasserstein distances (such as the quadratic one) and their dual formulation.</li>
</ul>
</li>
<li>The first 15 or so minutes of <a href="https://www.youtube.com/watch?v=eDWjfrD7nJY">this lecture on GANs</a> by Sebastian Nowozin
<ul>
<li>Great description of WGAN, including Lipschitz and KR duality. This lecture is actually part 2 of a series of 3 lectures from MLSS Africa. Watching the whole series is also highly recommended if you are interested in knowing more about the bigger picture for GANs (including other interesting developments and future work) and how WGAN relates to other GAN variants. However, to avoid spoilers for next week, you should wait to watch the rest of part 2.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1803.00567.pdf">Computational Optimal Transport</a> by Peyré and Cuturi (Chapters 2 and 3 in particular)
<ul>
<li>If you enjoyed Marco’s lectures above, or want a more thorough theoretical understanding of the Wasserstein distance, then this textbook is for you! However, please keep in mind that this textbook is somewhat mathematically involved, so if you don’t have a mathematics background you may struggle with it.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first two questions are here to highlight the key difference between the WGAN and the original GAN formulation. As before, the last question is to make sure you understood the overall picture of what a WGAN is and to get your hands dirty with how they differ from standard GANs in practice.</em></p>
<ol>
<li>What happens to the KLD/JSD when the real data and the generator’s data lie on low dimensional manifolds?
<details><summary>Solution</summary>
<p>
The true distribution and model distribution tend to have different supports which causes the KLD and JSD to saturate.
</p>
</details>
</li>
<li>With this in mind, how does using the Wasserstein distance, rather than JSD, reduce the sensitivity to careful scheduling of the generator and discriminator?
<details><summary>Solution</summary>
<p>
The Wasserstein distance does not saturate or blow up for distributions with different supports. This means that we still get signals in these cases which in turn means that we don’t have to worry about training the discriminator (or critic) to optimality — in fact, we <i>want</i> to train it to optimality since it will give better signals.
</p>
</details>
</li>
<li>Let’s compare the 1-Wasserstein Distance (aka Earth Mover’s Distance — EMD) with the KLD for a few simple discrete distributions. We want to build up an intuition for the differences between these two metrics and why one might be better than another in certain scenarios. You might find it useful to use the Scipy implementations for <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html">1-Wasserstein</a> and <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kl_div.html">KLD</a>.
<ol>
<li>Let \(P(x)\), \(Q(x)\) and \(R(x)\) be discrete distributions on \(Z\) with:
<ul>
<li>\(P(0) = 0.5\), \(P(1) = 0.5\),</li>
<li>\(Q(0) = 0.75\), \(Q(1) = 0.25\), and</li>
<li>\(R(0) = 0.25\) and \(R(1) = 0.75\).
<br /> Calculate both the KLD and EMD for the following pairs of distributions. You should notice that while Wasserstein is a proper distance metric, KLD is not (\(D_{KL}(P||Q) \ne D_{KL}(Q||P)\)).
<ol>
<li>\(P\) and \(Q\)</li>
<li>\(Q\) and \(P\)</li>
<li>\(P\) and \(P\)</li>
<li>\(P\) and \(R\)</li>
<li>\(Q\) and \(R\)</li>
</ol>
</li>
</ul>
</li>
<li>Let \(P(x)\), \(Q(x)\), \(R(x)\), \(S(x)\) be discrete distributions on \(Z\) with:
<ul>
<li>\(P(0) = 0.5\), \(P(1) = 0.5\), \(P(2) = 0\),</li>
<li>\(Q(0) = 0.33\), \(Q(1) = 0.33\), \(Q(2) = 0.33\),</li>
<li>\(R(0) = 0.5\), \(R(1) = 0.5\), \(R(2) = 0\), \(R(3) = 0\), and</li>
<li>\(S(0) = 0\), \(S(1) = 0\), \(S(2) = 0.5\), \(S(3) = 0.5\).
<br /> Calculate the KLD and EMD between the following pairs of distributions. You should notice that the EMD is well behaved for distributions with disjoint support while the KLD is not.
<ol>
<li>\(P\) and \(Q\)</li>
<li>\(Q\) and \(P\)</li>
<li>\(R\) and \(S\)</li>
</ol>
</li>
</ul>
</li>
<li>Let \(P(x)\), \(Q(x)\), \(R(x)\), and \(S(x)\) be discrete distributions on \(Z\) with:
<ul>
<li>\(P(0) = 0.25\), \(P(1) = 0.75\), \(P(2) = 0\),</li>
<li>\(Q(0) = 0\), \(Q(1) = 0.75\), \(Q(2) = 0.25\),</li>
<li>\(R(0) = 0\), \(R(1) = 0.25\), \(R(2) = 0.75\), and</li>
<li>\(S(0) = 0\), \(S(1) = 0\), \(S(2) = 0.25\), \(S(3) = 0.75\).
<br /> Calculate the EMD between the following pairs of distributions. Here we just want to get more of a sense for the EMD.
<ol>
<li>\(P\) and \(Q\)</li>
<li>\(P\) and \(R\)</li>
<li>\(Q\) and \(R\)</li>
<li>\(P\) and \(S\)</li>
<li>\(R\) and \(S\)</li>
</ol>
</li>
</ul>
</li>
</ol>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/JamesAllingham/DFL-WGAN/blob/master/DFL_WGAN_week4_q3.ipynb">Here</a> is a Jupyter notebook with solutions.
</p>
</details>
</li>
<li>Based on the GAN implementation from week 3, implement a WGAN for FashionMNIST.
<ul>
<li>Try various settings of hyper-parameters. Does this model seem more resilient to the choice of hyper-parameters?</li>
<li>Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.</li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/wgan/wgan.py">Here</a> is a WGAN implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week4.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="5-wgan-gp">5 WGAN-GP</h1>
<p><strong>Motivation</strong>: Let’s read the WGAN-GP paper (Improved Training of Wasserstein GANs). As has been the trend over the last few weeks, we’ll see how this method solves a problem with the standard WGAN: weight clipping, as well as a potential problem in the standard GAN: overfitting.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>WGAN-GP</li>
<li>Weight clipping vs gradient penalties</li>
<li>Measuring GAN performance</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1704.00028.pdf">WGAN-GP paper</a>
<ul>
<li>This is our final required reading. The paper suggests improvements to the training of Wasserstein GANs with some great theoretical justifications and actual results.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1709.08894.pdf">On the Regularization of Wasserstein GANs</a>
<ul>
<li>This paper came out after the WGAN-GP paper but gives a thorough discussion of why the weight clipping in the original WGAN was an issue (see Appendix B). In addition, they propose other solutions for how to get around doing so and provide other interesting discussions of GANs and WGANs. </li>
</ul>
</li>
<li><a href="https://medium.com/@jonathan_hui/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490">Wasserstein GAN & WGAN-GP blog post</a>
<ul>
<li>Another blog that summarises many of the key points we’ve covered and includes WGAN-GP.</li>
</ul>
</li>
<li><a href="https://medium.com/@jonathan_hui/gan-how-to-measure-gan-performance-64b988c47732">GAN — How to measure GAN performance?</a>
<ul>
<li>A blog that discusses a number of approaches to measuring the performance of GANs, including the Inception score, which is useful to know about when reading the WGAN-GP paper.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>This week’s questions follow the same pattern as last week’s. How does the formulation of WGAN-GP differ from that of the original GAN or WGAN (and how is it similar)? What does this mean in practice?</em></p>
<ol>
<li>Why does weight clipping lead to instability in the training of a WGAN & how does the gradient penalty avoid this problem?
<details><summary>Solution</summary>
<p>
The instability comes from the fact that if we choose the weight clipping hyper-parameter poorly we end up with either exploding or vanishing gradients. This is because weight clipping encourages the optimizer to push the absolute all of the weights very close to the clipping value. Figure 1b in the paper shows this happening. To explain this phenomenon, consider a simple logistic regression model. Here if any of the features are highly predictive of a particular class it will be assigned as positive a weight as possible, similarly, if a feature is not predictive of a particular class, it will be assigned as negative a weight as possible. Now depending on our choice of the weight clipping value, we either get exploding or vanishing gradients.
<ul>
<li> Vanishing gradients: this is similar to the issues if vanishing gradients in a vanilla RNN, or a very deep feed-forward NN without residual connections. If we choose the weight clipping value to be too small, during back-propagation, the error signal going to each layer will be multiplied by small values before being propagated to the previous layer. This results in exponential decay in the error signal as it propagates farther backward. </li>
<li> Exploding gradients: similarly, if we choose a weight clipping value that is too large, the error signals will get repeatedly multiplied by large numbers as the propagate backward — resulting in exponential growth. </li>
</ul>
</p>
<p>
This phenomena also related to the reason we use weight initialization schemes such as Xavier and He and also why batch normalization is important — both of these methods help to ensure that information is propagated through the network without decaying or exploding.
</p>
</details>
</li>
<li>Explain how WGAN-GP addresses issues of overfitting in GANs.
<details><summary>Solution</summary>
<p>
Both WGAN-GP, and indeed the original weight-clipped WGAN, have the property that the discriminator/critic loss corresponds to the sample quality from the discriminator, which lets us use the loss to detect overfitting (we can compare the negative discriminator/critic loss for a validation set to that of the training set of real images — when the two diverge we have overfitted). The correspondence between the loss and the sample quality can be explained by a number of factors.
<ul>
<li> With a WGAN we can train our discriminator to optimality. This means that if the critic is struggling to tell the difference between real and generated images we can conclude that the real and generated images are similar. In other words, the loss is meaningful.</li>
<li> In addition, in a standard GAN where we cannot train the discriminator to optimality, our loss no longer approximates the JSD. We do not know what function our loss is actually approximating and as a result we cannot say (and in practise we do not see) that the loss is a meaningful measure of sample quality. </li>
<li> Finally, there are arguments to be made that even if the loss for a standard GAN was approximating the JSD, the Wasserstein distance is a better distance measure for images distributions than the JSD. </li>
</ul>
</p>
</details>
</li>
<li>Based on the WGAN implementation from week 4, implement an improved WGAN for MNIST.
<ul>
<li>Compare the results, ease of hyper-parameter tuning, and correlation between loss and your subjective ranking of samples, with the previous two models.</li>
<li><em>The Keras implementation of WGAN-GP can be tricky. If you are familiar with another framework like TensorFlow or Pytorch it might be easier to use that instead. If not, don’t be too hesitant to check the solution if you get stuck.</em></li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/wgan_gp/wgan_gp.py">Here</a> is a WGAN-GP implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week5.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Ishaan Gulrajani sit in on the session!</p>james[Editor’s Note: We are especially proud of this one. James and his group went above and beyond the call of duty and made a guide from their class that we feel is especially superb for understanding their target paper. Moving forward, he has forced us to up our game because it will be hard to release a curriculum that is not as strong as this one. We highly recommend earnestly studying with this at hand.]Announcing the 2019 DFL Fellows2019-04-15T16:00:00+00:002019-04-15T16:00:00+00:00https://www.depthfirstlearning.com/2019/Announcing-DFL-Fellows<p>After we launched Depth First Learning last year, we wanted to keep the momentum
and continue outputting high-quality study guides for machine learning.
Subsequently, we launched the <a href="http://fellowship.depthfirstlearning.com">Depth First Learning Fellowship</a> with funding provided by <a href="https://www.janestreet.com/">Jane Street</a>.</p>
<p>We were blown away by the response. With over 100 applicants from 5 continents, we had a tremendously hard time selecting only four proposals. After speaking with many of the applicants, we could not be more thrilled with the groups we selected. See below for bios of the inaugural class, as well as the papers that their groups will be respectively learning.</p>
<p>What’s the process now you ask? The fellows are hard at work constructing their curricula and will soon begin online classes. Participants will meet weekly to discuss and go beyond the material.</p>
<div class="welcome">
<b>We are looking for participants for these groups.
<br />If you’re interested, please let us know by filling out <a href="https://docs.google.com/forms/d/e/1FAIpQLSdNsXeJn0Osc1m5A_Rj7tTE3yzPINuL09xbaqFdHZGmUUBMqA/viewform">this form</a>.</b>
</div>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Steve Kroon - Stellenbosch (South Africa)</b></p>
<p><img src="/assets/kroon.png" style="width: 35%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1505.05770">Variational Inference with Normalizing Flows</a>”, by Rezende and Mohamed (ICML 2015)</p>
<p>Dr Steve Kroon obtained MCom (Computer Science) and PhD (Mathematical Statistics) degrees while studying at Stellenbosch University. He joined the Stellenbosch University Computer Science department in 2008. His PhD thesis considered aspects of statistical learning theory, and his subsequent research has focused on decision making in
artificial intelligence, including machine learning, reinforcement learning, and search techniques. He has supervised and co-supervised 5 graduated and 3 current master’s students, and has published 3 journal articles and 8 peer-reviewed conference and conference workshop
articles. He has served as a reviewer for the journals Algorithmica, the Journal of Universal Computer Science, and the South African Computer Journal, as well as on the programme committee for 2 conferences. He holds a Diploma in Actuarial Techniques, and is a member of the Centre for Artificial Intelligence Research, the Institute of Electrical and Electronics Engineers (IEEE) and the IEEE Computational Intelligence Society, the International Computer Games
Association, the South African Statistical Association, and the South African Institute for Computer Scientists and Information Technologists.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Sandhya Prabhakaran - New York (USA)</b></p>
<p><img src="/assets/prabhakaran.jpeg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1801.10130">Spherical CNNs</a>” by Cohen, Geiger, Köhler and Welling (ICLR 2018)</p>
<p>Dr. Sandhya Prabhakaran is a Research Fellow at Memorial Sloan Kettering Cancer Centre, NYC. Before that she was a Research Scientist at Columbia University in the City of New York.</p>
<p>She received her Ph.D. from the Department of Mathematics and Computer Science, University of Basel (Switzerland) and her Masters in Intelligent Systems (Robotics) from School of Informatics, University of Edinburgh (Scotland). Her research deals with developing statistical theory and inference models, particularly to problems in Cancer Biology and Computer Vision.</p>
<p>Prior to academics, she was an Assembler programmer working with the Mainframe Operating System (z/OS) at IBM Software Laboratories, Bangalore and has developed Mainframe applications at UST Global, Thiruvananthapuram.</p>
<p>She is an avid hiker and distance runner and has completed 4 out of the 6 World Marathon Majors.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Bhairav Mehta - Montreal (Canada)</b></p>
<p><img src="/assets/mehta.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1608.04471">Stein Variational Gradient Descent</a>” by Liu and Wang (NIPS 2016)</p>
<p>After finishing my undergraduate studies at the University of Michigan, I migrated north to Montreal, where I’m now a graduate student at Mila. I work mostly on reinforcement learning and robotics, but continue to find that teaching is the most rewarding part of graduate (and undergraduate) studies. I’ve been serving as a tutor, TA, and now, GSI, for over a decade, and I’m incredibly excited by the opportunity to lead a DFL course online. In my free time, you can find me helping ducks waddle across the street at Duckietown, or building deep learning models for my nonprofit tackling core problems in K-12 education.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Vinay Ramasesh, Piyush Patil, and Riley Edmunds - Berkeley (USA)</b></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1711.04735">Resurrecting the sigmoid in deep learning through dynamical isometry</a>” by Pennington, Schoenholz and Ganguli (NIPS 2017)</p>
<div style="overflow: hidden;">
<img src="/assets/ramasesh.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<p><b>Vinay:</b> I am finishing up a Ph. D. in physics at UC Berkeley, where I have worked on building and testing small quantum processors made from superconducting circuits. At Berkeley, I work in the Quantum Nanoelectronics Lab under the guidance of Dr. Irfan Siddiqi. My experience with machine learning comes from Berkeley's machine learning student group, ML@B, which I joined in 2017. Previously, I studied physics and electrical engineering at MIT, working in the group of Dr. Martin Zwierlein to build up an experiment to cool, trap, and image strongly-interacting atomic gases.
</p>
</div>
<p><br /></p>
<div style="overflow: hidden;">
<img src="/assets/patil.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<b>Piyush:</b> I graduated from UC Berkeley last May, where I studied electrical engineering and computer science and mathematics. While at Berkeley, I helped to get the university's student-run machine learning club, ML@B, up and running, serving as the vice president of projects during the last couple years. I was involved with research in quantum machine learning, adversarial examples, and natural language understanding. After graduating, I joined Nuro, a robotics startup working to build autonomous vehicles. Outside of ML, I enjoy reading philosophy, going hiking and backpacking, and spending time with friends.
</div>
<p><br /></p>
<div style="overflow: hidden;">
<img src="/assets/edmunds.png" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<b>Riley:</b> I'm currently finishing up my undergrad degree in computer science at UC Berkeley. I was one of the early members of ML@B, where as vice president of research, I helped club members form teams to work on ML research projects. At UC Berkeley, I've worked under Prof. Dawn Song, Alice Agogino and Stella Yu. With a couple friends, in February 2018 I co-founded an ML consulting company, Alinea AI. You can find more on my background at rileyedmunds.com. In my spare time, I enjoy traveling, playing spikeball, and discussing thought-provoking books.
</div>dflAfter we launched Depth First Learning last year, we wanted to keep the momentum and continue outputting high-quality study guides for machine learning. Subsequently, we launched the Depth First Learning Fellowship with funding provided by Jane Street.The DFL Fellowship2018-12-05T16:00:00+00:002018-12-05T16:00:00+00:00https://www.depthfirstlearning.com/2018/DFL-Fellowship<p>When we began Depth First Learning during the Google AI Residency, we wanted to find a
better way to study and understand important machine learning papers and ideas.
We found that many papers often assumed a set of requisite knowledge, which
prevented us from deeply appreciating the contribution or novelty of the work.</p>
<p>To this end, we designed Depth First Learning, a pedagogy for diving deep by
carefully tailoring a curriculum around a particular ML paper or concept and
leading small, focused discussion groups. So far, we’ve created guides for
<a href="http://www.depthfirstlearning.com/2018/InfoGAN">InfoGAN</a>, <a href="http://www.depthfirstlearning.com/2018/TRPO">TRPO</a>, <a href="http://www.depthfirstlearning.com/2018/AlphaGoZero">AlphaGoZero</a>, and <a href="http://www.depthfirstlearning.com/2018/DeepStack">DeepStack</a>.</p>
<p>Since our launch, we’ve received very positive feedback from students and
researchers around the world. <strong>Now, we want to run new, online classes around the
world.</strong></p>
<p>We intimately understand that the process of curating a meaningful curriculum
with reading materials, practice problems, and instructive discussion points can
be very rewarding, but also time-consuming and difficult. We wanted to make sure
that the people compiling the content understood that their efforts were well
worth their time and consequently decided to launch a fellowship program.</p>
<p><strong>Thanks to the generosity of <a href="http://www.janestreet.com">Jane Street</a>, we will provide 4 fellows
with a $4000 grant each to build a 6 week curriculum and run weekly on-line discussions.</strong></p>
<p><del>
If you’d like to lead a class about an important paper in machine learning, please visit <a href="http://fellowship.depthfirstlearning.com">http://fellowship.depthfirstlearning.com</a> to apply. We look forward to hearing from you!
</del></p>
<p><b>Thanks for all of the applications! We received interest from an astounding 113 people, and we are now going over the list. If you applied, you should have received an email from us. Applications are now closed.</b></p>
<ul>
<li><a href="http://twitter.com/avitaloliver">Avital</a>, <a href="http://twitter.com/suryabhupa">Surya</a>,
<a href="http://twitter.com/kumarkagrawal">Kumar</a>, <a href="http://twitter.com/cinjoncin">Cinjon</a></li>
</ul>dflWhen we began Depth First Learning during the Google AI Residency, we wanted to find a better way to study and understand important machine learning papers and ideas. We found that many papers often assumed a set of requisite knowledge, which prevented us from deeply appreciating the contribution or novelty of the work.DeepStack2018-07-10T16:00:00+00:002018-07-10T16:00:00+00:00https://www.depthfirstlearning.com/2018/DeepStack<p>Thank you to Michael Bowling, Michael Johanson, and Marc Lanctot for contributions to this guide.</p>
<p>Additionally, this would not have been possible without the generous support of
Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring18">The Mathematics of Deep Learning</a>.
Special thanks to him, as well as Martin Arjovsky, my colleague in leading this
recitation, and my fellow students Ojas Deshpande, Anant Gupta, Xintian Han,
Sanyam Kapoor, Chen Li, Yixiang Luo, Chirag Maheshwari, Zsolt Pajor-Gyulai,
Roberta Raileanu, Ryan Saxe, and Liang Zhuo.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/deepstack-deps.svg" width="200"></iframe>
<div>Concepts used in DeepStack. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Along with Libratus, DeepStack is one of two approaches to solving No-Limit
Texas Hold-em that debuted coincidentally. This game was notoriously difficult
to solve as it has just as large a branching factor
as Go, but additionally is a game of imperfect information.</p>
<p>The main idea behind both DeepStack and Libratus is to use Counterfactual Regret
Minimization (CFR) to find a mixed strategy that approximates a Nash Equilibrium
strategy. CFR’s convergence properties guarantee that we will yield such a strategy
and the closer we are to it, the better our outcome will be. They differ in
their implementation. In particular, DeepStack uses deep neural networks
to approximate the counterfactual value of each hand at specific points in the
game. While still being mathematically tight, this lets it cut short
the necessary computation to reach convergence.</p>
<p>In this curriculum, you will explore the study of games with a tour through
game theory and counterfactual regret minimization while building up the
requisite understanding to tackle DeepStack. Along the way, you will learn
all of the necessary topics, including what is the
<a href="https://en.wikipedia.org/wiki/Branching_factor">branching factor</a>, all about
<a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash Equilibria</a>, and
<a href="https://www.quora.com/What-is-an-intuitive-explanation-of-counterfactual-regret-minimization">CFR</a>.</p>
<p><br /></p>
<h1 id="common-resources">Common Resources:</h1>
<ol>
<li>MAS: <a href="http://www.masfoundations.org/mas.pdf">Multi Agent Systems</a>.</li>
<li>LT: <a href="http://mlanctot.info/files/papers/PhD_Thesis_MarcLanctot.pdf">Marc Lanctot’s Thesis</a>.</li>
<li>ICRM: <a href="http://modelai.gettysburg.edu/2013/cfr/cfr.pdf">Introduction to Counterfactual Regret Minimization</a>.</li>
<li>PLG: <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf">Prediction, Learning, and Games</a>.</li>
</ol>
<p><br /></p>
<h1 id="1-normal-form-games--poker">1 Normal Form Games & Poker</h1>
<p><strong>Motivation</strong>: Most of Game Theory, as well as the particular techniques used in
DeepStack and Libratus, is built on the framework of Normal Form
Games. These are game descriptions and are familiarly represented as a matrix,
a famous example being the Prisoner’s Dilemma. In this section, we cover
the basics of Normal Form Games. In addition, we go over the rules of Poker and
why it had proved so difficult to solve.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 3.1 & 3.2.</li>
<li>LT: Pages 5-7.</li>
<li><a href="https://arxiv.org/pdf/1701.01724.pdf">The Game of Poker</a>: Supplementary #1 on pages 16-17.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~sandholm/solving%20games.aimag11.pdf">The State of Solving Large Incomplete-Information Games, and Application to Poker</a> (2010)</li>
<li><a href="https://www.youtube.com/watch?v=2dX0lwaQRX0">Why Poker is Difficult</a>
Very good video by Noam Brown, the main author of Libratus. The first eighteen
minutes are the most relevant.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>LT: Prove that in a zero-sum game, the Nash Equilibrium strategies are interchangeable.
<details><summary>Hint</summary>
<p>Use the definition of a Nash Equilibrium along with the fact that
\(\mu_{i}(\sigma_{i}, \sigma_{-i}) + \mu_{-i}(\sigma_{i}, \sigma_{-i}) = c\).
</p>
</details>
</li>
<li>LT: Prove that in a zero-sum game, the expected payoff to each player is the same for every equilibrium.
<details><summary>Solution</summary>
<p>We will solve both this problem and the one above here. We have that if
\(\mu_{i}(\sigma) = \mu(\sigma_{i}, \sigma_{-i})\) and
\(\mu_{i}(\sigma') = \mu(\sigma_{i}', \sigma_{-i}')\) are both
Nash Equilibria, then:</p>
<p>\(
\begin{align}
\mu_{i}(\sigma_{i}, \sigma_{-i}) &\geq \mu_{i}(\sigma_{i}', \sigma_{-i}) \\
&= c - \mu_{-i}(\sigma_{i}', \sigma_{-i}) \\
&\geq c - \mu_{-i}(\sigma_{i}', \sigma_{-i}') \\
&= \mu_{i}(\sigma_{i}', \sigma_{-i}')
\end{align}
\)
</p>
<p>In a similar fashion, we can show that
\(\mu(\sigma_{i}', \sigma_{-i}') \geq \mu(\sigma_{i}, \sigma_{-i})\).
</p>
Consequently, \(\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}, \sigma_{-i})\),
which also implies that the strategies are interchangeable, i.e.
\(\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}', \sigma_{-i})\).
</details>
</li>
<li>MAS: Prove Lemma 3.1.6. <br />
\(\textit{Lemma}\): If a preference relation \(\succeq\) satisfies the axioms
completeness, transitivity, decomposability, and monotonicity, and if \(o_1 \succ o_2\)
and \(o_2 \succ o_1\), then there exists probability \(p\) s.t. \(\forall p' < p\),
\(o_2 \succ [p': o_1; (1 - p'): o_3]\) and for all \(p'' > p\),
\([p'': o_1; (1 - p''): o_3] \succ o_2.\)</li>
<li>MAS: Theorem 3.1.8 ensures that rational agents need only maximize the expectation
of single-dimensional utility functions. Prove this result as a good test of your
understanding. <br />
\(\textit{Theorem}\): If a preference relation \(\succeq\) satisfies the axioms completeness,
transitivity, substitutability, decomposability, monotonicity, and continuity, then
there exists a function \(u: \mathbb{L} \mapsto [0, 1]\) with the properties that:
<ol>
<li>\(u(o_1) \geq u(o_2)\) iff \(o_1 \succeq o_2\).</li>
<li>\(u([p_1 : o_1, ..., p_k: o_k]) = \sum_{i=1}^k p_{i}u(o_i)\).</li>
</ol>
</li>
</ol>
<p><br /></p>
<h1 id="2-optimality--equilibrium">2 Optimality & Equilibrium</h1>
<p><strong>Motivation</strong>: How do you reason about games? The best strategies in multi-agent
scenarios depend on the choices of others. Game theory deals with this problem
by identifying subsets of outcomes called solution concepts. In this section, we
discuss the fundamental solution concepts: Nash Equilibrium, Pareto Optimality,
and Correlated Equilibrium. For each solution concept, we cover what it implies
for a given game and how difficult it is to discover a representative strategy.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 3.3, 3.4.5, 3.4.7, 4.1, 4.2.4, 4.3, & 4.6.</li>
<li>LT: Section 2.1.1.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>MAS: Section 3.4.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Why must every game have a Pareto optimal strategy?
<details><summary>Solution</summary>
<p>Say that a game does not have a Pareto optimal outcome. Then, for every
outcome \(O\), there was another \(O'\) that Pareto-dominated \(O\).
Say \(O_2 > O_1\). Because \(O_2\) is not Pareto optimal, there is some
\(O_k > O_2\). There cannot be a max in this chain (because that max would
be Pareto optimal) and thus there must be some cycle. Consequently, there
exists for some agent a strategy \(O_j\) s.t. \(O_j > O_j\), which is a
contradiction.
</p>
</details>
</li>
<li>Why must there always exist at least one Pareto optimal strategy in which
all players adopt pure strategies?</li>
<li>Why in common-payoff games do all Pareto optimal strategies have the same payoff?
<details><summary>Solution</summary>
<p>Say two strategies \(S\) and \(S'\) are Pareto optimal. Then neither
dominates the other, so either \(\forall i \mu_{i}(S) = \mu_{i}(S')\)
or there are two players \(i, j\) for which \(mu_{i}(S) < \mu_{i}(S')\)
and \(mu_{j}(S) > \mu_{j}(S')\). In the former case, we see that the
two strategies have the same payoff as desired. In the latter case, we have
a contradiction because \(\mu_{j}(S') = \mu_{i}(S') > \mu_{i}(S)
= \mu_{j}(S) > \mu_{j}(S')\). Thus, all of the Pareto optimal strategies
must have the same payoff.
</p>
</details>
</li>
<li>MAS: Why does definition 3.3.12 imply that the vertices of a simplex must
all receive different labels?
<details><summary>Solution</summary>
<p>This follows from the definitions of \(\mathbb{L}(v)\) and \(\chi(v)\).
At the vertices of the simplex, \(\chi\) will only have singular values in
its range defined by the vertice itself. Consequently, \(\mathbb{L}\) must
as well.
</p>
</details>
</li>
<li>MAS: Why in definition 3.4.12 does it not matter that the mapping is to
pure strategies rather than to mixed strategies?</li>
<li>Take your favorite normal-form game, find a Nash Equilibrium, and then find
a corresponding Correlated Equilibrium.</li>
</ol>
<p><br /></p>
<h1 id="3-extensive-form-games">3 Extensive Form Games</h1>
<p><strong>Motivation</strong>: What happens when players don’t act simultaneously?
Extensive Form Games are an answer to this question. While this representation
of a game always has a comparable Normal Form, it’s much more natural to reason
about sequential games in this format. Examples include familiar ones like Go,
but also more exotic games like Magic: The Gathering and Civilization. This
section is imperative as Poker is best described as an Extensive Form Game.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 5.1.1 - 5.1.3.</li>
<li>MAS: Sections 5.2.1 - 5.2.3.</li>
<li><a href="http://martin.zinkevich.org/publications/ijcai2011_rgbr.pdf">Accelerating Best Response Calculation in Large Extensive Games</a>:
This is important for understanding how to evaluate Poker algorithms.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>LT: Section 2.1.2.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the intuition for why not all normal form games can be transformed
into perfect-form extensive games?
<details><summary>Solution</summary>
<p>The problem is one of modeling simultaneity. Perfect information
extensive form games have trouble modeling concurrent moves because they
have an explicit temporal structure of moves.
</p>
</details>
</li>
<li>Why does that change when the transformation is to imperfect extensive games?</li>
<li>How are the set of behavioral strategies different from the set of mixed strategies?
<details><summary>Solution</summary>
<p>The set of mixed strategies are each distributions over pure strategies.
The set of behavioral strategies are each vectors of distributions over the
actions and assign that distribution independently at each Information Set.
</p>
</details>
</li>
<li>Succinctly describe the technique demonstrated in the Accelerating Best Response paper.</li>
</ol>
<p><br /></p>
<h1 id="4-counterfactual-regret-minimization-1">4 Counterfactual Regret Minimization #1</h1>
<p><strong>Motivation</strong>: Counterfactual Regret Minimization (CFR) is only a decade old
but has already achieved huge success as the foundation underlying DeepStack
and Libratus. In the first of two weeks dedicated to CFR, we learn how the
algorithm works practically and get our hands dirty coding up our implementation.</p>
<p>The optional readings are papers introducing CFR-D and CFR+, further
iterations upon CFR. These are both used in DeepStack.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>ICRM: Sections 2.1-2.4.</li>
<li>ICRM: Sections 3.1-3.4.</li>
<li>LT: Section 2.2.</li>
<li><a href="http://poker.cs.ualberta.ca/publications/NIPS07-cfr.pdf">Regret Minimization in Games with Incomplete Information</a>.</li>
</ol>
<p><strong>Optional Reading</strong>: These two papers are CFR extensions used in DeepStack.</p>
<ol>
<li><a href="https://pdfs.semanticscholar.org/8216/0cbdcbeb13d53db85da928d8c42a789fdd69.pdf">Solving Imperfect Information Games Using Decomposition</a>: CFR-D.</li>
<li><a href="https://arxiv.org/pdf/1407.5042.pdf">Solving Large Imperfect Information Games Using CFR+</a>: CFR+.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the difference between external regret, internal regret, swap regret,
and counterfactual regret?
<details><summary>Hint</summary>
<p>The definitions of the three are the following:</p>
<ul>
<li><b>External Regret</b>: How much the algorithm regrets not taking the best
single decision in hindsight. We compare to a policy that performs a single
action in all timesteps.</li>
<li><b>Internal Regret</b>: How much the algorithm regrets making one choice
over another in all instances. An example is whenever you bought Amazon stock,
you instead bought Microsoft stock.</li>
<li><b>Swap Regret</b>: Similar to Internal Regret but instead of one categorical
action being replaced wholesale with another categorical action, now we allow
for any number of categorical swaps.</li>
<li><b>Counterfactual Regret</b>: Assuming that your actions take you to a
node, this is the expectation of that node over your opponents' strategies.
The counterfactual component is that we assume you get to that node with a
probability of one.</li>
</ul>
</details>
</li>
<li>Why is Swap Regret important?
<details><summary>Hint</summary>
<p>Swap Regret is connected to Correlated Equilibrium. Can you see why?</p>
</details>
</li>
<li>Implement CFR (or CFR+ / CFR-D) in your favorite programming language to play
Leduc Poker or Liar’s Dice.</li>
<li>How do you know if you’ve implemented CFR correctly?
<details><summary>Solution</summary>
<p>One way is to test it by implementing Local Best Response. It should
perform admirably against that algorithm, which is meant to best it.</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-counterfactual-regret-minimization-2">5 Counterfactual Regret Minimization #2</h1>
<p><strong>Motivation</strong>: In the last section, we saw the practical side of CFR and how effective it
can be. In this section, we’ll understand the theory underlying it. This will culminate
with Blackwell’s Approachability Theorem, a generalization of repeated two-player
zero-sum games. This is a challenging session but the payoff will be a much
keener understanding of CFR’s strengths.</p>
<p><strong>Required</strong>:</p>
<ol>
<li>PLG: Sections 7.3 - 7.7, 7.9.</li>
</ol>
<p><strong>Optional</strong>:</p>
<ol>
<li><a href="http://wwwf.imperial.ac.uk/~dturaev/Hart0.pdf">A Simple Adaptive Procedure Leading to Correlated Equilibrium</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture11_2007.pdf">Prof. Johari’s 2007 Class - 11</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture13_2007.pdf">Prof. Johari’s 2007 Class - 13</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture14_2007.pdf">Prof. Johari’s 2007 Class - 14</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture15_2007.pdf">Prof. Johari’s 2007 Class - 15</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>
<p>PLG: Prove Lemma 7.1. <br />
\(\textit{Lemma}\): A probability distribution \(P\) over the set of all \(K\)-tuples
\(i = (i_{1}, ..., i_{K})\) of actions is a correlated equilibrium iff, for every
player \(k \in {1, ..., K}\) and actions \(j, j' \in {1, ..., N_{k}}\), we have</p>
\[\sum_{i: i_{k} = j} P(i)\big(\mathcal{l}(i) - \mathcal{l}(i^{-}, j')\big) \leq 0\]
<p>where \((i^{-}, j') = (i_{1}, ..., i_{k-1}, j', i_{k+1}, ..., i_{K})\).</p>
</li>
<li>
<p>It’s brushed over in the proof of Theorem 7.5 in PLG, but prove that if set
\(S\) is approachable, then every halfspace \(H\) containing \(S\) is approachable.</p>
<details><summary>Solution</summary>
<p>Because \(S \in H\) is approachable, we can always find a strategy for player one s.t.
the necessary approachability clauses hold (see Johari's Lecture 13). Namely, choose
the strategy in \(S\) that asserts \(S\) as being approachable.</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-deepstack">6 DeepStack</h1>
<p><strong>Motivation</strong>: Let’s read the paper! A summary of what’s going on to help with your
understanding:</p>
<p>DeepStack runs counterfactual regret minimization at every decision. However, it uses
two separate neural networks, one for after the flop and one for after the turn, to
estimate the counterfactual values without having to continue running CFR after those
moments. This approach is trained beforehand and helps greatly with cutting short the
search space at inference time. Each of the networks take as input the size of the pot
and the current Bayesian ranges for each player across all hands. They output the
counterfactual values for each hand for each player.</p>
<p>In addition to DeepStack, we also include Libratus as required reading. This paper
highlights Game Theory and CFR as the really important concepts in this curriculum;
deep learning is not necessary to build a champion Poker bot.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36630/t/58b7a3dce3df28761dd25e54/1488430045412/DeepStack.pdf">DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker</a>.</li>
<li><a href="https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36630/t/58bed28de3df287015e43277/1488900766618/DeepStackSupplement.pdf">DeepStack Supplementary Materials</a>.</li>
<li><a href="https://arxiv.org/pdf/1705.02955.pdf">Libratus</a>.</li>
<li><a href="https://vimeo.com/212288252">Michael Bowling on DeepStack</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://github.com/lifrordi/DeepStack-Leduc">DeepStack Implementation for Leduc Hold’em</a>.</li>
<li><a href="https://www.youtube.com/watch?v=2dX0lwaQRX0">Noam Brown on Libratus</a>.</li>
<li><a href="https://arxiv.org/abs/1805.08195">Depth-Limited Solving for Imperfect-Information Games</a>: This paper is fascinating because it is achieves a poker-playing bot almost as good as Libratus but using a fraction of the necessary computation and disk space.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are the differences between the approaches taken in DeepStack and in Libratus?
<details><summary>Solution</summary>
<p>Here are some differences:</p>
<ul>
<li>A clear difference is that DeepStack uses a deep neural network to reduce the necessary search space, and Libratus does not.</li>
<li>DeepStack does not use any action abstraction and instead melds those considerations into the pot size input. Libratus does use a dense action abstraction but adapts it each game and additionally constructs new sub-games on the fly for actions not in its abstraction.</li>
<li>DeepStack uses card abstraction by first clustering the hands into 1000 buckets and then considering probabilities over that range. Libratus does not use any card abstraction preflop or on the flop, but does use it on later rounds such that the game's \(10^{61}\) decision points are reduced to \(10^{12}\).</li>
<li>DeepStack does not have a way to learn from recent games without further neural network training. On the other hand, Libratus improves via a background process that adds novel opponent actions to its action abstraction.</li>
</ul>
</details>
</li>
<li>Can you succinctly explain “Continual Re-solving”?</li>
<li>Can you succinctly explain AIVAT?</li>
</ol>cinjonThank you to Michael Bowling, Michael Johanson, and Marc Lanctot for contributions to this guide.AlphaGoZero2018-06-27T15:55:00+00:002018-06-27T15:55:00+00:00https://www.depthfirstlearning.com/2018/AlphaGoZero<p>Thank you to Marc Lanctot, Hugo Larochelle, Katherine Lee, and Tim Lillicrap for contributions to this guide.</p>
<p>Additionally, this would not have been possible without the generous support of
Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring18">The Mathematics of Deep Learning</a>.
Special thanks to him, as well as Martin Arjovsky, my colleague in leading this
recitation, and my fellow students Ojas Deshpande, Anant Gupta, Xintian Han,
Sanyam Kapoor, Chen Li, Yixiang Luo, Chirag Maheshwari, Zsolt Pajor-Gyulai,
Roberta Raileanu, Ryan Saxe, and Liang Zhuo.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/ag0-deps.svg" width="200"></iframe>
<div>Concepts used in AlphaGoZero. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>AlphaGoZero was a big splash when it debuted and for good reason. The grand effort
was led by David Silver at DeepMind and was an extension of work that he started
during his PhD. The main idea is to solve the game of Go and the approach taken
is to use an algorithm called Monte Carlo Tree Search (MCTS). This algorithm acts as an expert guide to teach
a deep neural network how to approximate the value of each state. The convergence
properties of MCTS provides the neural network with a founded way to reduce the
search space.</p>
<p>In this curriculum, you will focus on the study of two-person zero-sum perfect
information games and develop understanding so that you can completely grok
AlphaGoZero.</p>
<p><br /></p>
<h1 id="common-resources">Common Resources:</h1>
<ol>
<li>Knuth: <a href="https://pdfs.semanticscholar.org/dce2/6118156e5bc287bca2465a62e75af39c7e85.pdf">An Analysis of Alpha-Beta Pruning</a></li>
<li>SB: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Sutton & Barto</a>.</li>
<li>Kun: <a href="https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/">Jeremy Kun: Optimizing in the Face of Uncertainty</a>.</li>
<li>Vodopivec: <a href="https://pdfs.semanticscholar.org/3d78/317f8aaccaeb7851507f5256fdbc5d7a6b91.pdf">On Monte Carlo Tree Search and Reinforcement Learning</a>.</li>
</ol>
<p><br /></p>
<h1 id="1-minimax--alpha-beta-pruning">1 Minimax & Alpha Beta Pruning</h1>
<p><strong>Motivation</strong>: Minimax and Alpha-Beta Pruning are original ideas that blossomed
from the study of games starting in the 50s. To this day, they are components in
strong game-playing computer engines like Stockfish. In this class, we will go
over these foundations, learn from Prof. Knuth’s work analyzing their properties,
and prove that these algorithms are theoretically sound solutions to two-player
games.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Perfect Information Games.</li>
<li>Minimax.</li>
<li>Alpha-Beta Pruning.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.cs.cornell.edu/courses/cs4700/2019sp/lectures/Lecture9.pdf">Cornell Recitation on Minimax & AB Pruning</a>.</li>
<li>Knuth: Section 6 (Theorems 1&2, Corollaries 1&3).</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~arielpro/mfai_papers/lecture1.pdf">CMU’s Mathematical Foundations of AI Lecture 1</a>.</li>
<li>Knuth: Sections 1-3.</li>
<li><a href="https://www.chessprogramming.org/index.php?title=Minimax">Chess Programming on Minimax</a>.</li>
<li><a href="https://www.chessprogramming.org/Alpha-Beta">Chess Programming on AB Pruning</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Knuth: Show that AlphaBetaMin \(= G2(p, \alpha, \beta) = -F2(p, -\beta, -\alpha) = -\)AlphaBetaMax. (p. 300)
<details><summary>Solution</summary>
<p>If \(d = 0\), then \(F2(p, \alpha, \beta) = f(p)\) and \(G2(p, -\alpha, -\beta) = g(p) = -f(p)\)
as desired, where the last step follows from equation 2 on p. 295.
</p>
<p>Otherwise, \(d > 0\) and we proceed by induction on the height \(h\). The
base case of \(h = 0\) is trivial because then the tree is a single root and
consequently is the \(d = 0\) case. Assume it is true for height \(< h\),
then for \(p\) of height \(h\), we have that \(m = a\) at the start of
\(F2(p, \alpha, \beta)\) and \(m\prime = -\alpha\) at the start of \(G2(p, -\beta, -\alpha)\). So
\(m = -m\prime\).
</p>
<p>In the i-th iteration of the loop, let's label the resulting value of \(m\)
as \(m_{n}\). We have that \(t = G2(p_{i}, m , \beta) = -F2(p_i, -\beta, -m) = -t\)
by the inductive assumption. Then,
\(t > m \iff -t < -m \iff t\prime < m\prime \iff m_{n} = t = -m_{n}\prime\),
which means that every time there is an update to the value of \(m\), it will
be preserved across both functions. Further, because
\(m \geq \beta \iff -m \leq -\beta \iff m\prime \leq -\beta\), we have that \(G2\) and
\(F2\) will have the same stopping criterion. Together, these imply that
\(G2(p, \alpha, \beta) = -F2(p, -\beta, -\alpha)\) after each iteration of the
loop as desired.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.1, why are the successor positions of type 2? (p. 305)
<details><summary>Solution</summary>
<p>By the definition of being type 1, \(p = a_{1} a_{2} \ldots a_{l}\), where
each \(a_{k} = 1\). Its successor positions \(p_{l+1} = p (l+1)\) all have length
\(l + 1\) and their first term \(> 1\) is at position \(l+1\), the last entry.
Consequently, \((l+1) - (l+1) = 0\) is even and they are type 2.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.2, why is it that p’s successor position is of type 3
if p is not terminal?
<details><summary>Solution</summary>
<p>If \(p\) is type 2 and size \(l\), then for \(j\) s.t. \(a_j\) is the first entry where
\(a_j > 1\), we have that \(l - j\) is even. When it's not terminal, then its
successor position \(p_1 = a_{1} \ldots a_{j} \dots a_{l} 1\) has a length of
size \(l + 1\), which implies that \(l + 1 - j\) is odd and so \(p_1\) is
type 3.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.3, why is it that p’s successor positions are of type 2
if p is not terminal?
<details><summary>Hint</summary>
<p>This is similar to the above two.</p>
</details>
</li>
<li>Knuth: Show that the three inductive steps of Theorem 2 are correct.</li>
</ol>
<p><br /></p>
<h1 id="2-multi-armed-bandits--upper-confidence-bounds">2 Multi-Armed Bandits & Upper Confidence Bounds</h1>
<p><strong>Motivation</strong>: The multi-armed bandits problem is a framework for understanding
the exploitation vs exploration tradeoff. Upper Confidence Bounds, or UCB, is
an algorithmically tight approach to addressing that tradeoff under certain
constraints. Together, they are important components of how Monte Carlo Tree
Search (MCTS), a key aspect of AlphaGoZero, was originally formalized. For
example, in MCTS there is a notion of node selection where UCB is used extensively.
In this section, we will cover bandits and UCB.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Basics of reinforcement learning.</li>
<li>Multi-armed bandit algorithms and their bounds.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>SB: Sections 2.1 - 2.7.</li>
<li>Kun.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf">Original UCB1 Paper</a></li>
<li><a href="https://courses.cs.washington.edu/courses/cse599s/14sp/scribes/lecture15/lecture15_draft.pdf">UW Lecture Notes</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>SB: Exercises 2.3, 2.4, 2.6.</li>
<li>SB: What are the pros and cons of the optimistic initial values method? (Section 2.6)</li>
<li>Kun: In the proof for the expected cumulative regret of UCB1, why is \(\delta *T\)
a trivial regret bound if the deltas are all the same?
<details><summary>Solution</summary>
<p>\(
\begin{align}
\mathbb{E}[R_{A}(T)] &= \mu^{*}T - \mathbb{E}[G_{A}(T)] \\
&= \mu^{*}T - \sum_{i} \mu_{i}\mathbb{E}[P_{i}(T)] \\
&= \sum_{i} (\mu^{*} - \mu_{i})\mathbb{E}[P_{i}(T)] \\
&= \sum_{i} \delta_{i} \mathbb{E}[P_{i}(T)] \\
&\leq \delta \sum_{i} \mathbb{E}[P_{i}(T)] \\
&= \delta * T
\end{align}
\)
</p>
<p>The third line follows from \(sum_{i} \mathbb{E}[P_{i}(T)] = T\) and the
fifth line from the definition of \(\delta\).
</p>
</details>
</li>
<li>Kun: Do you understand the argument for why the regret bound is \(O(\sqrt{KT\log(T)})\)?
<details><summary>Hint</summary>
<p>
What happens if you break the arms into those with regret \(< \sqrt{K(\log{T})/T}\)
and those with regret \(\geq \sqrt{K(\log{T})/T}\)? Can we use this to bound
the total regret?
</p>
</details>
</li>
<li>Reproduce the UCB1 algorithm in code with minimal supervision.</li>
</ol>
<p><br /></p>
<h1 id="3-policy--value-functions">3 Policy & Value Functions</h1>
<p><strong>Motivation</strong>: Policy and value functions are at the core of reinforcement
learning. The policy function is the representative probabilities that our
policy assigns to each action. When we sample from these, we would like for
better actions to have higher probability. The value function is our estimate
of the strength of the current state. In AlphaGoZero, a single network
calculates both a value and a policy, then later updates its weights according
to how well the agent performs in the game.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Bellman equation.</li>
<li>Policy gradient.</li>
<li>On-policy / off-policy.</li>
<li>Policy iteration.</li>
<li>Value iteration.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Value Function:
<ol>
<li>SB: Sections 3.5, 3.6, 3.7.</li>
<li>SB: Sections 9.1, 9.2, 9.3*.</li>
</ol>
</li>
<li>Policy Function:
<ol>
<li>SB: Sections 4.1, 4.2, 4.3.</li>
<li>SB: Sections 13.1, 13.2*, 13.3, 13.4.</li>
</ol>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Sergey Levine: <a href="https://www.youtube.com/watch?v=tWNpiNzWuO8&feature=youtu.be">Berkeley Fall’17: Policy Gradients</a> → This is really good.</li>
<li>Sergey Levine: <a href="https://www.youtube.com/watch?v=k1vNh4rNYec&feature=youtu.be">Berkeley Fall’17: Value Functions</a> → This is really good.</li>
<li><a href="http://karpathy.github.io/2016/05/31/rl/">Karpathy does Pong</a>.</li>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf">David Silver on PG</a>.</li>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf">David Silver on Value</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Why does policy gradient have such high variance?</li>
<li>What is the difference between off-policy and on-policy?
<details><summary>Solution</summary>
<p>
On-policy algorithms learn from the current policy's action decisions.
Off-policy algorithms learn from another arbitrary policy's actions. An
example of an on-policy algorithm is SARSA or REINFORCE. An example of an
off-policy algorithm is Q-Learning.
</p>
</details>
</li>
<li>SB: Exercises 3.15, 3.16, 3.17, 3.23, 4.3.</li>
<li>SB: Exercise 4.5 - How would policy iteration be defined for action values?
Give a complete algorithm for computing \(q^{*}\), analogous to that on page 65
for computing \(v^{*}\).
<details><summary>Solution</summary>
<p> The solution follows the proof (page 65) for \(v^{*}\), with the following modifications:
<ol>
<li>Consider a randomly initialized Q(s, a) and a random policy \( \pi(s) \). </li>
<li><b> Policy Evaluation </b> : Update Q(s, a) \( \leftarrow \sum_{s'} P_{ss'}^{a} R_{ss'}^{a} + \gamma \sum_{s'} \sum_{a'} P_{ss'}^{a} Q^{\pi}(s', a') \pi(a' | s') \) <br />
Note that \( P_{ss'}^{a} \leftarrow P(s' |s, a) , R_{ss'}^{a} \leftarrow R(s, a, s').\)</li>
<li><b> Policy Improvement </b> : Update \( \pi(s) = {argmax}_{a} Q^{\pi}(s, a) \). If \(unstable\), go to step 2. Here, \( unstable \), implies \( \pi_{before\_update}(s) \neq \pi_{after\_update}(s) \)</li>
<li> \( q^{*} \leftarrow Q(s, a) \) </li>
</ol> </p>
</details>
</li>
<li>SB: Exercise 13.3 - Prove that the eligibility vector
\(\nabla_{\theta} \ln \pi (a | s, \theta) = x(s, a) - \sum_{b} \pi (b | s, \theta)x(s, b)\)
using the definitions and elementary calculus. Here, \(\pi (a | s, \theta)\) = softmax( \(\theta^{T}x(s, a)\) ).
<details><summary>Solution</summary>
<p align="center">
By definition, we have \( \pi( a| s, \theta) = \frac{e^{ \theta^{T}
\mathbf{x}( s, a) }}{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b)) }} \), where
\( \mathbf{x}(s, a) \) is the state-action feature representation. Consequently:
<br />
\(
\begin{align}
\nabla_{\theta} \ln \pi (a | s, \theta) &= \nabla_\theta \Big( \theta^{T}\mathbf{x}(s, a) - \ln \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } \Big) \\
&= \mathbf{x}(s, a) - \sum_b \mathbf{x}(s, b) \frac{ e^{ \theta^{T}\mathbf{x}(s, b) } }{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } } \\
&= \mathbf{x}(s, a) - \sum_{b} \pi (b | s, \theta)\mathbf{x}(s, b) \\
\end{align}
\)
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="4-mcts--uct">4 MCTS & UCT</h1>
<p><strong>Motivation</strong>: Monte Carlo Tree Search (MCTS) forms the backbone of AlphaGoZero.
It is what lets the algorithm reliably explore and then hone in on the best policy.
UCT (UCB for Trees) combines MCTS and UCB so that we get reliable convergence
guarantees. In this section, we will explore how MCTS works and how to make
it excel for our purposes in solving Go, a game with an enormous branching factor.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Conceptual understanding of Monte Carlo Tree Search.</li>
<li>Optimality of UCT.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>SB: Section 8.11</li>
<li><a href="https://gnunet.org/sites/default/files/Browne%20et%20al%20-%20A%20survey%20of%20MCTS%20methods.pdf">Browne</a>: Sections 2.2, 2.4, 3.1-3.5, 8.2-8.4.</li>
<li><a href="http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/1029/paper_thesis.pdf">Silver Thesis</a>: Sections 1.4.2 and 3.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://jhamrick.github.io/quals/planning%20and%20decision%20making/2015/12/16/Browne2012.html">Jess Hamrick on Browne</a>.</li>
<li><a href="https://hal.archives-ouvertes.fr/file/index/docid/116992/filename/CG2006.pdf">Original MCTS Paper</a>.</li>
<li><a href="http://ggp.stanford.edu/readings/uct.pdf">Original UCT Paper</a>.</li>
<li>Browne:
<ol>
<li>Section 4.8: MCTS applied to Stochastic or Imperfect Information Games.</li>
<li>Sections 7.2, 7.3, 7.5, 7.7: Applications of MCTS.</li>
</ol>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Can you detail each of the four parts of the MCTS algorithm?
<details><summary>Solution</summary>
<ol>
<li><b>Selection</b>: Select child node from the current node based on the tree policy.</li>
<li><b>Expansion</b>: Expand the child node based on the exploration / exploitation trade-off.</li>
<li><b>Simulation</b>: Simulate from the child node until termination or upon reaching a suitably small future reward (like from reward decay).</li>
<li><b>Backup</b>: Backup the reward along the path taken according to the tree policy.</li>
</ol>
</details>
</li>
<li>What happens to the information gained from the Tree Search after each run?
<details><summary>Solution</summary>
<p>We can reuse the accumulated statistics in subsequent runs. We could also
ignore those statistics and build fresh each subsequent root. Both are used
in actual implementations.
</p>
</details>
</li>
<li>What characteristics of a domain would make MCTS a good algorithmic choice?
<details><summary>Solution</summary>
<p>
A few such characteristics are:
</p>
<ul>
<li>MCTS is aheuristic, meaning that it does not require any domain-specific
knowledge. Consequently, if it is difficult to produce game heuristics for
your target domain (e.g. Go), then it can perform much better than alternatives
like Minimax. And on the flip side, if you did have domain-specific knowledge,
MCTS can incorporate it and will improve dramatically.
</li>
<li>
If the target domain needs actions online, then MCTS is a good choice as all
values are always up to date. Go does not have this property but digital games
like in the <a href="http://ggp.stanford.edu/">General Game Playing</a> suite
may.
</li>
If the target domain's game tree is of a nontrivial size, then MCTS may be
a much better choice than other algorithms as it tends to build unbalanced
trees that explore the more promising routes rather than consider all routes.
</li>
<li>
If there is noise or delayed rewards in the target domain, then MCTS is a
good choice because it is robust to these effects which can gravely impact
other algorithms such as modern Deep Reinforcement Learning.
</li>
</ul>
</details>
</li>
<li>What are examples of domain knowledge default policies in Go?
<details><summary>Solution</summary>
<ul>
<li>Crazy Stone, an early program that won the 2006 9x9 Computer Go Olympiad,
used an
<a href="https://www.researchgate.net/figure/Examples-of-move-urgency_fig2_220174551">urgency</a>
heuristic value for each of the moves on the board.
</li>
<li>
MoGo, the algorithm that introduced UCT, bases its default policies on this
sequence:
<ol>
<li>Respond to ataris by playing a saving move at random.</li>
<li>If one of the eight intersections surrounding the last move matches a
simple pattern for cutting or <i>hane</i>, randomly play one.</li>
<li>If there are capturing moves, play one at random.</li>
<li>Play a random move.</li>
</ol>
</li>
<li>The second version of Crazy Stone used an algorithm learned from actual
game play to learn a library of strong patterns. It incorporated this into
its default policy.
</li>
</ul>
</details>
</li>
<li>Why is UCT optimal? For a finite-horizon MDP with rewards scaled to lie in
\([0, 1]\), can you prove that the failure probability at the root converges
to zero at a polynomial rate in the number of games simulated?
<details><summary>Hint</summary>
<p>
Try using induction on \(D\), the horizon of the MDP. At \(D=1\), to what
result does this correspond?
</p>
</details>
<details><summary>Hint 2</summary>
<p>
Assume that the result holds for a horizon up to depth \(D - 1\) and
consider a tree of depth \(D\). We can keep the cumulative rewards bounded
in the interval by dividing by \(D\). Now can you show that the UCT payoff
sequences at the root satisfy the drift conditions, repeated below?
</p>
<ul>
<li>The payoffs are bounded - \(0 \leq X_{it} \leq 1\), where \(i\) is the
arm number and \(t\) is the time step.</li>
<li>The expected values of the averages, \(\overline{X_{it}} =
\frac{1}{n} \sum_{t=1}^{n} X_{it}\), converge.</li>
<li>Define \(\mu_{in} = \mathbb{E}[\overline{X_{in}}]\) and \(\mu_{i} = \lim_{n\to\inf} \mu_{in}\).
Then, for \(c_{t, s} = 2C_{p}\sqrt{\frac{\ln{t}}{s}}\), where \(C_p\) is a
suitable constant, both
\(\mathbb{P}(\overline{X_{is}} \geq \mu_{i} + c_{t, s}) \leq t^{-4}\) and
\(\mathbb{P}(\overline{X_{is}} \leq \mu_{i} - c_{t, s}) \leq t^{-4}\) hold.
</li>
</ul>
</details>
<details><summary>Solution</summary>
<p>
For a complete detail of the proof, see the original
<a href="http://ggp.stanford.edu/readings/uct.pdf">UCT</a> paper.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-mcts--rl">5 MCTS & RL</h1>
<p><strong>Motivation</strong>: Up to this point, we have learned a lot about how games can be
solved and how reinforcement learning works on a foundational level. Before we
jump into the paper, one last foray contrasting and unifying the games vs
learning perspective is worthwhile for understanding the domain more fully. In
particular, we will focus on a paper from Vodopivec et al. After completing
this section, you should have an understanding of what research directions in
this field have been thoroughly explored and which still have open directions.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Integrating MCTS and reinforcement learning.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Vodopivec:
1. Section 3.1-3.4: Connection between MCTS and RL.
2. Section 4.1-4.3: Integrating MCTS and RL.</li>
<li><a href="https://papers.nips.cc/paper/1292-why-did-td-gammon-work.pdf">Why did TD-Gammon Work?</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Vodopivec: Section 5: Survey of research inspired by both fields.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are key differences between MCTS and RL?</li>
<li>UCT can be described in RL terms as the following “The original UCT searches
identically as an offline on-policy every-visit MC control algorithm that uses
UCB1 as the policy.” What do each of these terms mean?
<details><summary>Solution</summary>
<ul>
<li>
UCT is trained on-policy, which means it improves the policy used to make the
action decisions, i.e. UCB1.
</li>
<li>
The offline means that we can't learn until after the episode is completed.
An alternative online algorithm would learn while the episode was running.
</li>
<li>
Every-visit versus first-visit decides if we are going to update a state for
every time it's accessed in an episode or just the first time. The original
UCT algorithm did every-visit. Subsequent versions relaxed this.
</li>
<li>
MC control means that we are using Monte Carlo as the policy, i.e. we use
the average value of the state as the true value.
</li>
</ul>
</details>
</li>
<li>What is a Representation Policy? Give an example not described in the text.
<details><summary>Solution</summary>
<p>A Representation Policy defines the model of the state space (e.g. in
the form of a value function) and the boundary between memorized and
non-memorized parts of the space.
</p>
</details>
</li>
<li>What is a Control Policy? Give an example not described in the text.
<details><summary>Solution</summary>
<p>A Control Policy dictates what actions will be performed and (consequently)
which states will be visited. In MCTS, it includes the tree and default policies.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the paper! We have a deep understanding of the background,
so let’s delve into the apex result. Note that we don’t just focus on the final
AlphaGoZero paper but also explore a related paper written coincidentally by
a team at UCL using Hex as the game of choice. Their algorithm is very similar
to the AlphaGoZero algorithm and considering both in context is important to
gauging what was really the most important aspects of this research.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>MCTS learning and computational capacity.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://deepmind.com/documents/119/agz_unformatted_nature.pdf">Mastering the Game of Go Without Human Knowledge</a></li>
<li><a href="https://arxiv.org/pdf/1705.08439.pdf">Thinking Fast and Slow with Deep Learning and Tree Search</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tree-search-planning.pdf">Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning</a></li>
<li><a href="http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/1029/paper_thesis.pdf">Silver Thesis</a>: Section 4.6</li>
<li><a href="https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf">Mastering the game of Go with deep neural networks and tree search</a></li>
<li><a href="https://arxiv.org/abs/1712.01815">Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What were the differences between the two papers “Mastering the Game of Go
Without Human Knowledge” and “Thinking Fast and Slow with Deep Learning and Tree Search”?
<details><summary>Solution</summary>
<p>Some differences between the former (AG0) and the latter (ExIt) are:</p>
<ul>
<li>AG0 uses MC value estimates from the expert for the value network
where ExIt uses estimates from the apprentice. This requires more computation
by AG0 but produces better estimates.</li>
<li>The losses were different. For the value network, AG0 uses an MSE loss
with L2 regularization and ExIt uses a cross entropy loss with early stopping.
For the policy part, AG0 used cross entropy while ExIt uses a weighted
cross-entropy that takes into account how confident MCTS is in the action
based on the state count.</li>
<li>AG0 uses the value network to evaluate moves; ExIt uses RAVE and rollouts,
plus warm starts from the MCTS.</li>
<li>AG0 adds in Dirichlet noise to the prior probability at the root node.</li>
<li>AG0 elevates a new network as champion only when it's markedly better than
the prior champion; ExIt replaces the old network without verification of if
it is better.</li>
</ul>
</details>
</li>
<li>What was common to both of “Mastering the Game of Go Without Human Knowledge”
and “Thinking Fast and Slow with Deep Learning and Tree Search”?
<details><summary>Solution</summary>
<p>The most important commonality is that they both use MCTS as an expert
guide to help a neural network learn through self-play.</p>
</details>
</li>
<li>Will the system get stuck if the current neural network can’t beat the previous ones?
<details><summary>Solution</summary>
<p>No. The algorithm won’t accept a policy that is worse than the current best
and MCTS’s convergence properties imply that it will eventually tend towards
the equilibrium solution in a zero-sum two player game
</p>
</details>
</li>
<li>Why include both a policy and a value head in these algorithms? Why not just use policy?
<details><summary>Solution</summary>
<p>Value networks reduce the required search depth. This helps tremendously
because a rollout approach without the value network is inaccurate and spends
too much time on sub-optimal directions.
</p>
</details>
</li>
</ol>cinjonThank you to Marc Lanctot, Hugo Larochelle, Katherine Lee, and Tim Lillicrap for contributions to this guide.Trust Region Policy Optimization2018-06-19T16:00:00+00:002018-06-19T16:00:00+00:00https://www.depthfirstlearning.com/2018/TRPO<p>Thank you to Nic Ford, Ethan Holly, Matthew Johnson, Avital Oliver, John Schulman, George Tucker, and Charles Weill for contributing to this guide.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/trpo-deps.svg" width="200"></iframe>
<div>Concepts used in TRPO. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>TRPO is a scalable algorithm for optimizing policies in reinforcement learning by
gradient descent. Model-free algorithms such as policy gradient methods do not
require access to a model of the environment and often enjoy better
practical stability. Consequently, while straightforward to apply to new
problems, they have trouble scaling to large, nonlinear policies. TRPO couples
insights from reinforcement learning and optimization theory to develop an
algorithm which, under certain assumptions, provides guarantees for monotonic
improvement. It is now commonly used as a strong baseline when developing new
algorithms.</p>
<p><br /></p>
<h1 id="1-policy-gradient">1 Policy Gradient</h1>
<p><strong>Motivation</strong>: Policy gradient methods (e.g. TRPO) are a class
of algorithms that allow us to directly optimize the parameters of a policy by
gradient descent. In this section, we formalize the notion of Markov Decision Processes (MDP),
action and state spaces, and on-policy vs off-policy approaches. This leads to the
REINFORCE algorithm, the simplest instantiation of the policy gradient method.</p>
<p><a href="https://drive.google.com/file/d/1KFQ-NvcYHL0Pi9TUM96iTEGZzt9GLffO/view?usp=sharing" class="colab-root">Reproduce in a <span>Notebook</span></a></p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Markov Decision Processes.</li>
<li>Continuous action spaces.</li>
<li>On-policy and off-policy algorithms.</li>
<li>REINFORCE / likelihood ratio methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li>Deep RL Course at UC Berkeley (CS 294); Policy Gradient Lecture
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_4_policy_gradient.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=tWNpiNzWuO8&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=4">Video</a></li>
</ol>
</li>
<li>David Silver’s course at UCL; Policy Gradient Lecture
<ol>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=KHZVXao4qXs&index=7&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT">Video</a></li>
</ol>
</li>
<li>Reinforcement Learning by Sutton and Barto, 2nd Edition; pages 265 - 273</li>
<li><a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Simple statistical gradient-following algorithms for connectionist reinforcement learning</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="http://rl-gym-doc.s3-website-us-west-2.amazonaws.com/mlss/2016-MLSS-RL.pdf">John Schulman introduction at MLSS Cadiz</a></li>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdf">Lecture on Variance Reduction for Policy Gradient</a></li>
<li><a href="http://karpathy.github.io/2016/05/31/rl/">Introduction to policy gradient and motivations by Andrej Karpathy</a></li>
<li><a href="https://papers.nips.cc/paper/3922-on-a-connection-between-importance-sampling-and-the-likelihood-ratio-policy-gradient.pdf">Connection Between Importance Sampling and Likelihood Ratio</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>At its core, REINFORCE computes an
approximate gradient of the reward with respect to the parameters. Why
can’t we just use the familiar stochastic gradient descent?
<details><summary>Hint</summary>
<p>
Just as in reinforcement learning, when we use stochastic gradient descent, we also compute an estimate of the gradient, which looks something like $$\nabla_{\theta} \mathbb{E}_{x \sim p(x)} [ f_{\theta}(x) ] = \mathbb{E}_{x \sim p(x)} [\nabla_{\theta} f_{\theta}(x)]$$ where we move the gradient into the expectation (an operation made precise by the <a href=""> Leibniz integral rule</a>). In other words, the objective we're trying to take the gradient of is indeed differentiable with respect to the inputs. In reinforcement learning, our objective is non-differentiable -- we actually <b>select</b> an action and act on it. To convince yourself that this isn't something we can differentiate through, write out explicitly the full expansion of the training objective for policy gradient before we move the gradient into the expectation. Is sampling an action really non-differentiable? (spoiler: yes, but we can work around it in various ways, such as using REINFORCE or <a href="https://arxiv.org/abs/1611.01144">other methods</a>).
</p>
</details>
</li>
<li>Does the REINFORCE gradient estimator resemble maximum likelihood estimation (MLE)?
Why or why not?
<details><summary>Solution</summary>
<p>
The term \( \log \pi (a | s) \) should look like a very familiar tool in
statistical learning: the likelihood function! When we think of what happens
when we do MLE, we are trying to maximize the likelihood of \( \log p(D | \theta) \)
or as in supervised learning, we try to maximize $$\log p(y_i^* | x_i, \theta).$$
Normally, because we have the true label \( y_i^* \), this paradigm aligns
perfectly with what we are ultimately trying to do with MLE. However, this
naive strategy of maximizing the likelihood \( \pi(a | s) \) won't work in
reinforcement learning, because we do not have a label for the correct action
to be taken at a given time step (if we did, we should just do supervised
learning!). If we tried doing this, we would find that we would simply
maximize the probability of every action; make sure you convince yourself
this to be true. Instead, the only (imperfect) evidence we have of good or
bad actions is the reward we receive at that time step. Thus, a reasonable
thing to do seems like scaling the log-likelihood by how good or bad the
action by the reward. Thus, we would then maximize $$r(a, s) \log \pi (a | s).$$
Look familiar? This is just the REINFORCE term in our
expectation: $$ \mathbb{E}_{s,a} [ \nabla r(a, s) \log \pi (a | s) ] $$
</p>
</details>
</li>
<li>In its original formulation, REINFORCE is an on-policy algorithm. Why?
Can we make REINFORCE work off-policy as well?
<details><summary>Solution</summary>
<p>
We can tell that REINFORCE is on-policy by looking at the expectation a bit
closer: $$ \mathbb{E}_{s,a} [ \nabla \log \pi (a | s) r(a, s). ]$$ When we
see any expectation in an equation, we should always ask what exactly is the
expectation <b>over</b>? In this case, if we expand the expectation, we
have: $$\mathbb{E}_{s \sim p_{\theta}(s), a \sim \pi_{\theta}(a|s)}
[ \nabla_{\theta} \log \pi_{\theta} (a | s) r(a, s), ]$$ and we see that
while the states are being sampled from the empirical state visitation
distribution induced by the current policy, and the actions \( a \) are
coming directly from the current policy. Because we learn from the current
policy, and not some arbitrary policy, REINFORCE is an on-policy. To change
REINFORCE to use data, we simply change the sampling distribution to some
other policy \( \pi_{\beta} \) and use importance sampling to correct for
this disparity. For more details, see
<a href="https://scholarworks.umass.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1079&context=cs_faculty_pubs">a classic paper</a>
on this subject and <a href="https://arxiv.org/abs/1606.02647">a recent paper</a>
with new insights on off-policy learning with policy gradient methods.
</p>
</details>
</li>
<li>Do policy gradient methods work for discrete and continuous action spaces?
If not, why not?</li>
</ol>
<p><br /></p>
<h1 id="2-variance-reduction-and-advantage-estimate">2 Variance Reduction and Advantage Estimate</h1>
<p><strong>Motivation</strong>: One major shortcoming of policy gradient methods is that the
simplest instantation of REINFORCE suffers from high variance in the gradients
it computes. This results from the fact that rewards are sparse, we only visit a finite
set of states, and that we only take one action at each state rather than try all actions.
In order to properly scale our methods to harder problems, we need to reduce this variance.
In this section, we study common tools for reducing variance for REINFORCE. These include
a causality result, baselines, and advantages. Note that the TRPO paper does not introduce
new methods for variance reduction, but we cover it here for complete understanding.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Causality for REINFORCE.</li>
<li>Baselines and control variates.</li>
<li>Advantage estimation.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li>Deep RL Course at UC Berkeley (CS 294); Actor-Critic Methods Lecture
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_5_actor_critic_pdf.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=PpVhtJn-iZI&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=5">Video</a></li>
</ol>
</li>
<li><a href="/assets/gjt-var-red-notes.pdf">George Tucker’s notes on Variance Reduction</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Reinforcement Learning by Sutton and Barto, 2nd Edition; pages 273 - 275</li>
<li><a href="https://arxiv.org/abs/1506.02438">High-dimensional continuous control using generalized advantage estimation</a></li>
<li><a href="https://arxiv.org/abs/1602.01783">Asynchronous Methods for Deep Reinforcement Learning</a></li>
<li><a href="https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf">Monte Carlo theory, methods, and examples by Art B. Owen; Chapter 8</a>
(in-depth treatment of variance reduction; suitable for independent study)</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the intuition for using advantages instead of rewards as the
learning signal? Note on terminology: the learning signal is the factor by
which we multiply \(\log \pi (a | s)\) inside the expectation in REINFORCE.</li>
<li>What are some assumptions we make by using baselines as a variance
reduction method?</li>
<li>What are other methods of variance reduction?
<details><summary>Solution</summary>
<p>
Check out the optional reading <a href="https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf">Monte Carlo theory, methods, and examples by Art B. Owen; Chapter 8</a> if you're interested! Broadly speaking, other techniques for doing variance reduction for Monte Carlo integration include stratified sampling, antithetic sampling, common random variables, conditioning.
</p>
</details>
</li>
<li>The theory of control variates tells us that our control variate should
be correlated with the quantity we are trying to lower the variance of.
Can we construct a better control variate that is even more correlated
than a learned state-dependent value function? Why or why not?
<details><summary>Hint</summary>
<p>
Right now, the typical control variate \( b(s) \) depends only on the state. Can we also have the control variate depend on the action? What extra work do we have to do to make sure this is okay? Check <a href="https://arxiv.org/abs/1611.02247">this paper</a> if you're interested in one way to extend this, and <a href="https://arxiv.org/abs/1802.10031">this paper</a> if you're interested in why adding dependence on more than just the state can be tricky and hard to implement in practice.
</p>
</details>
</li>
<li>We use control variates as a method to reduce variance in our gradient
estimate. Why don’t we use these for supervised learning problems such as
classification? Are we implicitly using them?
<details><summary>Solution</summary>
<p>
Reducing variance in our gradient estimates seems like an important thing
to do, but we don't often see explicit variance reduction methods when we
do supervised learning. However, there is a line of work around
<b>stochastic variance reduced gradient</b> descent called
<a href="https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf">SVRG</a>
that tries to construct gradient estimators with reduced variance. See
<a href="http://ranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf">these
slides</a> and <a href="https://arxiv.org/abs/1202.6258">these</a>
<a href="https://arxiv.org/abs/1209.1873">papers</a> for more on this topic.
</p>
<p>
The reason that we don't often see these being used in the supervised
learning setting is because we're not necessarily looking to reduce the variance
of SGD and smoothly converge to a minima. This is
because we're actually interested in looking for minima that have low
<b>generalization error</b> and don't want to overfit to solutions with very
small training error. In fact, we often rely on the noise introduced by
using minibatches in SGD to help us to escape premature minima.
On the other hand, in reinforcement learning, the variance of our gradient
estimates is so high that it's often the foremost problem.
</p>
<p>
Beyond supervised learning, control variates are used often in Monte Carlo
integration, which is ubiquitous throughout Bayesian methods. They are also
used for problems in hard attention, discrete latent random variables, and
general stochastic computation graphs.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-fisher-information-matrix-and-natural-gradient-descent">3 Fisher Information Matrix and Natural Gradient Descent</h1>
<p><img src="/assets/fisher-steepest.png" /></p>
<p><strong>Motivation</strong>: While gradient descent is able to solve many optimization problems,
it suffers from a basic problem - performance is dependent on the model’s parameterization.
Natural gradient descent, on the other hand, is invariant to model parameterization.
This is achieved by multiplying gradient vectors by the inverse of the Fisher
information matrix, which is a measure of how much model predictions change with
local parameter changes.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Fisher information matrix.</li>
<li>Natural gradient descent.</li>
<li>(Optional) K-Fac.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="/assets/k-fac-tutorial.pdf">Matt Johnson’s Natural Gradient Descent and K-Fac Tutorial</a>: Sections 1-7, Section A, B</li>
<li><a href="https://arxiv.org/pdf/1412.1193.pdf">New insights and perspectives on the natural gradient method</a>: Sections 1-11.</li>
<li><a href="https://web.archive.org/web/20170807004738/https://hips.seas.harvard.edu/blog/2013/04/08/fisher-information/">Fisher Information Matrix</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="http://ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2015/01/mathematics_for_intelligent_systems_lecture12_notes_I.pdf">8-page intro to natural gradients</a></li>
<li><a href="http://www.yaroslavvb.com/papers/amari-why.pdf">Why Natural Gradient Descent / Amari and Douglas</a></li>
<li><a href="https://personalrobotics.ri.cmu.edu/files/courses/papers/Amari1998a.pdf">Natural Gradient Works Efficiently in Learning / Amari</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Consider classifiers \(p(y | x; \theta_{1})\) and \(p(y | x; \theta_{2})\),
such that \(l_2(\theta_1, \theta_2)\) is large, where \(l_2\) indicates the
Euclidean distance metric. Does this imply the difference in accuracy of the
classifiers is high?
<details><summary>Solution</summary>
The accuracy of the classifier depends on the function defined by
\(p(y|x;\theta) \). The distance between the parameters do not inform us
about distance between the two functions. Hence, we cannot draw any conclusions
about the difference in accuracy of the classifiers.
</details>
</li>
<li>How is the Fisher matrix similar and different from the Hessian?</li>
<li>How does natural gradient descent compare to Newton’s method?</li>
<li>Why is the natural gradient slow to compute?</li>
<li>How can one efficiently compute the product of the Fisher information matrix with an arbitrary vector?</li>
</ol>
<p><br /></p>
<h1 id="4-conjugate-gradient">4 Conjugate Gradient</h1>
<p><strong>Motivation</strong>: The conjugate gradient method (CG) is an iterative algorithm for finding
approximate solutions to \(Ax=b\), where \(A\) is a symmetric and positive-definite matrix (such
as the Fisher information matrix). The method works by iteratively computing matrix-vector
products \(Ax_i\) and is particularly well-suited for matrices with computationally
tractable matrix-vector products.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Solving system of linear equations.</li>
<li>Efficiently computing matrix-vector products.</li>
<li>Computational complexities of second order methods optimization methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf">An Introduction to the Conjugate Gradient Method Without the Agonizing Pain</a>: Section 7-9</li>
<li><a href="https://ee227c.github.io/notes/ee227c-notes.pdf">Convex Optimization and
Approximation</a>, UC Berkeley, Section 7.4</li>
<li>Convex Optimization II by Stephen Boyd:
<ol>
<li><a href="https://www.youtube.com/watch?feature=player_embedded&v=cHVpwyYU_LY#t=2230">Lecture 12, from 37:10 to 1:05:00</a></li>
<li><a href="https://www.youtube.com/watch?feature=player_embedded&v=E4gl91l0l40#t=1266">Lecture 13, from 21:20 to 29:30</a></li>
</ol>
</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Numerical Optimization by Nocedal and Wright; Section 5.1, pages 101-120</li>
<li><a href="https://metacademy.org/graphs/concepts/conjugate_gradient">Metacademy</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Remember that the natural gradient of a model is the Fisher information matrix of that model times the
vanilla gradient (\(F^{-1}g\)). How does CG allow us to approximate the natural gradient?
<ol>
<li>What is the naive way to compute \(F^{-1}g\)? How much memory and time would it take?
<details><summary>Solution</summary>
Assume we have an estimate of \(F\) (by the process
in section 7 of <a href="/assets/k-fac-tutorial.pdf">Matt Johnson's tutorial</a>).
Storing \(F\) would take space proportional to \(n^2\), and inverting \(F\) would take
time proportional to \(n^3\) (or slightly lower with the Strassen algorithm)
</details>
</li>
<li>How long would a run of CG to <em>exactly</em> compute \(F^{-1}g\) take? How does that compare
to the naive process?
<details><summary>Solution</summary>
Each iteration of CG computes \(Fv\) for some vector \(v\), which would take time
proportional to \(n^2\). CG converges to the true answer after \(n\) steps, so in total
it would take time proportional to \(n^3\). This process ends up being slower than directly inverting
the Fisher naively and uses the same amount of memory.
</details>
</li>
<li>How can we use CG and bring down the time and memory to compute the natural gradient \(F^{-1}g\)?
<details><summary>Solution</summary>
<ol>
<li>
Use the closed form estimate of \(Fv\) for arbitrary \(v\), as described in section A of <a href="/assets/k-fac-tutorial.pdf">Matt Johnson's tutorial</a>)
</li>
<li>
Take fewer CG iteration steps, which leads to an approximation of the natural gradient that may be sufficient.
</li>
</ol>
</details>
</li>
</ol>
</li>
<li>In pre-conditioned conjugate gradient, how does scaling the pre-conditioner
matrix \(M\) by a constant \(c\) impact the convergence?</li>
<li>Exercises 5.1 to 5.10 in Chapter 5, Numerical Optimization
(<b>Exercises 5.2 and 5.9 are particularly recommended.</b>)</li>
</ol>
<p><br /></p>
<h1 id="5-trust-region-methods">5 Trust Region Methods</h1>
<p><strong>Motivation</strong>: Trust region methods are a class of methods used in general
optimization problems to constrain the update size. While
TRPO does not use the full gamut of tools from the trust region literature,
studying them provides good intuition for the problem that TRPO
addresses and how we might improve the algorithm even more. In this
section, we focus on understanding trust regions and line search methods.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Trust regions and subproblems.</li>
<li>Line search methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods">A friendly introduction to Trust Region Methods</a></li>
<li>Numerical Optimization by Nocedal and Wright: Chapter 2, Chapter 4, Section 4.1, 4.2</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Numerical Optimization by Nocedal and Wright: Chapter 4, Section 4.3</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Instead of directly imposing constraints on the updates, what would be
alternatives to enforce an algorithm to make bounded updates?
<details><summary>Hint</summary>
<p>
Recall the methods of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>. How does this method move between two types of optimization problems?
</p>
</details>
</li>
<li>Each step of a trust region optimization method updates parameters to
the optimal setting given some constraint. Can we solve this in closed
form using Lagrange multipliers? In what way would this be similar, or
different, from the trust region methods we just discussed?</li>
<li>Exercises 4.1 to 4.10 in Chapter 4, Numerical Optimization.
(<b>Exercise 4.10 is particularly recommended</b>)</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the <a href="https://arxiv.org/abs/1502.05477">paper</a>.
We’ve built a good foundation for the various tools and mathematical ideas
used by TRPO. In this section, we focus on the parts of the paper that aren’t
explicitly covered by the above topics and together result in the practical
algorithm used by many today. These are monotonic policy improvement and the
two different implementation approaches: vine and single-path.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>What is the problem with policy gradients that TRPO addresses?</li>
<li>What are the bottlenecks to addressing that problem in the existing approaches when it debuted?</li>
<li>Policy improvement bounds and theory.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a></li>
<li>Deep RL Course at UC Berkeley (CS 294); Advanced Policy Gradient Methods (TRPO)
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=ycCtmp4hcUs&feature=youtu.be&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3">Video</a></li>
</ol>
</li>
<li><a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf">Approximately Optimal Approximate Reinforcement Learning</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/">TRPO Tutorial</a></li>
<li><a href="https://arxiv.org/abs/1708.05144">ACKTR</a></li>
<li><a href="https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf">A Natural Policy Gradient</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>How is the trust region set in TRPO? Can we do better? Under what
assumptions is imposing the trust region constraint not required?</li>
<li>Why do we use conjugate gradient methods for optimization in TRPO? Can we
exploit the fact the conjugate gradient optimization is differentiable?</li>
<li>How is line search used in TRPO?</li>
<li>How does TRPO differ from natural policy gradient?
<details><summary>Solution</summary>
<p> See slides 30-34 from <a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">this lecture</a>. </p>
</details>
</li>
<li>What are the pros and cons of using the vine and single-path methods?</li>
<li>In practice, TRPO is really slow. What is the main computational
bottleneck and how might we remove it? Can we approximate this bottleneck?</li>
<li>TRPO makes a series of approximations that deviate from the policy
improvement theory that is cited. What are the assumptions that are made
that allow these approximations to be reasonable? Should we still expect
monotonic improvement in our policy?</li>
<li>TRPO is a general procedure to directly optimize parameters from rewards,
even though the procedure is “non-differentiable”. Does it make sense to
apply TRPO to other non-differentiable problems, like problems involving
hard attention or discrete random variables?</li>
</ol>suryaThank you to Nic Ford, Ethan Holly, Matthew Johnson, Avital Oliver, John Schulman, George Tucker, and Charles Weill for contributing to this guide.InfoGAN2018-05-28T14:00:00+00:002018-05-28T14:00:00+00:00https://www.depthfirstlearning.com/2018/InfoGAN<p>Thank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/infogan-deps.svg" width="200"></iframe>
<div>Concepts used in InfoGAN. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>InfoGAN is an extension of GANs that learns to represent unlabeled data as codes,
aka representation learning. Compare this to vanilla GANs that can only generate
samples or to VAEs that learn to both generate code and samples. Representation
learning is an important direction for unsupervised learning and GANs are a
flexible and powerful interpretation. This makes InfoGAN an interesting stepping
stone towards research in representation learning.</p>
<p><a href="https://colab.research.google.com/drive/1JkCI_n2U2i6DFU8NKk3P6EkPo3ZTKAaq#forceEdit=true&offline=true&sandboxMode=true" class="colab-root">Reproduce in a
<span>Notebook</span></a></p>
<p><br /></p>
<h1 id="1-information-theory">1 Information Theory</h1>
<p><strong>Motivation</strong>: Information theory formalizes the concept of the “amount of randomness” or
“amount of information”. These concepts can be extended to relative quantities
among random variables. This section leads to Mutual Information (MI), a concept core to
InfoGAN. MI extends entropy to the amount of additional information you yield from
observing a joint sample of two random variables as compared to the baseline of
observing each variable separately.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Entropy</li>
<li>Differential Entropy</li>
<li>Conditional Entropy</li>
<li>Jensen’s Inequality</li>
<li>KL divergence</li>
<li>Mutual Information</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Chapter 1.6 from <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf">Pattern Recognition and Machine Learning / Bishop. (“PRML”)</a></li>
<li>A good <a href="https://www.quora.com/What-is-an-intuitive-explanation-of-the-concept-of-entropy-in-information-theory/answer/Peter-Gribble">intuitive explanation of Entropy</a>, from Quora.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1404.2000.pdf">Notes on Kullback-Leibler Divergence and Likelihood Theory</a></li>
<li>For more perspectives and deeper dependencies, see Metacademy:
<ol>
<li><a href="https://metacademy.org/graphs/concepts/entropy">Entropy</a></li>
<li><a href="https://metacademy.org/graphs/concepts/mutual_information">Mutual Information</a></li>
<li><a href="https://metacademy.org/graphs/concepts/kl_divergence">KL divergence</a></li>
</ol>
</li>
<li><a href="https://colah.github.io/posts/2015-09-Visual-Information/">Visual Information Theory</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>From PRML: 1.31, 1.36, 1.37, 1.38, 1.39, 1.41.
<details><summary>Solution</summary>
<p>
PRML 1.31: Consider two variables \(x\) and \(y\) having joint distribution \(p(x,y)\). Show that the differential entropy of this pair of variables satisfies \(H(x,y) \le H(x) + H(y)\) with equality if, and only if, \(x\) and \(y\) are statistically independent.
</p><p>
If \(p(x)\) and \(p(y)\) are independent then the joint distribution is given by:<br />
\(p(x,y) = p(x)p(y)\)
</p><p>
Based on the independent \(p(x)\) and \(p(y)\) the joint entropy can be derived from the conditional entropies \(H(x|y)\) and \(H(y|x)\):<br />
\(H(x|y) = H(x)\)<br />
\(H(y|x) = H(y) \to\)<br />
\(H(x,y) = H(x) + H(y|x) = H(y) + H(x|y) \to\)<br />
\(H(x,y) = H(x) + H(y)\)
</p><p>
Therefore, there is no mutual information \(I(x,y)\) if \(p(x)\) and \(p(y)\) are independent:<br />
\(H(x,y) = H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) = H(x,y) - H(x,y) = 0\)
</p><p>
If \(p(x)\) and \(p(y)\) are dependent:<br />
\(H(x,y) < H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) > 0\)
</p><p>
The indepent and dependent case can be combined to a general form:<br />
\(H(x,y) \le H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) \ge 0\)
</p>
</details>
</li>
<li>How is Mutual Information similar to correlation? How are they different? Are they directly related under some conditions?
<details><summary>Solution</summary>
<p>Start <a href="https://stats.stackexchange.com/questions/81659/mutual-information-versus-correlation">here</a>.
</p>
</details>
</li>
<li>In classification problems, <a href="https://ai.stackexchange.com/questions/3065/why-has-cross-entropy-become-the-classification-standard-loss-function-and-not-k/4185">minimizing cross-entropy loss is the same as minimizing the KL divergence
of the predicted class distribution from the true class distribution</a>. Why do we minimize the KL, rather
than other measures, such as L2 distance?
<details><summary>Solution</summary>
<p>
In classification problem: One natural measure of “goodness” is the likelihood or marginal
probability of observed values. By definition, it’s \(P(Y | X; params)\), which is
\(\prod_i P(Y_i = y_i | X; params)\).
This says that we want to maximize the probability of producing the “correct” \(y_i\)
class only, and don’t really care to push down the probability of incorrect class like
L2 loss would.
</p><p>
E.g., suppose the true label \(y = [0, 1, 0]\) (one-hot of class label {1, 2, 3}),
and the softmax of the final layer in NN is \(y’ = [0.2, 0.5, 0.3]\).
One could use L2 between these two distributions, but if instead we minimize KL
divergence \(KL(y || y’)\), which is equivalent to minimizing cross-entropy
loss (the standard loss everyone uses to solve this problem),
we would compute \(0 \cdot \log(0) + 1 \cdot \log (0.5) + 0 \cdot \log(0) = \log(0.5)\),
which describes exactly the log likelihood of the label being class 2
for this particular training example.
</p><p>
Here choosing to minimize KL means we’re maximizing the data likelihood.
I think it could also be reasonable to use L2, but we would be maximizing
the data likelihood + “unobserved anti-likelihood” :) (my made up word)
meaning we want to kill off all those probabilities of predicting wrong
labels as well.
</p><p>
Another reason L2 is less prefered might be that L2 involves looping over all
class labels whereas KL can look only at the correct class when computing the loss.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-generative-adversarial-networks-gan">2 Generative Adversarial Networks (GAN)</h1>
<p><strong>Motivation</strong>: GANs are framework for constructing models that learn to sample
from a probability distribution, given a finite sample from that distribution.
More concretely, after training on a finite unlabeled dataset (say of images),
a GAN can generate new images from the same “kind” that aren’t in the original
training set.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>JS (Jensen-Shannon) divergence</li>
<li>How are GANs trained?</li>
<li>Various possible GAN objectives. Why are they needed?</li>
<li>GAN training minimizes the JS divergence between the data-generating distribution and the distribution of samples from the generator part of the GAN</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">JS Divergence</a></li>
<li><a href="https://arxiv.org/abs/1406.2661">The original GAN paper</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1701.00160">NIPS 2016 Tutorial: Generative Adversarial Networks</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Prove that \(0 \leq JSD(P||Q) \leq 1\) bit for all P, Q. When are the bounds achieved?
<details><summary>Solution</summary>
<p>Start <a href="https://en.wikipedia.org/wiki/Jensen-Shannon_divergence#Relation_to_mutual_information">here</a>.
</p>
</details>
</li>
<li>What are the bounds for KL divergence? When are those bounds achieved?
<details><summary>Solution</summary>
<p>Start <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">here</a>.
</p><p>
The Kullback–Leibler divergence \(D_{KL}(P||Q)\) between \(P\) and \(Q\) is defined as:<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right)\)
</p><p>
The lower bound is reached when \(P(x) = Q(x)\) because \(\left(\frac{P(x)}{Q(x)}\right) = 1\):<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2(1) = \sum_{x}P(x) 0 = 0\)
</p><p>
The upper bound is reached when \(Q(x)\) is disjoint from \(P(x)\), i.e., \(Q(x)\) is zero where \(P(x)\) is not zero, because then the log-ratio \(\log_2\left(\frac{P(x)}{Q(x)}\right)\) becomes \(\infty\):<br />
\(x_i \in x\)<br />
\(P(x_i) \log_2\left(\frac{P(x_i)}{Q(x_i)}\right) = P(x_i) \log_2\left(\frac{P(x_i)}{0}\right) = \infty \to\)<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right) = \infty\)
</p>
</details>
</li>
<li>In the paper, why do they say “In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, \(log(1 − D(G(z)))\) saturates”?
<details><summary>Solution</summary>
<p><a href="/assets/gan_gradient.pdf">Understanding the vanishing generator gradients point in the GAN paper</a></p>
</details>
</li>
<li>Implement a <a href="https://colab.research.google.com/">Colab</a> that trains a GAN for MNIST. Try both the saturating and non-saturating discriminator loss.
<details><summary>Solution</summary>
<p>An implementation can be found <a href="https://colab.research.google.com/drive/1joM97ITFowvWU_qgRjQRiOKajHQKKH80#forceEdit=true&offline=tru&sandboxMode=true">here</a>.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-the-paper">3 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the <a href="https://arxiv.org/abs/1606.03657">paper</a>. Keep
in mind that InfoGAN modifies the original GAN objective in this way:</p>
<ol>
<li>Split the incoming noise vector z into two parts - noise and code. The goal
is to learn meaningful codes for the dataset.</li>
<li>In addition to the discriminator, it adds another prediction head to the
network that tries to predict the code from the generated sample. The loss
is a combination of the normal GAN loss and the prediction loss.</li>
<li>This new loss term can be interpreted as a lower bound on the mutual
information between the generated samples and the code.</li>
</ol>
<p><strong>Topics</strong>:</p>
<ol>
<li>The InfoGAN objective</li>
<li>Why can’t we directly optimize for the mutual information \(I(c; G(z,c))\)</li>
<li>Variational Information Maximization</li>
<li>Possible choices for classes of random variables for dimensions of the code c</li>
</ol>
<p><strong>Reproduce</strong>:</p>
<p><a href="https://colab.research.google.com/drive/1JkCI_n2U2i6DFU8NKk3P6EkPo3ZTKAaq#forceEdit=true&offline=true&sandboxMode=true" class="colab-root">Reproduce in a
<span>Notebook</span></a></p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1606.03657">InfoGAN</a></li>
<li><a href="http://aoliver.org/assets/correct-proof-of-infogan-lemma.pdf">A correction to a proof in the paper</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd">A blog post explaining InfoGAN</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>How does one compute \(log Q(c|x)\) in practice? How does this answer change based on the choice of the type of random variables in c?
<details><summary>Solution</summary>
<p>What is \(\log Q(c|x)\) when c is a Gaussian centered at \(f_\theta(x)\)? What about when c is the output of a softmax?
</p><p>
See section 6 in the paper.
</p>
</details>
</li>
<li>Which objective in the paper can actually be optimized with gradient-based algorithms? How? (An answer to this needs to refer to “the reparameterization trick”)</li>
<li>Why is an auxiliary \(Q\) distribution necessary?</li>
<li>Draw a neural network diagram for InfoGAN
<details><summary>Solution</summary>
<p>There is a good diagram in <a href="https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd">this blog post</a></p>
</details>
</li>
<li>In the paper they say “However, in this paper we opt for
simplicity by fixing the latent code distribution and we will treat \(H(c)\) as a constant.”. What if you want to learn
the latent code (say, if you don’t know that classes are balanced in the dataset). Can you still optimize for this with gradient-based algorithms? Can you implement this on an intentionally class-imbalanced variant of MNIST?
<details><summary>Solution</summary>
<p>
You could imagine learning the parameters of the distribution of c, if you can get H(c) to be a differentiable function of those parameters.
</p>
</details>
</li>
<li>In the paper they say “the lower bound … is quickly maximized to … and maximal mutual information is achieved”. How do they know this is the maximal value?</li>
<li>Open-ended question: Is InfoGAN guaranteed to find disentangled representations? How would you tell if a representation is disentangled?</li>
</ol>avitalThank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.