Jekyll2020-04-01T04:02:38+00:00https://www.depthfirstlearning.com/feed.xmlDepth First LearningMachine Learning CurriculaStein Variational Gradient Descent2020-03-02T10:00:00+00:002020-03-02T10:00:00+00:00https://www.depthfirstlearning.com/2020/SVGD<p>[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]</p>
<p>This guide is thanks to a many different people, all of whom took their time to give feedback, write reviews, and provide their own insights to the curriculum.</p>
<p>Special thanks to Cinjon Resnick, who was incredibly helpful throughout the iterations of the class, curriculum, and final notes. A special thanks as well to Professor Qiang Liu, who took the time to help shape the curriculum.</p>
<p>Thank you to Calvin Woo, Sanyam Kapoor, Thomas Pinder, Swapneel Mehta, and Avital Oliver for useful contributions to this guide, as well as countless insights during our discussions.</p>
<p>A special thanks to the many outside guests who offered to provide their time, including Dilin Wang, Tongzheng Ren, and Haoran Tang.</p>
<p>Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/svgd-deps.svg" width="200"></iframe>
<div>Concepts used in SVGD. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Stein’s Method is a powerful statistical method, one that is at the disposal (and the focus) of many statisticians today. Recently, Stein’s Method has made its way into machine learning and has already proved to be a fruitful research area. Stein’s Method has deep connections to many machine learning problems of interest, and by the end of this guide, you should be able to understand the relevant mathematics behind this powerful tool.</p>
<p><br /></p>
<h1 id="1-basics-behind-kernelized-stein-discrepancy">1 Basics Behind Kernelized Stein Discrepancy</h1>
<p><strong>Motivation</strong>: Before jumping into all the math and methodology, we have to be able to understand the basics of what’s going on. Most importantly, we will review the basics of measure theory and reproducing kernel hilbert spaces. Measure theory allows us to understand the notion of discrepancy measures between distributions, which we will use later on to quantify the difference between two arbitrary distributions of interest. Our other topic, Reproducing Kernel Hilbert Spaces (RKHS), will serve as the connection between measure theory and a practical machine learning algorithm. With RKHS, we will be able to define and optimize intractable measures which previously, were only useful for theoretical analysis or a restrictive class of functions. These two together set the foundation for defining a tractable Kernelized Stein Discrepancy, which serves as the driving factor behind Stein Variational Gradient Descent.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Measure Theory</li>
<li>Kernels</li>
<li>Reproducing Kernel Hilbert Space</li>
<li>Machine Learning Basics</li>
</ol>
<p><strong>Notes</strong>: In this class, we went over the basic mathematical concepts we will need throughout the rest of the curriculum. See here for the notes in <a href="https://colab.research.google.com/drive/1x3bgKtYWaYRTV1VGaf0bKRyQ_qxNZpjh">Colab</a> or here for the <a href="/assets/svgd_notes/week01.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/lecture4_introToRKHS.pdf">Reproducing Kernel Hilbert Spaces Tutorial, Section 1 - 3</a>.</li>
<li><a href="https://www.win.tue.nl/~rvhassel/Onderwijs/Old-Onderwijs/2DE08-1011/ConTeXt-OWN-FA-201209-Bib/Literature/sigma-algebra/gc_06_measure_theory.pdf">A gentle introduction to Measure Theory (Chandalia)</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://mlss.tuebingen.mpg.de/2015/slides/gretton/part_1.pdf">Slides on RKHS from Arthur Gretton</a>.</li>
<li><a href="http://cs231n.github.io/python-numpy-tutorial/">CS231n’s Numpy and Python Tutorial</a>.</li>
<li><a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">CS229’s Linear Algebra Refresher</a>.</li>
<li><a href="https://xavierbourretsicotte.github.io/Kernel_feature_map.html">Xavier Bourret Sicotte’s Blog on Kernels and Feature Maps</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>“However, Cauchy sequences are not the same as convergent sequences”, but a property of Cauchy sequences is that they are bounded. What’s the difference?
<details><summary>Solution</summary>
<p>
Convergent sequences have a limit, but Cauchy sequences are only required to be bounded. But what exactly does bounded mean? Here's a proof that shows that they are bounded, which might shed some light on the definition itself:
<br />
<p>
<b>a.</b> There exists \(N\) such that \(|a_n - a_m| < 1 \quad \forall m, n \geq N\) (Property of Cauchy Sequence iterates getting closer)
<br />
<b>b.</b> \(\implies \forall n \geq N, |a_n - a_N| < 1\)
<br />
<b>c.</b> \(a_n \in (a_N - 1, a_N + 1) \forall n \geq N\). (\(n \geq N\) is bounded)
<br />
<b>d.</b> Since the sequence is \(n < N\) is finite (since \(N\) is finite), it is also bounded.
</p>
Therefore the Cauchy sequence \(\{ a_n \}\) is bounded \(\square\)
</p>
</details>
</li>
<li>“The open interval (0, 1) is not complete whereas the closed interval [0, 1] is complete.” Why? Can we use this example to get a intuitive definition of complete?
<details><summary>Solution</summary>
<p>
Intuitively, a space is complete if there are no "points missing" from it (inside or at the boundary). For instance, the set of rational numbers is not complete, because e.g. \(\sqrt{2}\) is "missing" from it, even though one can construct a Cauchy sequence of rational numbers that converges to it. More information can be found at <a href="https://en.wikipedia.org/wiki/Complete_metric_space">Wikipedia: Complete Metric Space.</a>
</p>
</details>
</li>
<li>Explain the difference between a Banach and Hilbert Space. Is every Hilbert space a Banach space?
<details><summary>Solution</summary>
<p>
A Banach space is a vector space in which each vector has a non-negative length, or norm, and in which every Cauchy sequence converges to a point of the space. Also known as complete normed linear space.
A Hilbert space is a Banach space with inner product, which defines the norm.
</p>
</details>
</li>
<li>In Machine Learning, kernels can be thought of as a “dot product” (a kind of similarity score) in high-dimensional space. Why would this be useful? Given a feature map, do we always have a corresponding kernel? Given any kernel, can we always explicitly write out the elements of the corresponding feature map?
<details><summary>Solution</summary>
<p>
Kernels (and the corresponding kernel trick) allow us to compute similarities in high-dimensional space without explicitly writing out and computing the dot product.
However, not ever feature map corresponds to a kernel; there are certain properties a kernel must have, and not every feature map imbues it with those properties.
Likewise, given a kernel, it may be the case that we can never write out (explicitly) the corresponding feature map. A good example of this is the popular exponential kernel.
</p>
</details>
</li>
<li>Assume that we just need the log-likeihood in many machine learning tasks so that we can compute <script type="math/tex">KL(q||p)</script> , and iteratively fit our model <script type="math/tex">p</script> to the underlying, generating data distribution <script type="math/tex">q</script>. Why is this already too large of an assumption (“We assume that we have the ability to calculate the log-likelihood under the model that we specify”)?
<details><summary>Solution</summary>
<p>
The dreaded normalization constant! Most models we see will give an unnormalized likelihood, and the normalization constant (which we will see in a few weeks, often denoted as \(Z\)) is intractable to compute. We need the normalization constant to bring a probability function to a probability density function.
</p>
</details>
</li>
<li>What is the use of Monte-Carlo methods in machine learning?
<details><summary>Solution</summary>
<p>
They are a way to estimate quantities in the presence of complex, many-random-variable situations. They do so by repeatedly generating (via simulation) instances from which they estimate the quantities.
</p>
</details>
</li>
<li>Explain the reproducing property in your own words.
<details><summary>Solution</summary>
<p>
Sanyam Kapoor's answer from our class was: "Every feature map is a linear combination of the full Hilbert space weighted by the kernel evaluations."
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-steins-method">2 Stein’s Method</h1>
<p><strong>Motivation</strong>: Most of the theory we will see in this curriculum builds off the general theoretical framework of Stein’s Method, a tool to obtain bounds on distances between distributions. In Machine Learning (as we shall later see), distances between distributions can be used to quantify how well (or poorly) a model is at approximating a certain distribution of interest. We shall start from Stein’s Identity and Operator, while explaining their theoretical significance and working through some proofs to get an understanding of some terms (Stein’s Method, Stein’s Discrepancy) we’ll see in the coming weeks. Lastly, we will discuss why Stein’s Method has historically been a theoretical tool, and hint at how ideas from Week 1 (particularly RKHS) can be used in combination with Stein’s Method to build the tractable discrepancy measure at the center of Week 3’s discussion.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Stein’s Method</li>
<li>The Stein Operator</li>
<li>Stein Equation</li>
<li>Stein’s Identity</li>
</ol>
<p><strong>Notes</strong>: In this class, we discussed the theoretical concepts behind Stein’s method, and discussed different ways to interpret the core ideas. See here for the notes in <a href="https://colab.research.google.com/drive/1HqHSP9x01te7e33-zDR00vAsPX_M19h2">Colab</a> or here for the <a href="/assets/svgd_notes/week02.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1602.03253">Section 2 of Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://en.wikipedia.org/wiki/Stein%27s_method">Stein’s Method on Wikipedia</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1404.1392">A Short History of Stein’s Method</a>.</li>
<li><a href="http://www.ims.nus.edu.sg/Programs/stein09/files/A%20Gentle%20Introduction%201.pdf">Gentle Introduction to Stein’s Method</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Prove Stein’s Identity for a standard Gaussian random variable <script type="math/tex">Z</script>.
<details><summary>Solution</summary>
<p>
Recall that Stein's Identity tells us that for a unit-normal random variable \(Z\) (i.e \(Z \sim \mathcal{N}(0, 1)\)):
$$ \mathbf{E}f'(Z) = \mathbf{E}Zf(Z)$$
for all absolutely continuous functions \(f\) with \( \mathbf{E}[f'(Z)] < \infty \).
To start, we state, without proof, that the density function of the unit normal Gaussian:
$$ p(z) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^2}{2}} $$
satisfies \( zp(z) = p'(z) \).
For some normal \(Z\), we can break the left hand side of the original identity into two integrals:
$$\mathbf{E}f'(Z) = \int_0^\infty f'(z)p(z)dz + \int_{-\infty}^0 f'(z)p(z)dz $$
For each left-hand side integral, we use <a href="https://en.wikipedia.org/wiki/Fubini%27s_theorem">Fubini's Theorem</a>:
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty f'(z) \int_z^\infty yp(y)dydz $$
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_z^\infty f'(z)yp(y)dydz $$
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_0^y f'(z)yp(y)dzdy $$
Leading us to our final integral:
$$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty [f(y) - f(0)] yp(y)dy $$
For the second integral, it evaluates to \( \int_{-\infty}^0 [f(y) - f(0)] yp(y)dy \)
When we combine each individual result, we get:
$$ \mathbf{E}f'(Z) = \mathbf{E}Z[f(Z) - f(0)] = \mathbf{E}Zf(Z)$$
which proves the forward direction.
</p>
</details>
</li>
<li>Explain why Stein’s Identity is useful.
<details><summary>Solution</summary>
<p>
Stein's Identity in the converse as well; if the identity holds, we can conclude the random variable, which we call \(W\), is also normal. However, if the two quantities in Stein's Identity are approximately equal, then Stein's Identity also lets us conclude that \(W\) is also approximately normal. Stein's Identity and Method are used to quantify this "approximately" term, which we briefly discuss below.
Probability metrics (between two random variables \(X\) and \(Y\)) take the general form of:
$$d(X, Y) = \sup_{h \in \mathcal{H}} | \mathbf{E}h(X) - \mathbf{E}h(Y) |$$
for some class of functions \( \mathcal{H} \). We normally want to bound the distances between the corresponding distribution functions \(P \) and \(Q \), but that choice is less important for this brief discussion.
When we choose different classes of functions, we can recover various distances that we often use (in machine learning) to compare probability distributions, such as the Kolmorgov or Wasserstein distance.
We get to the Stein Discrepancy by measuring the distance between \(W\) to our standard normal \(Z\) via:
$$ \mathbf{E}h(W) - \mathcal{N}h $$
where \(\mathcal{N}h = \mathbf{E}h\) for \(h \in \mathcal{H}\).
Stein's Identity tells us that the discrepancy can also be measured by:
$$ \mathbf{E}[f'(W) - Wf(W)]$$
which, when we evaluate at \(w\), gives us the Stein Equation:
$$ f'(w) - wf(w) = h(w) - \mathcal{N}h $$
Since we're trying to bound: \(\mathbf{E}h(W) - \mathcal{N}h\), we can now instead bound the LHS, which turns out to be a lot easier once we account for all of the boundary conditions.
</p>
</details>
<p><br /></p>
</li>
</ol>
<h1 id="3-kernelized-stein-discrepancy">3 Kernelized Stein Discrepancy</h1>
<p><strong>Motivation</strong>: The main theoretical meat comes from a single 2016 paper titled Kernelized Stein Discrepancy (KSD). KSD takes the powerful Stein’s Identity, and uses RKHS theory to define a tractable discrepancy between a ground truth distribution and samples from an arbitrary one. Most importantly, KSD defines a discrepancy function that does not involve calculating the normalizing constant, allowing it to be much more widely applicable in practical tasks. We will discuss the difference between likelihood-free and likelihood-based methods in machine learning, how this normalization constant proves to be problematic in machine learning, and how KSD allows us to sidestep this issue with a new, tractable discrepancy. KSD will serve as the launch pad for the algorithm at the focus of this curriculum, Stein Variational Gradient Descent.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>A Stein Discrepancy</li>
<li>Goodness of Fit</li>
<li>Tractable Optimization of the Stein Discrepancy</li>
</ol>
<p><strong>Notes</strong>: In this class, we worked through the Kernelized Stein Discrepancy paper, focusing on the optimization and use cases of such a method. See here for the notes in <a href="https://colab.research.google.com/drive/1V7zpm9U3TCjIM9DxeRWo6IEypDkZObrH">Colab</a> or here for the <a href="/assets/svgd_notes/week03.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.cs.dartmouth.edu/~qliu/PDF/ksd_short.pdf">A Short Introduction to Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/slides_ksd_icml2015.pdf">ICML 2015 Slides on KSD</a>.</li>
<li><a href="https://arxiv.org/abs/1602.03253">Kernelized Stein Discrepancy</a>.</li>
<li><a href="https://stats.stackexchange.com/questions/276497/maximum-mean-discrepancy-distance-distribution">What is Maximum Mean Discrepancy?</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<p>Although we focus on the work leading up to Stein Variational Gradient Descent, this week’s optional reading provides historical context on how Stein’s Method was introduced into the context of machine learning.</p>
<ol>
<li><a href="https://arxiv.org/abs/1506.03039">Measuring Sample Quality with Stein’s Method</a>.</li>
<li><a href="https://arxiv.org/abs/1602.02964">A Kernel Test of Goodness of Fit</a></li>
<li><a href="https://arxiv.org/abs/1703.01717">Measuring Sample Quality with Kernels</a></li>
</ol>
<p>The first reference, from Gorham and Mackey, introduced the notion of a Stein Discrepancy. Kernelized Stein Discrepancy, the paper of focus for this week, built upon that idea with kernels, enabling the use of kernel functions in the Stein Discrepancy. The latter two references are also works that independently developed kernel-based Stein Discrepancies.</p>
<p><strong>Questions</strong>:</p>
<ol>
<li>What determines the choice of kernel in KSD?</li>
</ol>
<details><summary>Solution</summary>
<p>
Since KSD requires an RKHS for optimization, the kernel must be positive definite. However, whenever given a positive definite kernel \(K\), we can always build an associated RKHS as follows.
If we take \(H\) as the Hilbert space of functions \(f: \mathcal{X} \rightarrow \mathbf{R}\) defined on some set \(\mathcal{X}\) with some inner product \( \langle \cdot, \cdot \rangle_H \) defined on \(H\), then we can define the evaluation functional \(e_x: H \rightarrow \mathbf{R}\) as \(f \rightarrow e_x(f) = f(x) \).
Using the above definitions, our space \( H\) is an RKHS iff the evaluation functionals are continuous. As we saw in the notes, we call the given kernel \(K\) a reproducing kernel if:
<br />
<b>1.</b> \(K(x, \cdot), \; \forall x \in \mathcal{X}\)
<br />
<b>2.</b> \(\langle f, K_x \rangle = f(x) \; \forall f \in H, \forall x \in \mathcal{X}\).
<br />
Thus, every reproducing kernel \( K\) induces a unique RKHS given the kernel is positive definite.
Excitingly, in the context of machine learning, positive definite kernels themselves can be defined in terms of inner products. Therefore, we can generate arbitrary kernels and RKHS with some feature map \( \Phi: \mathcal{X} \rightarrow \mathcal{F}\) where feature space \( \mathcal{F}\) is a Hilbert space with some inner product \( \langle \cdot, \cdot \rangle \).
</p>
</details>
<p><br /></p>
<h1 id="4-stein-variational-gradient-descent">4 Stein Variational Gradient Descent</h1>
<p><strong>Motivation</strong>: Stein Variational Gradient Descent (SVGD) is a popular, non-parametric Bayesian Inference algorithm that’s been applied to Variational Inference, Reinforcement Learning, GANs, and much more. This week, we study the algorithm in its entirety, building off of last week’s work on KSD, and seeing how viewing KSD from a KL-Divergence-minimization lens induces a powerful, practical algorithm. We discuss the benefits of SVGD over other similar approximators, and look at a practical implementation of the algorithm.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Stein Variational Gradient Descent</li>
<li>Implementing the Algorithm</li>
</ol>
<p><strong>Notes</strong>: In this class, we go over the core paper, Stein Variational Gradient Descent. At the end of the notes, we provide link to implementations in a variety of different languages. See here for the notes in <a href="https://colab.research.google.com/drive/0B2rVTvobCLlWNEY4SENKdG1OQ3c">Colab</a> or here for the <a href="/assets/svgd_notes/week04.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/steinslides16.pdf">SVGD Slides</a>.</li>
<li><a href="https://arxiv.org/abs/1608.04471">SVGD Paper</a>.</li>
<li><a href="https://www.sanyamkapoor.com/machine-learning/stein-gradient/">Sanyam Kapoor’s great notebook on Stein Gradients</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.dartmouth.edu/~qliu/PDF/svgd_aabi2016.pdf">SVGD: Theory and Applications</a>.</li>
<li><a href="https://arxiv.org/abs/1707.06626">Learning to Sample with Amortized SVGD</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Compare and contrast the method shown here and MCMC. What are some advantages MCMC still has over SVGD?
<details><summary>Solution</summary>
<p>
Below are some ideas we discussed in our class. <br />
<b>1.</b> SVGD requires a compact subspace \( \mathcal{X} \), and as noted <a href="http://proceedings.mlr.press/v97/chen19b/chen19b.pdf">here in Chen '19</a>, requires the number of particles to be fixed apriori.<br />
<b>2.</b> SVGD has a lot less theoretical understanding compared to MCMC (which, is potentially due to the recency of the result). SVGD has had analysis done in the infinite-particle regime, but minimal work done in finite particle scenarios (an example of such work can be found <a href="https://papers.nips.cc/paper/8101-stein-variational-gradient-descent-as-moment-matching.pdf">here</a>. A concern of theoretical analysis is the complexity of analyzing the interacting particle updates, so the works covered here either view it from a dynamical systems / differential equation perspective (which concerns the smooth transformation of density), or discuss the properties of the final particles, regardless of how they were algorithmically attained.<br />
<b>3.</b> SVGD still seems to collapse in high-dimensional spaces, leading to exciting new research in <a href="https://arxiv.org/abs/1902.03394">why this occurs</a> and <a href="https://arxiv.org/abs/1711.04425">ideas on how to get around it</a>.
</p>
</details>
</li>
<li>Prove that the discrepancy in Equation 3 of the Stein Variational Gradient Descent Paper only equals 0 when (p) and (q) are equal.
<details><summary>Solution</summary>
<p>
Recall the operator definition of Stein's Identity:
$$ \mathbf{E}_p[\mathcal{A}_pf(x)] = 0$$
If \( p \neq q \), we get \( \mathbf{E}_q[\mathcal{A}_pf(x)] \) for some choice of function \( f \).
We can expand this to:
$$\mathbf{E}_q[\mathcal{A}_pf(x)] - \mathbf{E}_q[\mathcal{A}_qf(x)]$$
Recalling the full definition of the operator:
$$\mathcal{A}_pf(x) = \mathbf{E}_p[s_p(x)f(x) + \nabla_x f(x)] = 0$$
where score function \( s_p(x) \) is just \( \nabla_x \log p(x) \), we are left with
$$\mathbf{E}_q[(s_p(x) - s_q(x))f(x)]$$
This means unless \(p = q \rightarrow s_p(x) = s_q(x) \; \forall x \in \mathcal{X} \), we can always find some function \(f\) for which the above quantity is nonzero.
</p>
</details>
</li>
<li>
<p>Implement SVGD in your favorite language (see the notes for links to different implementations). Then, let’s take a look at the role of the kernel in SVGD:</p>
<ul>
<li>
<p>Remove the repulsive kernel term and observe how particles collapse to modes.</p>
</li>
<li>
<p>Remove the kernel’s contribution in the first term.</p>
</li>
</ul>
<p>What happens?</p>
</li>
</ol>
<p><br /></p>
<h1 id="5-svgd-as-gradient-flow">5. SVGD as Gradient Flow</h1>
<p><strong>Motivation</strong>: SVGD as Gradient Flow is one of the first papers that analyzes the dynamics and theoretical properties of SVGD. This paper covers an incredible amount of seemingly-disparate topics, connecting them in a succinct explanation. Due to the relative difficulty of the material, especially the necessary background, the attached notes are self-contained and should be read alongside the paper.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Large Sample Regime of SVGD</li>
<li>Continuous Time Analysis of SVGD</li>
<li>Optimal Transport, Wasserstein Distances, and Differential Geometry</li>
<li>SVGD as a Gradient Flow</li>
</ol>
<p><strong>Notes</strong>: In this class, we try to understand the geometric implications of SVGD. The notes are structured relatively differently - with the amount of background needed, relevant material is introduced in-line. As a result, the ideal way to understand this week requires reading the notes alongside the paper, using the background sections to understand the concepts and their connections within the paper. See here for the notes in <a href="/assets/svgd_notes/week05.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1704.07520">Stein Variational Gradient Descent as Gradient Flow</a>.</li>
</ol>
<p><br /></p>
<h1 id="6-stein-in-reinforcement-learning">6. Stein in Reinforcement Learning</h1>
<p><strong>Motivation</strong>: One of the most exciting use cases of SVGD is in reinforcement learning, due to its connection to maximum entropy reinforcement learning. This week, we study two key techniques in reinforcement learning that use SVGD as the underlying mechanism. In reinforcement learning, the target distribution is not known, so we derive gradient updates to our parameters using policy gradients. As we derive the gradient estimators in the maximum-entropy framework of reinforcement learning, we will start to see what benefits SVGD-based methods have. In particular, we will focus on the explore-exploit tradeoff, as well as normalization constants for intractable distributions, and see how SVGD helps us get around complicated problems regarding both.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Reinforcement Learning</li>
<li>Explore vs. Exploit</li>
<li>Maximum Entropy Reinforcement Learning</li>
</ol>
<p><strong>Notes</strong>: In this class, we look at the application area of reinforcement learning, and see how the diversity induced by SVGD (and its connection to maxmimum entropy reinforcement learning) generates strongly-exploring policies. See here for the notes in <a href="https://colab.research.google.com/drive/178X8BgGrUmPaRTLulL_ETUKaBf-MfrgS">Colab</a> or here for the <a href="/assets/svgd_notes/week06.pdf">PDF</a>.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1704.02399">Stein Variational Policy Gradient</a>.</li>
<li><a href="https://arxiv.org/abs/1702.08165">Reinforcement Learning with Energy-Based Policies</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html">A Long Peek into Reinforcement Learning</a>.</li>
<li><a href="https://arxiv.org/abs/1707.06626">Learning to Draw Samples with Amortized SVGD - Same as W4</a>.</li>
<li><a href="https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/">Soft Q-Learning BAIR Blogpost</a>.</li>
<li><a href="https://arxiv.org/abs/1805.10309">Learning Self-Imitating Diverse Policies (an improved SVPG)</a>.</li>
<li><a href="https://arxiv.org/abs/1806.03836">Bayesian MAML</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are some of the issues with using the RBF kernel when comparing RL policies? Is parameter space appropriate for comparing policies?
<details><summary>Solution</summary>
<p>
While it works in practice, the networks used for particles in the original SVPG paper were reasonably small. With larger numbers of parameters (i.e which are necessary when working with image-based observations), parameter-based discrepancies start to make even less sense. This is one of two core ideas that drove the formulation of the Self-Imitating Diverse Policies paper, seen as Resource 4 in Optional Reading.
</p>
</details>
</li>
<li>
<p>In SVPG, the introduction of a prior (and priors in RL) is one active area of research. To incorporate priors in this framework, what “space” does the prior need to be over?</p>
<details><summary>Solution</summary>
<p>
SVPG incorporates a prior over \(q \), which is actually a prior over the distribution of particle parameters \(\theta\). Since this space is uninterpretable, the prior term is set to be a constant, generating an "improper" prior that, in most use cases, can get dropped out of the optimization. Even if you were to use an old set of particles as a prior, the term is basically unusable, because in order to estimate the density of \(q\), you'd need to fit high-dimensional ( \( d = \mathbf{R}^{|\theta|} \)) kernel-density estimators. In addition, usually the number of particles used is much less than the number of parameters each has, making the density estimation an ill-posed problem.
</p>
</details>
</li>
<li>With the code implementation linked in the notes (or, your own), ablate on the architecture of each SVPG particle. What types of behavioral differences do you see in the different policies as you increase or decrease? Try adding a second layer instead; for example, how does a 2-layer, 200 neuron-per-layer network compare to a single-layer, 400 neuron particle?</li>
</ol>
<p><br /></p>bhairav[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]Neural ODEs2019-09-23T10:00:00+00:002019-09-23T10:00:00+00:00https://www.depthfirstlearning.com/2019/NeuralODEs<p>This guide would not have been possible without the help and feedback from many people.</p>
<p>Special thanks to Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring19">Mathematics of Deep Learning</a>, and to Cinjon Resnick, who introduced me to DFL and helped complete this guide.</p>
<p>Thank you to Avital Oliver, Matt Johnson, Dougal MacClaurin, David Duvenaud, and Ricky Chen for useful contributions to this guide.</p>
<p>Thank you to Tinghao Li, Chandra Prakash Konkimalla, Manikanta Srikar Yellapragada, Shan-Conrad Wolf, Deshana Desai, Yi Tang, Zhonghui Hu for helping me prepare the notes.</p>
<p>Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/nodes-deps.svg" width="200"></iframe>
<div>Concepts used in Neural ODEs. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Neural ODEs are neural network models which generalize standard layer to layer propagation to continuous depth models. Starting from the observation that the forward propagation in neural networks is equivalent to one step of discretation of an ODE, we can construct and efficiently train models via ODEs. On top of providing a novel family of architectures, notably for invertible density models and continuous time series, neural ODEs also provide a memory efficiency gain in supervised learning tasks.</p>
<p>In this curriculum, we will go through all the background topics necessary to understand these models. At the end, you should be able to implement neural ODEs and apply them to different tasks.</p>
<p><br /></p>
<h1 id="common-resources">Common resources:</h1>
<ol>
<li>Süli & Mayers: <a href="https://www.cambridge.org/core/books/an-introduction-to-numerical-analysis/FD8BCAD7FE68002E2179DFF68B8B7237#">An Introduction to Numerical Analysis</a>.</li>
<li>Quarteroni et al.: <a href="https://www.springer.com/us/book/9783540346586?token=holiday18&utm_campaign=3_fjp8312_us_dsa_springer_holiday18&gclid=Cj0KCQiAvebhBRD5ARIsAIQUmnlViB7VsUn-2tABSAhIvYaJgSEqmJXD7F4A7EgyDQtY9v_GeUsNif8aArGAEALw_wcB">Numerical Mathematics</a>.</li>
</ol>
<h1 id="1-numerical-solution-of-odes---part-1">1 Numerical solution of ODEs - Part 1</h1>
<p><strong>Motivation</strong>: ODEs are used to mathematically model a number of natural processes and phenomena. The study of their numerical
simulations is one of the main topics in numerical analysis and of fundamental importance in applied sciences. To understand Neural ODEs, we need to first understand how ODEs are solved with numerical techniques.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Initial values problems.</li>
<li>One-step methods.</li>
<li>Consistency and convergence.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week1.pdf">class</a>, we touched upon one-step method and their analysis. We also looked at some illustrative examples.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Sections 12.1-4 from Süli & Mayers.</li>
<li>Sections 11.1-3 from Quarteroni et al.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Runge-Kutta methods: Section 12.5 from Süli & Mayers.</li>
<li><a href="http://podcasts.ox.ac.uk/odes-and-nonlinear-dynamics-42">Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.2</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercise 1 in Section 11.12 of Quarteroni et al.
<details><summary>Solution</summary>
<p>
The truncation error can be split as
$$h\tau_{n+1} = y_{n+1} - y_n - h\Phi(t_n,y_n;h) = E_1 + E_2$$
where
$$E_1 = \int_{t_n}^{t_{n+1}} f(s, y(s))\,ds - \frac{h}{2}\left( f(t_n,y_n) + f(t_{n+1},y_{n+1}) \right)$$
and
$$E_2 = \frac{h}{2}\left( f(t_{n+1},y_{n+1}) - f(t_{n+1},y_n + hf(t_n,y_n) \right)$$
We can bound \(E_2\) as
$$|E_2| = \frac{h}{2} \left| f(t_{n+1},y_{n+1}) - f(t_{n+1}, y_n + h f(t_n,y_n)) \right| \leq \frac{hL}{2}|y_{n+1}-y_{n} - hf(t_n,y_n)| = \frac{hL}{2}O(h^2) = O(h^3)$$
where \(L\) is the Lipschitz constant of \(f\). On the other hand, \(E_1\) is bounded above by \(O(h^3)\); see this <a href="https://en.wikipedia.org/wiki/Trapezoidal_rule#Error_analysis">link</a> for a proof. It follows that \(\tau_{n} = O(h^2)\).
</p>
</details>
</li>
<li>Exercises 12.3,12.4, 12.7 in Section 12 of Süli & Mayers.
<details><summary>Solution to Exercise 12.3</summary>
<p>
Notice that we can write
$$\left(y + \frac{q}{p}\right)'=p\left(y + \frac{q}{p}\right)$$
It follows that \(y(t) = Ce^{pt} - q/p\) for some constant \(C\). Imposing the initial condition \(y(0)=1\), we get \(y(t)=e^{pt} + q/p(e^{pt}-1)\). In particular, we expand \(y\) in its Taylor series:
$$y(t) = 1 + \left(y + \frac{q}{p}\right)\sum_{k=1}^\infty \frac{(pt)^k}{k!}$$
To conclude the exercise we only need to notice that
$$y_n(t) = q/p + \left(y + \frac{q}{p}\right)\sum_{k=1}^n \frac{(pt)^k}{k!}$$
satisfies Picard's iteration: \(y_0 \equiv 1\), \(y_{n+1}(t) = y_0 + \int_0^t (py_n(s) + q)\,ds\).
</p>
</details>
<details><summary>Solution to Exercise 12.4</summary>
<p>
Applying Euler's method with step-size \(h\), we get \(\hat{y}(0) = 0\), \(\hat{y}(h) = \hat{y}(0) + h \hat{y}(0)^{1/5} = 0\), \(\hat{y}(2h) = \hat{y}(h) + h \hat{y}(h)^{1/5} =0\). Iterating, we see that \(y(nh) = 0\) for all \(n\geq 0\). On the other hand, the implicit Euler's method says that
$$\hat{y}_{n+1} = \hat{y}_n + h \hat{y}_{n+1}^{1/5}$$
for \(n \geq 0\) and \(\hat{y}_0 = 0\). After substituting \(\hat{y}_{n} = (C_nh)^{5/4}\) in the above relation, we only need to check that there exists a sequence \(C_n\) satisfying the requirements.
</p>
</details>
<details><summary>Solution to Exercise 12.7</summary>
<p>
First, notice that
$$e_{n+1} = y(x_{n+1}) - y_{n} - \frac{1}{2}h(f_{n+1} + f_n)= e_n - \frac{1}{2}h (f_{n+1}+f_n) + \int_{x_n}^{x_{n+1}} f(s,y(s))\,ds$$
and that the second component of the RHS is the same as \(E_1\) in Exercise 1 above. Therefore the first bound follows. The last inequality is simply obtained by re-arranging the terms.
</p>
</details>
</li>
<li>
<p>Consider the following method for solving <script type="math/tex">y' = f(y)</script>:</p>
<script type="math/tex; mode=display">y_{n+1} = y_n + h(\theta f(y_n) + (1-\theta) f(y_{n+1}))</script>
<p>Assuming sufficient smoothness of <script type="math/tex">y</script> and <script type="math/tex">f</script>, for what value of <script type="math/tex">0 \leq\theta\leq 1</script> is the truncation error the smallest? What does this mean about the accuracy of the method?</p>
<details><summary>Solution</summary>
<p>
By definition, it holds that
$$h\tau_n = y_{n+1} - y_n - h (\theta f_n + (1-\theta) f_{n+1}) = y_{n+1} - y_n - h \theta y_n' - h(1-\theta) y_{n+1}'$$
Taylor-expanding, we get
$$h\tau_n = y_{n} + hy_n' + h^2/2y_n'' + O(h^3) - y_n - h \theta y_n' - h(1-\theta) y_{n}' - h^2(1-\theta) y_{n}'' + O(h^3) = h^2(\theta - 1/2)y_n''+O(h^3)$$
It follows that the truncation error is the smallest for \(\theta=1/2\). For \(\theta = 1/2\), the method has order \(2\), otherwise it has order \(1\).
</p>
</details>
</li>
<li><a href="https://colab.research.google.com/drive/1bNg-RzZoelB3w8AUQ6mefRQuN3AdrIqX">Colab notebook</a>.
<details><summary>Solution</summary>
<p>
See this <a href="https://colab.research.google.com/drive/1wTQXy2_4InQH51rEmiCtvl5Q7MiyrC4k">Colab</a> for the solution.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-numerical-solution-of-odes---part-2">2 Numerical solution of ODEs - Part 2</h1>
<p><strong>Motivation</strong>: In the previous class, we introduced some simple schemes to numerically solve ODEs. In order to understand which numerical scheme is more proper to apply, it is important to know and understand their different properties. For this reason, in this class, we go through some more involved schemes and analyze them with regards to convergence and stability.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Runge-Kutta methods.</li>
<li>Multi-step methods.</li>
<li>System of ODEs and absolute converge.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week2.pdf">class</a>, we went through different ways to construct multi-step methods and their convergence analysis. We then looked into absolute stability regions for different methods.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Runge-Kutta methods: Section 11.8 from Quarteroni et al. or Sections 12.{5,12} from Süli & Mayers.</li>
<li>Multi-step methods: Sections 12.6-9 from Quarteroni et al. or Section 11.5-6 from Süli & Mayers.</li>
<li>System of ODEs: Sections 12.10-11 from Quarteroni et al. or Sections 11.9-10 from Süli & Mayers.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://podcasts.ox.ac.uk/odes-and-nonlinear-dynamics-41">Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.1</a>.</li>
<li>Predictor-corrector methods: Section 11.7 from Quarteroni et al.</li>
<li>Richardson extrapolation: Section 16.4 from <a href="http://numerical.recipes/">Numerical Recipes</a>.</li>
<li><a href="https://epubs.siam.org/doi/pdf/10.1137/0904010?">Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equations</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercises 12.11, 12.12, 12.19 in Section 12 of Süli & Mayers.
<details><summary>Solution to Exercise 12.11</summary>
<p>
By definition, the truncation error is given by
$$h\tau_n = y_{n+3} + \alpha y_{n+2} -\alpha y_{n+1} - y_n -h\beta y_{n+2}' - h\beta y_{n+1}'$$
Taylor-expanding, we have that
$$y_{n+3} = y_n + 3hy_n' + 9/2h^2 y_n'' + 9/2h^3 y_n''' + 27/8h^4 y_n^{(4)} + O(h^5)$$
$$y_{n+2} = y_n + 2hy_n' + 2h^2 y_n'' + 4/3h^3 y_n''' + 2/3h^4 y_n^{(4)} + O(h^5)$$
$$y_{n+1} = y_n + hy_n' + h^2 y_n'' + h^3 y_n''' + h^4y_n^{(4)} + O(h^5)$$
$$y_{n+2}' = y_n' + 2hy_n'' + 2h^2y_n''' + 4/3 h^3 y_{n}^{(4)}$$
$$y_{n+1}' = y_n' + hy_n'' + h^2y_n''' + h^3 y_{n}^{(4)}$$
Substituting these in the first equation and imposing the terms in \(h^i\), \(i = 0,1,2,3,4\), to be \(0\), we get the equations
$$3 + \alpha - 2\beta = 0$$
$$27 + 7\alpha - 15\beta = 0$$
$$27 + 5\alpha - 12\beta = 0$$
Solving for these, we find \(\alpha = 9\) and \(\beta = 6\). The resulting method reads
$$y_{n+3} + 9(y_{n+2} - y_{n+1}) - y_n = 6h(f_{n+2} + f_{n+1})$$
The characteristic polynomial is given by
$$\rho(z) = z^3 +9z^2 - 9z -1$$
One of the roots of this polynomial satisfies \(|z|>1\) and this implies that the method is not zero-stable.
</p>
</details>
<details><summary>Solution to Exercise 12.12</summary>
<p>
By definition, the truncation error is given by
$$h\tau_n = y_{n+1} + b y_{n-1} +a y_{n-2} -h y_{n}'$$
Taylor-expanding, we have that
$$y_{n+1} = y_n + hy_n' + 1/2h^2 y_n'' + O(h^3)$$
$$y_{n-1} = y_n - hy_n' + 1/2h^2 y_n'' + O(h^3)$$
$$y_{n-2} = y_n - 2hy_n' + 2h^2 y_n'' + O(h^3)$$
Substituting these in the first equation and solving for the terms in \(h^i\), \(i = 0,1\), to be \(0\), we get \(a=1\) and \(b=-2\). In particular
$$\tau_n = 3/2h + O(h^2)$$
and thus the method has order of accuracy \(1\).
The resulting method reads
$$y_{n+1} -2 y_{n-1} + y_{n-2} = h f_{n}$$
The characteristic polynomial is given by
$$\rho(z) = z^3 -2z -1$$
One of the roots of this polynomial satisfies \(|z|>1\) and this implies that the method is not zero-stable.
</p>
</details>
<details><summary>Solution to Exercise 12.19</summary>
<p>
The first equation can be found by substituting \(f(t,y) = \lambda y\) in equation (12.51) in the book and by solving for \(k_1,k_2\) (it is a \(2\times 2\) linear system). Substituting the values of \(A\) and \(b\) from the Butcher tableau in this formula and in the one right before equation (12.51) in the book, and simplifying, we get the formula for \(R(\lambda h)\). Finally, \(p\) and \(q\) are given by \(p,q=-3\pm i \sqrt{3}\). One can see that this implies \(|R(z)|<1\) if \(Re(z) <0\) and thus the method is A-stable.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-resnets">3 ResNets</h1>
<p><strong>Motivation</strong>: The introduction of Residual Networks (ResNets) made it possible to train very deep networks. In this section, we study residual architectures and their properties. We then look into how ResNets approximate ODEs and how this interpretation can motivate neural net architectures and new training approaches. This is important in order to understand the basic models underlying Neural ODEs and gain some insights into their connection to numerical solutions of ODEs.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>ResNets.</li>
<li>ResNets and ODEs.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week3.pdf">class</a>, we defined and briefly discussed residual network architecture. We then looked at a stability notion for ResNets, derived from the connection with discretisation of ODEs, and to a simple way to make such architectures reversible.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>ResNets:
<ul>
<li><a href="https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9">ResNets</a>.</li>
<li><a href="https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035">An Overview of ResNet and its Variants</a>.</li>
</ul>
</li>
<li>ResNets and ODEs:
<ul>
<li>Sections 1-3 from <a href="https://arxiv.org/pdf/1710.10348.pdf">Multi-level Residual Networks from Dynamical Systems View</a>.</li>
<li><a href="https://arxiv.org/abs/1709.03698">Reversible Architectures for Arbitrarily Deep Residual Neural Networks</a>.</li>
<li>Invertible ResNets: <a href="https://arxiv.org/pdf/1707.04585.pdf">The Reversible Residual Network: Backpropagation Without Storing Activations</a></li>
<li><a href="https://arxiv.org/pdf/1705.03341.pdf">Stable Architectures for Deep Neural Networks</a>.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>The original ResNets paper: <a href="https://arxiv.org/abs/1512.03385">Deep Residual Learning for Image Recognition</a>.</li>
<li>Another blog post on ResNets: <a href="https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624">Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Do you understand why adding ‘residual layers’ should not degrade the network performance?
<details><summary>Solution</summary>
<p>
Let
$$x_k = x_{k-1} + f(W_k, x_{k-1})$$
be the output of the \(k\)-th layer of a residual net. Then, adding a residual layer consists of considering $$x_{k+1} = x_{k} + f(W_{k+1}, x_{k})$$ instead of \(x_k\). For most common architectures, it holds that \(f(W, x) \equiv 0\) for \(W=0\). This is why adding a layer should not degrade the performances: any residual network with \(k\) layers can be also written as a residual network with \(k+1\) layers, by simply taking \(W_{k+1}=0\).
</p>
</details>
</li>
<li>How do the authors of (Multi-level Residual Networks from Dynamical Systems View) explain the phenomena of still having almost as good performances in residual networks when removing a layer?
<details><summary>Solution</summary>
<p>
Viewing the network output as time-step of the forward Euler's method, we have that
$$x^{(n+1)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$
where \(x^{(n)}(x_i)\) is the output of the \(n\)-th layer of the network evaluated on the input point \(x_i\). Then
$$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta) + h F(x^{(n+1)}(x_i); \theta)$$
Therefore, removing layer \(n+1\) consists of taking
$$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$
instead. As \(h\) is small (and this is motivated by the experiments in Section 3.2), the removed term is small and so is the variation in the output layer. Nevertheless, it must be noticed that this analysis is only based on empirical evaluations.
</p>
</details>
</li>
<li>Implement your favourite ResNet variant.
<details><summary>Example</summary>
<p>
See this <a href="https://keras.io/examples/cifar10_resnet/">tutorial</a> for an example of implementation of a ResNet.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="4-normalising-flows">4 Normalising Flows</h1>
<p><strong>Motivation</strong>: In this class, we take a little detour to learn about Normalising Flows. These are used for density estimation and generative modeling, and their implementation is motivated by a discretisation of an ODE. Understanding it at a basic level is necessary to understanding continuous normalizing flows, a central application of neural ODEs.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Normalising Flows.</li>
<li>End-to-end implementations with neural nets.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week4.pdf">class</a>, we defined nomalising flow, starting from the non-parametric form and then deriving their algorithmic (and parametric) implementation. We concluded by discussing some architectures proposed in the literature and their trade-offs.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><em>DE</em>: <a href="https://math.nyu.edu/faculty/tabak/publications/CMSV8-1-10.pdf">Density Estimation by Dual Ascent of the Log-likelihood</a> (Skip Section 3).</li>
<li><a href="https://math.nyu.edu/faculty/tabak/publications/Tabak-Turner.pdf">A family of non-parametric density estimation algorithms</a>.</li>
<li><a href="http://akosiorek.github.io/ml/2018/04/03/norm_flows.html">A post on Normalising flow</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1505.05770.pdf">Variational Inference with Normalizing Flows</a>.</li>
<li><a href="https://arxiv.org/pdf/1302.5125.pdf">High-Dimensional Probability Estimation with Deep Density Models</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>In <em>DE</em>, what is the difference between <script type="math/tex">\rho_t</script> and <script type="math/tex">\tilde{\rho}_t</script>, i.e. what do they represent?
<details><summary>Solution</summary>
<p>
The function \(\tilde{\rho}_t\) is the density of the distribution of the random variable \(\phi_t^{-1}(y)\) where \(y\sim \mu\). The function \(\rho_t\) is the density of the distribution of the random variable \(\phi_t(x)\) where \(x\sim \rho\).
</p>
</details>
</li>
<li>What is the computational complexity of evaluating a determinant of an <script type="math/tex">N\times N</script> matrix, and why is that relevant in this context?
<details><summary>Solution</summary>
<p>
In general, the cost of computing the determinant of an \(N\times N\) matrix is \(O(N^3)\). To compute densities transported by normalising flows, we need to compute the determinants of the Jacobians; therefore, an important feature of practical normalising flows, is that the Jacobian structure must allow an efficient computation of its determinant. See this week notes for more discussion on this.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-the-adjoint-method-and-auto-diff">5 The Adjoint Method (and Auto-Diff)</h1>
<p><strong>Motivation</strong>: The adjoint method is a numerical method for efficiently computing the gradient of a function in numerical optimization problems. Understanding this method is essential to understand how to train ‘continuous depth’ nets. We also review the basics of Automatic Differentiation, which will help us understand the efficiency of the algorithm proposed in the NeuralODE paper.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Adjoint Method.</li>
<li>Auto-Diff.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week5.pdf">class</a>, we discussed the adjoint method. We started from the case of linear system and went through non-linear equations and recurrent relations. We concluded by discussing their application to ODE constrained optimization problems, which is the case of interest for Neural ODEs.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Section 8.7 from <em>CSE</em>: <a href="http://math.mit.edu/~gs/cse/">Computational Science and Engineering</a>.</li>
<li>Sections 2 and 3 from <a href="http://www.jmlr.org/papers/volume18/17-468/17-468.pdf">Automatic Differentiation in Machine Learning: a Survey</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://math.mit.edu/~stevenj/notes.html">Prof. Steven G. Johnson’s notes on adjoint method</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Exercises 1,2,3 from Section 8.7 of <em>CSE</em>.
<details><summary>Solution to Exercise 1</summary>
<p>
This follows immediately by noticing that the number of multiply-add operations of multiplying an \(N\times M\) matrix with an \(M\times P\) matrix is given by \(O(NMP)\).
</p>
</details>
<details><summary>Solution to Exercise 2</summary>
<p>
Apply the chain rule. Since \(\frac{\partial C}{\partial S} = 2S\) and \(\frac{dT}{dS} = \frac{\partial T}{\partial S} + \frac{\partial T}{\partial C}\frac{\partial C}{\partial S}\), we get \(\frac{d T}{d S} = 1 -2S\).
</p>
</details>
<details><summary>Solution to Exercise 3</summary>
<p>
This follows from Exercise 1 by seeing \(u^T\) and \(w^T\) as \(1\times N\) matrices and \(v\) as an \(N\times 1\) matrix.
</p>
</details>
</li>
<li>Consider the problem of optimizing a real-valued function <script type="math/tex">g</script> over the solution of the ODE <script type="math/tex">y'(t) = A(p)y(t)</script>, <script type="math/tex">y(0) = b(p)</script> at time <script type="math/tex">T>0</script>: <script type="math/tex">\min_p\, g(T) \doteq g(y(T; p))</script>. Find <script type="math/tex">\frac{dg(T)}{dp}</script> by solving the ODE and by applying chain rule. Check the correctness of equations (16-17) in <em>CSE</em>.
<details><summary>Solution</summary>
<p>
It holds that
$$y(t) = e^{tA(p)}y(0)$$
Applying the chain rule, we get
$$\frac{dg}{dp} = \frac{dg}{dy}e^{TA(p)}\frac{db}{dp} + T\frac{dg}{dy}\frac{\partial A}{\partial p}e^{TA(p)}b(p)$$
On the other hand, the adjoint ODE reads
$$\lambda'(t) = -A(p)^T\lambda(t)$$
with the final condition \(\lambda(T) = \left(\frac{\partial g}{\partial y}\right)^T\), which gives \(\lambda(t) = e^{A(p)^T(T-t)}\left(\frac{\partial g}{\partial y}\right)^T\). Equation (17) from <i>CSE</i> gives
$$\frac{dg}{dp} = \left(e^{TA(p)^T}\left(\frac{\partial g}{\partial y}\right)^T\right)^T\frac{\partial b}{\partial p} + \int_0^T \frac{\partial g}{\partial y} e^{A(p)(T-t)}\frac{\partial A}{\partial p}e^{tA(p)}b(p)\,dt$$
which coincides with the above.
</p>
</details>
</li>
<li>Prove equations (14-15) in Section 8.7 of <em>CSE</em>.
<details><summary>Solution</summary>
<p>
By definition, it holds that
$$\frac{dG}{dp} = \int_0^T\left(\frac{\partial g}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\right)\,dt $$
On the other hand, it holds that
$$\lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T\lambda^T \frac{\partial f}{\partial p}\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt $$
Using equation (14) from <i>CSE</i> and the equality \(\frac{\partial u}{\partial p} = \frac{\partial f}{\partial p} + \frac{\partial f}{\partial u}\frac{\partial u}{\partial p}\), we get
$$\int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} + \lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p} - \lambda^T \frac{\partial f}{\partial p} -\lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} \right)\,dt$$
which gives
$$
\lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T \lambda^T\frac{\partial f}{\partial p}\,dt = \int_0^T \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\,dt
$$
and thus completes the proof.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the paper! Here is a summary of what’s going on to help with your understanding:</p>
<p>Any residual network can be seen as the Explicit Euler’s method discretisation of a certain ODE; given the network parameters, any numerical ODE solver can be used to evaluate the output layer. The application of the adjoint method makes it possible to efficiently back-propagate (and thus train) these models. The same idea can be used to train time-continuous normalising flows. In this case, moving to the continuous formulation allows us to avoid the computation of the determinant of the Jacobian, one of the major bottlenecks of normalising flows. Neural ODEs can also be used to model latent dynamics in time-series modeling, allowing us to easily tackle irregularly sampled data.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Normalising Flows.</li>
<li>End-to-end implementations with neural nets.</li>
</ol>
<p><strong>Notes</strong>: In this <a href="/assets/nodes_notes/week6.pdf">class</a>, we defined Neural ODEs and derived the respective adjoint method, essential for their implementation. We then discussed continuous normalising flows and the computational advantages offered by Neural ODEs in this setting.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1806.07366">Neural Ordinary Differential Equations</a>.</li>
<li><a href="https://rkevingibson.github.io/blog/neural-networks-as-ordinary-differential-equations/">A blog post on NeuralODEs</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>A follow-up paper by the authors on scalable continuous normalizing flows: <a href="https://arxiv.org/abs/1810.01367">Free-form Continuous Dynamics for Scalable Reversible Generative Models</a>.</li>
</ol>
<p><br /></p>lucaThis guide would not have been possible without the help and feedback from many people.Wasserstein GAN2019-05-02T14:00:00+00:002019-05-02T14:00:00+00:00https://www.depthfirstlearning.com/2019/WassersteinGAN<p>[Editor’s Note: We are especially proud of this one. James and his group went above and beyond the call of duty and made a guide from their class that we feel is especially superb for understanding their target paper. Moving forward, he has forced us to up our game because it will be hard to release a curriculum that is not as strong as this one. We highly recommend earnestly studying with this at hand.]</p>
<p>A number of people need to be thanked for their parts in making this happen. Thank you to Martin Arjovsky, Avital Oliver, Cinjon Resnick, Marco Cuturi, Kumar Krishna Agrawal, and Ishaan Gulrajani for contributing to this guide.</p>
<p>Of course, thank you to Sasha Naidoo, Egor Lakomkin, Taliesin Beynon, Sebastian Bodenstein, Julia Rozanova, Charline Le Lan, Paul Cresswell, Timothy Reeder, and Michał Królikowski for beta-testing the guide and giving invaluable feedback. A special thank you to Martin Arjovsky, Tim Salimans, and Ishaan Gulrajani for joining us for the weekly meetings.</p>
<p>Finally, thank you to Ulrich Paquet and Stephan Gouws for introducing many of us to Cinjon.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/wgan-deps.svg" width="400"></iframe>
<div>Concepts used in Wasserstein GAN. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>The Wasserstein GAN (WGAN) is a GAN variant which uses the 1-Wasserstein distance, rather than the JS-Divergence, to measure the difference between the model and target distributions. This seemingly simple change has big consequences! Not only does WGAN train more easily (a common struggle with GANs) but it also achieves very impressive results — generating some stunning images. By studying the WGAN, and its variant the WGAN-GP, we can learn a lot about GANs and generative models in general. After completing this curriculum you should have an intuitive grasp of why the WGAN and WGAN-GP work so well, as well as, a thorough understanding of the mathematical reasons for their success. You should be able to apply this knowledge to understanding cutting edge research into GANs and other generative models.</p>
<p><br /></p>
<h1 id="1-basics-of-probability--information-theory">1 Basics of Probability & Information Theory</h1>
<p><strong>Motivation</strong>: To understand GAN training (and eventually WGAN & WGAN-GP) we need to first have some understanding of probability and information theory. In particular, we will focus on Maximum Likelihood Estimation and the KL-Divergence. This week we will make sure that we understand the basics so that we can build upon them in the following weeks.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Probability Theory</li>
<li>Information Theory</li>
<li>Mean Squared Error (MSE)</li>
<li>Maximum Likelihood Estimation (MLE)</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Chs 3.1 - 3.5 of <a href="https://www.deeplearningbook.org/">Deep Learning</a> by Goodfellow <em>et. al</em> (the DL book)
<ul>
<li>These chapters are here to introduce fundamental concepts such as random variables, probability distributions, marginal probability, and conditional probability. If you have the time, reading the whole of chapter 3 is highly recommended. A solid grasp of these concepts will be important foundations for what we will cover over the next 5 weeks.</li>
</ul>
</li>
<li>Ch 3.13 of the DL book
<ul>
<li>This chapter covers KL-Divergence & the idea of distances between probability distributions which will also be a key concept going forward.</li>
</ul>
</li>
<li>Chs 5.1.4 and 5.5 of the DL book
<ul>
<li>The aim of these chapters is to make sure that everyone understands maximum likelihood estimation (MLE) which is a fundamental concept in machine learning. It is used explicitly or implicitly in both supervised and unsupervised learning as well as in both discriminative and generative methods. In fact, many methods using gradient descent are doing approximate MLE. It is important to understanding MLE as a fundamental concept, and its use in machine learning in practice. Note that, if you are not familiar with the notation used in these chapters, you might want to start at the beginning of the chapter. Also note that, if you are not familiar with the concept of estimators, you might want to read Ch 5.4. However, you can probably get by simply knowing that minimizing mean squared error (MSE) is a method for optimizing some approximation for a function we are trying to learn (an estimator).</li>
</ul>
</li>
<li>The first 3 sections of <a href="https://colinraffel.com/blog/gans-and-divergence-minimization.html">GANs and Divergence Minimization</a> (check out the rest after week 3)
<ul>
<li>This blog gives a great description of the connections between the KL divergence and MLE. It also provides a nice teaser for what is to come in the following weeks, particularly with regards to the difficulties of training generative models.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Ch 2 from <a href="http://www.inference.org.uk/itprnn/book.pdf">Information Theory, Inference & Learning Algorithms by David MacKay</a> (MacKay’s book)
<ul>
<li>This is worth reading if you feel like you didn’t quite grok the probability and information theory content in the DL book. MacKay provides a different perspective on these ideas which might help make things click. These concepts are going to be crucial going forward so it is definitely worth making sure you are comfortable with them.</li>
</ul>
</li>
<li>Chs 1.6 and 10.1 of <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf">Pattern Recognition and Machine Learning by Christopher M. Bishop</a> (PRML)
<ul>
<li>Similarly, this is worth reading if you don’t feel comfortable with the KL-Divergence and want another perspective.</li>
</ul>
</li>
<li>Aurélien Géron’s video <a href="https://www.youtube.com/watch?v=ErfnhcEV1O8">A Short Introduction to Entropy, Cross-Entropy and KL-Divergence</a>
<ul>
<li>An introductory, but interesting video that describes the KL-Divergence.</li>
</ul>
</li>
<li><a href="http://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf">Notes on MLE and MSE</a>
<ul>
<li>An alternative discussion on the links between MLE and MSE.</li>
</ul>
</li>
<li>The first 37ish minutes of Arthur Gretton’s MLSS Africa talk on comparing probability distributions — <a href="https://www.youtube.com/watch?v=5sijxSg8P14">video</a>, <a href="https://drive.google.com/file/d/1RNrgDs5xw-9HTjikFU1L0iO1PBMDaGwE/view">slides</a>
<ul>
<li>An interesting take on comparing probability distributions. The first 37 minutes are fairly general and give some nice insights as well as some foreshadowing of what we will be covering in the following weeks. The rest of the talk is also very interesting and ends up covering another GAN called the MMD-GAN, but it isn’t all that relevant for us.</li>
</ul>
</li>
<li><a href="https://pdfs.semanticscholar.org/6af2/fa8887a2cb0386f79e3a2822b661e2dc8369.pdf">On Integral Probability Metrics, φ-Divergences and Binary Classification</a>
<ul>
<li>For those of you whose curiosity was piqued by Arthur’s talk, this paper goes into depth describing IPMs (such as MMD and the 1-Wasserstein distance) and comparing them the φ-divergences (such as the KL-Divergence). <em>This paper is fairly heavy mathematically so don’t be discouraged if you struggle to follow it</em>.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The questions this week are here to make sure that you can put all the theory you’ve been reading about to a little practice. For example, do you understand how to perform calculations on probabilities, or, what Bayes’ rule is and how to use it?</em></p>
<ol>
<li>Examples/Exercises 2.3, 2.4, 2.5, 2.6, and 2.26 in MacKay’s book
<ul>
<li>Bonus: 2.35, and 2.36</li>
</ul>
<details><summary>Solutions</summary>
<p>
Examples 2.3, 2.5, and 2.6 have their solutions directly following them.
</p>
<p>
Exercise 2.26 has a solution on page 44.
</p>
<p>
Exercise 2.35 has a solution on page 45.
</p>
<p>
Exercise 2.36: 1/2 and 2/3.
</p>
<p>
(Page numbers from Version 7.2 (fourth printing) March 28, 2005, of MacKay's book.)
</p>
</details>
</li>
<li>Derive Bayes’ rule using the definition of conditional probability.
<details><summary>Solution</summary>
<p>
The definition of conditional probability tells us that
$$p(y|x) = \frac{p(y,x)}{p(x)}$$
and that
$$p(x|y) = \frac{p(y,x)}{p(y)}.$$
From this we can see that \(p(y,x) = p(y|x)p(x) = p(x|y)p(y)\). Finally if we divide everything by \(p(x)\) we get
$$p(y|x) = \frac{p(x|y)p(y)}{p(x)}$$
which is Bayes' rule.
</p>
</details>
</li>
<li>Exercise 1.30 in PRML
<details><summary>Solution</summary>
<p>
<a href="https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians?rq=1">Here</a> is a solution.
</p>
<p>
The result should be \(\log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}\).
</p>
</details>
</li>
<li>Prove that minimizing MSE is equivalent to maximizing likelihood (assuming Gaussian distributed data).
<details><summary>Solution</summary>
<p>
Mean squared error is defined as
$$MSE = \frac{1}{N}\sum^N_{n=1}(\hat{y}_n - y_n)^2$$
where \(N\) is the number of examples, \(y_n\) are the true labels, and \(\hat{y}_n\) are the predicted labels.
Log-likelihood is defined as \(LL = \log(p(\mathbf{y}|\mathbf{x}))\). Assuming that the examples are independent and identically distributed (i.i.d.) we get
$$ LL = \log\prod_{n=1}^Np(y_n|x_n) = \sum_{n=1}^{N}\log p(y_n|x_n). $$
Now, substituting in the definition of the normal distribution
$$ \mathcal{N}(y;\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y - \mu)^2}{2\sigma^2}}$$
for \(p(y_n|x_n)\) and simplifying the expression, we get
$$ LL = \sum_{n=1}^{N} -\frac{1}{2}\log(2\pi) - \log\sigma - \frac{(y_n - \mu_n)^2}{2\sigma^2}.$$
Finally, replacing \(\mu\) with \(\hat{y}\) (because we use the mean as our prediction), and noticing that maximizing the expression above depends only on the third term (because the others are constants), we arrive at the conclusion that to maximize the log-likelihood we must minimize
$$ \frac{(y_n - \hat{y}_n)^2}{2\sigma^2} $$
which is the same as minimising the MSE.
</p>
</details>
</li>
<li>Prove that maximizing likelihood is equivalent to minimizing KL-Divergence.
<details><summary>Solution</summary>
<p>
KL-Divergence is defined as
$$ D_{KL}(p||q) = \sum_x p(x) \log\frac{p(x)}{q(x|\bar{\theta})}$$
where \(p(x)\) is the true data distribution, \(q(x|\bar{\theta})\) is our model distribution, and \(\bar{\theta}\) are the parameters of our model. We can rewrite this as
$$ D_{KL}(p||q) = \mathbb{E}_p[\log p(x)] - \mathbb{E}_p[\log q(x|\bar{\theta})]$$
where the notation \(\mathbb{E}_p[f(x)]\) means that we are taking the expected value of \(f(x)\) by sampling \(x\) from \(p(x)\). We notice that minimizing \(D_{KL}(p||q)\) means maximizing \(\mathbb{E}_p[\log q(x|\bar{\theta})]\) since the first term in the expression above is constant (we can't change the true data distribution). Now, to maximize the likelihood of our model, we need to maximize
$$q(\bar{x}|\bar{\theta}) = \prod_{n=1}^Nq(x_n|\bar{\theta}).$$
Recall that taking a logarithm does not change the result of optimization which means that we can maximize
$$\log q(\bar{x}|\bar{\theta}) = \sum_{n=1}^N\log q(x_n|\bar{\theta}).$$
If we divide this term by a constant factor of \(N\) we the same term that would minimize the to maximize the KLD: \(\mathbb{E}_p[\log q(x|\bar{\theta})]\).
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week1.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="2-generative-models">2 Generative Models</h1>
<p><strong>Motivation</strong>: This week we’ll take a look at generative models. We will aim to understand how they are similar and how they differ from the discriminative models covered last week. In particular, we want to understand the challenges that come with training generative models.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Generative Models</li>
<li>Evaluation of Generative Models</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>The “Overview”, “What are generative models?”, and “Differentiable inference” sections of the webpage for David Duvenaud’s <a href="https://www.cs.toronto.edu/~duvenaud/courses/csc2541/index.html">course on Differentiable Inference and Generative Models</a>.
<ul>
<li>Here we want to get a sense of the big picture of what generative models are all about. There are also some fantastic resources here for further reading if you are interested.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1511.01844.pdf">A note on the evaluation of generative models</a>
<ul>
<li>This paper is the real meat of this week’s content. After reading this paper you should have a good idea of the challenges involved in evaluating (and therefore training) generative models. Understanding these issues will be important for appreciating what the WGAN is all about. Don’t worry too much if some sections don’t completely make sense yet - we’ll be returning to the key ideas in the coming weeks.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Ch 20 of the DL book, particularly:
<ul>
<li>Differentiable Generator Networks (Ch 20.10.2)
<ul>
<li>Description of a broad class of generative models to which GANs belong which will help contextualize GANs when we look at them next week.</li>
</ul>
</li>
<li>Variational Autoencoders (Ch 20.10.3)
<ul>
<li>Description of another popular class of differentiable generative model which might be nice to contrast to GANs next week.</li>
</ul>
</li>
<li>Evaluating Generative Models (Ch 20.14)
<ul>
<li>Summary of techniques and challenges for evaluating generative models which might put Theis <em>et al.</em>’s paper into context.</li>
</ul>
</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first two questions are here to make sure that you understand what a generative model is and how it differs from a discriminative model. The last two questions are a good barometer for determining your understanding of the challenges involved in training generative models.</em></p>
<ol>
<li>Fit a <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.multivariate_normal.html">multivariate Gaussian distribution</a> to the <a href="https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset">Fisher Iris dataset</a> using maximum likelihood estimation (see Section 2.3.4 of PRML for help) then:
<ol>
<li>Determine the probability of seeing a flower with a sepal length of 7.9, a sepal width of 4.4, a petal length of 6.9, and a petal width of 2.5.</li>
<li>Determine the distribution of flowers with a sepal length of 6.3, a sepal width of 4.8, and a petal length of 6.0 (see section 2.3.2 of PRML for help).</li>
<li>Generate 20 flower measurements.</li>
<li>Generate 20 flower measurements with a sepal length of 6.3.</li>
</ol>
<p>(congrats you’ve just trained and used a generative model)</p>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/JamesAllingham/DFL-WGAN/blob/master/DFL_WGAN_week2.ipynb">Here</a> is a Jupyter notebook with solutions. Open the notebook on your computer or Google colab to render the characters properly.
</p>
</details>
</li>
<li>Describe in your own words the difference between a generative and a discriminative model.
<details><summary>Solution</summary>
<p>
This is an open ended question but here are some of the differences:
<ul>
<li>In the generative setting, we usually model \(p(x)\), while in the discriminative setting we usually model \(p(y|x)\).</li>
<li>Generative models are usually non-deterministic, and we can sample from them, while discriminative models are often deterministic, and we can't necessarily sample from them.</li>
<li>Discriminative models need labels while generative models typically do not.</li>
<li>In generative modelling the goal is often to learn some latent variables that describe the data in a compact manner, this is not usually the case for discriminative models.</li>
</ul>
</p>
</details>
</li>
<li>Theis <em>et al.</em> claim that “a model with zero KL divergence will produce perfect samples” — why is this the case?
<details><summary>Solution</summary>
<p>
As we showed last week, \(D_{KL}(p||q) = 0\) if and only if \(p(x)\), the true data distribution, and \(q(x)\) the model distribution, are the same.
</p>
<p>
Therefore, if \(D_{KL}(p||q) = 0\), samples from our model will be indistinguishable from the real data.
</p>
</details>
</li>
<li>Explain why the high log-likelihood of a generative model might not correspond to realistic samples?
<details><summary>Solution</summary>
<p>
Theis <i>et al.</i> outlined two scenarios where this is the case:
<ul>
<li><b>Low likelihood & good samples</b>: our model can overfit to the training data and produce good samples, however, because the model has overfitted it will have a low likelihood for unseen test data.</li>
<li><b>High likelihood & poor samples</b>: here the issue is that high dimensional data will tend to have higher log-likelihoods than low dimensional data. </li>
</ul>
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week2.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Tim Salimans sit in on the session!</p>
<p><br /></p>
<h1 id="3-generative-adversarial-networks">3 Generative Adversarial Networks</h1>
<p><strong>Motivation</strong>: Let’s read the original GAN paper. Our main goal this week is to understand how GANs solve some of the problems with training generative models, as well as, some of the new issues that come with training GANs.</p>
<p><em>The second paper this week is actually optional but <strong>highly</strong> recommended — we think that it contains some interesting material and sets the stage for looking at WGAN in week 4, however, the core concepts will be repeated again. Depending on your interest you might want to spend more or less time on this paper (we recommend that most people don’t spend too much time).</em></p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Generative Adversarial Networks</li>
<li>The Jensen-Shannon Divergence (JSD)</li>
<li>Why training GANs is hard</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1406.2661.pdf">Goodfellow’s GAN paper</a>
<ul>
<li>This is the paper the started it all and if we want to understand WGAN & WGAN-GP we’d better understand the original GAN.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1701.04862.pdf">Toward Principled Methods for Generative Adversarial Network Training</a>
<ul>
<li>This paper explores the difficulties in training GANs and is a precursor to the WGAN paper that we will look at next week. The paper is quite math heavy so unless math is your cup of tea you shouldn’t spend too much time trying to understand the details of the proofs, corollaries, and lemmas. The important things to understand here are: what is the problem, and how do the proposed solutions solve the problem. Focus on the introduction, the English descriptions of the theorems and the figures. <strong>Don’t spend too much time on this paper</strong>.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1701.00160.pdf">Goodfellow’s tutorial on GANs</a>
<ul>
<li>A more in-depth explanation of GANs from the man himself.</li>
</ul>
</li>
<li>The GAN chapter in the DL book (20.10.4)
<ul>
<li>A summary of what a GAN is and some of the issues involved in GAN training.</li>
</ul>
</li>
<li>Coursera (Stanford) course on game theory videos: <a href="https://www.youtube.com/watch?v=-j44yHK0nn4&index=5&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh">1-05</a>, <a href="https://www.youtube.com/watch?v=BsgnKTfOxTs&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=11">2-01</a>, <a href="https://www.youtube.com/watch?v=FU6ax5K9HIA&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=12">2-02</a>, and <a href="https://www.youtube.com/watch?v=RIneClCKgAw&list=PLGdMwVKbjVQ8DhP8dgrBO1B5etE81Hxxh&index=22">3-04b</a>
<ul>
<li>This is really here just for people who are interested in the game theory ideas such as minmax.</li>
</ul>
</li>
<li>Finish reading <a href="https://colinraffel.com/blog/gans-and-divergence-minimization.html">GANs and Divergence Minimization</a>.
<ul>
<li>Now that we know what a GAN is it will be worth it to go back and finish reading this blog. It should help to tie together many of the concepts we’ve covered so far. It also has some great resources for extra reading at the end.</li>
</ul>
</li>
<li><a href="https://ahmedhanibrahim.wordpress.com/2017/01/17/generative-adversarial-networks-when-deep-learning-meets-game-theory/comment-page-1/">Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory</a>
<ul>
<li>A short blog post which briefly summarises many of the topics we’ve covered so far.</li>
</ul>
</li>
<li><a href="https://www.inference.vc/how-to-train-your-generative-models-why-generative-adversarial-networks-work-so-well-2/">How to Train your Generative Models? And why does Adversarial Training work so well?</a> and <a href="https://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/">An Alternative Update Rule for Generative Adversarial Networks</a>
<ul>
<li>Two great blog posts from Ferenc Huszár that discuss the challenges in training GANs as well as the differences between the JSD, KLD and reverse KLD.</li>
</ul>
</li>
<li><a href="https://github.com/HIPS/autograd/blob/master/examples/generative_adversarial_net.py">Simple Python GAN example</a>
<ul>
<li>This example illustrates how simple GANs are to implement by doing it in 145 lines of Python using Numpy and a simple autograd library.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first three questions this week are here to make sure that you understand some of the most important points in the GAN paper. The last question is to make sure you understood the overall picture of what a GAN is, and to get your hands dirty with some of the practical difficulties of training GANs.</em></p>
<ol>
<li>Prove that minimizing the optimal discriminator loss, with respect to the generator model parameters, is equivalent to minimizing the JSD.
<ul>
<li>Hint, it may help to somehow introduce the distribution <script type="math/tex">p_m(x) = \frac{p_d(x) + p_g(x)}{2}</script>.</li>
</ul>
<details><summary>Solution</summary>
<p>
The loss we are minimizing is
$$\mathbb{E}_{x \sim p_d(x)}[\log D^*(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D^*(G(x)))]$$
where \(p_d(x)\) is the true data distribution, \(p_z(z)\) is the noise distribution from which we draw samples to pass through our generator, \(D\) and \(G\) are the discriminator and generator, and \(D^*\) is the optimal discriminator which has the form:
$$ D^*(x) = \frac{p_d(x)}{p_d(x) + p_g(x)}.$$
Here \(p_g(x)\) is the distribution of the data sampled from the generator. Substiting in \(D^*(x)\) and \(p_g(x)\), we can rewrite the loss as
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_d(x) + p_g(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_d(x) + p_g(x)}]. $$
Now we can multiply the values inside the logs by \(1 = \frac{0.5}{0.5}\) to get
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{0.5 p_d(x)}{0.5(p_d(x) + p_g(x))}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{0.5 p_g(x)}{0.5(p_d(x) + p_g(x))}]. $$
Recall that \(\log(ab) = \log(a) + \log(b)\) and define \(p_m(x) = \frac{p_d(x) + p_g(x)}{2}\), we now get
$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_m(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_m(x)}] - 2\log2. $$
Using the definition of the KL-Divergence, this simplifies to
$$ D_{KL}(p_d||p_m) + D_{KL}(p_g||p_m) - 2\log2. $$
Finally, using the definition of the JS-Divergence and noting that for the purposes of minimization the \(2\log2\) term can be ignored, we get
$$ D_{JS}(p_d||p_g).$$
</p>
</details>
</li>
<li>Explain why Goodfellow says that <script type="math/tex">D</script> and <script type="math/tex">G</script> are playing a two-player minmax game and derive the definition of the value function <script type="math/tex">V(G,D)</script>.
<details><summary>Solution</summary>
<p>
\(G\) wants to maximize the probability that \(D\) thinks the generated samples are real \(\mathbb{E}_{z \sim p_z(z)}[D(G(z))]\). This is the same as minimizing the probability that \(D\) thinks the generated samples are not fake \(\mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]\).
</p>
<p>
On the other hand, \(D\) wants to maximise the probability that it assigns the labels correctly \(\mathbb{E}_{x \sim p_d(x)}[D(x)] + \mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]\). Note that \(D(x)\) should be 1 if \(x\) is real, and 0 if \(x\) is fake.
</p>
<p>
We can take logs without changing the optimization, which gives
$$ V(G,D) = \min_G\max_D \mathbb{E}_{x \sim p_d(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]. $$
</p>
</details>
</li>
<li>Why is it important to carefully tune the amount that the generator and discriminator are trained in the original GAN formulation?
<ul>
<li>Hint, it has to do with the approximation for the JSD & the dimensionality of the data manifolds.</li>
</ul>
<details><summary>Solution</summary>
<p>
If we train the discriminator too much we get vanishing gradients. This is due to the fact that when the true data distribution and model distribution lie on low dimensional manifolds (or have disjoint support almost everywhere), the optimal discriminator will be perfect — i.e. the gradient will be zero almost everywhere. This is something that almost always happens.
</p>
<p>
On the other hand, if we train the discriminator too little, then the loss for the generator no longer approximates the JSD. This is because the approximation only holds if the discriminator is near the optimal \(D^*(x) = \frac{p)d(x)}{p_d(x) + p_g(x)}\).
</p>
</details>
</li>
<li>Implement a GAN and train it on Fashion MNIST.
<ul>
<li><a href="https://colab.research.google.com/drive/1OWZEeF-SB0r1f6mHm-7-hfxd2zsecEwq#scrollTo=Q8YoJ4mejp97">This notebook</a> contains a skeleton with boilerplate code and hints.</li>
<li>Try various settings of hyper-parameters, other than those suggested, and see if the model converges.</li>
<li>Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.</li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/dcgan/dcgan.py">Here</a> is a GAN implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week3.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="4-wasserstein-gan">4 Wasserstein GAN</h1>
<p><strong>Motivation</strong>: Last week we saw how GANs solve some problems in training generative models but also that they bring in new problems. This week we’ll look at the Wasserstein GAN which goes a long way to solving these problems.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Wasserstein Distance vs KLD/JSD</li>
<li>Wasserstein GAN</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1701.07875.pdf">The WGAN paper</a>
<ul>
<li>This should be pretty self-explanatory! We’re doing a DFL on Wasserstein GANs so we’d better read the paper! (This isn’t the end of the road, however, next week we’ll look at WGAN-GP.) The paper builds upon an intuitive idea: the family of Wasserstein distances is a nice distance between probability distributions, that is well grounded in theory. The authors propose to use the 1-Wasserstein distance to estimate generative models. More specifically, they propose to use the 1-Wasserstein distance in place of the JSD in a standard GAN — that is to measure the difference between the true distribution and the model distribution of the data. They show that the 1-Wasserstein distance is an integral probability metric (IPM) with a meaningful set of constraints (1-Lipschitz functions), and can, therefore, be optimized by focusing on discriminators that are “well behaved” (meaning that their output does not change to much if you perturb the input, i.e. they are Lipschitz!).</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.alexirpan.com/2017/02/22/wasserstein-gan.html">Summary blog for the paper</a>
<ul>
<li>This is a brilliant blog post that summarises almost all of the key points we’ve covered over the last 4 weeks and puts them in the context of the WGAN paper. In particular, if any of the more theoretic aspects of the WGAN paper were a bit much for you then this post is worth reading.</li>
</ul>
</li>
<li><a href="https://mindcodec.ai/2018/09/23/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy/">Another good summary of the paper</a></li>
<li>Wasserstein / Earth Mover distance <a href="https://vincentherrmann.github.io/blog/wasserstein/">blog</a> <a href="https://mindcodec.ai/2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/">posts</a></li>
<li><a href="https://www.youtube.com/watch?v=6iR1E6t1MMQ">Set of</a> <a href="https://www.youtube.com/watch?v=1ZiP_7kmIoc">three</a> <a href="https://www.youtube.com/watch?v=SZHumKEhgtA">lectures</a> by Marco Cuturi on optimal transport (with accompanying <a href="https://drive.google.com/file/d/1oYX41dIAXhU6EShcid6eYrrK7svi5NXW/view">slides</a>)
<ul>
<li>If you are interested in the history of optimal transport and would like to see where the KR duality comes from (that’s the crucial argument in the WGAN paper which connects the 1-Wasserstein distance to an IPM with a Lipschitz constraint), the Wasserstein distance, or if you feel like you need a different explanation of what the Wasserstein distance and the Kantorovich-Rubinstein duality are, then watching these lectures is recommended. There are some really cool applications of optimal transport here too, and a more exhaustive description of other families of Wasserstein distances (such as the quadratic one) and their dual formulation.</li>
</ul>
</li>
<li>The first 15 or so minutes of <a href="https://www.youtube.com/watch?v=eDWjfrD7nJY">this lecture on GANs</a> by Sebastian Nowozin
<ul>
<li>Great description of WGAN, including Lipschitz and KR duality. This lecture is actually part 2 of a series of 3 lectures from MLSS Africa. Watching the whole series is also highly recommended if you are interested in knowing more about the bigger picture for GANs (including other interesting developments and future work) and how WGAN relates to other GAN variants. However, to avoid spoilers for next week, you should wait to watch the rest of part 2.</li>
</ul>
</li>
<li><a href="https://arxiv.org/pdf/1803.00567.pdf">Computational Optimal Transport</a> by Peyré and Cuturi (Chapters 2 and 3 in particular)
<ul>
<li>If you enjoyed Marco’s lectures above, or want a more thorough theoretical understanding of the Wasserstein distance, then this textbook is for you! However, please keep in mind that this textbook is somewhat mathematically involved, so if you don’t have a mathematics background you may struggle with it.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>The first two questions are here to highlight the key difference between the WGAN and the original GAN formulation. As before, the last question is to make sure you understood the overall picture of what a WGAN is and to get your hands dirty with how they differ from standard GANs in practice.</em></p>
<ol>
<li>What happens to the KLD/JSD when the real data and the generator’s data lie on low dimensional manifolds?
<details><summary>Solution</summary>
<p>
The true distribution and model distribution tend to have different supports which causes the KLD and JSD to saturate.
</p>
</details>
</li>
<li>With this in mind, how does using the Wasserstein distance, rather than JSD, reduce the sensitivity to careful scheduling of the generator and discriminator?
<details><summary>Solution</summary>
<p>
The Wasserstein distance does not saturate or blow up for distributions with different supports. This means that we still get signals in these cases which in turn means that we don’t have to worry about training the discriminator (or critic) to optimality — in fact, we <i>want</i> to train it to optimality since it will give better signals.
</p>
</details>
</li>
<li>Let’s compare the 1-Wasserstein Distance (aka Earth Mover’s Distance — EMD) with the KLD for a few simple discrete distributions. We want to build up an intuition for the differences between these two metrics and why one might be better than another in certain scenarios. You might find it useful to use the Scipy implementations for <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html">1-Wasserstein</a> and <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kl_div.html">KLD</a>.
<ol>
<li>Let <script type="math/tex">P(x)</script>, <script type="math/tex">Q(x)</script> and <script type="math/tex">R(x)</script> be discrete distributions on <script type="math/tex">Z</script> with:
<ul>
<li><script type="math/tex">P(0) = 0.5</script>, <script type="math/tex">P(1) = 0.5</script>,</li>
<li><script type="math/tex">Q(0) = 0.75</script>, <script type="math/tex">Q(1) = 0.25</script>, and</li>
<li><script type="math/tex">R(0) = 0.25</script> and <script type="math/tex">R(1) = 0.75</script>.
<br /> Calculate both the KLD and EMD for the following pairs of distributions. You should notice that while Wasserstein is a proper distance metric, KLD is not (<script type="math/tex">D_{KL}(P||Q) \ne D_{KL}(Q||P)</script>).
<ol>
<li><script type="math/tex">P</script> and <script type="math/tex">Q</script></li>
<li><script type="math/tex">Q</script> and <script type="math/tex">P</script></li>
<li><script type="math/tex">P</script> and <script type="math/tex">P</script></li>
<li><script type="math/tex">P</script> and <script type="math/tex">R</script></li>
<li><script type="math/tex">Q</script> and <script type="math/tex">R</script></li>
</ol>
</li>
</ul>
</li>
<li>Let <script type="math/tex">P(x)</script>, <script type="math/tex">Q(x)</script>, <script type="math/tex">R(x)</script>, <script type="math/tex">S(x)</script> be discrete distributions on <script type="math/tex">Z</script> with:
<ul>
<li><script type="math/tex">P(0) = 0.5</script>, <script type="math/tex">P(1) = 0.5</script>, <script type="math/tex">P(2) = 0</script>,</li>
<li><script type="math/tex">Q(0) = 0.33</script>, <script type="math/tex">Q(1) = 0.33</script>, <script type="math/tex">Q(2) = 0.33</script>,</li>
<li><script type="math/tex">R(0) = 0.5</script>, <script type="math/tex">R(1) = 0.5</script>, <script type="math/tex">R(2) = 0</script>, <script type="math/tex">R(3) = 0</script>, and</li>
<li><script type="math/tex">S(0) = 0</script>, <script type="math/tex">S(1) = 0</script>, <script type="math/tex">S(2) = 0.5</script>, <script type="math/tex">S(3) = 0.5</script>.
<br /> Calculate the KLD and EMD between the following pairs of distributions. You should notice that the EMD is well behaved for distributions with disjoint support while the KLD is not.
<ol>
<li><script type="math/tex">P</script> and <script type="math/tex">Q</script></li>
<li><script type="math/tex">Q</script> and <script type="math/tex">P</script></li>
<li><script type="math/tex">R</script> and <script type="math/tex">S</script></li>
</ol>
</li>
</ul>
</li>
<li>Let <script type="math/tex">P(x)</script>, <script type="math/tex">Q(x)</script>, <script type="math/tex">R(x)</script>, and <script type="math/tex">S(x)</script> be discrete distributions on <script type="math/tex">Z</script> with:
<ul>
<li><script type="math/tex">P(0) = 0.25</script>, <script type="math/tex">P(1) = 0.75</script>, <script type="math/tex">P(2) = 0</script>,</li>
<li><script type="math/tex">Q(0) = 0</script>, <script type="math/tex">Q(1) = 0.75</script>, <script type="math/tex">Q(2) = 0.25</script>,</li>
<li><script type="math/tex">R(0) = 0</script>, <script type="math/tex">R(1) = 0.25</script>, <script type="math/tex">R(2) = 0.75</script>, and</li>
<li><script type="math/tex">S(0) = 0</script>, <script type="math/tex">S(1) = 0</script>, <script type="math/tex">S(2) = 0.25</script>, <script type="math/tex">S(3) = 0.75</script>.
<br /> Calculate the EMD between the following pairs of distributions. Here we just want to get more of a sense for the EMD.
<ol>
<li><script type="math/tex">P</script> and <script type="math/tex">Q</script></li>
<li><script type="math/tex">P</script> and <script type="math/tex">R</script></li>
<li><script type="math/tex">Q</script> and <script type="math/tex">R</script></li>
<li><script type="math/tex">P</script> and <script type="math/tex">S</script></li>
<li><script type="math/tex">R</script> and <script type="math/tex">S</script></li>
</ol>
</li>
</ul>
</li>
</ol>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/JamesAllingham/DFL-WGAN/blob/master/DFL_WGAN_week4_q3.ipynb">Here</a> is a Jupyter notebook with solutions.
</p>
</details>
</li>
<li>Based on the GAN implementation from week 3, implement a WGAN for FashionMNIST.
<ul>
<li>Try various settings of hyper-parameters. Does this model seem more resilient to the choice of hyper-parameters?</li>
<li>Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.</li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/wgan/wgan.py">Here</a> is a WGAN implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week4.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!</p>
<p><br /></p>
<h1 id="5-wgan-gp">5 WGAN-GP</h1>
<p><strong>Motivation</strong>: Let’s read the WGAN-GP paper (Improved Training of Wasserstein GANs). As has been the trend over the last few weeks, we’ll see how this method solves a problem with the standard WGAN: weight clipping, as well as a potential problem in the standard GAN: overfitting.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>WGAN-GP</li>
<li>Weight clipping vs gradient penalties</li>
<li>Measuring GAN performance</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1704.00028.pdf">WGAN-GP paper</a>
<ul>
<li>This is our final required reading. The paper suggests improvements to the training of Wasserstein GANs with some great theoretical justifications and actual results.</li>
</ul>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1709.08894.pdf">On the Regularization of Wasserstein GANs</a>
<ul>
<li>This paper came out after the WGAN-GP paper but gives a thorough discussion of why the weight clipping in the original WGAN was an issue (see Appendix B). In addition, they propose other solutions for how to get around doing so and provide other interesting discussions of GANs and WGANs. </li>
</ul>
</li>
<li><a href="https://medium.com/@jonathan_hui/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490">Wasserstein GAN & WGAN-GP blog post</a>
<ul>
<li>Another blog that summarises many of the key points we’ve covered and includes WGAN-GP.</li>
</ul>
</li>
<li><a href="https://medium.com/@jonathan_hui/gan-how-to-measure-gan-performance-64b988c47732">GAN — How to measure GAN performance?</a>
<ul>
<li>A blog that discusses a number of approaches to measuring the performance of GANs, including the Inception score, which is useful to know about when reading the WGAN-GP paper.</li>
</ul>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<p><em>This week’s questions follow the same pattern as last week’s. How does the formulation of WGAN-GP differ from that of the original GAN or WGAN (and how is it similar)? What does this mean in practice?</em></p>
<ol>
<li>Why does weight clipping lead to instability in the training of a WGAN & how does the gradient penalty avoid this problem?
<details><summary>Solution</summary>
<p>
The instability comes from the fact that if we choose the weight clipping hyper-parameter poorly we end up with either exploding or vanishing gradients. This is because weight clipping encourages the optimizer to push the absolute all of the weights very close to the clipping value. Figure 1b in the paper shows this happening. To explain this phenomenon, consider a simple logistic regression model. Here if any of the features are highly predictive of a particular class it will be assigned as positive a weight as possible, similarly, if a feature is not predictive of a particular class, it will be assigned as negative a weight as possible. Now depending on our choice of the weight clipping value, we either get exploding or vanishing gradients.
<ul>
<li> Vanishing gradients: this is similar to the issues if vanishing gradients in a vanilla RNN, or a very deep feed-forward NN without residual connections. If we choose the weight clipping value to be too small, during back-propagation, the error signal going to each layer will be multiplied by small values before being propagated to the previous layer. This results in exponential decay in the error signal as it propagates farther backward. </li>
<li> Exploding gradients: similarly, if we choose a weight clipping value that is too large, the error signals will get repeatedly multiplied by large numbers as the propagate backward — resulting in exponential growth. </li>
</ul>
</p>
<p>
This phenomena also related to the reason we use weight initialization schemes such as Xavier and He and also why batch normalization is important — both of these methods help to ensure that information is propagated through the network without decaying or exploding.
</p>
</details>
</li>
<li>Explain how WGAN-GP addresses issues of overfitting in GANs.
<details><summary>Solution</summary>
<p>
Both WGAN-GP, and indeed the original weight-clipped WGAN, have the property that the discriminator/critic loss corresponds to the sample quality from the discriminator, which lets us use the loss to detect overfitting (we can compare the negative discriminator/critic loss for a validation set to that of the training set of real images — when the two diverge we have overfitted). The correspondence between the loss and the sample quality can be explained by a number of factors.
<ul>
<li> With a WGAN we can train our discriminator to optimality. This means that if the critic is struggling to tell the difference between real and generated images we can conclude that the real and generated images are similar. In other words, the loss is meaningful.</li>
<li> In addition, in a standard GAN where we cannot train the discriminator to optimality, our loss no longer approximates the JSD. We do not know what function our loss is actually approximating and as a result we cannot say (and in practise we do not see) that the loss is a meaningful measure of sample quality. </li>
<li> Finally, there are arguments to be made that even if the loss for a standard GAN was approximating the JSD, the Wasserstein distance is a better distance measure for images distributions than the JSD. </li>
</ul>
</p>
</details>
</li>
<li>Based on the WGAN implementation from week 4, implement an improved WGAN for MNIST.
<ul>
<li>Compare the results, ease of hyper-parameter tuning, and correlation between loss and your subjective ranking of samples, with the previous two models.</li>
<li><em>The Keras implementation of WGAN-GP can be tricky. If you are familiar with another framework like TensorFlow or Pytorch it might be easier to use that instead. If not, don’t be too hesitant to check the solution if you get stuck.</em></li>
</ul>
<details><summary>Solution</summary>
<p>
<a href="https://github.com/eriklindernoren/Keras-GAN/blob/master/wgan_gp/wgan_gp.py">Here</a> is a WGAN-GP implementation using Keras.
</p>
</details>
</li>
</ol>
<p><strong>Notes</strong>: Here is a <a href="/assets/wgan_notes/week5.pdf">link</a> to our notes for the lesson. We were fortunate enough to have Ishaan Gulrajani sit in on the session!</p>james[Editor’s Note: We are especially proud of this one. James and his group went above and beyond the call of duty and made a guide from their class that we feel is especially superb for understanding their target paper. Moving forward, he has forced us to up our game because it will be hard to release a curriculum that is not as strong as this one. We highly recommend earnestly studying with this at hand.]Announcing the 2019 DFL Fellows2019-04-15T16:00:00+00:002019-04-15T16:00:00+00:00https://www.depthfirstlearning.com/2019/Announcing-DFL-Fellows<p>After we launched Depth First Learning last year, we wanted to keep the momentum
and continue outputting high-quality study guides for machine learning.
Subsequently, we launched the <a href="http://fellowship.depthfirstlearning.com">Depth First Learning Fellowship</a> with funding provided by <a href="https://www.janestreet.com/">Jane Street</a>.</p>
<p>We were blown away by the response. With over 100 applicants from 5 continents, we had a tremendously hard time selecting only four proposals. After speaking with many of the applicants, we could not be more thrilled with the groups we selected. See below for bios of the inaugural class, as well as the papers that their groups will be respectively learning.</p>
<p>What’s the process now you ask? The fellows are hard at work constructing their curricula and will soon begin online classes. Participants will meet weekly to discuss and go beyond the material.</p>
<div class="welcome">
<b>We are looking for participants for these groups.
<br />If you’re interested, please let us know by filling out <a href="https://docs.google.com/forms/d/e/1FAIpQLSdNsXeJn0Osc1m5A_Rj7tTE3yzPINuL09xbaqFdHZGmUUBMqA/viewform">this form</a>.</b>
</div>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Steve Kroon - Stellenbosch (South Africa)</b></p>
<p><img src="/assets/kroon.png" style="width: 35%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1505.05770">Variational Inference with Normalizing Flows</a>”, by Rezende and Mohamed (ICML 2015)</p>
<p>Dr Steve Kroon obtained MCom (Computer Science) and PhD (Mathematical Statistics) degrees while studying at Stellenbosch University. He joined the Stellenbosch University Computer Science department in 2008. His PhD thesis considered aspects of statistical learning theory, and his subsequent research has focused on decision making in
artificial intelligence, including machine learning, reinforcement learning, and search techniques. He has supervised and co-supervised 5 graduated and 3 current master’s students, and has published 3 journal articles and 8 peer-reviewed conference and conference workshop
articles. He has served as a reviewer for the journals Algorithmica, the Journal of Universal Computer Science, and the South African Computer Journal, as well as on the programme committee for 2 conferences. He holds a Diploma in Actuarial Techniques, and is a member of the Centre for Artificial Intelligence Research, the Institute of Electrical and Electronics Engineers (IEEE) and the IEEE Computational Intelligence Society, the International Computer Games
Association, the South African Statistical Association, and the South African Institute for Computer Scientists and Information Technologists.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Sandhya Prabhakaran - New York (USA)</b></p>
<p><img src="/assets/prabhakaran.jpeg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1801.10130">Spherical CNNs</a>” by Cohen, Geiger, Köhler and Welling (ICLR 2018)</p>
<p>Dr. Sandhya Prabhakaran is a Research Fellow at Memorial Sloan Kettering Cancer Centre, NYC. Before that she was a Research Scientist at Columbia University in the City of New York.</p>
<p>She received her Ph.D. from the Department of Mathematics and Computer Science, University of Basel (Switzerland) and her Masters in Intelligent Systems (Robotics) from School of Informatics, University of Edinburgh (Scotland). Her research deals with developing statistical theory and inference models, particularly to problems in Cancer Biology and Computer Vision.</p>
<p>Prior to academics, she was an Assembler programmer working with the Mainframe Operating System (z/OS) at IBM Software Laboratories, Bangalore and has developed Mainframe applications at UST Global, Thiruvananthapuram.</p>
<p>She is an avid hiker and distance runner and has completed 4 out of the 6 World Marathon Majors.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Bhairav Mehta - Montreal (Canada)</b></p>
<p><img src="/assets/mehta.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" /></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1608.04471">Stein Variational Gradient Descent</a>” by Liu and Wang (NIPS 2016)</p>
<p>After finishing my undergraduate studies at the University of Michigan, I migrated north to Montreal, where I’m now a graduate student at Mila. I work mostly on reinforcement learning and robotics, but continue to find that teaching is the most rewarding part of graduate (and undergraduate) studies. I’ve been serving as a tutor, TA, and now, GSI, for over a decade, and I’m incredibly excited by the opportunity to lead a DFL course online. In my free time, you can find me helping ducks waddle across the street at Duckietown, or building deep learning models for my nonprofit tackling core problems in K-12 education.</p>
<hr style="margin-bottom: 25px; margin-top: 25px; " />
<p><b>Vinay Ramasesh, Piyush Patil, and Riley Edmunds - Berkeley (USA)</b></p>
<p><b>Target paper:</b> “<a href="https://arxiv.org/abs/1711.04735">Resurrecting the sigmoid in deep learning through dynamical isometry</a>” by Pennington, Schoenholz and Ganguli (NIPS 2017)</p>
<div style="overflow: hidden;">
<img src="/assets/ramasesh.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<p><b>Vinay:</b> I am finishing up a Ph. D. in physics at UC Berkeley, where I have worked on building and testing small quantum processors made from superconducting circuits. At Berkeley, I work in the Quantum Nanoelectronics Lab under the guidance of Dr. Irfan Siddiqi. My experience with machine learning comes from Berkeley's machine learning student group, ML@B, which I joined in 2017. Previously, I studied physics and electrical engineering at MIT, working in the group of Dr. Martin Zwierlein to build up an experiment to cool, trap, and image strongly-interacting atomic gases.
</p>
</div>
<p><br /></p>
<div style="overflow: hidden;">
<img src="/assets/patil.jpg" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<b>Piyush:</b> I graduated from UC Berkeley last May, where I studied electrical engineering and computer science and mathematics. While at Berkeley, I helped to get the university's student-run machine learning club, ML@B, up and running, serving as the vice president of projects during the last couple years. I was involved with research in quantum machine learning, adversarial examples, and natural language understanding. After graduating, I joined Nuro, a robotics startup working to build autonomous vehicles. Outside of ML, I enjoy reading philosophy, going hiking and backpacking, and spending time with friends.
</div>
<p><br /></p>
<div style="overflow: hidden;">
<img src="/assets/edmunds.png" style="width: 42%; padding-left: 20px; padding-bottom: 20px;" align="right" />
<b>Riley:</b> I'm currently finishing up my undergrad degree in computer science at UC Berkeley. I was one of the early members of ML@B, where as vice president of research, I helped club members form teams to work on ML research projects. At UC Berkeley, I've worked under Prof. Dawn Song, Alice Agogino and Stella Yu. With a couple friends, in February 2018 I co-founded an ML consulting company, Alinea AI. You can find more on my background at rileyedmunds.com. In my spare time, I enjoy traveling, playing spikeball, and discussing thought-provoking books.
</div>dflAfter we launched Depth First Learning last year, we wanted to keep the momentum and continue outputting high-quality study guides for machine learning. Subsequently, we launched the Depth First Learning Fellowship with funding provided by Jane Street.The DFL Fellowship2018-12-05T16:00:00+00:002018-12-05T16:00:00+00:00https://www.depthfirstlearning.com/2018/DFL-Fellowship<p>When we began Depth First Learning during the Google AI Residency, we wanted to find a
better way to study and understand important machine learning papers and ideas.
We found that many papers often assumed a set of requisite knowledge, which
prevented us from deeply appreciating the contribution or novelty of the work.</p>
<p>To this end, we designed Depth First Learning, a pedagogy for diving deep by
carefully tailoring a curriculum around a particular ML paper or concept and
leading small, focused discussion groups. So far, we’ve created guides for
<a href="http://www.depthfirstlearning.com/2018/InfoGAN">InfoGAN</a>, <a href="http://www.depthfirstlearning.com/2018/TRPO">TRPO</a>, <a href="http://www.depthfirstlearning.com/2018/AlphaGoZero">AlphaGoZero</a>, and <a href="http://www.depthfirstlearning.com/2018/DeepStack">DeepStack</a>.</p>
<p>Since our launch, we’ve received very positive feedback from students and
researchers around the world. <strong>Now, we want to run new, online classes around the
world.</strong></p>
<p>We intimately understand that the process of curating a meaningful curriculum
with reading materials, practice problems, and instructive discussion points can
be very rewarding, but also time-consuming and difficult. We wanted to make sure
that the people compiling the content understood that their efforts were well
worth their time and consequently decided to launch a fellowship program.</p>
<p><strong>Thanks to the generosity of <a href="http://www.janestreet.com">Jane Street</a>, we will provide 4 fellows
with a $4000 grant each to build a 6 week curriculum and run weekly on-line discussions.</strong></p>
<p><del>
If you’d like to lead a class about an important paper in machine learning, please visit <a href="http://fellowship.depthfirstlearning.com">http://fellowship.depthfirstlearning.com</a> to apply. We look forward to hearing from you!
</del></p>
<p><b>Thanks for all of the applications! We received interest from an astounding 113 people, and we are now going over the list. If you applied, you should have received an email from us. Applications are now closed.</b></p>
<ul>
<li><a href="http://twitter.com/avitaloliver">Avital</a>, <a href="http://twitter.com/suryabhupa">Surya</a>,
<a href="http://twitter.com/kumarkagrawal">Kumar</a>, <a href="http://twitter.com/cinjoncin">Cinjon</a></li>
</ul>dflWhen we began Depth First Learning during the Google AI Residency, we wanted to find a better way to study and understand important machine learning papers and ideas. We found that many papers often assumed a set of requisite knowledge, which prevented us from deeply appreciating the contribution or novelty of the work.DeepStack2018-07-10T16:00:00+00:002018-07-10T16:00:00+00:00https://www.depthfirstlearning.com/2018/DeepStack<p>Thank you to Michael Bowling, Michael Johanson, and Marc Lanctot for contributions to this guide.</p>
<p>Additionally, this would not have been possible without the generous support of
Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring18">The Mathematics of Deep Learning</a>.
Special thanks to him, as well as Martin Arjovsky, my colleague in leading this
recitation, and my fellow students Ojas Deshpande, Anant Gupta, Xintian Han,
Sanyam Kapoor, Chen Li, Yixiang Luo, Chirag Maheshwari, Zsolt Pajor-Gyulai,
Roberta Raileanu, Ryan Saxe, and Liang Zhuo.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/deepstack-deps.svg" width="200"></iframe>
<div>Concepts used in DeepStack. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>Along with Libratus, DeepStack is one of two approaches to solving No-Limit
Texas Hold-em that debuted coincidentally. This game was notoriously difficult
to solve as it has just as large a branching factor
as Go, but additionally is a game of imperfect information.</p>
<p>The main idea behind both DeepStack and Libratus is to use Counterfactual Regret
Minimization (CFR) to find a mixed strategy that approximates a Nash Equilibrium
strategy. CFR’s convergence properties guarantee that we will yield such a strategy
and the closer we are to it, the better our outcome will be. They differ in
their implementation. In particular, DeepStack uses deep neural networks
to approximate the counterfactual value of each hand at specific points in the
game. While still being mathematically tight, this lets it cut short
the necessary computation to reach convergence.</p>
<p>In this curriculum, you will explore the study of games with a tour through
game theory and counterfactual regret minimization while building up the
requisite understanding to tackle DeepStack. Along the way, you will learn
all of the necessary topics, including what is the
<a href="https://en.wikipedia.org/wiki/Branching_factor">branching factor</a>, all about
<a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash Equilibria</a>, and
<a href="https://www.quora.com/What-is-an-intuitive-explanation-of-counterfactual-regret-minimization">CFR</a>.</p>
<p><br /></p>
<h1 id="common-resources">Common Resources:</h1>
<ol>
<li>MAS: <a href="http://www.masfoundations.org/mas.pdf">Multi Agent Systems</a>.</li>
<li>LT: <a href="http://mlanctot.info/files/papers/PhD_Thesis_MarcLanctot.pdf">Marc Lanctot’s Thesis</a>.</li>
<li>ICRM: <a href="http://modelai.gettysburg.edu/2013/cfr/cfr.pdf">Introduction to Counterfactual Regret Minimization</a>.</li>
<li>PLG: <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf">Prediction, Learning, and Games</a>.</li>
</ol>
<p><br /></p>
<h1 id="1-normal-form-games--poker">1 Normal Form Games & Poker</h1>
<p><strong>Motivation</strong>: Most of Game Theory, as well as the particular techniques used in
DeepStack and Libratus, is built on the framework of Normal Form
Games. These are game descriptions and are familiarly represented as a matrix,
a famous example being the Prisoner’s Dilemma. In this section, we cover
the basics of Normal Form Games. In addition, we go over the rules of Poker and
why it had proved so difficult to solve.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 3.1 & 3.2.</li>
<li>LT: Pages 5-7.</li>
<li><a href="https://arxiv.org/pdf/1701.01724.pdf">The Game of Poker</a>: Supplementary #1 on pages 16-17.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~sandholm/solving%20games.aimag11.pdf">The State of Solving Large Incomplete-Information Games, and Application to Poker</a> (2010)</li>
<li><a href="https://www.youtube.com/watch?v=2dX0lwaQRX0">Why Poker is Difficult</a>
Very good video by Noam Brown, the main author of Libratus. The first eighteen
minutes are the most relevant.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>LT: Prove that in a zero-sum game, the Nash Equilibrium strategies are interchangeable.
<details><summary>Hint</summary>
<p>Use the definition of a Nash Equilibrium along with the fact that
\(\mu_{i}(\sigma_{i}, \sigma_{-i}) + \mu_{-i}(\sigma_{i}, \sigma_{-i}) = c\).
</p>
</details>
</li>
<li>LT: Prove that in a zero-sum game, the expected payoff to each player is the same for every equilibrium.
<details><summary>Solution</summary>
<p>We will solve both this problem and the one above here. We have that if
\(\mu_{i}(\sigma) = \mu(\sigma_{i}, \sigma_{-i})\) and
\(\mu_{i}(\sigma') = \mu(\sigma_{i}', \sigma_{-i}')\) are both
Nash Equilibria, then:</p>
<p>\(
\begin{align}
\mu_{i}(\sigma_{i}, \sigma_{-i}) &\geq \mu_{i}(\sigma_{i}', \sigma_{-i}) \\
&= c - \mu_{-i}(\sigma_{i}', \sigma_{-i}) \\
&\geq c - \mu_{-i}(\sigma_{i}', \sigma_{-i}') \\
&= \mu_{i}(\sigma_{i}', \sigma_{-i}')
\end{align}
\)
</p>
<p>In a similar fashion, we can show that
\(\mu(\sigma_{i}', \sigma_{-i}') \geq \mu(\sigma_{i}, \sigma_{-i})\).
</p>
Consequently, \(\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}, \sigma_{-i})\),
which also implies that the strategies are interchangeable, i.e.
\(\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}', \sigma_{-i})\).
</details>
</li>
<li>MAS: Prove Lemma 3.1.6. <br />
<script type="math/tex">\textit{Lemma}</script>: If a preference relation <script type="math/tex">\succeq</script> satisfies the axioms
completeness, transitivity, decomposability, and monotonicity, and if <script type="math/tex">o_1 \succ o_2</script>
and <script type="math/tex">o_2 \succ o_1</script>, then there exists probability <script type="math/tex">p</script> s.t. <script type="math/tex">% <![CDATA[
\forall p' < p %]]></script>,
<script type="math/tex">o_2 \succ [p': o_1; (1 - p'): o_3]</script> and for all <script type="math/tex">p'' > p</script>,
<script type="math/tex">[p'': o_1; (1 - p''): o_3] \succ o_2.</script></li>
<li>MAS: Theorem 3.1.8 ensures that rational agents need only maximize the expectation
of single-dimensional utility functions. Prove this result as a good test of your
understanding. <br />
<script type="math/tex">\textit{Theorem}</script>: If a preference relation <script type="math/tex">\succeq</script> satisfies the axioms completeness,
transitivity, substitutability, decomposability, monotonicity, and continuity, then
there exists a function <script type="math/tex">u: \mathbb{L} \mapsto [0, 1]</script> with the properties that:
<ol>
<li><script type="math/tex">u(o_1) \geq u(o_2)</script> iff <script type="math/tex">o_1 \succeq o_2</script>.</li>
<li><script type="math/tex">u([p_1 : o_1, ..., p_k: o_k]) = \sum_{i=1}^k p_{i}u(o_i)</script>.</li>
</ol>
</li>
</ol>
<p><br /></p>
<h1 id="2-optimality--equilibrium">2 Optimality & Equilibrium</h1>
<p><strong>Motivation</strong>: How do you reason about games? The best strategies in multi-agent
scenarios depend on the choices of others. Game theory deals with this problem
by identifying subsets of outcomes called solution concepts. In this section, we
discuss the fundamental solution concepts: Nash Equilibrium, Pareto Optimality,
and Correlated Equilibrium. For each solution concept, we cover what it implies
for a given game and how difficult it is to discover a representative strategy.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 3.3, 3.4.5, 3.4.7, 4.1, 4.2.4, 4.3, & 4.6.</li>
<li>LT: Section 2.1.1.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>MAS: Section 3.4.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Why must every game have a Pareto optimal strategy?
<details><summary>Solution</summary>
<p>Say that a game does not have a Pareto optimal outcome. Then, for every
outcome \(O\), there was another \(O'\) that Pareto-dominated \(O\).
Say \(O_2 > O_1\). Because \(O_2\) is not Pareto optimal, there is some
\(O_k > O_2\). There cannot be a max in this chain (because that max would
be Pareto optimal) and thus there must be some cycle. Consequently, there
exists for some agent a strategy \(O_j\) s.t. \(O_j > O_j\), which is a
contradiction.
</p>
</details>
</li>
<li>Why must there always exist at least one Pareto optimal strategy in which
all players adopt pure strategies?</li>
<li>Why in common-payoff games do all Pareto optimal strategies have the same payoff?
<details><summary>Solution</summary>
<p>Say two strategies \(S\) and \(S'\) are Pareto optimal. Then neither
dominates the other, so either \(\forall i \mu_{i}(S) = \mu_{i}(S')\)
or there are two players \(i, j\) for which \(mu_{i}(S) < \mu_{i}(S')\)
and \(mu_{j}(S) > \mu_{j}(S')\). In the former case, we see that the
two strategies have the same payoff as desired. In the latter case, we have
a contradiction because \(\mu_{j}(S') = \mu_{i}(S') > \mu_{i}(S)
= \mu_{j}(S) > \mu_{j}(S')\). Thus, all of the Pareto optimal strategies
must have the same payoff.
</p>
</details>
</li>
<li>MAS: Why does definition 3.3.12 imply that the vertices of a simplex must
all receive different labels?
<details><summary>Solution</summary>
<p>This follows from the definitions of \(\mathbb{L}(v)\) and \(\chi(v)\).
At the vertices of the simplex, \(\chi\) will only have singular values in
its range defined by the vertice itself. Consequently, \(\mathbb{L}\) must
as well.
</p>
</details>
</li>
<li>MAS: Why in definition 3.4.12 does it not matter that the mapping is to
pure strategies rather than to mixed strategies?</li>
<li>Take your favorite normal-form game, find a Nash Equilibrium, and then find
a corresponding Correlated Equilibrium.</li>
</ol>
<p><br /></p>
<h1 id="3-extensive-form-games">3 Extensive Form Games</h1>
<p><strong>Motivation</strong>: What happens when players don’t act simultaneously?
Extensive Form Games are an answer to this question. While this representation
of a game always has a comparable Normal Form, it’s much more natural to reason
about sequential games in this format. Examples include familiar ones like Go,
but also more exotic games like Magic: The Gathering and Civilization. This
section is imperative as Poker is best described as an Extensive Form Game.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>MAS: Sections 5.1.1 - 5.1.3.</li>
<li>MAS: Sections 5.2.1 - 5.2.3.</li>
<li><a href="http://martin.zinkevich.org/publications/ijcai2011_rgbr.pdf">Accelerating Best Response Calculation in Large Extensive Games</a>:
This is important for understanding how to evaluate Poker algorithms.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>LT: Section 2.1.2.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the intuition for why not all normal form games can be transformed
into perfect-form extensive games?
<details><summary>Solution</summary>
<p>The problem is one of modeling simultaneity. Perfect information
extensive form games have trouble modeling concurrent moves because they
have an explicit temporal structure of moves.
</p>
</details>
</li>
<li>Why does that change when the transformation is to imperfect extensive games?</li>
<li>How are the set of behavioral strategies different from the set of mixed strategies?
<details><summary>Solution</summary>
<p>The set of mixed strategies are each distributions over pure strategies.
The set of behavioral strategies are each vectors of distributions over the
actions and assign that distribution independently at each Information Set.
</p>
</details>
</li>
<li>Succinctly describe the technique demonstrated in the Accelerating Best Response paper.</li>
</ol>
<p><br /></p>
<h1 id="4-counterfactual-regret-minimization-1">4 Counterfactual Regret Minimization #1</h1>
<p><strong>Motivation</strong>: Counterfactual Regret Minimization (CFR) is only a decade old
but has already achieved huge success as the foundation underlying DeepStack
and Libratus. In the first of two weeks dedicated to CFR, we learn how the
algorithm works practically and get our hands dirty coding up our implementation.</p>
<p>The optional readings are papers introducing CFR-D and CFR+, further
iterations upon CFR. These are both used in DeepStack.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>ICRM: Sections 2.1-2.4.</li>
<li>ICRM: Sections 3.1-3.4.</li>
<li>LT: Section 2.2.</li>
<li><a href="http://poker.cs.ualberta.ca/publications/NIPS07-cfr.pdf">Regret Minimization in Games with Incomplete Information</a>.</li>
</ol>
<p><strong>Optional Reading</strong>: These two papers are CFR extensions used in DeepStack.</p>
<ol>
<li><a href="https://pdfs.semanticscholar.org/8216/0cbdcbeb13d53db85da928d8c42a789fdd69.pdf">Solving Imperfect Information Games Using Decomposition</a>: CFR-D.</li>
<li><a href="https://arxiv.org/pdf/1407.5042.pdf">Solving Large Imperfect Information Games Using CFR+</a>: CFR+.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the difference between external regret, internal regret, swap regret,
and counterfactual regret?
<details><summary>Hint</summary>
<p>The definitions of the three are the following:</p>
<ul>
<li><b>External Regret</b>: How much the algorithm regrets not taking the best
single decision in hindsight. We compare to a policy that performs a single
action in all timesteps.</li>
<li><b>Internal Regret</b>: How much the algorithm regrets making one choice
over another in all instances. An example is whenever you bought Amazon stock,
you instead bought Microsoft stock.</li>
<li><b>Swap Regret</b>: Similar to Internal Regret but instead of one categorical
action being replaced wholesale with another categorical action, now we allow
for any number of categorical swaps.</li>
<li><b>Counterfactual Regret</b>: Assuming that your actions take you to a
node, this is the expectation of that node over your opponents' strategies.
The counterfactual component is that we assume you get to that node with a
probability of one.</li>
</ul>
</details>
</li>
<li>Why is Swap Regret important?
<details><summary>Hint</summary>
<p>Swap Regret is connected to Correlated Equilibrium. Can you see why?</p>
</details>
</li>
<li>Implement CFR (or CFR+ / CFR-D) in your favorite programming language to play
Leduc Poker or Liar’s Dice.</li>
<li>How do you know if you’ve implemented CFR correctly?
<details><summary>Solution</summary>
<p>One way is to test it by implementing Local Best Response. It should
perform admirably against that algorithm, which is meant to best it.</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-counterfactual-regret-minimization-2">5 Counterfactual Regret Minimization #2</h1>
<p><strong>Motivation</strong>: In the last section, we saw the practical side of CFR and how effective it
can be. In this section, we’ll understand the theory underlying it. This will culminate
with Blackwell’s Approachability Theorem, a generalization of repeated two-player
zero-sum games. This is a challenging session but the payoff will be a much
keener understanding of CFR’s strengths.</p>
<p><strong>Required</strong>:</p>
<ol>
<li>PLG: Sections 7.3 - 7.7, 7.9.</li>
</ol>
<p><strong>Optional</strong>:</p>
<ol>
<li><a href="http://wwwf.imperial.ac.uk/~dturaev/Hart0.pdf">A Simple Adaptive Procedure Leading to Correlated Equilibrium</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture11_2007.pdf">Prof. Johari’s 2007 Class - 11</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture13_2007.pdf">Prof. Johari’s 2007 Class - 13</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture14_2007.pdf">Prof. Johari’s 2007 Class - 14</a>.</li>
<li><a href="http://web.stanford.edu/~rjohari/teaching/notes/336_lecture15_2007.pdf">Prof. Johari’s 2007 Class - 15</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>
<p>PLG: Prove Lemma 7.1. <br />
<script type="math/tex">\textit{Lemma}</script>: A probability distribution <script type="math/tex">P</script> over the set of all <script type="math/tex">K</script>-tuples
<script type="math/tex">i = (i_{1}, ..., i_{K})</script> of actions is a correlated equilibrium iff, for every
player <script type="math/tex">k \in {1, ..., K}</script> and actions <script type="math/tex">j, j' \in {1, ..., N_{k}}</script>, we have</p>
<script type="math/tex; mode=display">\sum_{i: i_{k} = j} P(i)\big(\mathcal{l}(i) - \mathcal{l}(i^{-}, j')\big) \leq 0</script>
<p>where <script type="math/tex">(i^{-}, j') = (i_{1}, ..., i_{k-1}, j', i_{k+1}, ..., i_{K})</script>.</p>
</li>
<li>
<p>It’s brushed over in the proof of Theorem 7.5 in PLG, but prove that if set
<script type="math/tex">S</script> is approachable, then every halfspace <script type="math/tex">H</script> containing <script type="math/tex">S</script> is approachable.</p>
<details><summary>Solution</summary>
<p>Because \(S \in H\) is approachable, we can always find a strategy for player one s.t.
the necessary approachability clauses hold (see Johari's Lecture 13). Namely, choose
the strategy in \(S\) that asserts \(S\) as being approachable.</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-deepstack">6 DeepStack</h1>
<p><strong>Motivation</strong>: Let’s read the paper! A summary of what’s going on to help with your
understanding:</p>
<p>DeepStack runs counterfactual regret minimization at every decision. However, it uses
two separate neural networks, one for after the flop and one for after the turn, to
estimate the counterfactual values without having to continue running CFR after those
moments. This approach is trained beforehand and helps greatly with cutting short the
search space at inference time. Each of the networks take as input the size of the pot
and the current Bayesian ranges for each player across all hands. They output the
counterfactual values for each hand for each player.</p>
<p>In addition to DeepStack, we also include Libratus as required reading. This paper
highlights Game Theory and CFR as the really important concepts in this curriculum;
deep learning is not necessary to build a champion Poker bot.</p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36630/t/58b7a3dce3df28761dd25e54/1488430045412/DeepStack.pdf">DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker</a>.</li>
<li><a href="https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36630/t/58bed28de3df287015e43277/1488900766618/DeepStackSupplement.pdf">DeepStack Supplementary Materials</a>.</li>
<li><a href="https://arxiv.org/pdf/1705.02955.pdf">Libratus</a>.</li>
<li><a href="https://vimeo.com/212288252">Michael Bowling on DeepStack</a>.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://github.com/lifrordi/DeepStack-Leduc">DeepStack Implementation for Leduc Hold’em</a>.</li>
<li><a href="https://www.youtube.com/watch?v=2dX0lwaQRX0">Noam Brown on Libratus</a>.</li>
<li><a href="https://arxiv.org/abs/1805.08195">Depth-Limited Solving for Imperfect-Information Games</a>: This paper is fascinating because it is achieves a poker-playing bot almost as good as Libratus but using a fraction of the necessary computation and disk space.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are the differences between the approaches taken in DeepStack and in Libratus?
<details><summary>Solution</summary>
<p>Here are some differences:</p>
<ul>
<li>A clear difference is that DeepStack uses a deep neural network to reduce the necessary search space, and Libratus does not.</li>
<li>DeepStack does not use any action abstraction and instead melds those considerations into the pot size input. Libratus does use a dense action abstraction but adapts it each game and additionally constructs new sub-games on the fly for actions not in its abstraction.</li>
<li>DeepStack uses card abstraction by first clustering the hands into 1000 buckets and then considering probabilities over that range. Libratus does not use any card abstraction preflop or on the flop, but does use it on later rounds such that the game's \(10^{61}\) decision points are reduced to \(10^{12}\).</li>
<li>DeepStack does not have a way to learn from recent games without further neural network training. On the other hand, Libratus improves via a background process that adds novel opponent actions to its action abstraction.</li>
</ul>
</details>
</li>
<li>Can you succinctly explain “Continual Re-solving”?</li>
<li>Can you succinctly explain AIVAT?</li>
</ol>cinjonThank you to Michael Bowling, Michael Johanson, and Marc Lanctot for contributions to this guide.AlphaGoZero2018-06-27T15:55:00+00:002018-06-27T15:55:00+00:00https://www.depthfirstlearning.com/2018/AlphaGoZero<p>Thank you to Marc Lanctot, Hugo Larochelle, Katherine Lee, and Tim Lillicrap for contributions to this guide.</p>
<p>Additionally, this would not have been possible without the generous support of
Prof. Joan Bruna and his class at NYU, <a href="https://github.com/joanbruna/MathsDL-spring18">The Mathematics of Deep Learning</a>.
Special thanks to him, as well as Martin Arjovsky, my colleague in leading this
recitation, and my fellow students Ojas Deshpande, Anant Gupta, Xintian Han,
Sanyam Kapoor, Chen Li, Yixiang Luo, Chirag Maheshwari, Zsolt Pajor-Gyulai,
Roberta Raileanu, Ryan Saxe, and Liang Zhuo.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/ag0-deps.svg" width="200"></iframe>
<div>Concepts used in AlphaGoZero. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>AlphaGoZero was a big splash when it debuted and for good reason. The grand effort
was led by David Silver at DeepMind and was an extension of work that he started
during his PhD. The main idea is to solve the game of Go and the approach taken
is to use an algorithm called Monte Carlo Tree Search (MCTS). This algorithm acts as an expert guide to teach
a deep neural network how to approximate the value of each state. The convergence
properties of MCTS provides the neural network with a founded way to reduce the
search space.</p>
<p>In this curriculum, you will focus on the study of two-person zero-sum perfect
information games and develop understanding so that you can completely grok
AlphaGoZero.</p>
<p><br /></p>
<h1 id="common-resources">Common Resources:</h1>
<ol>
<li>Knuth: <a href="https://pdfs.semanticscholar.org/dce2/6118156e5bc287bca2465a62e75af39c7e85.pdf">An Analysis of Alpha-Beta Pruning</a></li>
<li>SB: <a href="http://incompleteideas.net/book/bookdraft2017nov5.pdf">Reinforcement Learning: An Introduction, Sutton & Barto</a>.</li>
<li>Kun: <a href="https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/">Jeremy Kun: Optimizing in the Face of Uncertainty</a>.</li>
<li>Vodopivec: <a href="https://pdfs.semanticscholar.org/3d78/317f8aaccaeb7851507f5256fdbc5d7a6b91.pdf">On Monte Carlo Tree Search and Reinforcement Learning</a>.</li>
</ol>
<p><br /></p>
<h1 id="1-minimax--alpha-beta-pruning">1 Minimax & Alpha Beta Pruning</h1>
<p><strong>Motivation</strong>: Minimax and Alpha-Beta Pruning are original ideas that blossomed
from the study of games starting in the 50s. To this day, they are components in
strong game-playing computer engines like Stockfish. In this class, we will go
over these foundations, learn from Prof. Knuth’s work analyzing their properties,
and prove that these algorithms are theoretically sound solutions to two-player
games.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Perfect Information Games.</li>
<li>Minimax.</li>
<li>Alpha-Beta Pruning.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="http://www.cs.cornell.edu/courses/cs4700/2019sp/lectures/Lecture9.pdf">Cornell Recitation on Minimax & AB Pruning</a>.</li>
<li>Knuth: Section 6 (Theorems 1&2, Corollaries 1&3).</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~arielpro/mfai_papers/lecture1.pdf">CMU’s Mathematical Foundations of AI Lecture 1</a>.</li>
<li>Knuth: Sections 1-3.</li>
<li><a href="https://www.chessprogramming.org/index.php?title=Minimax">Chess Programming on Minimax</a>.</li>
<li><a href="https://www.chessprogramming.org/Alpha-Beta">Chess Programming on AB Pruning</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Knuth: Show that AlphaBetaMin <script type="math/tex">= G2(p, \alpha, \beta) = -F2(p, -\beta, -\alpha) = -</script>AlphaBetaMax. (p. 300)
<details><summary>Solution</summary>
<p>If \(d = 0\), then \(F2(p, \alpha, \beta) = f(p)\) and \(G2(p, -\alpha, -\beta) = g(p) = -f(p)\)
as desired, where the last step follows from equation 2 on p. 295.
</p>
<p>Otherwise, \(d > 0\) and we proceed by induction on the height \(h\). The
base case of \(h = 0\) is trivial because then the tree is a single root and
consequently is the \(d = 0\) case. Assume it is true for height \(< h\),
then for \(p\) of height \(h\), we have that \(m = a\) at the start of
\(F2(p, \alpha, \beta)\) and \(m\prime = -\alpha\) at the start of \(G2(p, -\beta, -\alpha)\). So
\(m = -m\prime\).
</p>
<p>In the i-th iteration of the loop, let's label the resulting value of \(m\)
as \(m_{n}\). We have that \(t = G2(p_{i}, m , \beta) = -F2(p_i, -\beta, -m) = -t\)
by the inductive assumption. Then,
\(t > m \iff -t < -m \iff t\prime < m\prime \iff m_{n} = t = -m_{n}\prime\),
which means that every time there is an update to the value of \(m\), it will
be preserved across both functions. Further, because
\(m \geq \beta \iff -m \leq -\beta \iff m\prime \leq -\beta\), we have that \(G2\) and
\(F2\) will have the same stopping criterion. Together, these imply that
\(G2(p, \alpha, \beta) = -F2(p, -\beta, -\alpha)\) after each iteration of the
loop as desired.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.1, why are the successor positions of type 2? (p. 305)
<details><summary>Solution</summary>
<p>By the definition of being type 1, \(p = a_{1} a_{2} \ldots a_{l}\), where
each \(a_{k} = 1\). Its successor positions \(p_{l+1} = p (l+1)\) all have length
\(l + 1\) and their first term \(> 1\) is at position \(l+1\), the last entry.
Consequently, \((l+1) - (l+1) = 0\) is even and they are type 2.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.2, why is it that p’s successor position is of type 3
if p is not terminal?
<details><summary>Solution</summary>
<p>If \(p\) is type 2 and size \(l\), then for \(j\) s.t. \(a_j\) is the first entry where
\(a_j > 1\), we have that \(l - j\) is even. When it's not terminal, then its
successor position \(p_1 = a_{1} \ldots a_{j} \dots a_{l} 1\) has a length of
size \(l + 1\), which implies that \(l + 1 - j\) is odd and so \(p_1\) is
type 3.
</p>
</details>
</li>
<li>Knuth: For Theorem 1.3, why is it that p’s successor positions are of type 2
if p is not terminal?
<details><summary>Hint</summary>
<p>This is similar to the above two.</p>
</details>
</li>
<li>Knuth: Show that the three inductive steps of Theorem 2 are correct.</li>
</ol>
<p><br /></p>
<h1 id="2-multi-armed-bandits--upper-confidence-bounds">2 Multi-Armed Bandits & Upper Confidence Bounds</h1>
<p><strong>Motivation</strong>: The multi-armed bandits problem is a framework for understanding
the exploitation vs exploration tradeoff. Upper Confidence Bounds, or UCB, is
an algorithmically tight approach to addressing that tradeoff under certain
constraints. Together, they are important components of how Monte Carlo Tree
Search (MCTS), a key aspect of AlphaGoZero, was originally formalized. For
example, in MCTS there is a notion of node selection where UCB is used extensively.
In this section, we will cover bandits and UCB.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Basics of reinforcement learning.</li>
<li>Multi-armed bandit algorithms and their bounds.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>SB: Sections 2.1 - 2.7.</li>
<li>Kun.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf">Original UCB1 Paper</a></li>
<li><a href="https://courses.cs.washington.edu/courses/cse599s/14sp/scribes/lecture15/lecture15_draft.pdf">UW Lecture Notes</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>SB: Exercises 2.3, 2.4, 2.6.</li>
<li>SB: What are the pros and cons of the optimistic initial values method? (Section 2.6)</li>
<li>Kun: In the proof for the expected cumulative regret of UCB1, why is <script type="math/tex">\delta *T</script>
a trivial regret bound if the deltas are all the same?
<details><summary>Solution</summary>
<p>\(
\begin{align}
\mathbb{E}[R_{A}(T)] &= \mu^{*}T - \mathbb{E}[G_{A}(T)] \\
&= \mu^{*}T - \sum_{i} \mu_{i}\mathbb{E}[P_{i}(T)] \\
&= \sum_{i} (\mu^{*} - \mu_{i})\mathbb{E}[P_{i}(T)] \\
&= \sum_{i} \delta_{i} \mathbb{E}[P_{i}(T)] \\
&\leq \delta \sum_{i} \mathbb{E}[P_{i}(T)] \\
&= \delta * T
\end{align}
\)
</p>
<p>The third line follows from \(sum_{i} \mathbb{E}[P_{i}(T)] = T\) and the
fifth line from the definition of \(\delta\).
</p>
</details>
</li>
<li>Kun: Do you understand the argument for why the regret bound is <script type="math/tex">O(\sqrt{KT\log(T)})</script>?
<details><summary>Hint</summary>
<p>
What happens if you break the arms into those with regret \(< \sqrt{K(\log{T})/T}\)
and those with regret \(\geq \sqrt{K(\log{T})/T}\)? Can we use this to bound
the total regret?
</p>
</details>
</li>
<li>Reproduce the UCB1 algorithm in code with minimal supervision.</li>
</ol>
<p><br /></p>
<h1 id="3-policy--value-functions">3 Policy & Value Functions</h1>
<p><strong>Motivation</strong>: Policy and value functions are at the core of reinforcement
learning. The policy function is the representative probabilities that our
policy assigns to each action. When we sample from these, we would like for
better actions to have higher probability. The value function is our estimate
of the strength of the current state. In AlphaGoZero, a single network
calculates both a value and a policy, then later updates its weights according
to how well the agent performs in the game.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Bellman equation.</li>
<li>Policy gradient.</li>
<li>On-policy / off-policy.</li>
<li>Policy iteration.</li>
<li>Value iteration.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Value Function:
<ol>
<li>SB: Sections 3.5, 3.6, 3.7.</li>
<li>SB: Sections 9.1, 9.2, 9.3*.</li>
</ol>
</li>
<li>Policy Function:
<ol>
<li>SB: Sections 4.1, 4.2, 4.3.</li>
<li>SB: Sections 13.1, 13.2*, 13.3, 13.4.</li>
</ol>
</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Sergey Levine: <a href="https://www.youtube.com/watch?v=tWNpiNzWuO8&feature=youtu.be">Berkeley Fall’17: Policy Gradients</a> → This is really good.</li>
<li>Sergey Levine: <a href="https://www.youtube.com/watch?v=k1vNh4rNYec&feature=youtu.be">Berkeley Fall’17: Value Functions</a> → This is really good.</li>
<li><a href="http://karpathy.github.io/2016/05/31/rl/">Karpathy does Pong</a>.</li>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf">David Silver on PG</a>.</li>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf">David Silver on Value</a>.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Why does policy gradient have such high variance?</li>
<li>What is the difference between off-policy and on-policy?
<details><summary>Solution</summary>
<p>
On-policy algorithms learn from the current policy's action decisions.
Off-policy algorithms learn from another arbitrary policy's actions. An
example of an on-policy algorithm is SARSA or REINFORCE. An example of an
off-policy algorithm is Q-Learning.
</p>
</details>
</li>
<li>SB: Exercises 3.13, 3.14, 3.15, 3.20, 4.3.</li>
<li>SB: Exercise 4.6 - How would policy iteration be defined for action values?
Give a complete algorithm for computing <script type="math/tex">q^{*}</script>, analogous to that on page 65
for computing <script type="math/tex">v^{*}</script>.
<details><summary>Solution</summary>
<p> The solution follows the proof (page 65) for \(v^{*}\), with the following modifications:
<ol>
<li>Consider a randomly initialized Q(s, a) and a random policy \( \pi(s) \). </li>
<li><b> Policy Evaluation </b> : Update Q(s, a) \( \leftarrow \sum_{s'} P_{ss'}^{a} R_{ss'}^{a} + \gamma \sum_{s'} \sum_{a'} P_{ss'}^{a} Q^{\pi}(s', a') \pi(a' | s') \) <br />
Note that \( P_{ss'}^{a} \leftarrow P(s' |s, a) , R_{ss'}^{a} \leftarrow R(s, a, s').\)</li>
<li><b> Policy Improvement </b> : Update \( \pi(s) = {argmax}_{a} Q^{\pi}(s, a) \). If \(unstable\), go to step 2. Here, \( unstable \), implies \( \pi_{before\_update}(s) \neq \pi_{after\_update}(s) \)</li>
<li> \( q^{*} \leftarrow Q(s, a) \) </li>
</ol> </p>
</details>
</li>
<li>SB: Exercise 13.2 - Prove that the eligibility vector
<script type="math/tex">\nabla_{\theta} \ln \pi (a | s, \theta) = x(s, a) - \sum_{b} \pi (b | s, \theta)x(s, b)</script>
using the definitions and elementary calculus. Here, <script type="math/tex">\pi (a | s, \theta)</script> = softmax( <script type="math/tex">\theta^{T}x(s, a)</script> ).
<details><summary>Solution</summary>
<p align="center">
By definition, we have \( \pi( a| s, \theta) = \frac{e^{ \theta^{T}
\mathbf{x}( s, a) }}{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b)) }} \), where
\( \mathbf{x}(s, a) \) is the state-action feature representation. Consequently:
<br />
\(
\begin{align}
\nabla_{\theta} \ln \pi (a | s, \theta) &= \nabla_\theta \Big( \theta^{T}\mathbf{x}(s, a) - \ln \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } \Big) \\
&= \mathbf{x}(s, a) - \sum_b \mathbf{x}(s, b) \frac{ e^{ \theta^{T}\mathbf{x}(s, b) } }{ \sum_b e^{ \theta^{T}\mathbf{x}(s, b) } } \\
&= \mathbf{x}(s, a) - \sum_{b} \pi (b | s, \theta)\mathbf{x}(s, b) \\
\end{align}
\)
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="4-mcts--uct">4 MCTS & UCT</h1>
<p><strong>Motivation</strong>: Monte Carlo Tree Search (MCTS) forms the backbone of AlphaGoZero.
It is what lets the algorithm reliably explore and then hone in on the best policy.
UCT (UCB for Trees) combines MCTS and UCB so that we get reliable convergence
guarantees. In this section, we will explore how MCTS works and how to make
it excel for our purposes in solving Go, a game with an enormous branching factor.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Conceptual understanding of Monte Carlo Tree Search.</li>
<li>Optimality of UCT.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>SB: Section 8.11</li>
<li><a href="https://gnunet.org/sites/default/files/Browne%20et%20al%20-%20A%20survey%20of%20MCTS%20methods.pdf">Browne</a>: Sections 2.2, 2.4, 3.1-3.5, 8.2-8.4.</li>
<li><a href="http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/1029/paper_thesis.pdf">Silver Thesis</a>: Sections 1.4.2 and 3.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://jhamrick.github.io/quals/planning%20and%20decision%20making/2015/12/16/Browne2012.html">Jess Hamrick on Browne</a>.</li>
<li><a href="https://hal.archives-ouvertes.fr/file/index/docid/116992/filename/CG2006.pdf">Original MCTS Paper</a>.</li>
<li><a href="http://ggp.stanford.edu/readings/uct.pdf">Original UCT Paper</a>.</li>
<li>Browne:
<ol>
<li>Section 4.8: MCTS applied to Stochastic or Imperfect Information Games.</li>
<li>Sections 7.2, 7.3, 7.5, 7.7: Applications of MCTS.</li>
</ol>
</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Can you detail each of the four parts of the MCTS algorithm?
<details><summary>Solution</summary>
<ol>
<li><b>Selection</b>: Select child node from the current node based on the tree policy.</li>
<li><b>Expansion</b>: Expand the child node based on the exploration / exploitation trade-off.</li>
<li><b>Simulation</b>: Simulate from the child node until termination or upon reaching a suitably small future reward (like from reward decay).</li>
<li><b>Backup</b>: Backup the reward along the path taken according to the tree policy.</li>
</ol>
</details>
</li>
<li>What happens to the information gained from the Tree Search after each run?
<details><summary>Solution</summary>
<p>We can reuse the accumulated statistics in subsequent runs. We could also
ignore those statistics and build fresh each subsequent root. Both are used
in actual implementations.
</p>
</details>
</li>
<li>What characteristics of a domain would make MCTS a good algorithmic choice?
<details><summary>Solution</summary>
<p>
A few such characteristics are:
</p>
<ul>
<li>MCTS is aheuristic, meaning that it does not require any domain-specific
knowledge. Consequently, if it is difficult to produce game heuristics for
your target domain (e.g. Go), then it can perform much better than alternatives
like Minimax. And on the flip side, if you did have domain-specific knowledge,
MCTS can incorporate it and will improve dramatically.
</li>
<li>
If the target domain needs actions online, then MCTS is a good choice as all
values are always up to date. Go does not have this property but digital games
like in the <a href="http://ggp.stanford.edu/">General Game Playing</a> suite
may.
</li>
If the target domain's game tree is of a nontrivial size, then MCTS may be
a much better choice than other algorithms as it tends to build unbalanced
trees that explore the more promising routes rather than consider all routes.
</li>
<li>
If there is noise or delayed rewards in the target domain, then MCTS is a
good choice because it is robust to these effects which can gravely impact
other algorithms such as modern Deep Reinforcement Learning.
</li>
</ul>
</details>
</li>
<li>What are examples of domain knowledge default policies in Go?
<details><summary>Solution</summary>
<ul>
<li>Crazy Stone, an early program that won the 2006 9x9 Computer Go Olympiad,
used an
<a href="https://www.researchgate.net/figure/Examples-of-move-urgency_fig2_220174551">urgency</a>
heuristic value for each of the moves on the board.
</li>
<li>
MoGo, the algorithm that introduced UCT, bases its default policies on this
sequence:
<ol>
<li>Respond to ataris by playing a saving move at random.</li>
<li>If one of the eight intersections surrounding the last move matches a
simple pattern for cutting or <i>hane</i>, randomly play one.</li>
<li>If there are capturing moves, play one at random.</li>
<li>Play a random move.</li>
</ol>
</li>
<li>The second version of Crazy Stone used an algorithm learned from actual
game play to learn a library of strong patterns. It incorporated this into
its default policy.
</li>
</ul>
</details>
</li>
<li>Why is UCT optimal? For a finite-horizon MDP with rewards scaled to lie in
<script type="math/tex">[0, 1]</script>, can you prove that the failure probability at the root converges
to zero at a polynomial rate in the number of games simulated?
<details><summary>Hint</summary>
<p>
Try using induction on \(D\), the horizon of the MDP. At \(D=1\), to what
result does this correspond?
</p>
</details>
<details><summary>Hint 2</summary>
<p>
Assume that the result holds for a horizon up to depth \(D - 1\) and
consider a tree of depth \(D\). We can keep the cumulative rewards bounded
in the interval by dividing by \(D\). Now can you show that the UCT payoff
sequences at the root satisfy the drift conditions, repeated below?
</p>
<ul>
<li>The payoffs are bounded - \(0 \leq X_{it} \leq 1\), where \(i\) is the
arm number and \(t\) is the time step.</li>
<li>The expected values of the averages, \(\overline{X_{it}} =
\frac{1}{n} \sum_{t=1}^{n} X_{it}\), converge.</li>
<li>Define \(\mu_{in} = \mathbb{E}[\overline{X_{in}}]\) and \(\mu_{i} = \lim_{n\to\inf} \mu_{in}\).
Then, for \(c_{t, s} = 2C_{p}\sqrt{\frac{\ln{t}}{s}}\), where \(C_p\) is a
suitable constant, both
\(\mathbb{P}(\overline{X_{is}} \geq \mu_{i} + c_{t, s}) \leq t^{-4}\) and
\(\mathbb{P}(\overline{X_{is}} \leq \mu_{i} - c_{t, s}) \leq t^{-4}\) hold.
</li>
</ul>
</details>
<details><summary>Solution</summary>
<p>
For a complete detail of the proof, see the original
<a href="http://ggp.stanford.edu/readings/uct.pdf">UCT</a> paper.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="5-mcts--rl">5 MCTS & RL</h1>
<p><strong>Motivation</strong>: Up to this point, we have learned a lot about how games can be
solved and how reinforcement learning works on a foundational level. Before we
jump into the paper, one last foray contrasting and unifying the games vs
learning perspective is worthwhile for understanding the domain more fully. In
particular, we will focus on a paper from Vodopivec et al. After completing
this section, you should have an understanding of what research directions in
this field have been thoroughly explored and which still have open directions.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Integrating MCTS and reinforcement learning.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Vodopivec:
1. Section 3.1-3.4: Connection between MCTS and RL.
2. Section 4.1-4.3: Integrating MCTS and RL.</li>
<li><a href="https://papers.nips.cc/paper/1292-why-did-td-gammon-work.pdf">Why did TD-Gammon Work?</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li>Vodopivec: Section 5: Survey of research inspired by both fields.</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What are key differences between MCTS and RL?</li>
<li>UCT can be described in RL terms as the following “The original UCT searches
identically as an offline on-policy every-visit MC control algorithm that uses
UCB1 as the policy.” What do each of these terms mean?
<details><summary>Solution</summary>
<ul>
<li>
UCT is trained on-policy, which means it improves the policy used to make the
action decisions, i.e. UCB1.
</li>
<li>
The offline means that we can't learn until after the episode is completed.
An alternative online algorithm would learn while the episode was running.
</li>
<li>
Every-visit versus first-visit decides if we are going to update a state for
every time it's accessed in an episode or just the first time. The original
UCT algorithm did every-visit. Subsequent versions relaxed this.
</li>
<li>
MC control means that we are using Monte Carlo as the policy, i.e. we use
the average value of the state as the true value.
</li>
</ul>
</details>
</li>
<li>What is a Representation Policy? Give an example not described in the text.
<details><summary>Solution</summary>
<p>A Representation Policy defines the model of the state space (e.g. in
the form of a value function) and the boundary between memorized and
non-memorized parts of the space.
</p>
</details>
</li>
<li>What is a Control Policy? Give an example not described in the text.
<details><summary>Solution</summary>
<p>A Control Policy dictates what actions will be performed and (consequently)
which states will be visited. In MCTS, it includes the tree and default policies.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the paper! We have a deep understanding of the background,
so let’s delve into the apex result. Note that we don’t just focus on the final
AlphaGoZero paper but also explore a related paper written coincidentally by
a team at UCL using Hex as the game of choice. Their algorithm is very similar
to the AlphaGoZero algorithm and considering both in context is important to
gauging what was really the most important aspects of this research.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>MCTS learning and computational capacity.</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://deepmind.com/documents/119/agz_unformatted_nature.pdf">Mastering the Game of Go Without Human Knowledge</a></li>
<li><a href="https://arxiv.org/pdf/1705.08439.pdf">Thinking Fast and Slow with Deep Learning and Tree Search</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="http://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tree-search-planning.pdf">Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning</a></li>
<li><a href="http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/1029/paper_thesis.pdf">Silver Thesis</a>: Section 4.6</li>
<li><a href="https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf">Mastering the game of Go with deep neural networks and tree search</a></li>
<li><a href="https://arxiv.org/abs/1712.01815">Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What were the differences between the two papers “Mastering the Game of Go
Without Human Knowledge” and “Thinking Fast and Slow with Deep Learning and Tree Search”?
<details><summary>Solution</summary>
<p>Some differences between the former (AG0) and the latter (ExIt) are:</p>
<ul>
<li>AG0 uses MC value estimates from the expert for the value network
where ExIt uses estimates from the apprentice. This requires more computation
by AG0 but produces better estimates.</li>
<li>The losses were different. For the value network, AG0 uses an MSE loss
with L2 regularization and ExIt uses a cross entropy loss with early stopping.
For the policy part, AG0 used cross entropy while ExIt uses a weighted
cross-entropy that takes into account how confident MCTS is in the action
based on the state count.</li>
<li>AG0 uses the value network to evaluate moves; ExIt uses RAVE and rollouts,
plus warm starts from the MCTS.</li>
<li>AG0 adds in Dirichlet noise to the prior probability at the root node.</li>
<li>AG0 elevates a new network as champion only when it's markedly better than
the prior champion; ExIt replaces the old network without verification of if
it is better.</li>
</ul>
</details>
</li>
<li>What was common to both of “Mastering the Game of Go Without Human Knowledge”
and “Thinking Fast and Slow with Deep Learning and Tree Search”?
<details><summary>Solution</summary>
<p>The most important commonality is that they both use MCTS as an expert
guide to help a neural network learn through self-play.</p>
</details>
</li>
<li>Will the system get stuck if the current neural network can’t beat the previous ones?
<details><summary>Solution</summary>
<p>No. The algorithm won’t accept a policy that is worse than the current best
and MCTS’s convergence properties imply that it will eventually tend towards
the equilibrium solution in a zero-sum two player game
</p>
</details>
</li>
<li>Why include both a policy and a value head in these algorithms? Why not just use policy?
<details><summary>Solution</summary>
<p>Value networks reduce the required search depth. This helps tremendously
because a rollout approach without the value network is inaccurate and spends
too much time on sub-optimal directions.
</p>
</details>
</li>
</ol>cinjonThank you to Marc Lanctot, Hugo Larochelle, Katherine Lee, and Tim Lillicrap for contributions to this guide.Trust Region Policy Optimization2018-06-19T16:00:00+00:002018-06-19T16:00:00+00:00https://www.depthfirstlearning.com/2018/TRPO<p>Thank you to Nic Ford, Ethan Holly, Matthew Johnson, Avital Oliver, John Schulman, George Tucker, and Charles Weill for contributing to this guide.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/trpo-deps.svg" width="200"></iframe>
<div>Concepts used in TRPO. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>TRPO is a scalable algorithm for optimizing policies in reinforcement learning by
gradient descent. Model-free algorithms such as policy gradient methods do not
require access to a model of the environment and often enjoy better
practical stability. Consequently, while straightforward to apply to new
problems, they have trouble scaling to large, nonlinear policies. TRPO couples
insights from reinforcement learning and optimization theory to develop an
algorithm which, under certain assumptions, provides guarantees for monotonic
improvement. It is now commonly used as a strong baseline when developing new
algorithms.</p>
<p><br /></p>
<h1 id="1-policy-gradient">1 Policy Gradient</h1>
<p><strong>Motivation</strong>: Policy gradient methods (e.g. TRPO) are a class
of algorithms that allow us to directly optimize the parameters of a policy by
gradient descent. In this section, we formalize the notion of Markov Decision Processes (MDP),
action and state spaces, and on-policy vs off-policy approaches. This leads to the
REINFORCE algorithm, the simplest instantiation of the policy gradient method.</p>
<p><a href="https://drive.google.com/file/d/1KFQ-NvcYHL0Pi9TUM96iTEGZzt9GLffO/view?usp=sharing" class="colab-root">Reproduce in a <span>Notebook</span></a></p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Markov Decision Processes.</li>
<li>Continuous action spaces.</li>
<li>On-policy and off-policy algorithms.</li>
<li>REINFORCE / likelihood ratio methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li>Deep RL Course at UC Berkeley (CS 294); Policy Gradient Lecture
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_4_policy_gradient.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=tWNpiNzWuO8&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=4">Video</a></li>
</ol>
</li>
<li>David Silver’s course at UCL; Policy Gradient Lecture
<ol>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=KHZVXao4qXs&index=7&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT">Video</a></li>
</ol>
</li>
<li>Reinforcement Learning by Sutton and Barto, 2nd Edition; pages 265 - 273</li>
<li><a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Simple statistical gradient-following algorithms for connectionist reinforcement learning</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="http://rl-gym-doc.s3-website-us-west-2.amazonaws.com/mlss/2016-MLSS-RL.pdf">John Schulman introduction at MLSS Cadiz</a></li>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdf">Lecture on Variance Reduction for Policy Gradient</a></li>
<li><a href="http://karpathy.github.io/2016/05/31/rl/">Introduction to policy gradient and motivations by Andrej Karpathy</a></li>
<li><a href="https://papers.nips.cc/paper/3922-on-a-connection-between-importance-sampling-and-the-likelihood-ratio-policy-gradient.pdf">Connection Between Importance Sampling and Likelihood Ratio</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>At its core, REINFORCE computes an
approximate gradient of the reward with respect to the parameters. Why
can’t we just use the familiar stochastic gradient descent?
<details><summary>Hint</summary>
<p>
Just as in reinforcement learning, when we use stochastic gradient descent, we also compute an estimate of the gradient, which looks something like $$\nabla_{\theta} \mathbb{E}_{x \sim p(x)} [ f_{\theta}(x) ] = \mathbb{E}_{x \sim p(x)} [\nabla_{\theta} f_{\theta}(x)]$$ where we move the gradient into the expectation (an operation made precise by the <a href=""> Leibniz integral rule</a>). In other words, the objective we're trying to take the gradient of is indeed differentiable with respect to the inputs. In reinforcement learning, our objective is non-differentiable -- we actually <b>select</b> an action and act on it. To convince yourself that this isn't something we can differentiate through, write out explicitly the full expansion of the training objective for policy gradient before we move the gradient into the expectation. Is sampling an action really non-differentiable? (spoiler: yes, but we can work around it in various ways, such as using REINFORCE or <a href="https://arxiv.org/abs/1611.01144">other methods</a>).
</p>
</details>
</li>
<li>Does the REINFORCE gradient estimator resemble maximum likelihood estimation (MLE)?
Why or why not?
<details><summary>Solution</summary>
<p>
The term \( \log \pi (a | s) \) should look like a very familiar tool in
statistical learning: the likelihood function! When we think of what happens
when we do MLE, we are trying to maximize the likelihood of \( \log p(D | \theta) \)
or as in supervised learning, we try to maximize $$\log p(y_i^* | x_i, \theta).$$
Normally, because we have the true label \( y_i^* \), this paradigm aligns
perfectly with what we are ultimately trying to do with MLE. However, this
naive strategy of maximizing the likelihood \( \pi(a | s) \) won't work in
reinforcement learning, because we do not have a label for the correct action
to be taken at a given time step (if we did, we should just do supervised
learning!). If we tried doing this, we would find that we would simply
maximize the probability of every action; make sure you convince yourself
this to be true. Instead, the only (imperfect) evidence we have of good or
bad actions is the reward we receive at that time step. Thus, a reasonable
thing to do seems like scaling the log-likelihood by how good or bad the
action by the reward. Thus, we would then maximize $$r(a, s) \log \pi (a | s).$$
Look familiar? This is just the REINFORCE term in our
expectation: $$ \mathbb{E}_{s,a} [ \nabla r(a, s) \log \pi (a | s) ] $$
</p>
</details>
</li>
<li>In its original formulation, REINFORCE is an on-policy algorithm. Why?
Can we make REINFORCE work off-policy as well?
<details><summary>Solution</summary>
<p>
We can tell that REINFORCE is on-policy by looking at the expectation a bit
closer: $$ \mathbb{E}_{s,a} [ \nabla \log \pi (a | s) r(a, s). ]$$ When we
see any expectation in an equation, we should always ask what exactly is the
expectation <b>over</b>? In this case, if we expand the expectation, we
have: $$\mathbb{E}_{s \sim p_{\theta}(s), a \sim \pi_{\theta}(a|s)}
[ \nabla_{\theta} \log \pi_{\theta} (a | s) r(a, s), ]$$ and we see that
while the states are being sampled from the empirical state visitation
distribution induced by the current policy, and the actions \( a \) are
coming directly from the current policy. Because we learn from the current
policy, and not some arbitrary policy, REINFORCE is an on-policy. To change
REINFORCE to use data, we simply change the sampling distribution to some
other policy \( \pi_{\beta} \) and use importance sampling to correct for
this disparity. For more details, see
<a href="https://scholarworks.umass.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1079&context=cs_faculty_pubs">a classic paper</a>
on this subject and <a href="https://arxiv.org/abs/1606.02647">a recent paper</a>
with new insights on off-policy learning with policy gradient methods.
</p>
</details>
</li>
<li>Do policy gradient methods work for discrete and continuous action spaces?
If not, why not?</li>
</ol>
<p><br /></p>
<h1 id="2-variance-reduction-and-advantage-estimate">2 Variance Reduction and Advantage Estimate</h1>
<p><strong>Motivation</strong>: One major shortcoming of policy gradient methods is that the
simplest instantation of REINFORCE suffers from high variance in the gradients
it computes. This results from the fact that rewards are sparse, we only visit a finite
set of states, and that we only take one action at each state rather than try all actions.
In order to properly scale our methods to harder problems, we need to reduce this variance.
In this section, we study common tools for reducing variance for REINFORCE. These include
a causality result, baselines, and advantages. Note that the TRPO paper does not introduce
new methods for variance reduction, but we cover it here for complete understanding.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Causality for REINFORCE.</li>
<li>Baselines and control variates.</li>
<li>Advantage estimation.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li>Deep RL Course at UC Berkeley (CS 294); Actor-Critic Methods Lecture
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_5_actor_critic_pdf.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=PpVhtJn-iZI&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=5">Video</a></li>
</ol>
</li>
<li><a href="/assets/gjt-var-red-notes.pdf">George Tucker’s notes on Variance Reduction</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Reinforcement Learning by Sutton and Barto, 2nd Edition; pages 273 - 275</li>
<li><a href="https://arxiv.org/abs/1506.02438">High-dimensional continuous control using generalized advantage estimation</a></li>
<li><a href="https://arxiv.org/abs/1602.01783">Asynchronous Methods for Deep Reinforcement Learning</a></li>
<li><a href="https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf">Monte Carlo theory, methods, and examples by Art B. Owen; Chapter 8</a>
(in-depth treatment of variance reduction; suitable for independent study)</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>What is the intuition for using advantages instead of rewards as the
learning signal? Note on terminology: the learning signal is the factor by
which we multiply <script type="math/tex">\log \pi (a | s)</script> inside the expectation in REINFORCE.</li>
<li>What are some assumptions we make by using baselines as a variance
reduction method?</li>
<li>What are other methods of variance reduction?
<details><summary>Solution</summary>
<p>
Check out the optional reading <a href="https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf">Monte Carlo theory, methods, and examples by Art B. Owen; Chapter 8</a> if you're interested! Broadly speaking, other techniques for doing variance reduction for Monte Carlo integration include stratified sampling, antithetic sampling, common random variables, conditioning.
</p>
</details>
</li>
<li>The theory of control variates tells us that our control variate should
be correlated with the quantity we are trying to lower the variance of.
Can we construct a better control variate that is even more correlated
than a learned state-dependent value function? Why or why not?
<details><summary>Hint</summary>
<p>
Right now, the typical control variate \( b(s) \) depends only on the state. Can we also have the control variate depend on the action? What extra work do we have to do to make sure this is okay? Check <a href="https://arxiv.org/abs/1611.02247">this paper</a> if you're interested in one way to extend this, and <a href="https://arxiv.org/abs/1802.10031">this paper</a> if you're interested in why adding dependence on more than just the state can be tricky and hard to implement in practice.
</p>
</details>
</li>
<li>We use control variates as a method to reduce variance in our gradient
estimate. Why don’t we use these for supervised learning problems such as
classification? Are we implicitly using them?
<details><summary>Solution</summary>
<p>
Reducing variance in our gradient estimates seems like an important thing
to do, but we don't often see explicit variance reduction methods when we
do supervised learning. However, there is a line of work around
<b>stochastic variance reduced gradient</b> descent called
<a href="https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf">SVRG</a>
that tries to construct gradient estimators with reduced variance. See
<a href="http://ranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf">these
slides</a> and <a href="https://arxiv.org/abs/1202.6258">these</a>
<a href="https://arxiv.org/abs/1209.1873">papers</a> for more on this topic.
</p>
<p>
The reason that we don't often see these being used in the supervised
learning setting is because we're not necessarily looking to reduce the variance
of SGD and smoothly converge to a minima. This is
because we're actually interested in looking for minima that have low
<b>generalization error</b> and don't want to overfit to solutions with very
small training error. In fact, we often rely on the noise introduced by
using minibatches in SGD to help us to escape premature minima.
On the other hand, in reinforcement learning, the variance of our gradient
estimates is so high that it's often the foremost problem.
</p>
<p>
Beyond supervised learning, control variates are used often in Monte Carlo
integration, which is ubiquitous throughout Bayesian methods. They are also
used for problems in hard attention, discrete latent random variables, and
general stochastic computation graphs.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-fisher-information-matrix-and-natural-gradient-descent">3 Fisher Information Matrix and Natural Gradient Descent</h1>
<p><img src="/assets/fisher-steepest.png" /></p>
<p><strong>Motivation</strong>: While gradient descent is able to solve many optimization problems,
it suffers from a basic problem - performance is dependent on the model’s parameterization.
Natural gradient descent, on the other hand, is invariant to model parameterization.
This is achieved by multiplying gradient vectors by the inverse of the Fisher
information matrix, which is a measure of how much model predictions change with
local parameter changes.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Fisher information matrix.</li>
<li>Natural gradient descent.</li>
<li>(Optional) K-Fac.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="/assets/k-fac-tutorial.pdf">Matt Johnson’s Natural Gradient Descent and K-Fac Tutorial</a>: Sections 1-7, Section A, B</li>
<li><a href="https://arxiv.org/pdf/1412.1193.pdf">New insights and perspectives on the natural gradient method</a>: Sections 1-11.</li>
<li><a href="https://web.archive.org/web/20170807004738/https://hips.seas.harvard.edu/blog/2013/04/08/fisher-information/">Fisher Information Matrix</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="http://ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2015/01/mathematics_for_intelligent_systems_lecture12_notes_I.pdf">8-page intro to natural gradients</a></li>
<li><a href="http://www.yaroslavvb.com/papers/amari-why.pdf">Why Natural Gradient Descent / Amari and Douglas</a></li>
<li><a href="https://personalrobotics.ri.cmu.edu/files/courses/papers/Amari1998a.pdf">Natural Gradient Works Efficiently in Learning / Amari</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Consider classifiers <script type="math/tex">p(y | x; \theta_{1})</script> and <script type="math/tex">p(y | x; \theta_{2})</script>,
such that <script type="math/tex">l_2(\theta_1, \theta_2)</script> is large, where <script type="math/tex">l_2</script> indicates the
Euclidean distance metric. Does this imply the difference in accuracy of the
classifiers is high?
<details><summary>Solution</summary>
The accuracy of the classifier depends on the function defined by
\(p(y|x;\theta) \). The distance between the parameters do not inform us
about distance between the two functions. Hence, we cannot draw any conclusions
about the difference in accuracy of the classifiers.
</details>
</li>
<li>How is the Fisher matrix similar and different from the Hessian?</li>
<li>How does natural gradient descent compare to Newton’s method?</li>
<li>Why is the natural gradient slow to compute?</li>
<li>How can one efficiently compute the product of the Fisher information matrix with an arbitrary vector?</li>
</ol>
<p><br /></p>
<h1 id="4-conjugate-gradient">4 Conjugate Gradient</h1>
<p><strong>Motivation</strong>: The conjugate gradient method (CG) is an iterative algorithm for finding
approximate solutions to <script type="math/tex">Ax=b</script>, where <script type="math/tex">A</script> is a symmetric and positive-definite matrix (such
as the Fisher information matrix). The method works by iteratively computing matrix-vector
products <script type="math/tex">Ax_i</script> and is particularly well-suited for matrices with computationally
tractable matrix-vector products.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Solving system of linear equations.</li>
<li>Efficiently computing matrix-vector products.</li>
<li>Computational complexities of second order methods optimization methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf">An Introduction to the Conjugate Gradient Method Without the Agonizing Pain</a>: Section 7-9</li>
<li><a href="https://ee227c.github.io/notes/ee227c-notes.pdf">Convex Optimization and
Approximation</a>, UC Berkeley, Section 7.4</li>
<li>Convex Optimization II by Stephen Boyd:
<ol>
<li><a href="https://www.youtube.com/watch?feature=player_embedded&v=cHVpwyYU_LY#t=2230">Lecture 12, from 37:10 to 1:05:00</a></li>
<li><a href="https://www.youtube.com/watch?feature=player_embedded&v=E4gl91l0l40#t=1266">Lecture 13, from 21:20 to 29:30</a></li>
</ol>
</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Numerical Optimization by Nocedal and Wright; Section 5.1, pages 101-120</li>
<li><a href="https://metacademy.org/graphs/concepts/conjugate_gradient">Metacademy</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Remember that the natural gradient of a model is the Fisher information matrix of that model times the
vanilla gradient (<script type="math/tex">F^{-1}g</script>). How does CG allow us to approximate the natural gradient?
<ol>
<li>What is the naive way to compute <script type="math/tex">F^{-1}g</script>? How much memory and time would it take?
<details><summary>Solution</summary>
Assume we have an estimate of \(F\) (by the process
in section 7 of <a href="/assets/k-fac-tutorial.pdf">Matt Johnson's tutorial</a>).
Storing \(F\) would take space proportional to \(n^2\), and inverting \(F\) would take
time proportional to \(n^3\) (or slightly lower with the Strassen algorithm)
</details>
</li>
<li>How long would a run of CG to <em>exactly</em> compute <script type="math/tex">F^{-1}g</script> take? How does that compare
to the naive process?
<details><summary>Solution</summary>
Each iteration of CG computes \(Fv\) for some vector \(v\), which would take time
proportional to \(n^2\). CG converges to the true answer after \(n\) steps, so in total
it would take time proportional to \(n^3\). This process ends up being slower than directly inverting
the Fisher naively and uses the same amount of memory.
</details>
</li>
<li>How can we use CG and bring down the time and memory to compute the natural gradient <script type="math/tex">F^{-1}g</script>?
<details><summary>Solution</summary>
<ol>
<li>
Use the closed form estimate of \(Fv\) for arbitrary \(v\), as described in section A of <a href="/assets/k-fac-tutorial.pdf">Matt Johnson's tutorial</a>)
</li>
<li>
Take fewer CG iteration steps, which leads to an approximation of the natural gradient that may be sufficient.
</li>
</ol>
</details>
</li>
</ol>
</li>
<li>In pre-conditioned conjugate gradient, how does scaling the pre-conditioner
matrix <script type="math/tex">M</script> by a constant <script type="math/tex">c</script> impact the convergence?</li>
<li>Exercises 5.1 to 5.10 in Chapter 5, Numerical Optimization
(<b>Exercises 5.2 and 5.9 are particularly recommended.</b>)</li>
</ol>
<p><br /></p>
<h1 id="5-trust-region-methods">5 Trust Region Methods</h1>
<p><strong>Motivation</strong>: Trust region methods are a class of methods used in general
optimization problems to constrain the update size. While
TRPO does not use the full gamut of tools from the trust region literature,
studying them provides good intuition for the problem that TRPO
addresses and how we might improve the algorithm even more. In this
section, we focus on understanding trust regions and line search methods.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Trust regions and subproblems.</li>
<li>Line search methods.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods">A friendly introduction to Trust Region Methods</a></li>
<li>Numerical Optimization by Nocedal and Wright: Chapter 2, Chapter 4, Section 4.1, 4.2</li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li>Numerical Optimization by Nocedal and Wright: Chapter 4, Section 4.3</li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Instead of directly imposing constraints on the updates, what would be
alternatives to enforce an algorithm to make bounded updates?
<details><summary>Hint</summary>
<p>
Recall the methods of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>. How does this method move between two types of optimization problems?
</p>
</details>
</li>
<li>Each step of a trust region optimization method updates parameters to
the optimal setting given some constraint. Can we solve this in closed
form using Lagrange multipliers? In what way would this be similar, or
different, from the trust region methods we just discussed?</li>
<li>Exercises 4.1 to 4.10 in Chapter 4, Numerical Optimization.
(<b>Exercise 4.10 is particularly recommended</b>)</li>
</ol>
<p><br /></p>
<h1 id="6-the-paper">6 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the <a href="https://arxiv.org/abs/1502.05477">paper</a>.
We’ve built a good foundation for the various tools and mathematical ideas
used by TRPO. In this section, we focus on the parts of the paper that aren’t
explicitly covered by the above topics and together result in the practical
algorithm used by many today. These are monotonic policy improvement and the
two different implementation approaches: vine and single-path.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>What is the problem with policy gradients that TRPO addresses?</li>
<li>What are the bottlenecks to addressing that problem in the existing approaches when it debuted?</li>
<li>Policy improvement bounds and theory.</li>
</ol>
<p><strong>Required Readings</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a></li>
<li>Deep RL Course at UC Berkeley (CS 294); Advanced Policy Gradient Methods (TRPO)
<ol>
<li><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=ycCtmp4hcUs&feature=youtu.be&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3">Video</a></li>
</ol>
</li>
<li><a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf">Approximately Optimal Approximate Reinforcement Learning</a></li>
</ol>
<p><strong>Optional Readings</strong>:</p>
<ol>
<li><a href="https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/">TRPO Tutorial</a></li>
<li><a href="https://arxiv.org/abs/1708.05144">ACKTR</a></li>
<li><a href="https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf">A Natural Policy Gradient</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>How is the trust region set in TRPO? Can we do better? Under what
assumptions is imposing the trust region constraint not required?</li>
<li>Why do we use conjugate gradient methods for optimization in TRPO? Can we
exploit the fact the conjugate gradient optimization is differentiable?</li>
<li>How is line search used in TRPO?</li>
<li>How does TRPO differ from natural policy gradient?
<details><summary>Solution</summary>
<p> See slides 30-34 from <a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">this lecture</a>. </p>
</details>
</li>
<li>What are the pros and cons of using the vine and single-path methods?</li>
<li>In practice, TRPO is really slow. What is the main computational
bottleneck and how might we remove it? Can we approximate this bottleneck?</li>
<li>TRPO makes a series of approximations that deviate from the policy
improvement theory that is cited. What are the assumptions that are made
that allow these approximations to be reasonable? Should we still expect
monotonic improvement in our policy?</li>
<li>TRPO is a general procedure to directly optimize parameters from rewards,
even though the procedure is “non-differentiable”. Does it make sense to
apply TRPO to other non-differentiable problems, like problems involving
hard attention or discrete random variables?</li>
</ol>suryaThank you to Nic Ford, Ethan Holly, Matthew Johnson, Avital Oliver, John Schulman, George Tucker, and Charles Weill for contributing to this guide.InfoGAN2018-05-28T14:00:00+00:002018-05-28T14:00:00+00:00https://www.depthfirstlearning.com/2018/InfoGAN<p>Thank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.</p>
<div class="deps-graph">
<iframe class="deps" src="/assets/infogan-deps.svg" width="200"></iframe>
<div>Concepts used in InfoGAN. Click to navigate.</div>
</div>
<h1 id="why">Why</h1>
<p>InfoGAN is an extension of GANs that learns to represent unlabeled data as codes,
aka representation learning. Compare this to vanilla GANs that can only generate
samples or to VAEs that learn to both generate code and samples. Representation
learning is an important direction for unsupervised learning and GANs are a
flexible and powerful interpretation. This makes InfoGAN an interesting stepping
stone towards research in representation learning.</p>
<p><a href="https://colab.research.google.com/drive/1JkCI_n2U2i6DFU8NKk3P6EkPo3ZTKAaq#forceEdit=true&offline=true&sandboxMode=true" class="colab-root">Reproduce in a
<span>Notebook</span></a></p>
<p><br /></p>
<h1 id="1-information-theory">1 Information Theory</h1>
<p><strong>Motivation</strong>: Information theory formalizes the concept of the “amount of randomness” or
“amount of information”. These concepts can be extended to relative quantities
among random variables. This section leads to Mutual Information (MI), a concept core to
InfoGAN. MI extends entropy to the amount of additional information you yield from
observing a joint sample of two random variables as compared to the baseline of
observing each variable separately.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>Entropy</li>
<li>Differential Entropy</li>
<li>Conditional Entropy</li>
<li>Jensen’s Inequality</li>
<li>KL divergence</li>
<li>Mutual Information</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li>Chapter 1.6 from <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf">Pattern Recognition and Machine Learning / Bishop. (“PRML”)</a></li>
<li>A good <a href="https://www.quora.com/What-is-an-intuitive-explanation-of-the-concept-of-entropy-in-information-theory/answer/Peter-Gribble">intuitive explanation of Entropy</a>, from Quora.</li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1404.2000.pdf">Notes on Kullback-Leibler Divergence and Likelihood Theory</a></li>
<li>For more perspectives and deeper dependencies, see Metacademy:
<ol>
<li><a href="https://metacademy.org/graphs/concepts/entropy">Entropy</a></li>
<li><a href="https://metacademy.org/graphs/concepts/mutual_information">Mutual Information</a></li>
<li><a href="https://metacademy.org/graphs/concepts/kl_divergence">KL divergence</a></li>
</ol>
</li>
<li><a href="https://colah.github.io/posts/2015-09-Visual-Information/">Visual Information Theory</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>From PRML: 1.31, 1.36, 1.37, 1.38, 1.39, 1.41.
<details><summary>Solution</summary>
<p>
PRML 1.31: Consider two variables \(x\) and \(y\) having joint distribution \(p(x,y)\). Show that the differential entropy of this pair of variables satisfies \(H(x,y) \le H(x) + H(y)\) with equality if, and only if, \(x\) and \(y\) are statistically independent.
</p><p>
If \(p(x)\) and \(p(y)\) are independent then the joint distribution is given by:<br />
\(p(x,y) = p(x)p(y)\)
</p><p>
Based on the independent \(p(x)\) and \(p(y)\) the joint entropy can be derived from the conditional entropies \(H(x|y)\) and \(H(y|x)\):<br />
\(H(x|y) = H(x)\)<br />
\(H(y|x) = H(y) \to\)<br />
\(H(x,y) = H(x) + H(y|x) = H(y) + H(x|y) \to\)<br />
\(H(x,y) = H(x) + H(y)\)
</p><p>
Therefore, there is no mutual information \(I(x,y)\) if \(p(x)\) and \(p(y)\) are independent:<br />
\(H(x,y) = H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) = H(x,y) - H(x,y) = 0\)
</p><p>
If \(p(x)\) and \(p(y)\) are dependent:<br />
\(H(x,y) < H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) > 0\)
</p><p>
The indepent and dependent case can be combined to a general form:<br />
\(H(x,y) \le H(x) + H(y)\ \to\)<br />
\(I(x,y) = H(x) + H(y) - H(x,y) \ge 0\)
</p>
</details>
</li>
<li>How is Mutual Information similar to correlation? How are they different? Are they directly related under some conditions?
<details><summary>Solution</summary>
<p>Start <a href="https://stats.stackexchange.com/questions/81659/mutual-information-versus-correlation">here</a>.
</p>
</details>
</li>
<li>In classification problems, <a href="https://ai.stackexchange.com/questions/3065/why-has-cross-entropy-become-the-classification-standard-loss-function-and-not-k/4185">minimizing cross-entropy loss is the same as minimizing the KL divergence
of the predicted class distribution from the true class distribution</a>. Why do we minimize the KL, rather
than other measures, such as L2 distance?
<details><summary>Solution</summary>
<p>
In classification problem: One natural measure of “goodness” is the likelihood or marginal
probability of observed values. By definition, it’s \(P(Y | X; params)\), which is
\(\prod_i P(Y_i = y_i | X; params)\).
This says that we want to maximize the probability of producing the “correct” \(y_i\)
class only, and don’t really care to push down the probability of incorrect class like
L2 loss would.
</p><p>
E.g., suppose the true label \(y = [0, 1, 0]\) (one-hot of class label {1, 2, 3}),
and the softmax of the final layer in NN is \(y’ = [0.2, 0.5, 0.3]\).
One could use L2 between these two distributions, but if instead we minimize KL
divergence \(KL(y || y’)\), which is equivalent to minimizing cross-entropy
loss (the standard loss everyone uses to solve this problem),
we would compute \(0 \cdot \log(0) + 1 \cdot \log (0.5) + 0 \cdot \log(0) = \log(0.5)\),
which describes exactly the log likelihood of the label being class 2
for this particular training example.
</p><p>
Here choosing to minimize KL means we’re maximizing the data likelihood.
I think it could also be reasonable to use L2, but we would be maximizing
the data likelihood + “unobserved anti-likelihood” :) (my made up word)
meaning we want to kill off all those probabilities of predicting wrong
labels as well.
</p><p>
Another reason L2 is less prefered might be that L2 involves looping over all
class labels whereas KL can look only at the correct class when computing the loss.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="2-generative-adversarial-networks-gan">2 Generative Adversarial Networks (GAN)</h1>
<p><strong>Motivation</strong>: GANs are framework for constructing models that learn to sample
from a probability distribution, given a finite sample from that distribution.
More concretely, after training on a finite unlabeled dataset (say of images),
a GAN can generate new images from the same “kind” that aren’t in the original
training set.</p>
<p><strong>Topics</strong>:</p>
<ol>
<li>JS (Jensen-Shannon) divergence</li>
<li>How are GANs trained?</li>
<li>Various possible GAN objectives. Why are they needed?</li>
<li>GAN training minimizes the JS divergence between the data-generating distribution and the distribution of samples from the generator part of the GAN</li>
</ol>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">JS Divergence</a></li>
<li><a href="https://arxiv.org/abs/1406.2661">The original GAN paper</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1701.00160">NIPS 2016 Tutorial: Generative Adversarial Networks</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>Prove that <script type="math/tex">0 \leq JSD(P||Q) \leq 1</script> bit for all P, Q. When are the bounds achieved?
<details><summary>Solution</summary>
<p>Start <a href="https://en.wikipedia.org/wiki/Jensen-Shannon_divergence#Relation_to_mutual_information">here</a>.
</p>
</details>
</li>
<li>What are the bounds for KL divergence? When are those bounds achieved?
<details><summary>Solution</summary>
<p>Start <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">here</a>.
</p><p>
The Kullback–Leibler divergence \(D_{KL}(P||Q)\) between \(P\) and \(Q\) is defined as:<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right)\)
</p><p>
The lower bound is reached when \(P(x) = Q(x)\) because \(\left(\frac{P(x)}{Q(x)}\right) = 1\):<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2(1) = \sum_{x}P(x) 0 = 0\)
</p><p>
The upper bound is reached when \(Q(x)\) is disjoint from \(P(x)\), i.e., \(Q(x)\) is zero where \(P(x)\) is not zero, because then the log-ratio \(\log_2\left(\frac{P(x)}{Q(x)}\right)\) becomes \(\infty\):<br />
\(x_i \in x\)<br />
\(P(x_i) \log_2\left(\frac{P(x_i)}{Q(x_i)}\right) = P(x_i) \log_2\left(\frac{P(x_i)}{0}\right) = \infty \to\)<br />
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right) = \infty\)
</p>
</details>
</li>
<li>In the paper, why do they say “In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, <script type="math/tex">log(1 − D(G(z)))</script> saturates”?
<details><summary>Solution</summary>
<p><a href="/assets/gan_gradient.pdf">Understanding the vanishing generator gradients point in the GAN paper</a></p>
</details>
</li>
<li>Implement a <a href="https://colab.research.google.com/">Colab</a> that trains a GAN for MNIST. Try both the saturating and non-saturating discriminator loss.
<details><summary>Solution</summary>
<p>An implementation can be found <a href="https://colab.research.google.com/drive/1joM97ITFowvWU_qgRjQRiOKajHQKKH80#forceEdit=true&offline=tru&sandboxMode=true">here</a>.
</p>
</details>
</li>
</ol>
<p><br /></p>
<h1 id="3-the-paper">3 The Paper</h1>
<p><strong>Motivation</strong>: Let’s read the <a href="https://arxiv.org/abs/1606.03657">paper</a>. Keep
in mind that InfoGAN modifies the original GAN objective in this way:</p>
<ol>
<li>Split the incoming noise vector z into two parts - noise and code. The goal
is to learn meaningful codes for the dataset.</li>
<li>In addition to the discriminator, it adds another prediction head to the
network that tries to predict the code from the generated sample. The loss
is a combination of the normal GAN loss and the prediction loss.</li>
<li>This new loss term can be interpreted as a lower bound on the mutual
information between the generated samples and the code.</li>
</ol>
<p><strong>Topics</strong>:</p>
<ol>
<li>The InfoGAN objective</li>
<li>Why can’t we directly optimize for the mutual information <script type="math/tex">I(c; G(z,c))</script></li>
<li>Variational Information Maximization</li>
<li>Possible choices for classes of random variables for dimensions of the code c</li>
</ol>
<p><strong>Reproduce</strong>:</p>
<p><a href="https://colab.research.google.com/drive/1JkCI_n2U2i6DFU8NKk3P6EkPo3ZTKAaq#forceEdit=true&offline=true&sandboxMode=true" class="colab-root">Reproduce in a
<span>Notebook</span></a></p>
<p><strong>Required Reading</strong>:</p>
<ol>
<li><a href="https://arxiv.org/abs/1606.03657">InfoGAN</a></li>
<li><a href="http://aoliver.org/assets/correct-proof-of-infogan-lemma.pdf">A correction to a proof in the paper</a></li>
</ol>
<p><strong>Optional Reading</strong>:</p>
<ol>
<li><a href="https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd">A blog post explaining InfoGAN</a></li>
</ol>
<p><strong>Questions</strong>:</p>
<ol>
<li>How does one compute <script type="math/tex">log Q(c|x)</script> in practice? How does this answer change based on the choice of the type of random variables in c?
<details><summary>Solution</summary>
<p>What is \(\log Q(c|x)\) when c is a Gaussian centered at \(f_\theta(x)\)? What about when c is the output of a softmax?
</p><p>
See section 6 in the paper.
</p>
</details>
</li>
<li>Which objective in the paper can actually be optimized with gradient-based algorithms? How? (An answer to this needs to refer to “the reparameterization trick”)</li>
<li>Why is an auxiliary <script type="math/tex">Q</script> distribution necessary?</li>
<li>Draw a neural network diagram for InfoGAN
<details><summary>Solution</summary>
<p>There is a good diagram in <a href="https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd">this blog post</a></p>
</details>
</li>
<li>In the paper they say “However, in this paper we opt for
simplicity by fixing the latent code distribution and we will treat <script type="math/tex">H(c)</script> as a constant.”. What if you want to learn
the latent code (say, if you don’t know that classes are balanced in the dataset). Can you still optimize for this with gradient-based algorithms? Can you implement this on an intentionally class-imbalanced variant of MNIST?
<details><summary>Solution</summary>
<p>
You could imagine learning the parameters of the distribution of c, if you can get H(c) to be a differentiable function of those parameters.
</p>
</details>
</li>
<li>In the paper they say “the lower bound … is quickly maximized to … and maximal mutual information is achieved”. How do they know this is the maximal value?</li>
<li>Open-ended question: Is InfoGAN guaranteed to find disentangled representations? How would you tell if a representation is disentangled?</li>
</ol>avitalThank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.What Is This2018-05-27T14:00:00+00:002018-05-27T14:00:00+00:00https://www.depthfirstlearning.com/2018/What-This-Is<p>Welcome! We made this for you.</p>
<p>Why? Because there’s a flood of information out there on machine learning and it
is woefully insufficient in two ways that we all recognize.</p>
<p>The first is that while there is a terrific amount of resources online for
picking up machine learning in 24 hours, there is not nearly as much that treats
this as a ten year endeavor. When you start thinking that way, you really want
to consider the trunk of your tree foremost before each convolutional leaf.</p>
<p>The second is that there is no bridge from where we came from before, to where we
are today. Without that, it is really hard to insightfully propose directions on
where to go. For example, it is one thing to understand how an algorithm like TRPO
works but all together another to grasp why the authors made the choices that they
did and what they knew before that to influence those choices.</p>
<p>This public repository of depth-first study plans is the outcome of an effort to
address those shortcomings in the machine learning research world. DFL was
designed to be shared with and grown by the community. If you are interested in developing your own
curriculum for a paper that we do not already host, <strong>we will help you!</strong> Check out
<a href="https://github.com/depthfirstlearning/depthfirstlearning.com#contributing">Github</a>
for examples and open an <a href="https://github.com/depthfirstlearning/depthfirstlearning.com/issues">issue</a>
detailing which paper you want to study. We want your help in growing this resource
and making it richer.</p>
<p>We were inspired to create DFL to satisfy a missing part of other resources.
Each of the following were north stars and are complementary to DFL:</p>
<ol>
<li><strong>Textbooks</strong>: The de facto method of learning. We cannot match the depth that
they cover. Instead, we aim to to create a curricula for a single modern paper.</li>
<li><strong>Online courses</strong>: Online courses are fantastic at reaching almost everyone
in the world while having a similar profile to textbooks.</li>
<li><strong>Blogs</strong>: Blog posts are a copious resource in our community. They provide a
counter to textbooks in that they frequently cover just a single paper or idea
and can be a very welcome accompaniment to our own reading. In contrast to DFL,
they generally are one-off and insufficient for deep understanding.</li>
<li><strong><a href="https://distill.pub/">distill.pub</a></strong>: Distill.pub has beautiful and thorough
explanations for machine learning phenomena, oftentimes even with new insights
about the underlying mathematics. Each topic may span multiple papers and
cover a broader area of interest to machine learning practioners. In contrast, we
try to unify resources needed to understand a single paper deeply by providing
the right resources and topics from which to study.</li>
<li><strong><a href="https://metacademy.org/">Metacademy</a></strong>: Metacademy is the closest parallel
to Depth First Learning. We have similar goals in improving the learning process
through breaking down concepts into their parents. However, we ultimately have
different foci as DFL is built for understanding significant machine learning
papers and consequentially increasing the strength of the field itself.</li>
</ol>
<ul>
<li><a href="http://twitter.com/avitaloliver">Avital</a>, <a href="http://twitter.com/suryabhupa">Surya</a>,
<a href="http://twitter.com/kumarkagrawal">Kumar</a>, <a href="http://twitter.com/cinjoncin">Cinjon</a></li>
</ul>dflWelcome! We made this for you.