Depth First Learning

Bringing the people back in by Emily Denton

2021-09-14T08:00:00+00:00

Bringing the people back in: Contesting benchmark machine learning datasets was a paper from July, 2020, by Emily Denton and team. It is an interogation of how datasets in machine learning are made and how they influence the field. The work motivates the need for genealogical methods for datasets so that we can trace their history and ensure that users are sufficiently aware of what biases they introduce into resulting infrastructure. This is an ongoing journey for Emily in her blossoming career. Listen to her describe that journey here.

Learning the Optimizer by Luke Metz

2021-08-18T13:00:00+00:00

Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves was a paper from September, 2020, by Luke Metz and co. It is another step towards replacing hand-designed features with learned functions, this time the optimizer. This has been a three year journey for Luke; listen to him describe what he’s learned along the way and where are the pain points, from research to engineering.

Characterising Bias in Compressed Models by Sara Hooker

2021-08-10T14:00:00+00:00

Characterising Bias in Compressed Models by Sara Hooker et al highlighted where the lunch was getting paid when it came to modern deep learning compression techniques. All of these models we use on a pervasive basis, in her phones, on social feeds, etc, they all use compression. Are we compromising what we want when we apply this everywhere? More particularly, are we affecting some groups more than others?

T5 by Colin Raffel

2021-08-03T14:00:00+00:00

T5 by Colin Raffel et al is an important work in the NLP literature. The idea behind it was to perform a gigantic study on a wide array of methods and scientifically assess what worked. They then combined those working methods into a single model called … T5. It performs well in every common NLP task, from summarization to translation to question answering.

Variational Inference with Normalizing Flows

2021-02-09T10:00:00+00:00

[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]

Firstly, a huge thank-you to the participants in the study group that led to this guide, for their enthusiastic participation, interesting perspectives and insights, and useful feedback and contributions: Scott Cameron, Jean Michel Sarr, Suvarna Kadam, James Allingham, Bharati Srinivasan, Lood van Niekerk, and Witold Szejgis.

Thank you too to the Depth First Learning team for bringing me on board, and especially to Avital Oliver for helping get things started, keeping them on the rails, organizing guests for study group sessions, and gently but insistently nudging me to wrap things up after the study group had concluded.

Finally, thank you to Laurent Dinh and Rianne van den Berg for sitting in on our discussion sessions and sharing their inputs, and to them, Avital, and the study group members for their feedback on and contributions to various drafts of this material.

Concept dependency graph. Click to navigate.

Why

Variational inference forms a cornerstone of large-scale Bayesian inference. Large-scale neural architectures making use of variational inference have been enabled by approaches allowing computationally and statistically efficient approximate gradient-based techniques for the optimization required by variational inference - the prototypical resulting model is the variational autoencoder.

A complementary objective to efficient variational inference in a given variational family, is maintaining efficiency while allowing a richer variational family of approximate posteriors. Normalizing flows are an elegant approach to representing complex densities as transformations from a simple density.

This curriculum develops key concepts in inference and variational inference, leading up to the variational autoencoder, and considers the relevant computational requirements for tackling certain tasks with normalizing flows. While it provides good background for studying a variety of papers on VI and generative modeling, the key focus of the curriculum is the paper Variational inference with normalizing flows, which uses normalizing flows to enrich the representation used for the approximate posterior in amortized variational inference.

Outline

The paper that we are working towards combines two key ideas: (1) amortized variational inference, and (2) normalizing flows.

We first introducing the challenge of Bayesian inference in latent variable models (Section 1), then explain variational inference (VI) as an approach for approximate inference (Section 2). In Section 3, we develop some key ideas from the past decade extending the range of problems and problem sizes where VI can be applied. These ideas are then combined with the idea of an inference network to develop amortized VI, showcased by the variational autoencoder (VAE), in Section 4.

Normalizing flows (NFs) are a modelling approach which represent a density of interest by a sequence of invertible transformations from a reference distribution, for example a standard Gaussian. NFs can enable one to model a rich class of distributions by specifying parameters for these transformations. We introduce the key ideas of NFs in Section 5, and then move on to the main paper (Section 6), which leverages NFs to improve the richness of the family of approximate latent distributions used in amortized VI.

A Google Doc containing an expanded version of this curriculum is also available. It contains more information on assumed prerequisites, additional rationale for and commentary on various assigned readings, links to supporting material to help mastering the required reading, a couple of extra exercises that did not make the final curriculum, and scribe notes from the group discussion sessions.

1 Bayesian inference and latent variable models

Synopsis: This part’s material covers some general background from probability theory, including Bayes rule. With this background, students should be able to formulate a probabilistic model and understand the inference and learning problems. Of particular interest in this course are latent variable models, where the model includes variables which are never observed (and are arguably only modelling artifacts). In some special cases, Bayesian inference (using Bayes rule to update beliefs about variables based on observations) leads to tractable posteriors for the variables, where we can conveniently calculate expectations as required for further inference or decision-making. Many models make use of exponential families of distributions to obtain tractable posteriors through a property called conjugacy. In most practical cases of interest, however, the posterior will be more complicated than we can deal with exactly. Monte Carlo methods based on sampling from the posterior are one approach for dealing with this. Our focus in the coming parts, however, will be another major approach, variational inference.

Objectives: After this part, you should:

be able to apply the change of variable formula to calculate the distribution of a transformation of a random variable;
understand the tasks of inference of variables and learning of parameters in a probabilistic model;
be comfortable with manipulating the core quantities used in Bayes rule (prior, likelihood, evidence, posterior) and key information-theoretic quantities;
be able to convert between a Bayes network representation and a factored joint distribution;
understand the principle of conjugate priors and the relevance of the exponential family w.r.t conjugacy; and
be aware of sampling techniques and how a sampler can be used to evaluate a posterior expectation.

Topics:

Important concepts in probability and information theory (Bayes rule, latent variables, multivariate change of variables formula, Kullback-Leibler divergence and entropy)
(Exact) Bayesian inference, conjugacy, and the exponential family
Introduction to approximate inference

Required Reading

Important concepts in probability and information theory:

Ian Goodfellow et al., Deep Learning, the following portions of Chapter 3: Sections 3.9.6 and 3.11–3.13 (excluding the portion in Section 3.12 on measure theory). [Note that the content of Chapter 3 before Section 3.9.3 is assumed background knowledge.]

(Exact) Bayesian inference, conjugacy, and the exponential family:

David MacKay, Information Theory, Inference, and Learning Algorithms, Section 3.2.
David Blei, The Exponential Family, sections titled “Definition” and “Conjugacy” (until Formula (49), before the subsection “Posterior predictive distribution”)

Introduction to approximate inference:

Dimitris G. Tzikas, Aristidis C. Likas, and Nikolaos P. Galatsanos, The Variational Approximation for Bayesian Inference, until the end of the section titled “An alternative view of the EM algorithm”.
David MacKay, Information Theory, Inference, and Learning Algorithms, Section 29.1 (excluding the portion on uniform sampling).

Additional Reading:

The rest of David Blei, The Exponential Family
More of Chapters 29 and 30 of David MacKay, Information Theory, Inference, and Learning Algorithms

Questions:

Density transformation formula. Use the formula for transformation of variables to derive the density of the multivariate Gaussian distribution from an invertible linear transformation of the standard multivariate Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$.
Belief networks. Complete part 1 of Exercise 35 at the end of Chapter 3 in this PDF pre-print version of David Barber’s “Bayesian Reasoning and Machine Learning”.
Posterior inference via conjugacy. Suppose you have data $D$ consisting of i.i.d. observations $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2 =1)$.

a. Specify the likelihood of the observations $p(D; \mu)$.

b. Derive the maximum likelihood estimate of $\mu$.

c. Suppose we model our uncertainty about the mean with $\mu \sim \mathcal{N}(0, \sigma_{\mu}^2 = 1)$. Derive the posterior distribution by making use of conjugacy, and use this to obtain the MAP estimate of $\mu$.
Prove that the KL divergence $\mathrm{KL}(q \mid p)$ is nonnegative.
Hint
Apply the bound $$\log t \leq t-1$$ to $$t=p(x)/q(x)$$.
KL divergence for simple normal distributions. Show that
\[\text{KL}\left(\mathcal{N}\left((\mu_1, \ldots, \mu_k)^\mathsf{T}, \operatorname{diag} (\sigma_1^2, \ldots, \sigma_k^2)\right) \parallel \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)\right) = {1 \over 2} \sum_{i=1}^k (\sigma_i^2 + \mu_i^2 - \ln(\sigma_i^2) - 1) \enspace .\]
Derive Equation (7) in The Variational Approximation for Bayesian Inference.

Solutions

Solutions to these exercises can be found here

2 Introduction to Variational Inference (VI)

Synopsis: In practice, Bayesian inference yields posteriors which do not have convenient forms. The traditional approach to calculating or estimating posteriors or posterior expectations is to use Monte Carlo methods based on posterior sampling. These are asymptotically exact but computationally intensive, particularly in high dimensions. An alternative approach is variational inference (VI), which trades off exactness for tractability. In this part, we introduce the core ideas of VI approaches in the context of mean field VI. The VI approach loses exactness by approximating the true posterior with a representative from a variational family. There is a tradeoff between richness of the approximation family (impacting the resulting estimate quality) and the tractability of the VI scheme. Mean-field factorization assumptions on the variational family yield an approach for optimizing the variational parameters through coordinate ascent. Much of the rest of this curriculum will focus on trying to improve the behaviour of VI in terms of scalability, broadness of applicability, and accuracy (by using more sophisticated variational families).

Objectives: After this part, you should:

have an idea of the relationships between the (variational) EM algorithm and (variational) Bayesian inference;
be able to describe coordinate ascent variational inference (CAVI), and explain its shortcomings in terms of scalability to large models; and
understand and follow the steps required in deriving a CAVI algorithm for a conditionally conjugate model.

Topics:

Variational expectation-maximization
Variational inference
Mean-field variational inference
Co-ordinate ascent variational inference

Required Reading:

Variational expectation-maximization:

Dimitris G. Tzikas, Aristidis C. Likas, and Nikolaos P. Galatsanos, The Variational Approximation for Bayesian Inference, the section titled “The Variational EM framework”.

Variational inference:

David Blei, Alp Kucukelbir, and Jon D. McAuliffe, Variational Inference: A Review for Statisticians, until the end of Section 4.2.

Additional Reading:

The rest of Dimitris G. Tzikas, Aristidis C. Likas, and Nikolaos P. Galatsanos, The Variational Approximation for Bayesian Inference.

Questions:

Forward vs reverse KL. Consider the univariate distribution $P$ formed by an equal mixture of unit variance Gaussians with means at -5 and 5. Think about how a Gaussian distribution $Q$ would look that minimizes (i) $\mathrm{KL}(Q\|P)$ and (ii) $\mathrm{KL}(P\|Q)$. Explain your answers. Which approximation behaviour do you think is preferable for posterior inference, and why? Which approach do you think will be more tractable, and why? Additional: implement the required KL calculations - sampling or other tricks will be required - and numerically optimize to fit the optimal Q in each case.
EM vs. variational inference. Describe how Bayesian inference of latent variables and unknown parameters can be seen as a special case of the EM algorithm. Extend this analogy to compare coordinate ascent variational inference to mean-field variational EM.
ELBO as a KL divergence?. Looking at Equation (13) of Variational Inference: A Review for Statisticians, it seems one can write $\mathrm{ELBO}(q) = -\mathrm{KL}(q(\mathbf{z})\|p(\mathbf{z},\mathbf{x}))$. Explain what the problem is with this. (Note that this is also essentially done in Equation 15 of The Variational Approximation for Bayesian Inference.) Warning: some would argue this is just nitpicking about a technicality!
ELBO derivations. Show that the expression $\mathbb{E}[\log p(x_i \mid c_i,\mathbf{\mu}; \phi_i, \mathbf{m}, \mathbf{s}^2)]$ in Equation (21) of Variational Inference: A Review for Statisticians equals $-\frac{1}{2}[\log 2\pi + \sum_{k=1}^K \phi_{ik}(x_i^2 +m_k^2 + s_k^2 -2x_i m_k)]$.
What do you think is the biggest challenge to scalability of CAVI?
What is the benefit of your model having complete conditionals in the exponential family if you would like to apply CAVI?
Calculate the rest of the terms in the ELBO of Equation (21) in Variational Inference: A Review for Statisticians, and verify the CAVI update equations by setting the components of the ELBO gradient to zero. (Additional)
Implement CAVI for the example in Sections 2-3 of Variational Inference: A Review for Statisticians using PyTorch or a similar package. Think about how to visualize the behaviour of the algorithm and/or its results. If you have done the previous exercise, use a threshold on the relative change in the ELBO to control when to terminate; otherwise you can monitor changes in the variational parameters, or the log-predictive density on a hold-out set. If you have implemented the ELBO, compare the behaviour of CAVI to directly optimizing the ELBO by gradient descent. (Additional)

Solutions

Solutions to these exercises can be found here

3 Doubly stochastic estimation: VI by Monte-Carlo mini-batch gradient estimation

Synopsis: In this part we consider two techniques used to address major limitations on the applicability and scalability of CAVI. The first challenge (to scalability) is that each global parameter update requires a full pass through the complete data set, which is problematic for very large data sets. This is resolved through stochastic variational inference, which uses the same ideas from stochastic approximation that enable the use of stochastic gradient descent in training other machine learning models. The second challenge (to applicability) is that the updates by CAVI need to be determined manually for each model. This is addressed through black-box variational inference (BBVI), which uses Monte Carlo estimates to replace the manual derivation. Since the naive Monte Carlo estimator has very high variance, variance reduction techniques for Monte Carlo estimation must be applied to make this approach effective. When BBVI is combined with SVI by using mini-batches for the gradient estimation, we speak of doubly stochastic estimation.

Objectives: After this part, you should:

be aware of the concept of natural gradient;
be aware of the Robbins-Munro conditions for stochastic optimization;
understand how SVI uses mini-batch gradients to efficiently scale up CAVI;
understand the score function Monte Carlo gradient estimator of the ELBO;
be aware of what is required to apply BBVI and doubly stochastic estimation;
be aware of Rao-Blackwellization/conditioning and control variates as variance reduction techniques in Monte Carlo estimation; and
be able to explain the impact of doubly stochastic estimation on scalability, and what issues further limit scalability.

Topics:

Fisher information and natural gradient
Stochastic variational inference
Variance reduction methods for Monte Carlo estimation
Black box variational inference

Required Reading:

Fisher information and natural gradient:

Andrew Miller, Natural Gradients and Stochastic Variational Inference, until the start of the section “Gaussian example”.

Stochastic variational inference:

David Blei, Alp Kucukelbir, and Jon D. McAuliffe, Variational Inference: A Review for Statisticians, Section 4.3.

Variance reduction methods for Monte Carlo estimation:

Martin Haugh, Simulation Efficiency and an Introduction to Variance Reduction Methods. Read from the beginning until the end of Example 1 on page 4, and then Section 4 until the end of Example 9 on page 12.

Black box variational inference:

Rajesh Ranganath, Sean Gerrish, and David M. Blei, Black Box Variational Inference. Section 5 is optional, but note the dramatic effect of the variance reduction techniques shown in Figure 2. (Also check the derivation of the ELBO gradient in Equation 2 presented in Section 7, but note that there is a missing gradient sign in the expectation in the line where Equation (13) is labelled.)

Additional Reading:

Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley, Stochastic Variational Inference. (The most relevant portion is until the end of Section 2, with Section 3 discussing applications to two topic models: latent Dirichlet allocation and the hierarchical Dirichlet process.)

Questions:

Extend your CAVI implementation from the previous section to VI using natural gradient descent, and consider the impact of the minibatch size on the convergence time in terms of number of examples seen. Use the autodifferentiation capability of PyTorch to perform stochastic gradient descent on the ELBO (i.e. not following the natural gradient), and compare the performance of this to the previous approach. (Additional)
The score function. For a parameterized distribution $p(x; \theta)$, the score is defined as the gradient (w.r.t. $\theta$) of the log-density, and the covariance matrix of the score under this distribution is called the Fisher information matrix.

a. Derive the score function for a univariate Gaussian.

b. Show that the expected score (w.r.t.$p$) is zero.
Fisher as the Hessian of relative entropy. Assuming $\log q_{\lambda}$ is twice differentiable, one has that the entries of the Fisher can also be written as $[F_\lambda]_{ij} = -\mathbb{E}_{x \sim q_{\lambda}}[\frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \log q_{\lambda} (x)]$. (Additional: derive this.) Use this formulation to show that the Fisher is the Hessian (w.r.t. $\lambda^{\prime}$) of the KL divergence $\mathrm{KL}(q_\lambda \mid q_{\lambda^\prime})$ at $\lambda^\prime = \lambda$.
Fisher for exponential families. Given that $F_\eta = - \mathbb{E}_{x} \nabla_\eta^2 \log p(x \mid \eta)$ (the matrix form of the representation in the previous exercise), show that the Fisher equals the Hessian of the log normalizer (\nabla_\eta^2 a(\eta)) when $p(x \mid \eta)$ is from an exponential family.
Score function gradient estimation, a.k.a. the log-derivative trick. Consider the problem of using gradient descent to find the mean of a unit variance Gaussian with minimum second moment $\mathbb{E}(X^2)$. We thus seek the value of $\nabla_{\mu} \mathbb{E}_{N(\mu,1)}(X^2)$ at a candidate value $\mu_0$. Exchange the order of differentiation and integration, and then use the score function to obtain an expression for this derivative that is an expectation amenable to Monte Carlo estimation. Note how the derivation of the ELBO gradient for BBVI used this approach, along with the expectation of the score being zero. (This idea is essentially the key idea enabling BBVI, so it is probably the most important of this part’s exercises to get your head around.)
Incremental SVI. Suppose you have already fit a model to a huge data set with doubly stochastic VI, and then receive new data. How would you go about obtaining the estimated posterior over the latent variables for the new data? How would you go about updating the model to incorporate the new data?
Law of total variance. Derive the formula in Equation 5 on page 10 of Simulation Efficiency and an Introduction to Variance Reduction Methods.
Hint
Begin by writing the variance as a difference in the traditional way, and applying the law of total expectation (the formula above Equation 5) to each term. From there you should be able to manipulate expectations and variances w.r.t. $$Z$$ and $$X|Z$$ to get the required expression - i.e. there should be no need to write these out as integrals.
Efficacy of conditional Monte Carlo. Answer Exercise 2 on page 11 of Simulation Efficiency and an Introduction to Variance Reduction Methods
Implement naive Monte Carlo sampling as well as using the control variate and conditioning methods as per Examples 1 and 9 in Simulation Efficiency and an Introduction to Variance Reduction Methods to see the variance reduction effect of these strategies. (Additional)
Consider mean-field variational inference of an hierarchical Bayesian model as in Equation (12) of Black Box Variational Inference. Note that $\beta$ appears in all terms of the log-joint, while any specific $z_i$ only appears in two terms. What effect does this have when one calculates Rao-Blackwellized estimators of the gradient component for the variational parameters corresponding to $\beta$ vs. those for the $z_i$ according to Equation (6) of the paper? How does incorporating stochastic estimation via minibatching/observation sampling make these updates more efficient? (Focus on the overall effect, equations are not required!)
Implement BBVI for the Bayesian Gaussian mixture model, and compare its performance to the previous techniques (both with and without variance reduction techniques). (Additional)

Solutions

Solutions to these exercises can be found here

4 Inference networks and amortized VI

Synopsis: This part presents developments in VI allowing further scalability as well as use in online settings. Traditional VI analyses all the data together, and individually optimizes the latent variables corresponding to each observation. This means that new observations require refitting the entire model. A way to bypass this is to model the transformation from an observation to its posterior distribution using an inference network or recognition model. Instead of optimizing variational parameters for each observation, those variational parameters are output by the inference network when it is given the observation as input, and the model parameters of the inference network are trained to optimize these predictions during the learning phase. This allows direct, efficient, prediction of the latent variable posterior (i.e. inference) on previously unseen samples - so-called amortized VI. Previous work had trained such inference networks before, but the other development here was combining the inference and generative networks end-to-end in a neural network, and using the evidence bound (ELBO) as a combined training objective. This was enabled, for continuous variables, by an alternative Monte Carlo estimator of the gradient, based on the so-called reparameterization trick. The most well-known such model now is the variational autoencoder.

Objectives: After this part, you should be comfortable with:

explaining the reparameterization trick and what problem it tries to solve;
understanding in principle how the reparameterization trick is implemented in machine learning libraries with auto-differentiation facilities;
combining an inference network with a generator network, and training them end to end;
the idea of the inference network outputting parameters describing the posterior distribution corresponding to the network input;
the specific choice of loss function used for end-to-end training;
the use of amortized VI for variational autoencoders and deep latent variable models; and
discussing the scalability of such systems, and their limitations.

Topics:

Inference networks
Amortized VI
The reparameterization trick
Variational autoencoders

Required Reading:

Diederik Kingma and Max Welling, An Introduction to Variational Autoencoders, Sections 1.7-2.8 (but you can omit Section 2.6).

Additional Reading:

The first two papers listed below were independent proposals of the variational autoencoder.

Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra, Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
Diederik Kingma and Max Welling, Auto-Encoding variational Bayes.
Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih, Monte Carlo Gradient Estimation in Machine Learning. Studies various approaches to estimating gradients of function expectations with respect to parameters defining the distribution with Monte Carlo methods. Properties of the score function and pathwise (i.e. reparameterization trick) gradient estimators are discussed in considerable detail.
James Allingham, Deep Learning Indaba Practical 3b on Deep Generative Models - Colab notebook introducing VAEs and Generative Adversarial Networks (GANs).

Questions:

Reparameterization trick. Explain the reparameterization trick in your own words and what problem it tries to solve.
Applying the reparameterization trick Exercise 5 of Section 3 uses the score function gradient estimator. Now use the reparameterization trick to get an alternative expression for this gradient in terms of an expectation w.r.t. a standard Gaussian distribution (i.e. zero-mean and unit variance). Implement both estimators (the one from the previous section and the one from this section), and plot the variance of each against the number of Monte Carlo samples. (To obtain the variances, repeatedly estimate the gradient with independent Monte Carlo samples of the relevant size.)
Discrete latent variables and the reparameterization trick. Black-box variational inference can fit models with discrete latent variables, but the VAE can not. Explain why.
The ELBO as training objective. In previous sections, we considered the situation where the generative model was known, and we focused on estimating the variational parameters by optimizing the ELBO. In the VAE, the ELBO is used to jointly optimize the parameters of the encoder and the decoder. Consider the decomposition of the marginal likelihood in Equation 2.8 of An Introduction to Variational Autoencoders.

Suppose $\theta$ is held fixed, and $\phi$ is optimized w.r.t. the ELBO. This is similar to other VI approaches, except that an inference network is now used for amortized analysis. This has no effect on the marginal likelihood of the generative model (which should be expected, since $\theta$ is fixed), but makes the variational posterior better.

Suppose now that $\phi$ is held fixed, and $\theta$ is optimized w.r.t. the ELBO. This may make the variational posterior less accurate. Why is it nevertheless a good idea?

Finally, note that end-to-end optimization of the ELBO across the encoder and decoder essentially corresponds to interleaving stochastic gradient descent w.r.t. the two above steps.
VAE implementation and exploration. Complete the VAE implementation in vae.py.

a. Note how the provided code uses the VAE to sample new images.

b. Plot the variational parameters (means and log-variances) for a number of MNIST digits. Do they seem to have some kind of information about the classes present in the data set? (Additional)
Relationship to nonlinear PCA. An earlier approach to constructing low-dimensional representations (for compression or further analysis) was nonlinear PCA. This used a low-dimensional bottleneck layer in an autoencoder model, and then extracted the representation at this layer for the lower-dimensional representation. Modify your VAE implementation above by ignoring the log-variances, and simply returning the predicted mean in the reparameterization step. This corresponds to setting the variance for the latent Gaussian to zero, and the resulting model then almost corresponds to non-linear PCA. The final adjustment to obtain nonlinear PCA is to set the loss function to only use the reconstruction loss, and not to also penalize deviations of the variational family from the prior. (Additional)

a. Compare the sampling output for nonlinear PCA and the VAE, and contrast their suitability for sampling.

b. Contrast nonlinear PCA and the VAE w.r.t. their suitability for compression.

Solutions

Solutions to these exercises can be found here

5 Normalizing Flows

Synopsis: There are various approaches to probabilistic modelling of complex phenomena. In the previous parts, we have considered variational inference for directed graphical models with latent variables. These models postulate meaningful latent variables and are amenable to ancestral sampling once we have fit the required conditional distributions, but a challenge for this approach is that the posterior distribution of latent variables may exhibit complex dependencies, which may not be well modeled by the variational family. In this part, we consider a different approach to probabilistic modelling which dispenses with the latent variables, and directly models the data density as a sequence of parameterized invertible transformations starting from a (simple) base density. Such a sequence of transformations (from a complicated to a simple density) is called a normalizing flow. A key aspect of this approach is to ensure that applying the transformations and obtaining their gradients are computationally efficient to allow efficient training and sampling. Thus, normalizing flows in the machine learning literature usually refers to an approach to parameterizing a fairly complex distribution as a sequential transformation of a simple one with some attractive computation properties. In the setting we consider here, a single flow is fitted directly to the (often high-dimensional) data. The next section will combine these modelling approaches by using normalizing flows to refine the posteriors in amortized VI.

Objectives: After this part, you should:

be comfortable with the change of variable formula and the use of the Jacobian when transforming nonlinear densities;
understand the distinction between inference and sampling in flow models, and how inference enables density estimation;
know which operations need to be efficient for efficient inference vs efficient sampling in flow models; and
understand how the coupling layers used in NICE enable both efficient inference and efficient sampling.

Topics:

Normalizing flows
Efficient sampling vs. efficient inference with normalizing flows

Required Reading:

Normalizing Flows:

Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker, Normalizing Flows: Introduction and Ideas, until midway through Section 3.2.1, “Triangular”. (Skip Section 2.1.1.) This review introduces the foundational concepts of normalizing flows, their main forms of application, and the properties we desire for efficient computation with normalizing flows.

Efficient sampling vs. efficient inference with normalizing flows:

Laurent Dinh, David Krueger, and Yoshua Bengio, NICE: Non-linear independent components estimation. (Feel free to skim over portions in the Related Methods section that you are not familiar with.)

Additional Reading:

Eric Jang, Tips for Training Likelihood Models.
Eric Jang, Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows.
Lilian Weng, Flow-based Deep Generative Models.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio, Density Estimation using Real NVP.
Gustavo Deco and Wilfried Brauer, Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. this is an early forerunner of normalizing flows, with proposed flows that seems to match (volume-preserving) autoregressive flows.
George Ho, Autoregressive Models in Deep Learning - A Brief Survey - an introduction to a variety of deep autoregressive networks.

Questions:

Figure 2 of NICE: Non-linear independent components estimation labels the computation graph of a coupling layer using concepts from cryptography. Explain why this is a suitable metaphor.
Consider a VAE where we use a standard isotropic Gaussian as the prior for the latent variable, and where the conditional $p(x|z) \sim \mathcal{N}(f_\theta(z), I)$. Consider the following perspective on the forward pass through a VAE. The first (encoder) phase takes as input a pair $(x, \epsilon)$, and outputs a pair $(x,z)$ - this can be seen as an affine coupling layer (a la NICE). The second (decoder) phase takes as input the pair $(x,z)$ and outputs the pair $\varepsilon, z)$ (where $\varepsilon = x - f_{\theta}(z)$ in a sense encodes how $x$ might be generated from $f_{\theta}(z)$ with a change of variables) - this can also be seen as an affine coupling layer. The VAE estimates its parameters by optimizing Monte Carlo estimates of the ELBO with the reparameterization trick, while the normalizing flow estimates its parameters by optimizing the data log-likelihood (assuming isotropic Gaussian priors on $z$ and $\varepsilon$). Considering that in the above, the input data points to the normalizing flow are $(x,\epsilon)$ (and not just $x$), show/convince yourself that these two approaches to estimating the parameters are equivalent.
Suppose one fitted a normalizing flow with a Gaussian base density for some domain. Consider a model using this normalizing flow as an encoder, and the inverse of the flow as a decoder. Discuss the relationships between this model and a VAE (and nonlinear PCA, if you tackled Exercise 6 in the previous section).
Implement NICE in PyTorch using affine coupling layers. Prevent the multiplicative factor in the scaling of each layer being zero by exponentiating the output of a ReLU MLP. This approach, also used in RealNVP, removes the need for the final scaling layer in NICE. (Additional)
Use your NICE implementation from the previous question (or modify an implementation from online) to allow you to experiment with varying numbers of coupling layers while trying to model some somewhat complicated distributions. If you are doing it from scratch yourself, begin by modelling 2-D distributions, like that in the example at the bottom of https://blog.evjang.com/2018/01/nf1.html, or that from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html, before considering tackling higher-dimensional cases such as MNIST. (Additional)
Consider Table 1 and Figure 3 of Variational inference with normalizing flows. In this setting, we have the (unnormalized) target density, but we do not have samples from the density. Thus we can not fit a normalizing flow by optimizing the data log-likelihood w.r.t. the flow parameters. Yet Figures 3(b) and 3(c) present results for fitted flows. Can you think of a sensible objective function to fit the parameters of a normalizing flow in this case?

Hint
A Gaussian is a flow with zero transformations - how might you fit a Gaussian to such a distribution?

Solutions

Solutions to these exercises can be found here

6 Normalizing flows for variational inference

Synopsis: We now turn to the main paper considered in this curriculum. The techniques covered so far allow training combined generative and inference networks by stochastic backpropagation. However, the posterior family was generally fairly simple to ensure scalable inference. This paper leverages the normalizing flows considered in the previous section to transform the simple distributions whose parameters were originally output by the inference network to much more complex posterior distributions. As before, computational efficiency of the normalizing flow is essential, but due to the way in which the flows are deployed in the VI setting, the requirements for efficiency differ somewhat from those for the normalizing flows considered above.

Objectives: After this part you should:

understand the idea of using a normalizing flow to obtain a richer family of variational posteriors;
understand why the flow parameters should also be output by the encoder, rather than being learnt separately;
have an appreciation for the different requirements on the flows that are tractable for direct density modelling vs. for use with variational inference; and
understand the decomposition of the inference gap into the approximation and amortization gap, and have some intuition about the effects of the choice of variational posterior family, encoder architecture, and decoder architecture on these gaps.

Topics:

Normalizing flows for variational inference
Understanding the inference gap

Disclaimer:

In the reading for this part, there are a few concepts we have not yet covered - if you are not familiar with them, simply skim over the relevant portions - they are not crucial.

What you should know:

Auxiliary variables (see this section’s optional Section 3.2.1 in An Introduction to Variational Autoencoders) are an alternative technique for adding additional latent variables to a model which allow a richer class of variational posteriors. It can also be combined with normalizing flows.
Annealed importance sampling is an approach that can be used to estimating the marginal likelihood/evidence. The resulting estimate is with high probability a lower bound on the actual marginal likelihood. One can also use the importance weighted autoencoder (IWAE) objective (which we skipped over in Section 2.6 of An Introduction to Variational Autoencoders) as an estimate - this is also a lower bound, which becomes tighter as the number of samples used to calculate it increases.
Real NVP is an extension of NICE which incorporates various enhancements which are particularly appropriate for image data.
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo approach which uses the mathematics of Hamiltonian dynamics from physics to propose transitions. Hamiltonian dynamics describe motion in terms of kinetic and potential energy. For HMC, the potential energy corresponds to the distribution we wish to sample from, while the kinetic energy helps control how the space is explored. If one views the dynamics in continuous time, the parameters of the potential and kinetic energy will correspond to an infinitesimal flow for the latent variables and auxiliary latent variables , respectively.
Stochastic differential equations can be used to model the evolution of a probability distribution over time.

Required reading:

Normalizing flows for variational inference:

The first reading reviews what is required of the inference network, before presenting the key idea of normalizing flows for variational inference. Pay attention to how the proposed flows keep the required operations efficient. The second reading is the main paper for this curriculum.

Diederik Kingma and Max Welling, An Introduction to Variational Autoencoders, Chapter 3 until the end of Section 3.2 (with Section 3.2.1 optional).
Danilo Rezende and Shakir Mohamed, Variational inference with normalizing flows. (Only skim Section 3.2 and other portions discussing infinitesimal flows.) [Note: Equation (20) has a missing $\beta_t$ coefficient in the last term of the first line.]

Understanding the inference gap:

Chris Cremer, Xuechen Li, and David Duvenaud, Inference Suboptimality in Variational Autoencoders. [Note: In Equation 11, the T’s in the first factor in the denominator of the log should be zeros, and there should be a product over t from 1 to T of the ensuing determinants.]

Additional Reading/Resources:

Ben Lambert, The intuition behind the Hamiltonian Monte Carlo algorithm.
Diederik Kingma and Max Welling, An Introduction to Variational Autoencoders: The rest of Chapter 3 and Chapter 4 give an overview of further developments using amortized VI for deep generative models beyond the introduction of normalizing flows.
David Duvenaud’s University of Toronto course on Differentiable Inference and Generative Models.
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan, Normalizing Flows for Probabilistic Modeling and Inference: a review on the use of normalizing flows in modeling and inference, which came out after completion of the reading group this curriculum was based on.

Questions:

Explain why it is necessary that the flow parameters, and not just the parameters of the base density used in the flow, also be output by the inference network, rather than simply having global parameters for the flow parameters that are optimized.
Hint
How can the latter case be viewed as a regular VAE without a normalizing flow?
What is the impact of having the encoder output the flow parameters on using the trained model as a generative model, i.e. for sampling new observations, compared to a VAE.
Reproduce figures similar to those in Figure 1 of Variational inference with normalizing flows with your own implementation. (Additional)
Two key aims of general generative models are density estimation and sampling. In normalizing flow models for density estimation, we need to evaluate $p(x)$ for any potential choice of $x$. This requires that it be efficient to move from the observation space to the latent space, where the base density can be evaluated, i.e. efficient inference. In sampling, we wish to efficiently move from the latent space to the observation space. Requiring both of these operations be efficient constrains the choice of possible flows - in general, one must sacrifice efficiency in one of these tasks, or have an easily invertible flow (such as in NICE). The planar and radial flows used for variational inference in the main paper are not easily invertible, but yet we can efficiently perform the sampling and density estimation that we require.

a. Explain how this is achieved in light of which “observations” we perform density estimation on.

b. How does this influence the choice of flows we can use for variational inference compared to those where we require general efficient density estimation?
Implement VI with NFs, and experiment with your implementation. (Additional)
An Introduction to Variational Autoencoders points out that the change to $z$ in planar flows can be viewed as a single-hidden-layer multi-layer perceptron (MLP) with a single hidden unit, and say this “does not scale well to a high-dimensional latent space: since information goes through the single bottleneck, a long chain of transformations is required to capture high-dimensional dependencies.” One way to tackle this is to change the MLP to have more hidden units.

a. Give the resulting modified formula for these generalized flows.

b. Note that one can no longer use the vanilla form of the matrix determinant lemma to calculate the determinant of this generalized transformation’s Jacobian. Fortunately, there is a generalized matrix determinant lemma which enables us to calculate the determinant. Write down the determinant, and specify the order complexity of calculating it in terms of the number of hidden units. (As with planar flows, not all such flows will be invertible. Sylvester normalizing flows arise as special forms of the above transformations where one obtains invertibility based on specific assumed forms for the weight matrices in the MLP - note that these forms also need to be maintained throughout training.) (Additional)
Inequality 12 of Inference Suboptimality in Variational Autoencoders gives the IWAE lower bound on the marginal likelihood. Derive this result by using Jensen’s inequality after using $q(z|x)$ as a proposal distribution for importance sampling from $p(z|x)$. (If you are not familiar with importance sampling, the relevant formula (with $q$ as proposal for $p$) is the second one on this page.) (Additional)
How do you think the authors might have gotten the “true posteriors” in Figure 2 of Inference Suboptimality in Variational Autoencoders?
Try to explain in your own words the issue of encoder overfitting discussed in Section 5.5.1 Inference Suboptimality in Variational Autoencoders, and when you should prefer using flows to increase the complexity of the variational approximation to increasing the expressiveness of the encoder.

Solutions

Solutions to these exercises can be found here

Resurrecting the Sigmoid: Theory and Practice

2020-04-07T10:00:00+00:00

[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]

This guide would not have been possible without the help and feedback from many people.

Special thanks to Yasaman Bahri for her feedback, support, and mentoring.

Thank you to Kumar Krishna Agrawal, Sam Schoenholz, and Jeffrey Pennington for their valuable input and guidance.

Finally, thank you to our group members Chris Akers, Brian Friedenberg, Sajel Shah, Vincent Su, Witold Szejgis, for their curiosity, commitment to the course material, and feedback on the curriculum.

Concept dependency graph. Click to navigate.

Why

As deep networks continue to make progress in a variety of tasks such as vision and language processing, it is important to understand how to properly train very deep networks with gradient-based methods. This paper studies, from a rigorous theoretical perspective, which combinations of network weight initializations and network activation functions result in trainable deep networks. The analysis framework used is broadly applicable to general network architectures.

In this currriculum, we will go through all the background topics necessary to understand this mathematically heavy paper. By the end, you will have an understanding of the dynamics of signal propagation in very wide neural networks, as well as an introduction to random matrix theory.

General resources

This paper is founded upon Random Matrix Theory (RMT), and mean-field analysis of signal propagation. The first resource below is a friendly introduction to RMT, while the second and third are the papers in which the mean-field analysis for deep neural networks was developed. These are good resources to return to throughout the course. For Deep Learning, w recommend Goodfellow et al, listed as the fourth resource. And finally the course outline is listed below in case you want an offline copy.

Livan, Novaes & Vivo: Introduction to Random Matrices - Theory and Practice.
Poole, Lahiri, Raghu, Sohl-Dickstein & Ganguli: Exponential expressivity in deep neural networks through transient chaos.
Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein: Deep information propagation.
Goodfellow, Bengio & Courville: Deep Learning.
Course Outline.

1 Introduction [to Trainability].

Motivation: The paper we will study here is part of a body of work with the broad goal of understanding what combination of network architecture and initialization allow a neural network to be trained with gradient-based methods. This week, you will read about this problem, specifically its manifestation in deep neural networks.

We also suggest that you skim the paper itself, focusing on the introductory sections, to understand the relevance of vanishing/exploding gradients to the trainability of neural networks.

Objectives: Understand the following background.

Explain the vanishing/exploding gradient problem and why it worsens with network depth.
Relate vanishing/exploding gradients to the spectrums of various Jacobians.
Understand heuristics used by the community to circumvent the vanishing/exploding gradients, e.g.
- Common initialization schemes, such as Xavier initialization.
- Pre-training.
- Skip connections / residual neural networks.
- Non-saturating activation functions (ReLU and its variants.

We would also like you to have an overview of the paper’s structure and the problem the paper is trying to solve - how to concentrate the entire spectrum of the network’s Jacobian around unity. Understand why mean-field signal propagation analysis and random matrix theory are necessary for this task.

Topics:

Trainability of networks, specifically the vanishing/exploding gradient problem.
Introduction to the paper and course overview.

Required Reading

Prerequisite: For familiarity with deep learning, please read the following sections from the Deep Learning book.

Preliminaries:

2.7 (Eigendecomposition).
2.8 (Singular value decomposition).
3.2 (Random variables).
3.3 (Probability distributions).
3.7 (Independence and conditional independence).
3.8 (Expectation, variance, and covariance).
5.7 (Supervised learning algorithms).

Initialization:

8.2 (Challenges in neural network optimization).
8.4 (Parameter initialization strategies).

Other:

All you need is a good init by Mishkin et al., sections 1 and 2.
Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio.
Wikipedia article on residual networks (skip connections).

Optional Reading:

Deep Residual Learning for Image Recognition.
Depth-first learning : NeuralODEs, section 3 on ResNets.

2 Signal propagation

Motivation: The Resurrecting the Sigmoid paper relies on signal propagation in wide neural networks. Understanding this framework connects us to more recent investigations of neural networks as Gaussian processes and the neural tangent kernel (Jacot et al., 2018 and Lee et al., 2019)).

Topics:

Mean-field analysis of signal propagation in deep neural networks.

Required Reading:

Since this analysis is relatively new, the main sources of information online are the original papers in which it was developed, namely:

Poole, Lahiri, Raghu, Sohl-Dickstein & Ganguli: Exponential expressivity in deep neural networks through transient chaos (Sections 1, 2, and 3).
Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein: Deep information propagation (Sections 1, 2, 3, and 5).

These are very useful references, but not necessarily pedagogical for those unfamiliar with the field. The problem set below is designed to walk you through understanding the formalism in a self-contained manner. We strongly suggest doing the problem set before reading the papers above and only consulting afterwards or for reference. Note that certain problems point to sections of the above papers for hints.

Optional Reading:

Once you understand the mean-field analysis framework, you will have a good foundation for the following papers. These are ‘bonus’ and not connected to the target paper.

Questions:

This week’s problem set is here. In this section we highlight a couple of the problems.

Problem 2: The mean field approximation.

In this problem, we use the knowledge we gained in problem 1 to properly choose to initialize the weights and biases according to $W^l \sim \mathcal{N}(0, \sigma_w^2/N)$ and $b^l \sim \mathcal{N}(0, \sigma_b^2)$. We’ll investigate some techniques that will be useful in understanding precisely how the network’s random initialization influences what the net does to its inputs; specifically, we’ll be able to take a look at how the depth of the network together with the initialization governs the propagation of an input point as it flows forward through the network’s layers.
1. A natural property to study in a network is its length. Intuitively, this is closely related to how the net transforms the input space, and to how its depth relates to that transformation. Compute the length $q^l$ of the activation vector output by layer $l$. When considering non-rectangular nets, where layer $l$ has length $N_l$, we want to distinguish this activation norm from the width of individual layers. What’s a more appropriate quantity we can track to understand how the lengths of activation vectors change in the net?
  Solution
  
  The length is simply the Euclidean magnitude, i.e. $\sum_{i = 1}^N (h_i^l)^2$. We can stabilize this quantity, especially when $N$ differs across layers, by normalizing: $$ q^l = \frac{1}{N_l} \sum_{i = 1}^{N_l} (h_i^l)^2 $$
2. What probabilistic quantity of the neuronal activations does $q^l$ approximate (with the approximation improving for larger $N$)?
  Hint
  Recall that all neuronal activations $h^l_i$ are zero-mean, and consider the definition of $q^l$ from part (a) in terms of the empirical distribution of $h^l_i$.
  
  Solution
  
  $q^l$ is the second moment of the empirical distribution of layer $l$ activations, and hence approximates the variance. Indeed, as $N \to \infty$, the empirical average can be written $q^l = \mathbb{E} \left( (h^l_i)^2 \right) = \text{Var}(h^l_i)$.
3. Calculate the variance of an individual neuron’s pre-activations, that is, the variance of $h_i^l$. Your answer should be a recurrence relation, expressing this variance in terms of $h^{l-1}$ (and the parameters $\sigma_w$ and $\sigma_b$).
  Solution
  
  Because the means of both the weight and bias distributions are zero, to calculate the variance we just need to calculate the second moment. We can use the fact that the weights and biases are initialized independently, so that the variance of $h_i^l$ is the sum of a bias term and a variance term: $$ \begin{align*} \langle (h_i^l)^2 \rangle &= \left\langle \left( \sum_j W_{ij}^l x_j^{l-1} \right) ^2 \right\rangle \\ &= \left\langle \sum_{jj'} W_{ij}^l W_{ij'}^l x_j^{l-1} x_{j'}^{l-1} \right\rangle + \langle (b_i^l)^2 \rangle\\ &= \left\langle \sum_{jj'} W_{ij}^l W_{ij'}^l x_j^{l-1} x_{j'}^{l-1} \right\rangle + \sigma_b^2\\ &= \frac{\sigma_w^2}{N} \sum_j \langle (x_j^{l-1})^2 \rangle + \sigma_b^2\\ &= \sigma_w^2 \langle (x^{l-1})^2 \rangle + \sigma_b^2\\ &= \sigma_w^2 \langle \phi(h^{l - 1})^2 \rangle + \sigma_b^2 \end{align*} $$
4. Now consider the limit that the number of hidden neurons, (N), approaches infinity. Use the central limit theorem to argue that in this limit, the pre-activations will be zero-mean Gaussian distributed. Be explicit about the conditions under which this result holds.
  Solution
  
  The basic idea here is to use the central limit theorem since the pre-activation is a sum of a large number of random variables, i.e.: $$ h_i^l = \sum_j^N W_{ij}^l x_j^{l-1} + b_i^l $$ There are $N$ terms in the sum, so as $N$ goes to infinity, we should have a sum of a large number of random variables which should be well-approximated by a Gaussian. However, there are a few things we need to be careful of: 1. CLT can show that the sum $\sum_j^N W_{ij}^l x_j^{l-1}$ is Gaussian-distributed, but there is still the bias term $b_i^l$. So we do have to assume that the bias term is Gaussian-distributed as well. 2. In order to use CLT, we need each of the variables being added to have finite variance. These individual variables are $W_{ij}^{l}x_j^{l-1}$. By construction the weights have finite variance; what about the previous layer's activations, $x_j^{l-1}$? Unless the activation function $\phi$ is pathological, if we \emph{assume that the previous-layer pre-activations have finite variance}, there should not be a problem here. In fact, if we just assume that the input distribution, i.e. $x^0$, has finite variance, all the layers' activations do too. Certainly the commonly used activation functions sigmoid, ReLU, etc. cannot turn a finite-variance sample of pre-activations into an infinite-variance sample of activations. 3. In order to use the CLT, we also need each of the variables being added to have identical distributions. This is true by symmetry. 4. The final condition for use of CLT is that the variables being added are all independent. Taking another look at the definition of $q^l$, $$ q^l = \frac{1}{N} \sum_{i = 1}^N (h_i^l)^2 $$ we want to show that each $h_i^l$ is independent (from which the independence of their squares follows). Each $h_i^l$ is in turn defined $$ h_i^l = \sum_{j = 1}^N W^l_{ij} \phi(h^{l - 1}_j) + b^l_i $$ By assumption, $ W^l_{ij} $ and $b^l_i$ are independent from each other and, over all $i, j$, from any quantities in previous layers, including $\phi(h^{l - 1}_j)$. But are the $W^l_{ij}$ independent of the $h^{l - 1}_j$? To justify this, observe that we can view the sum above as a linear combination of the random variables $W^l_{ij}$; even though, technically, the linear combination is also over random variables $\phi(h^{l - 1}_j)$, the key is that over $1 \leq i \leq N$, all the $h^{l - 1}_j$'s are the same. In other words, each neuronal activation in layer $l$ depends on the same exact realization of the random variables that are the activations of the previous layer. So, $h^l_i$ is essentially a linear combination of the (independent) $W^l_{ij}$ with deterministic weights, at least with respect to $i$. So, we can justify the use of the CLT in analyzing $\lim_{N \to \infty} q^l$.
5. With this zero-mean Gaussian approximation of $q^l$, we have a single parameter characterizing this aspect of signal propagation in the net: the variance, $q^l$, of individual neuronal activations (a proxy for squared activation vector lengths). Let’s now look at how this variance changes from layer to layer, by deriving the relationship between $q^l$ and $q^{l - 1}$. In part (c), your answer should have included a term $\langle (x^{l-1})^2 \rangle$. In terms of the activation function $\phi$ and the variance $q^{l-1}$, write this expectation value as an integral over the standard Gaussian measure.
  Solution
  
  Since $x_i^{l-1} = \phi(h_i^{l-1})$, we can write the variance $\langle (x^{l-1})^2 \rangle$ as $$ \begin{align*} \langle (x^{l-1})^2 \rangle &= \langle \phi(h^{l-1})^2 \rangle \\ &= \int_\mathbb{R}~dx ~\phi(x)^2 ~p_{h^{l-1}}(x), \end{align*} $$ where $p_{h^{l-1}}(x)$ is the pdf of the pre-activations $h^{l-1}$. By assumption this is a zero-mean Gaussian of variance $q^{l-1}$, i.e. $$ p_{h^{l-1}}(x) = \frac{1}{\sqrt{2\pi q^{l-1}}} e^{-\frac{x^2}{2q^{l-1}}} $$ This can be written in terms of the standard Gaussian distribution $\rho(x)$ via the change of variables $$ p_{h^{l-1}}(x) = \frac{1}{\sqrt{q^{l-1}}} \rho(x/\sqrt{q^{l-1}}) $$ meaning that the variance $\langle (x^{l-1})^2 \rangle$ becomes $$ \langle (x^{l-1})^2 \rangle = \frac{1}{\sqrt{q^{l-1}}} \int_\mathbb{R}~dx ~\phi(x)^2 ~\rho(x/\sqrt{q^{l-1}}) $$ Let $y = x/\sqrt{q^{l-1}}$, then $$\langle (x^{l-1})^2 \rangle =\int_\mathbb{R}~dy ~\phi(y\sqrt{q^{l-1}})^2 ~\rho(y).$$
6. Use this result to write a recursion relation for $q^l$ in terms of $q^{l-1}$, $\sigma_w$, and $\sigma_b$.
  Solution
  
  We just plug in, to get $$ q^l = \sigma_w^2 \int_\mathbb{R}~dy ~\phi(y\sqrt{q^{l-1}})^2 ~\rho(y) + \sigma_b^2 $$
Problem 3: Fixed points and stability.

In the previous problem, we found a recurrence relation relating the length of a vector at layer $l$ of a network to the length of the vector at the previous layer, $l-1$ of the network. In this problem, we are interested in studying the properties of this recurrence relation. In the Resurrecting the sigmoid paper, the results of this problem are used to understand at which bias point to evaluate the Jacobian of the input-output map of the network.

Note that in this problem, we are just taking the recurrence relation as a given, i.e. we do not need to worry about random variables or probabilities; all of that went into determining the recurrence relation. Instead, we’ll use tools from the theory of dynamical systems to investigate the properties - in particular, the asymptotics - of this recurrence relation.
1. A simple example of a dynamical system is a recurrence defined by some initial value $x_0$ and a relation $x_n = f(x_{n-1})$ for all $n>0$. This system defines the resulting sequence $x_n$. Sometimes, these systems have fixed points, which are values $x^*$ such that $f(x^*) = x^*$. If the value of the system, $x_m$, at some time-step $m$ , happens to be a fixed point $x^*$, what is the subsequent evolution of the system?
  Solution
  
  Since $f(x^*) = x^*$, for all times greater than $m$, the system simply stays at $x^*$.
2. For the recurrence relation you derived in the previous problem, what is the equation which a fixed-point of the variance, $q^*$, must satisfy? Under some conditions (i.e. for some values of $\sigma_w$ and $\sigma_b$), the value $q^*=0$ is a fixed point of the system. What are these conditions?
  Solution
  
  A fixed point has to satisfy $$ q^* = \sigma_w^2 \int_\mathbb{R} \phi \left( \rho \sqrt{q^*} \right)^2 \text{ d}\rho + \sigma_b^2 $$ where $ \text{d}\rho $ is the standard Gaussian measure. If $\sigma_b = 0$, i.e. there is no bias term, and the nonlinearity has a zero y-intercept, then there is a trivial fixed point of $q^* = 0$.
3. Now let us be concrete, and look at the recurrence relation in the special case of a nonlinearity $\phi(h)$ which is both monotonically increasing and satisfies $\phi(0) = 0$. Note that both of the nonlinearities considered in the paper we are studying, the $\tanh$ and ReLU nonlinearities, satisfy this property. Show that those two properties (monotonicity and $\phi(0)=0$) imply that the length map $q^l(q^{l-1})$ is monotonically increasing. What is the maximum number of times any concave function can intersect the line $y = x$? What does this imply about the number of fixed points the length map $q^l(q^{l-1})$ can have?
  Solution
  
  To prove that the function is monotonically increasing with its argument $q$, we take the derivative: $$ \begin{align*} f(q) &= \sigma_w^2 \int_\mathbb{R}~\phi(\rho\sqrt{q})^2~d\rho + \sigma_b^2 \\ f'(q) &= \frac{\sigma_w^2}{\sqrt{q}} \int_\mathbb{R}~\phi(\rho\sqrt{q}) \phi ' (\rho\sqrt{q}) \rho d\rho \end{align*} $$ The derivative is positive since by assumption $\phi '$ is positive everywhere, and $\phi \rho$ is also positive everywhere. So the function is monotonically increasing. Note that since a fixed point is defined as a point, $x^*$, such that $f(x^*) = x^*$, graphically the fixed point can be found from the intersection of the length map $q^l(q^{l-1})$ with the line $y = x$. If you think about the definition of a concave function (specifically, the version of the definition which states that beween any two points $x=a$ and $x=b$, the graph of the function must lie above the line defined by $f(a)$ and $f(b)$), you will realize that a concave function cannot intersect any line more than twice. Thus, concavity implies that the function can have at most two fixed points.
4. Let’s be concrete now and consider the nonlinearity to be a ReLU. Compute (analytically) the length map $q^l = f(q^{l-1})$, which will also depend on $\sigma_w$ and $\sigma_b$ . For what values of $\sigma_w$ and $\sigma_b$ does the system have fixed point(s)? How does the value of the fixed point depend on $\sigma_w$ and $\sigma_b$?
  Solution
  
  Starting from $$ f(q) = \sigma_w^2 \int_\mathbb{R}~\phi(\rho\sqrt{q})^2~d\rho + \sigma_b^2 $$ and explicitly inserting the nonlinearity $\phi$ gives $$ f(q) = \sigma_w^2 \int_0^\infty \rho^2 q~d\rho + \sigma_b^2 $$ Note that since the ReLU nonlinearity is zero when the argument is zero and just the identity function when the argument is greater than zero, we can take its effect into account simply by changing the above limits of integration so that we only integrate over the region in which the argument is positive. Now we can pull $q$ out of the integral, $$ f(q) = q \sigma_w^2 \int_0^\infty \rho^2 ~d\rho + \sigma_b^2 $$ and to evaluate the integral, note that by symmetry of the Gaussian distribution, it's half of what it would be if we had the limits from $-\infty$ to $\infty$, in which case it would just be the variance of a standard Gaussian, and so $$ f(q) = q \frac{\sigma_w^2}{2} + \sigma_b^2 $$ The important things to note here are that because $f(q)$ is a simple linear function, there is at most a single fixed point of the system. If $\sigma_b^2$ is zero, that fixed point is at $q=0$. If $\sigma_b^2 > 0$, then there is a fixed point only if $\sigma_w < \sqrt{2}$. Otherwise, the system does not have any fixed point. This is a qualitative difference from the $\tanh$ case, in which there is always a fixed point. A slightly strange case is when $\sigma_w = \sqrt{2}$ exactly, and $\sigma_b = 0$. In this case, the recurrence relation gives $q^l(q^{l-1}) = q^{l-1}$, meaning that every point is a fixed point.
5. Now let’s consider the sigmoid nonlinearity $\phi(h) = \tanh(h)$. In this case the length map cannot be computed analytically, but it can be done numerically. Numerically plot the length map, $q^l=f(q^{l-1})$, for a few values of $\sigma_w$ and $\sigma_b$ in the following regimes: (i) $\sigma_b=0$ and $\sigma_w < 1$, (ii) $\sigma_b = 0$ and $\sigma_w > 1$, and (iii) $\sigma_b > 0$. Describe qualitatively the fixed points of the map in each regime.
  Solution
  
  The following Python code should work:
```
     import numpy as np
     import scipy.integrate as integrate

     def integrand(x):
         gaussian = np.sqrt(2 * np.pi), -np.inf, np.inf) * np.exp(-0.5 * x**2)
         return np.tanh(x * np.sqrt(q))**2 * gaussian

     def fint(q):
         result = integrate.quad(integrand)
         return result[0]

     def lengthmap(q, sigma_w, sigma_b):
         return sigma_w**2 * fint(q) + sigma_b**2
     
```
  The behavior that should be seen is the following, as described in the transient chaos paper (ignoring the parts about stability because we haven't covered that yet. See next part of the problem): _For $\sigma_b = 0$ and $\sigma_w < 1$, the only intersection is at $q^*=0$. In this bias-free, small weight regime, the network shrinks all inputs to the origin. For $\sigma_w > 1$ and $\sigma_b = 0$, the $q*=0$ fixed point becomes unstable and the length map acquires a second nonzero fixed point, which is stable. In this bias-free, large weight regime, the network expands small inputs and contracts large inputs. Also, for any nonzero bias σb, the length map has a single stable non-zero fixed point. In such a regime, even with small weights, the injected biases at each layer prevent signals from decaying to 0._
6. Let’s now talk about the stability of fixed points. In a dynamical system, once the system reaches (or starts at) a fixed point, by definition it can never leave. But what happens if the system gets or starts near a fixed point? In real physical systems, this question is very relevant because physical systems almost always have some noise which pushes the system away from a fixed point. In general, the fixed point can be either stable or unstable. For a stable fixed point, initializing the system near the fixed point will result in behavior which converges to the fixed point, i.e reducing the magnitude of the perturbation away from the fixed point. Conversely, for an unstable fixed point, the system initialized nearby will be repelled from the fixed point. Use the derivative of the length map at a fixed point to derive conditions on the stability of the fixed point.
  Solution
  
  If the absolute value of the derivative $\frac{df}{dx}$, evaluated at the fixed point $x^*$, is less than $1$, then the system is stable. This can be seen from considering initializing the system near the fixed point, say at $x^* + \Delta x$. After going through the length map, the value will be $$ \begin{aligned} f(x^* + \Delta x) &\approx f(x^*) + f'(x^*) \Delta x \\ &= x^* + f'(x^*) \Delta x \end{aligned} $$ So the deviation from the fixed point $x^*$ has changed to $f'(x^*) \Delta x$. If the magnitude of $f'(x^*)$ is less than $1$, then the magnitude of this deviation is lower than $\Delta x$, the system is getting closer to the fixed point, and the fixed point is said to be stable. Conversely, if the magnitude of $f'(x^*)$ is greater than $1$, then the deviations away from equilibrium grow, and the equilibrium is unstable.
7. With this understanding of stability, revisit your result in part (e) for the $\tanh$ nonlinearity. Specifically, discuss the stability of the fixed points in each of the three regimes. You can estimate the derivative of the length map by looking at the graphs.
  Solution
  
  See the italicized paragraph in the solutions above, from the transient chaos paper. In regime (i), there is a single fixed point, $q^*=0$, and it is stable. In regime (ii), there are two fixed points, $q^*=0$ (unstable) and some other positive value (stable), and in regime (iii), there is only a positive fixed point, which is stable.
8. Do the same stability analysis for the ReLU network.
  Solution
  
  In the $\sigma_b = 0$ case, where the only fixed point is at $q=0$, that point is stable if $\sigma_w < \sqrt{2}$ (because then the slope of the line is less than unity) and unstable if $\sigma_w > \sqrt{2}$. Even for non-zero $\sigma_b$, the fixed point (which will now be non-zero) is stable if $\sigma_w < \sqrt{2}$. The slightly strange case is when $\sigma_w = \sqrt{2}$ exactly, and $\sigma_b = 0$. In this case, the recurrence relation gives $q^l(q^{l-1}) = q^{l-1}$, meaning that every point is a fixed point. In this case, the fixed points are neither stable nor unstable, since perturbations from them will neither grow or shrink.
9. (Optional) You should have found above that the both the ReLU and $\tanh$ systems never had more than one stable fixed point. Show that this is a consequence of the concavity of the length map.
  Hint
  You can just draw a picture for this one. Consider using the fact that the length map is concave, which we discussed in part c).
  
  Solution
  
  Having two stable fixed points would mean having two intersection points with the line $y=x$ at which the slope of the function is less than unity. But this means that in both cases we approach the function from above, which means that there must have been a third intersection point in the middle. But we already proved that because of the concavity of the length map, the system can have at most two fixed points. [stable fixed point]

3 Random Matrix Theory: Introduction.

Motivation: The crux of the paper uses tools from random matrix theory, which studies ensembles of matrix-valued random variables. Here, we will take a first stab at analyzing some relevant questions and get a feel for how the spectra of random matrices from deterministic matrices. We will also see that the spectra depend on how we sample the matrices. Finally, we will investigate what random matrices from different ensembles have in common.

Objectives:

Gain familiarity with working with the spectra of random matrices.
Understand the typical behavior of a random matrix’s eigenvalues.
Understand how standard RMT eigenvalue distributions are influenced by both level repulsion and confinement.
Understand why RMT is used in the Resurrecting the Sigmoid paper.

Topics:

Eigenvalue spacing in random matrices.

Readings:

Livan RMT textbook, sections 2.1 - 2.3.

Optional Readings:

Random Matrix Theory and its Innovative Applications by Edelman and Yang.
Livan RMT textbook, chapters 3, 6, and 7.

Questions:

The full problem set, from which the problems below are taken, is here.

Avoided crossings in the spectra of random matrices.

In the first DFL session’s intro to RMT, we mentioned that eigenvalues of random matrices tend to repel each other. Indeed, as one of the recommended textbooks on RMT states, this interplay between confinement and repulsion is the physical mechanism at the heart of many results in RMT. This problem explores that statement, relating it to a concept which comes up often in physics - the avoided crossing.
1. The simplest example of an avoided crossing is in a $2 \times 2$ matrix. Let’s take the matrix
  \[\begin{pmatrix} \Delta & J \\ J & -\Delta \end{pmatrix}\]
  1. Since this matrix is symmetric, its eigenvalues will be real. What are its eigenvalues?
    Solution
    
    The polynomial to solve for the eigenvalues $\lambda$ is $$ \begin{eqnarray} (\Delta - \lambda) (-\Delta - \lambda ) - J^2 &=& 0 \\ \lambda^2 - (\Delta^2 + J^2) &=& 0 \end{eqnarray} $$ So the eigenvalues are $\pm \sqrt{\Delta^2 + J^2}$.
  2. To see the avoided crossing here, plot the eigenvalues as a function of $\Delta$, first for $J=0$, then for a few non-zero values of $J$.
    Solution
    
    Here is an example graph, showing $J$ values $0$, $1$, and $2$. The blue line shows no gap when $J = 0$, and the gap opens up when $J$ is non-zero. [avoided crossing]
  3. You should see a gap (i.e. the minimal distance between the eigenvalue curves) open up as $J$ becomes non-zero. What is the size of this gap?
    Solution
    
    To get the gap, evaluate the expression for the eigenvalues when $\Delta$ is zero, and you find that the gap is $2|J|$.
2. Now take a matrix of the form
  \[\begin{pmatrix} A & C \\ C & D \end{pmatrix}.\]
  In terms of (A), (C), and (D), what is the absolute value of the difference between the two eigenvalues of this matrix?
  
  Solution
  
  The difference in eigenvalues won't shift if we add a multiple of the identity matrix to our original matrix, meaning that the eigenvalue difference is the same as that of the matrix $$ \begin{pmatrix} \frac{1}{2}\left(A - D\right) & C \\ C & -\frac{1}{2}\left(A - D\right) \end{pmatrix} $$ The eigenvalue difference is thus (using the eigenvalues we calculated from the previous part): $$ s = \sqrt{4C^2 + \left( A - D \right) ^2} $$
3. Now let’s make the matrix a random matrix. We will take $A$, $C$, and $D$ to be independent random variables, where the diagonal entries $A$ and $D$ are distributed according to a normal distribution with mean zero and variance one, while the off-diagonal entry $C$ is also a zero-mean Gaussian but with a variance of $\frac{1}{2}$.
  1. Use the formula you derived in the previous part of the question to calculate the probability distribution function for the spacing between the two eigenvalues of the matrix.
    Solution
    
    From the previous part we know the spacing as a function of the random variables $A$, $B$, $C$: $$ s = \sqrt{4C^2 + \left( A - D \right) ^2} $$ So we can write in terms of the joint probability density function of $A$, $B$, and $C$ that $$ \begin{eqnarray} p_s(x) &=& \int~da~db~dc~p_{A,B,C}(a, b, c) ~ \delta(x - s(a,b,c)) \\ &=& \frac{1}{2\pi \sqrt{\pi}}\int_{-\infty}^{\infty}da \int_{-\infty}^{\infty}db \int_{-\infty}^{\infty}dc~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - s(a,b,c)) \end{eqnarray} $$ where $\delta$ is the Dirac delta function. Now perform the change of variables to cylindrical coordinates $r$, $\theta$, $z$, with $$ \begin{eqnarray} r \cos{\theta} &=& a - d \\ r \sin{\theta} &=& 2c \\ z &=& a + d \end{eqnarray} $$ The inverse of this transformation is $$ \begin{eqnarray} a &=& \frac{1}{2} ( r\cos(\theta) + z) \\ b &=& \frac{1}{2} ( z - r\cos(\theta)) \\ c &=& \frac{1}{2} r\sin(\theta) \end{eqnarray} $$ For later, note that $$ a^2 + b^2 = \frac{1}{2} (r^2 \cos^2(\theta) + z^2) $$ And the Jacobian can be calculated as $J=-r/4$ (see the Livan book, section 1.2). In terms of the new variables, the spacing $s$ becomes $r$, and the integration becomes $$ \begin{eqnarray} p_s(x) &=& \frac{1}{2\pi \sqrt{\pi}}\int_{-\infty}^{\infty}da \int_{-\infty}^{\infty}db \int_{-\infty}^{\infty}dc~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - s(a,b,c)) \\ &=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-a^2/2} e^{-b^2/2} e^{-c^2} ~ \delta(x - r) \\ &=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{2} (a^2 + b^2 + 2c^2)} ~ \delta(x - r) \\ &=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{4} (r^2 \cos^2(\theta) + z^2 + r^2\sin^2\theta)} ~ \delta(x - r) \\ &=& \frac{1}{8\pi \sqrt{\pi}}\int_{0}^{\infty} r~dr \int_{0}^{2\pi}d\theta \int_{-\infty}^{\infty}dz~ e^{-\frac{1}{4} (r^2 + z^2)} ~ \delta(x - r) \\ &=& \frac{1}{4\sqrt{\pi}}\int_{0}^{\infty} re^{-\frac{1}{4} r^2}~\delta(x - r)~dr \int_{-\infty}^{\infty}dz~ e^{-\frac{z^2}{4}} ~ \\ &=& \frac{1}{2}\int_{0}^{\infty} re^{-\frac{1}{4} r^2}~\delta(x - r)~dr \\ &=& \frac{x}{2}e^\frac{-x^2}{4} \end{eqnarray} $$
  2. What is the behavior of this pdf at zero? How does this relate to the avoided crossing you calculated earlier?
    Solution
    
    Clearly the pdf we calculated above is exactly zero at $s=0$, and grows linearly with $s$. This absence of spacings at zero is the same phenomenon as the avoided crossing noted above for deterministic matrices. Another way to see this is to note that from the first part of the problem, the only way to have a spacing of zero is to have the diagonal elements equal each other while the off-diagonal element needs to be zero. The set of points satisfying this condition is a line in the full 3D space of points, so will have a very low probability of occurring.
  3. Verify using numerical simulation that the pdf you found in the previous part is correct.
    Solution
    
    The following Python code should work; by generating plots using the two functions, we can verify that they match.
```
 import numpy as np 

 def eigenvalue_spacing():
     A = np.random.normal(scale=1)
     D = np.random.normal(scale=1)
     C = np.random.normal(scale=np.sqrt(0.5))

     M = np.array([[A, C], [C, D]]
     eigenvalues, _ = np.linalg.eig(M)
     return abs(eigenvalues[0] - eigenvalues[1])

 def pdf(x):
     return x / 2 * np.exp(- x**2 / 4)
 
```

4 Random Matrix Theory: Central concepts.

Motivation: In this section we cover the final topic before we can get to the calculations in the paper - free probability. Specifically, we discuss its instantiation in random matrix theory. This is a huge topic, but to understand the paper, we luckily don’t need to cover too much space. The basic question to think about is, “Given two random matrices whose spectral densities we know, when can we calculate the spectral density of their sum or product?”

We also address canonical results in random matrix theory, like the semicircular law.

Objectives:

Understand some basic properties that are of interest when working with random matrices.
Specifically, for this paper, understand why we are interested in the eigenvalue/singular-value distribution of matrices.
Be able to describe some canonical ensembles of random matrices, and their properties.
Be able to explain why we need the theory of freely-independent matrices in this paper.

Topics:

Free independence.
The $R$- and $S$-transforms.
The semicircle law.

Reading:

The primary learning tool is again the problem set. The following readings will help contextualize the problems.

Livan RMT textbook, chapter 17.
Section 2.3 of the Resurrecting the Sigmoid paper.
Partial Freeness of Random Matrices by Chen et al., Sections 1, 2, 3, and 5.

Optional Readings:

It is tough to find an exposition of free probability theory (i.e., the theory of non-commuting random variables) at an elementary level. The chapter in the Livan textbook listed above is a great resource, and the following papers might also help.

Financial Applications of Random Matrix Theory: a short review by Bouchaud and Potters, section III.
Applying Free Random Variables to Random Matrix Analysis of Financial Data Part I: A Gaussian Case by Burda et al.

Questions:

The full problem set, from which the below problems are taken, is here.

Why we need free probability.

In the upcoming lectures, we will encounter the concept of free independence of random matrices. As a reminder, in standard probability theory (of scalar-valued random variables), two random variables $X$ and $Y$ are said to be independent if their joint pdf is simply the product of the individual marginals, i.e.
\[p_{X,Y}(x,y) = p_X(x) p_Y(x)\]
When we have independent scalar random variables $X$ and $Y$, then in principle it is possible to calculate the distribution of any function of these variables, say the sum $X + Y$ or the product $XY$.

When it comes to random matrices, we are often interested in calculating the spectral density (the probability density of eigenvalues) of the sum or product of random matrices. In the Resurrecting the Sigmoid paper, for example, we will calculate the spectral density of the network’s input-output Jacobian, which is the product of several matrices for each layer. So we need an analogue of independent variables for matrices (this condition is known as free independence), such that if we know the spectral densities of each one, we can calculate spectral densities of sums and products.

The simplest condition we might imagine under which two matrix-valued random variables (or, equivalently, two matrix ensembles) being freely independent is that all of the entries of each matrix are mutually independent. However, it turns out that this condition is not good enough! In other words, independent entries sometimes are not enough to destroy all possible angular correlations between the eigenbases of two matrices. Instead, the property that generalizes statistical independence to random matrices is stronger and known as freeness.

In this problem, we will see a concrete example of matrix ensembles with mutually independent entries, yet knowing the eigenvalue spectral density of each ensemble is not enough to determine the eigenvalue spectral density of the sum.

Define three different ensembles of 2 by 2 matrices:
- Ensemble 1: To sample a matrix from ensemble 1, sample a standard Gaussian scalar random variable $z$ and multiply it by each element in the matrix $\sigma_z$, where
\[\sigma_z = \left( \begin{array}{cc} 1 & 0 \\ 0 & -1 \end{array} \right)\]
Thus the sampled matrix will be $z \sigma_z$.
- Ensemble 2: To sample a matrix from ensemble 2, sample a standard Gaussian scalar random variable $z$ and multiply it by each element in the matrix $\sigma_x$, where
\[\sigma_x = \left( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right)\]
Thus the sampled matrix will be $z \sigma_x$.
1. What is the spectral density $\rho_1(x)$ of eigenvalues of matrices sampled from ensemble 1?
  
  Solution
  
  The eigenvalues of $\sigma_z$ are $\pm 1$, so the spectral density $\rho_1(x)$ will be identical to the probability density of $z$ except for a factor of $2$ (because it has to integrate to $2$, the number of eigenvalues, instead of $1$ namely $$\rho_1(x) = \frac{\sqrt{2}}{\sqrt{\pi}} e^{-x^2/2}$$
2. What is the spectral density $\rho_2(x)$ of eigenvalues of matrices sampled from ensemble 2?
  
  Solution
  
  The eigenvalues of $\sigma_x$ are exactly the same as those of $\sigma_z$, namely $\pm 1$, so the spectral density $\rho_2(x)$ is the same as $\rho_1(x)$: $$\rho_2(x) = \frac{\sqrt{2}}{\sqrt{\pi}} e^{-x^2/2}$$
  
  You should have found above that the spectral densities of both ensembles are the same. However, we will see now that simply knowing the spectral density is not enough to determine the spectral density of the sum.
3. Let $A$ and $B$ be two matrices independently sampled from ensemble 1. Calculate analytically the spectral density of the sum, $A + B$.
  
  Solution
  
  We can write this matrix as $(z_1 + z_2)\sigma_z$, where $z_1$ and $z_2$ are standard normal variables. The eigenvalues are thus $\pm (z_1 + z_2)$. Since $z_1$ and $z_2$ are independent, their sum will be a zero-mean Gaussian with variance $2$. So the spectral density of eigenvalues will be twice that of such a Gaussian, namely: $$\rho_{A+B}(\lambda) = \frac{1}{\sqrt{\pi}} e^{-\lambda^2/4}$$
4. Now let $C$ be a matrix sampled from ensemble 2. In the next part, you will calculate the spectral density of the sum $A + C$, where $A$ is drawn from ensemble 1 and $C$ is drawn from ensemble 2. However, to see immediately that the distributions of $A+B$ and $A+C$ will be different, consider the behavior of the spectral density of $A+C$ at zero. Based on your knowledge of avoided crossings from the previous problem set, describe the spectral density of $A+C$ at $\lambda =0$ and contrast this to the spectral density of $A+B$.
  
  Solution
  
  Notice that the matrix $A+C$ will have the same form as the matrix considered in the previous problem set, and we found that for such a matrix the presence of the off-diagonal term caused their to be a level repulsion. So, the eigenvalue spectral density should go to zero as $\lambda$ approaches zero, for the matrix $A+C$. However, in the above part, we calculated that for matrices $A+B$, there is no avoided crossing and the pdf is finite at $\lambda = 0$.
5. Now let $C$ be a matrix sampled from ensemble 2. Calculate the spectral density of the sum, $A + C$. Make sure this is consistent with what you argued above about the behavior at $\lambda = 0$.
  
  Solution
  
  Notice that the matrix $A+C$will have the same form as the matrix considered in the previous problem set, and we found that for such a matrix the presence of the off-diagonal term caused their to be a level repulsion. So, the eigenvalue spectral density should go to zero as $\lambda$approaches zero, for the matrix $A+C$. However, in the above part, we calculated that for matrices $A+B$, there is no avoided crossing and the pdf is finite at $\lambda = 0$.
  
  Notice that the answers you got in the previous two parts were different, even though the underlying matrices that were being added had the same spectral density and independent entries.
Using free probability theory.

From the last problem, you learned that if you’re given two different random matrix ensembles, and you know the spectral density of the eigenvalues of each one, that might not be enough to determine the eigenvalue distribution of the sum (or product) of the two random matrices, even if all of the entries of the two matrices are mutually independent! As we mentioned in the last problem, the (stronger) condition that we are after is known as free independence. In general, proving that two matrix ensembles are “free” (freely independent) is quite tough, so we will not do that here. Instead, we will look at the tools we use to do calculations assuming we have random matrix ensembles which are freely independent.

Specifically, we will show that the sum of two freely independent random matrices, each of whose spectral density is given by a semicircle, is also described by the semicircle distribution.
1. Recall that the spectral density of the Gaussian orthogonal ensemble (in the large $N$ limit) is given by the semicircle law:
  \[\rho_{sc}(x) = \frac{1}{\pi}\sqrt{2-x^2}\]
  (sometimes you see this with a $4$ or $8$ in the square root and a different factor accompanying $\pi$ in the denominator. This is just a matter of choosing which Gaussian ensemble—orthogonal, unitary, or symplectic—to use, and doesn’t really matter for this problem)
  
  In a previous problem set, you calculated the Stieltjes transform associated with the spectral density for the Gaussian unitary ensemble. Recall that the Stieltjes transform, $G(z)$, is defined via the relation
  \[G(z) = \int_\mathbb{R}~dt \frac{\rho(t)}{z - t}\]
  (In the previous problem set, this was called $s_{\mu_N}(z)$. In literature you often see the $G(z)$ notation, since the Stieltjes transform is also known as the resolvent or Green’s function.)
  
  In the last problem set, you should have found that under the Stieltjes transform,
  \[\frac{1}{2\pi}\sqrt{4-x^2} \mapsto \frac{z - \sqrt{z^2 - 4}}{2}\]
  Use the above fact to calculate the Stieltjes transform of the GOE semicircle given at the beginning of this problem (part (a)). This is the first step to calculating the spectral density of the sum.
  
  Solution
  
  Define $$\begin{eqnarray} f(x) &=& \frac{1}{2\pi}\sqrt{4 - x^2} \\ g(x) &=& \frac{1}{\pi}\sqrt{2 - x^2} \end{eqnarray}$$ Notice that \begin{equation} g(x) = \sqrt{2} f(x\sqrt{2}). \end{equation} If we define $G_g(z)$ and $G_f(z)$ as the Green's functions corresponding to $g(x)$ and $f(x)$, respectively, then we can get a relation between the two: \begin{eqnarray} G_g(z) &=& \int~dt~\frac{g(t)}{z - t} \\ &=& \sqrt{2}\int~dt~\frac{f(t\sqrt{2})}{z - t} \\ &=& \sqrt{2}\int~\frac{dy}{\sqrt{2}} \frac{f(y)}{z - y/\sqrt{2}} \\ &=& \sqrt{2}\int~dy~\frac{f(y)}{z\sqrt{2} - y} \\ &=& \sqrt{2}G_f(z\sqrt{2}). \end{eqnarray} Since we have previously calculated that \begin{equation} G_f(z) = \frac{z - \sqrt{z^2 - 4}}{2}, \end{equation} this immediately gives us that \begin{equation} G_g(z) = z - \sqrt{z^2 - 2}. \end{equation}
2. We have calculated the Stieltjes transform or Green’s function of the semicircle. Now we proceed to calculate the so-called Blue’s function, which is just defined as the functional inverse of the Green’s function. That is, the Green’s function $G(z)$ and the Blue’s function $B(z)$ satisfy
  \[G(B(z)) = B(G(z)) = z\]
  Calculate the Blue’s function corresponding to the semicircle Green’s function you derived above.
  
  Solution
  
  The inverse function is defined by the relation \begin{equation} z = B - \sqrt{B^2 - 2}. \end{equation} Then \begin{eqnarray} \sqrt{B^2 - 2} &=& B - z \\ B^2 - 2 &=& B^2 - 2Bz + z^2 \\ 2Bz &=& z^2 + 2 \\ B &=& \frac{z}{2} + \frac{1}{z} \end{eqnarray}
3. You should have noticed that the Blue’s function you calculated had a singularity at the origin, that is, a term given by $1/z$. The $R$-transform is defined as the Blue’s function minus that singularity; that is,
  \[R(z) = B(z) - \frac{1}{z}\]
  What is the $R$-transform of the GOE semicircle?
  
  Solution
  
  Since \begin{equation} B = \frac{z}{2} + \frac{1}{z}, \end{equation} we can immediately write \begin{equation} R(z) = \frac{z}{2} \end{equation}
4. Finally we come to the law of addition of freely independent random matrices: If we are given freely independent random matrices $X$ and $Y$, whose $R$-transforms are $R_X(z)$ and $R_Y(z)$, respectively, then the $R$-transform of the sum (or more precisely, the $R$-transform of the spectral density of the sum $X + Y$) is simply given by $R_X(z) + R_Y(z)$.
  
  Assume that two standard GOE matrices, say $H_1$ and $H_2$, are freely independent. What is the $R$-transform of the spectral density of the sum $H_+ = pH_1 + (1 - p) H_2$?
  
  Solution
  
  $$R_{H_+}(z) = z$$
5. Using the results above, argue that the sum of two freely-independent ensembles described by the semicircular law is also described by the semicircular law.
  
  Solution
  
  The $R$-transform of the sum ($z$) has the same functional form as the individual $R$-transforms ($z/2$), so it seems plausible that this means that when we invert it, we get a semicircle. Let's make sure of this fact. We should first figure out how the scaling of the $R$-transform affects the scaling of the matrix it is describing. We can guess that this amounts to a simple scaling of the matrix itself. Under the scaling of a general matrix $H \mapsto cH$, the eigenvalue distribution goes from $\rho(\lambda) \mapsto \rho(\lambda/c) /c$. Then, by the same logic we used in part (a) of this problem, the Green's function goes $G \mapsto G(z/c)/c$ (in part (a), $c$ was $\sqrt{2}$). To figure out the change in the Blue's function, we can write: \begin{eqnarray} G_{pH}(B_{pH}(z)) = \frac{1}{p} G_{H}(B_{H}(z)/p) = z \\ G_{H}(B_{H}(z)/p) = pz \\ B_{pH}(z) = p B_H(pz) \\ \end{eqnarray} And finally we can get the scaling of the $R$-transform: \begin{eqnarray} R_{pH}(z) &=& pB_H(pz) - \frac{1}{z} \\ &=& p\left( R_H(pz) + \frac{1}{pz} \right) - \frac{1}{z} \\ &=& p R_H(pz) \end{eqnarray} With this result, we know that if we multiply the GOE matrix by $\sqrt{2}$, the $R$-transform goes from $z/2$ to $z$. This means that the spectral density of a sum of two GOE matrices is still semicircular, just with a $\sqrt{2}$ scaling. Another way of saying this is that the semicircular law is stable under free addition.

5 Calculations in Resurrecting the Sigmoid.

Motivation: We are ready to actually perform the calculations from the paper using RMT and building off of the signal propagation concepts from section two. Using this analysis, we will be able to predict under what conditions is dynamical isometry achievable. This principle is the one that guarantees that inputs and gradients neither vanish nor explode as they pass through the net.

Objectives:

Be able to use the $S$-transform to calculate $\sigma_{JJ^T}^2$ and $\lambda_\text{max}$ for Gaussian nets.
For Gaussian-initialized neural networks, explain why dynamical isometry is unattainable.
Be able to use the $S$-transform to calculate $\sigma_{JJ^T}^2$ and $\lambda_\text{max}$ for orthogonal nets.
Explain why orthogonal-initizlied neural networks can be initialized attain dynamical isometry when used with a sigmoidal activation function.
Understand how to choose initialization parameters of an orthogonal, sigmoidal net of a given depth to ensure dynamical isometry.

Topics:

Jacobian spectra of neural networks with Gaussian- and orthogonal- initialized random weight matrices.
Decomposing neural network Jacobians via weight matrices and diagonal “nonlinearity” matrices.

Required Reading:

Resurrecting the Sigmoid, sections 2.2 and 2.5.

Questions:

The full problem set, from which the below problems are taken, is here.

Setting up the calculations.

In this problem set, we perform the main calculations from the Resurrecting the Sigmoid paper. The ultimate aim is to look for conditions under which we can achieve dynamical isometry, the condition that all of the singular values of the network’s Jacobian have magnitude $1$. Thus, the problems in this set are all aimed at calculating the eigenvalue spectral density $\rho_{JJ^T}(\lambda)$ of nets’ Jacobians for specific choices of nonlinearities and weight-matrix initializations. We accomplish this by using the rule we learned from free probability: $S$-transforms of freely-independent matrix ensembles multiply under matrix multiplication. Following this logic, we will calculate $S$-transforms for the matrices $WW^T$ and $D^2$, combine these results to arrive at $S_{JJ^T}$, and from that calculate $\rho_{JJ^T}(\lambda)$. In this problem set, as in the paper, we do not prove that the matrices are freely independent, but instead take that as an assumption.

Recall that our neural network is defined by the relations:
\[\begin{aligned} h^l &= W^l x^{l-1} + b^l \\ x^l &= \phi(h^l) \end{aligned}\]
where the input is denoted $h^0$ and the output is given by $x^L$.
1. What is the Jacobian $J$ of the input-output relation of this network?
  
  Hint
  See eq. 2 of the paper.
  
  Solution
  
  Using the chain rule gives: $$ J = \prod_{l=1}^L D^l W^l$$ where $D^l$ is a matrix of pointwise derivatives of the nonlinearity $\phi$ at layer $l$: \begin{equation} (D^l)_{ij} = \frac{dx^l_j}{dh^l_i} = \phi ' (h^l_i)\delta_{ij}. \end{equation}
2. As the paper discusses, we are interested in the spectrum of singular values of $J$, but all of the tools we have developed so far deal with the eigenvalue spectrum.
  
  In terms of the singular values of $J$, what are the eigenvalues of $JJ^T$?
  
  Solution
  
  The definition of dynamical isometry, the condition we're after, is that the magnitude of the singular values of $J$ should concentrate around 1.
  
  What is the dynamical isometry condition in terms of the eigenvalues of $JJ^T$?
  
  Solution
  
  The singular values of a matrix $A$ are the square roots of the eigenvalues of $AA^T$, so the eigenvalues of $JJ^T$ are the squared singular values of $J$. Quick proof: By SVD, $A=U \Sigma V^\dagger$ and so $AA^T = (U\Sigma V^\dagger)(U\Sigma V^\dagger)^\dagger = U \Sigma^T \Sigma U^\dagger = U \Sigma^2 U^\dagger$ where $\Sigma^2$ is composed of squared singular values and $V^\dagger$ is matrix $V$'s conjugate transpose. Note that $\Sigma^2$ equals matrix $D$ from a spectral decomposition of $AA^T$, which contains eigenvalues of $AA^T$. Thus the squared singular values of $A$ equal the eigenvalues of $AA^T$. $\square$
  
  So, the dynamical isometry condition is that the spectrum of eigenvalues of $JJ^T$ concentrates around unity.
Now that we’re focused on $JJ^T$ instead of $J$, read the following section reproduced from the main paper, about the $S$-transform of $JJ^T$’s spectral density:

$S_{JJ^T} = \prod_{l=1}^L S_{W_lW_l^T} S_{D_l^2} = S_{WW^T}^L S_{D^2}^L$ where we have used the identical distribution of the weights to define $S_{WW^T} = S_{W_l W_l^T}$ for all $l$, and we have also used the fact the pre-activations are distributed independently of depth as $h_l \sim \mathcal{N}(0,q^*)$, which implies that $S_{D_l^2} = S_{D^2}$ for all $l$. Eqn.(12) provides a method to compute the spectrum $\rho_{JJ^T} (\lambda)$. Starting from $\rho_{W^T W} (\lambda)$ and $\rho_{D^2}$, we compute their respective $S$-transforms through the sequence of equations eqns. (7), (9), and (10), take the product in eqn. (12), and then reverse the sequence of steps to go from $S_{JJ^T}$ to $\rho_{JJ^T} (\lambda)$ through the inverses of eqns. (10), (9), and (8). Thus we must calculate the $S$-transforms of $WW^T$ and $D^2$, which we attack next for specific nonlinearities and weight ensembles in the following sections. In principle, this procedure can be carried out numerically for an arbitrary choice of nonlinearity, but we postpone this investigation to future work.

Prove the equation at the top of the box.

Hint
This is done in the first appendix of the paper. Note that you should assume free independence of the $D$'s and $W$'s.

The upshot of this problem is that we need to calculate the quantities $S_{WW^T}$ and $S_{D^2}$ for whatever nonlinearities and weight initialization schemes we’re interested in.

Solution

$JJ^T =\left( \prod_{l=1} D^l W^l\right) \left(\prod_{l=1} D^l W^l\right)^T = \left(D_L W_L \ldots D_1 W_1\right) \left(D_L W_L \ldots D_1 W_1\right)^T$ by expanding the product. So $$ S_{JJ^T} = S_{\left( D_L W_L \ldots D_1 W_1\right) \left( D_L W_L \ldots D_1 W_1\right)^T} $$ Since the $S$-transform is defined in terms of moments of the eigenvalue distribution, it is invariant to cyclic permutations (since the trace, which defines moments, is invariant to cyclic permutations). So, we can re-order matrices in the product, yielding: $$ S_{JJ^T} = S_{(W_L^T D_L^T D_L W_L) (D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T} $$ Then, assuming free independence, the $S$-transforms multiply: $$ S_{JJ^T} = S_{(W_L^T D_L^T D_L W_L)} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$ Again using invariance to cyclic permutations: $$ S_{JJ^T} = S_{(D_L^T D_L W_L W_L^T)} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$ And again assuming free independence: $$ S_{JJ^T} = S_{D_L^T D_L} S_{W_L W_L^T} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$ Since $D$ is diagonal, $$ S_{JJ^T} = S_{D_L^2} S_{W_L W_L^T} S_{(D_{L-1} W_{L-1} \ldots D_1 W_1)(D_{L-1} W_{L-1} \ldots D_1 W_1)^T}.$$ Continuing this procedure we get $$ S_{JJ^T} = \prod_{l=1}^L S_{D_l^2} S_{W_l^T W_l} $$ Since the weight matrices $W^l$ for each layer are identically distributed, their $S$-transforms are equal, so we can drop the subscript and write: $$ S_{JJ^T} = S_{W^T W}^L \prod_{l=1}^L S_{D_l^2}$$ Finally, using the fact that $D^l$ matrices are identically distributed gives the desired expression. $$ S_{JJ^T} = \left(S_{W^T W}\right)^L \left(S_{D^2}\right)^L$$
$S_{D^2}$ for ReLU and hard-tanh networks

In this problem, we turn to networks with nonlinearities. We look at two nonlinearities here, the ReLU function and a piecewise approximation to the sigmoid known as the hard-tanh. These functions are defined as follows:

$f_{\mathrm{ReLU}}(x) = \begin{cases} 0 & x\leq 0 \\ x & x\geq 0 \end{cases}$ $f_{\mathrm{HardTanh}}(x) = \begin{cases} -1 & x\leq -1 \\ x & -1\leq x\leq 1 \\ 1 & x\geq 1 \end{cases}$

We want the spectral density, $\rho_{JJ^T}(\lambda)$, of $JJ^T$, where $J$ is the Jacobian. We will find this by first calculating its $S$-transform, $S_{JJ^T}$. As discussed in the introduction, this involves two separate steps: finding $S_{D^2}$ and finding $S_{WW^T}$. Note that finding $S_{D^2}$’s closed form relies primarily on choice of nonlinearity, and finding $S_{WW^T}$’s closed form relies only on choice of weight initialization (and not on choice of nonlinearity). In this problem, we focus on the nonlinearities ($S_{D^2}$); the next problems focus on the weight initializations ($S_{WW^T}$), and how to combine these to get the $S$ transform of the Jacobian.
1. The probability density function of the $D$ matrix depends on the distributions of inputs to the nonlinearity. To calculate this, we will make a couple simplifying assumptions. The first assumption is that we initialize the network at a critical point (defined in problem set 2).
  
  If we are interested in finding conditions for achieving dynamical isometry, why is it a good assumption that the network is initialized at criticality?
  
  Solution
  
  The criticality condition, $\chi = 1$, implies that the mean squared singular value of $J$, or equivalently that the mean eigenvalue of $JJ^T$, is unity. Dynamical isometry means that the entire spectrum of squared singular values of $J$ is concentrated around unity. So criticality is a prerequisite for dynamical isometry.
2. The second assumption we make in calculating the distribution of inputs to the nonlinearity is that the we have settled to a stationary point of the length map (the variance map). Reread section 2.2 of Resurrecting the Sigmoid, and argue why this is also a good assumption.
  Solution
  
  As described in both Exponential Expressivity in Deep Neural Networks Through Transient Chaos and in section 2.2 of Resurrecting the Sigmoid, the empirical distribution of network pre-activations approximates a $0$-mean, $q^l$-variance Gaussian distribution in the large-width limit. The length map describing the evolution of $q^l$ has a fixed point, which the papers show empirically is rapidly converged to. Because of this rapid convergence, it is natural to assume that only a few initial layers are not characterized by this variance, and that we can neglet them in computing the spectrum of the network's Jacobian. Conveniently, assuming we are at a fixed point makes $D^2$ is independent of $l$, greatly simplifying our analysis.
3. To find the critical points of both the ReLU and hard-tanh networks, recall from problem set 2 that criticality was defined by the condition $\chi = 1$, where $\chi$ is defined in eqn. (5) of the main paper. As in the paper, define $p(q^*)$ as the probability, given the variance $q^*$, that a given neuron in a layer is in its linear (i.e. not constant) regime. Show that $\chi = \sigma_w^2 p(q^*)$.
  
  Hint
  Plug the nonlinearity into the equation for $\chi$ and reduce.
  
  Solution
  
  $$\chi = \sigma_w^2 \int D h \phi ' ((\sqrt(q^*)h)^2$$ Where $D h$ is the standard Gaussian measure. Note that when $\phi' = 0$ (the slope of the activation function is zero) then $\chi=0$. Thus, since $\chi$ only takes on values in ${0,1}$. Thus the Gaussian measure integral, which represents probability that $\phi ' \neq 0$ reduces to $p(q^*)$, the probability that $\phi' = 0$, so $\chi = \sigma_w^2 p(q^*)$.
4. In terms of $p(q^*)$, what is the spectral density $\rho_{D^2}(z)$ (for both ReLU and hard-tanh networks) of the eigenvalues of $D^2$?.
  Solution
  
  Bernoulli with parameter equal to the probability of being in the linear regime. The Dirac delta expresses the fact that both ReLU and hard-tanh are piecewise linear with sections at value $0$, so their probability of being in the linear regime is a step function -- it allows us to express a discrete pdf (in this case with two values, $0$ and $1$).
5. Following equations 7-10 in the main paper, derive the Stieltjes transform $G_{D^2}(z)$, the moment-generating function $M_{D^2}(z)$, and the $S$-transform $S_{D^2}(z)$ in terms of $p(q^*)$. Note: This should be the same for both ReLU and hard-tanh networks.
  Solution
  
  Recall that: $$ \rho_{D^2} (z) = (1-p(q^*)) \delta (z) + p(q^*) \delta(z-1)$$ Then recall the definition $$ G_{D^2} (z) = \int_\mathcal{R} \frac{\rho_x (t) dt}{z-t} = \frac{\rho_x(0)}{z} + \frac{\rho_x (1)}{z-1} = \frac{1-p(q^*)}{z} + \frac{p(q^*)}{z-1}$$ Then $$ \begin{eqnarray} M_{D^2}(z) &=& z G_{D^2}(z) - 1 \\ &=& z \left(\frac{1-p(q^*)}{z} + \frac{p(q^*)}{z-1}\right) - 1 \\ &=& -p(q^*) + \frac{z p(q^*)}{z-1} \\ &=& p(q^*) \left(\frac{z}{z-1} - 1\right) \\ &=& \frac{p(q^*)} \\ \end{eqnarray} $$ Next use the definition \begin{equation*} S_{D ^2} (z) = \frac{1+z}{z M_{D^2}^{-1} (z)}. \end{equation*} The inverse $M_{D^2}^{-1}(z)$ is $\frac{p(q^*)}{z} + 1$. Thus: $$ S_{D^2}(z) = \frac{1+z}{z \left(\frac{p(q^*)}{z} + 1\right)} = \frac{z+1}{z+ p(q^*)}$$
6. Now that we’ve calculated the transforms we wanted in terms of $p(q^*)$, let us see what the critical point (which determines $q^*$ and $p(q^*)$) looks like for our two nonlinearity options. For ReLU networks, what is $p(q^*)$? Show that this implies that the only critical point for ReLU networks is $(\sigma_w, \sigma_b) = (\sqrt{2},0).$
  Solution
  
  For ReLUs, the nonlinearity is half in the positive linear regime and half at $0$. Assuming $0$-mean symmetric activation distributions, the probability of being in the linear regime is $p(q^*) = \frac{1}{2}$. Using the above result that $ \chi = \sigma_w^2 p(q^*) $ immediately tells us that $ \sigma_w^2 = 2 $. Using equation (4) in the Resurrecting the Sigmoid paper, $$q^* = \sigma_w^2 \int \mathcal{D} h ~\phi(\sqrt{q^*}h)^2 + \sigma_b^2$$ and using the fact that $\phi$ is a ReLU, we can write --> $$q^* = q^* \sigma_w^2 \int_{h>0} \mathcal{D} h~ h^2 + \sigma_b^2.$$ Since the integrand is an even function, it can be evaluated easily $$q^* = \frac{1}{2} q^* \sigma_w^2 \int \mathcal{D} h~ h^2 + \sigma_b^2.$$ The integral now is the variance of $h$, which is unity by construction, so we simply get $$q^* = \frac{1}{2} q^* \sigma_w^2 + \sigma_b^2.$$ Plugging in $\sigma_w^2=2$ gives $q^* = q^* + \sigma_b^2$, meaning $\sigma_b^2 = 0$.
7. For hard-tanh networks, the behavior is a bit more complex, but we can calculate it numerically. As we saw in problem set 2, for the smooth tanh network there is a 1D curve in the $(\sigma_w, \sigma_b)$ plane which satisfies criticality. The same is true for the hard tanh network, as we’ll now see. We are interested in three quantities, all of which are functions of $\sigma_w$ and $\sigma_b$: $q^*$, $p(q^*)$, and $\chi$. We’ve already seen (in part (c) above) that if we know $\sigma_w$ and $p(q^*)$, we can easily determine $\chi$. It turns out that there is also a simple relation between $q^*$ and $p(q^*)$. Show that for the hard tanh network, $p(q^*) = \mathrm{erf}(1/\sqrt{2q^*})$.
  Solution
  
  For hard-tanh, $p(q^*)$ is the probability that a normally distribution set of activations takes on values in hard-tanh's linear regime (recall this is between $-1$ and $1$). Thus we integrate $\int_{-1}^{1} z dz$ where $z$ is a zero-mean Gaussian with variance $q^*$. The integral of the Gaussian is given by the error function. The error function (denoted $erf$ and defined as the integral of the standard Gaussian) is commonly defined without the leading factor $\frac{2}{\pi}$, so $\int z dz = erf(\sqrt(1/2q^*)$ (the parameter $1/2q^*$ is arrived at by substituting $t=h/\sqrt{2q^*}$). Thus $p(q^*) = erf(\sqrt(1/2q^*)$.
  
  Now all that’s left is to determine $q^*$ as a function of $\sigma_w$ and $\sigma_b$, and then we can get both $q^*$ and $p(q^*)$. Remember that in problem set 2, you derived the relation
  
  $q^* = \sigma_w^2 \int~ \mathcal{D}h~ \phi(\sqrt{q^*}h)^2 + \sigma_b^2$ Use this relation to get an implicit expression for $q^*$ in terms of $\sigma_w$ and $\sigma_b$.
  
  Solution
  
  $$ q^* = \sigma_w^2 \int~ \mathcal{D}h~ \phi(\sqrt{q^*}h)^2 + \sigma_b^2 $$ The hard-tanh nonlinearity squares to unity when $|\sqrt{q^* h}|\leq 1$, and otherwise squares to $q^* h^2$. So we can immediately write $$ q^* = \sigma_w^2 \left[ 1 + \int_{-1/\sqrt{q^*}}^{+1/sqrt{q^*}} \frac{q^*h^2 - 1}{\sqrt{2\pi}} e^{-h^2/2} \right] + \sigma_b^2 $$
Can Gaussian initialization achieve dynamical isometry?

In this problem, we will consider weights with a Gaussian initialization, and use the results from the previous problems to investigate whether dynamical isometry can be achieved for such nets over our two main activation functions of interest (ReLU and hard-tanh).
1. As we’ve seen in the decomposition from the previous problems, the $S$-transform of $\mathbf{J} \mathbf{J}^T$ depends on the $S$-transform of $D^2$, which was computed above, and that of $ WW^T $, which is a Wishart random matrix, i.e. the product of two random Gaussian matrices.
  
  Prove that $S_{WW^T}(z) = \frac{1}{\sigma_w^2 \cdot (z + 1)}$, using the following connection between the moments of a Wishart matrix and the Catalan numbers: $m_k = \frac{\sigma_w^{2k}}{k + 1} {2k \choose k}$ where $m_k$ is the $k^\text{th}$ moment of $WW^T$.
  
  Solution
  
  Given the moments, we can easily form the moment-generating function $$ M_{WW^T}(z) := \sum_{k = 1}^\infty \frac{m_k}{z^k} = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k \frac{1}{k + 1} {2k \choose k} = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k $$ where $C_k$ is the $k^\text{th}$ Catalan number. So, we can now exploit the defining recurrence relation for the Catalan numbers, that $C_k = \sum_{j = 0}^{k - 1} C_j C_{k - j - 1}$ (if you think of the $k^\text{th}$ Catalan number as the number of ways to balance $2k$ parentheses, this recurrence is pretty intuitive). To start off, this recurrence starts with the $C_0$, though our MGF does not, and this might make the calculation more difficult; let's temporarily work with $$ f(x) := \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k = 1 + M_{WW^T}(z) $$ Next, the recurrence is in a sum of products of Catalan numbers; specifically, products whose indices have a constant sum. Seeing as $f(x)$ is basically an infinitely long polynomial, and polynomial multiplication also involves such product sums, a good first attempt to apply this recurrence is to square our function. Indeed, we have: $$ f(x)^2 = \sum_{k = 0}^\infty \sum_{j = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^{k + j} C_k C_j $$ which after collecting like terms, is $$ f(x)^2 = \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k \sum_{j = 0}^{k - 1} C_j C_{k - j} = \sum_{k = 0}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_{k + 1} $$ Thus, $$ \frac{\sigma_w^2}{z} f(x)^2 = \frac{\sigma_w^2}{z} \left( M_{WW^T}(z) + 1 \right)^2 = \sum_{k = 1}^\infty \left( \frac{\sigma_w^2}{z} \right)^k C_k = M_{WW^T}(z) $$ Solving the quadratic equation yields $$ M_{WW^T}(z) = \frac{z}{2 \sigma_w^2} - 1 - \frac{1}{2} \sqrt{1 - \frac{4 \sigma_w^2}{z}} $$ Now that we've reduced the MGF to a quadratic polynomial, inverting it is easy enough, and we are left with $$ M_{WW^T}^{-1}(z) = \sigma_w^2 \frac{(z + 1)^2}{z} $$ $$ S_{WW^T}(z) = \left( \sigma_w^2 \cdot (z + 1) \right)^{-1} $$
2. We now have enough pieces to begin attacking the calculation of the Jacobian singular value distribution - recall that due to the decomposition
  \[S_{JJ^T} = (S_{WW^T})^L \cdot (S_{D^2})^L\]
  once we’ve calculated the $S$-transforms for $D^2$ and $WW^T$, we can easily obtain the $S$-transform of $\mathbf{J} \mathbf{J}^T$.
  
  Using your solution to the previous part and the calculation of $S_{D^2}$ from the earlier problems, show that
  \[S_{JJ^T} = \sigma_w^{-2L} \cdot (z + p(q^*))^{-L} .\]
  Solution
  
  We calculated $S_{D^2}$ in part (e) of problem 3, showing that $$S_{D^2}(z) = \frac{z+1}{z+ p(q^*)}$$ And from the previous part we know that $$S_{WW^T}(z) = \frac{1}{\sigma_w^2(z+1)},$$ so combining these gives $$ S_{JJ^T} = (S_{WW^T})^L (S_{D^2})^L = \left( \sigma_w^{-2} (1 + z)^{-1} \right)^L \left( \frac{1 + z}{z + p(q^*)} \right)^L = \sigma_w^{-2L} (z + p(q^*))^{-L} $$
3. From the $S$-transform, one route to getting information about the spectrum of $JJ^T$ is to compute the spectral density $\rho_{JJ^T}(\lambda)$. While that calculation is too involved, we can get the answer to the question of achieving dynamical isometry by a slightly more indirect route.
  
  Use the $S$-transform you calculated above to calculate $M_{JJ^T}^{-1}$ (the inverse of the moment-generating function for $\mathbf{J} \mathbf{J}^T$).
  
  Hint
  To compute the inverse MGF, recall the definition of the $S$-transform given in the paper (section 2.3, eqn. 10).
  
  Solution
  
  The $S$-transform is defined so that $S_{JJ^T} = \frac{1 + z}{z M^{-1}_{JJ^T}(z)}$, so $$ M^{-1}_{JJ^T}(z) = \frac{1 + z}{z S_{JJ^T}(z)} = \frac{1 + z}{z} \left(z + p(q^*)\right)^L \sigma_w^{2L} $$
4. We can now compute the variance of the $JJ^T$ eigenvalue distribution, $\sigma_{JJ^T}^2$. You should have calculated above that
  \[M_{JJ^T}^{-1}(z) = \frac{1 + z}{z} \cdot (z + p(q^*))^L \cdot \sigma_w^{2L}\]
  Using the definition that
  \[M_{JJ^T}(z) = \sum_{k = 1}^\infty \frac{m_k}{z^k}\]
  and the expression for the functional inverse of $M_{JJ^T}$ to compute that the first two moments are
  
  $m_1 = \sigma_w^{2L} p(q^*)^L$ $m_2 = m_1^2 \cdot \frac{L + p(q^*)}{p(q^*)}$
  
  Hint
  Use the Lagrange inversion theorem (eqn. 18 in the paper) to obtain a power series for the inverse MGF and equate corresponding coefficients with our calculated expressions.
  
  Solution
  
  Note that we have a formula for $M^{-1}(z)$ (suppressing the $JJ^T$ subscript for clarity, but the moments are defined in terms of $M(z)$. In the paper, the Lagrange inversion theorem is used to express the constant and $1/z$ coefficients of $M^{-1}(z)$ in terms of the $m_1$ and $m_2$. Here is a slightly hand-wavy proof of that result (the rigorous proof turns out to be quite difficult):
  Assume that the $M^{-1}(z)$ can be written as a Taylor series with an additional $1/z$ term (This assumption is one of the weaknesses of this proof). So $$ M^{-1}(z) = \frac{a}{z} + b + cz + dz^2 + \cdots $$ Since we know that $$ M(z) = \frac{m_1}{z} + \frac{m_2}{z^2} + \cdots, $$ we can write $$ z = \frac{m_1}{M^{-1}(z)} + \frac{m_2}{M^{-1}(z)^2} + \cdots $$ Plugging in our ansatz above gives $$ z = \frac{m_1}{\left( \frac{a}{z} + b + cz + dz^2 + \cdots \right)} + \frac{m_2}{\left( \frac{a}{z} + b + cz + dz^2 + \cdots \right)^2} + \cdots $$ We'll expand the RHS of the above equations assuming $z$ to be small, and then equate coefficients of the RHS and LHS. Specifically, we will expand the RHS to second order in $z$. $$ z = \frac{m_1 z}{a} \bigg( 1 + (b/a) + (c/a)z + (d/a)z^2 + \cdots \bigg)^{-1} + \cdots $$ $$ \frac{m_2 z^2}{a^2} \bigg( 1 + (b/a) + (c/a)z + (d/a)z^2 + \cdots \bigg)^{-2} + \cdots = \frac{m_1}{a} z - \frac{m_1 b}{a^2} z^2 + \frac{m_2}{a^2} z^2 + O(z^3) $$ Since the coefficient of $z$ above has to be unity, and the coefficient of $z^2$ has to be zero, this implies that $a = m_1$ and $b .= m_2/m_1$. This implies that our sought-after expression for $M^{-1}(z)$ is $$M^{-1}(z) = \frac{m_1}{z} + \frac{m_2}{m_1} + \cdots$$ With this expression in hand, we can directly extract the constant and $1/z$ coefficients of the function $M^{-1}(z)$: Given the result of our earlier calculation that $$ M_{JJ^T}^{-1}(z) = \left(1+\frac{1}{z}\right) \cdot (z + p(q^*))^L \cdot \sigma_w^{2L}, $$ we see that the only place to get a $1/z$ term here is from the constant term when the $(z+p(q^*))^L$ is expanded. This constant term will simply be $p(q^*)^L$, so the $1/z$ term here, which is $m_1$, is $$ m_1 = \sigma_w^{2L} p(q^*)^L $$ The constant term comes from two places. One, the $p(q^*)^L$ multiplies the $1$ in the first term, and the term $Lzp(q^*)^{L-1}$, coming from the binomial expansion, multiplies the $1/z$ in the first term. So this means that the constant coefficient, $m_2/m_1$, is given by $$ \frac{m_2}{m_1} = \sigma_w^{2L} p(q^*)^L + \sigma_w^{2L} L p(q^*)^{L-1}. $$ Recognizing the first term in the RHS sum as $m_1$, we can factor to get $$ \frac{m_2}{m_1} = m_1 \left( 1 + \frac{L}{p(q^*)} \right), $$ or, $$ m_2 = m_1^2 \left( 1 + \frac{L}{p(q^*)} \right), $$ as desired.

6 Experimental Results & Future Work.

Motivation: Before wrapping up, we will programatically validate the theoretical results we derived above. You can find starter code in an IPython notebook here.

Objectives:

Experimentally confirm the linear dependence of the singular value spectrum of a neural net’s Jacobian under various random initializations.
Experimentally confirm the positive impact of dynamical isometry at initialization on the trainability of a neural net.

Follow-up reading: Here are a couple papers you might enjoy, which build upon the results of this paper:

Stein Variational Gradient Descent

2020-03-02T10:00:00+00:00

[Editor’s Note: This class was a part of the 2019 DFL Jane Street Fellowship.]

This guide is thanks to a many different people, all of whom took their time to give feedback, write reviews, and provide their own insights to the curriculum.

Special thanks to Cinjon Resnick, who was incredibly helpful throughout the iterations of the class, curriculum, and final notes. A special thanks as well to Professor Qiang Liu, who took the time to help shape the curriculum.

Thank you to Calvin Woo, Sanyam Kapoor, Thomas Pinder, Swapneel Mehta, and Avital Oliver for useful contributions to this guide, as well as countless insights during our discussions.

A special thanks to the many outside guests who offered to provide their time, including Dilin Wang, Tongzheng Ren, and Haoran Tang.

Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.

Concepts used in SVGD. Click to navigate.

Why

Stein’s Method is a powerful statistical method, one that is at the disposal (and the focus) of many statisticians today. Recently, Stein’s Method has made its way into machine learning and has already proved to be a fruitful research area. Stein’s Method has deep connections to many machine learning problems of interest, and by the end of this guide, you should be able to understand the relevant mathematics behind this powerful tool.

1 Basics Behind Kernelized Stein Discrepancy

Motivation: Before jumping into all the math and methodology, we have to be able to understand the basics of what’s going on. Most importantly, we will review the basics of measure theory and reproducing kernel hilbert spaces. Measure theory allows us to understand the notion of discrepancy measures between distributions, which we will use later on to quantify the difference between two arbitrary distributions of interest. Our other topic, Reproducing Kernel Hilbert Spaces (RKHS), will serve as the connection between measure theory and a practical machine learning algorithm. With RKHS, we will be able to define and optimize intractable measures which previously, were only useful for theoretical analysis or a restrictive class of functions. These two together set the foundation for defining a tractable Kernelized Stein Discrepancy, which serves as the driving factor behind Stein Variational Gradient Descent.

Topics:

Measure Theory
Kernels
Reproducing Kernel Hilbert Space
Machine Learning Basics

Notes: In this class, we went over the basic mathematical concepts we will need throughout the rest of the curriculum. See here for the notes in Colab or here for the PDF.

Required Reading:

Optional Reading:

Questions:

“However, Cauchy sequences are not the same as convergent sequences”, but a property of Cauchy sequences is that they are bounded. What’s the difference?
Solution

Convergent sequences have a limit, but Cauchy sequences are only required to be bounded. But what exactly does bounded mean? Here's a proof that shows that they are bounded, which might shed some light on the definition itself:

a. There exists $N$ such that $|a_n - a_m| < 1 \quad \forall m, n \geq N$ (Property of Cauchy Sequence iterates getting closer)
b. $\implies \forall n \geq N, |a_n - a_N| < 1$
c. $a_n \in (a_N - 1, a_N + 1) \forall n \geq N$. ($n \geq N$ is bounded)
d. Since the sequence is $n < N$ is finite (since $N$ is finite), it is also bounded.
Therefore the Cauchy sequence $\{ a_n \}$ is bounded $\square$
“The open interval (0, 1) is not complete whereas the closed interval [0, 1] is complete.” Why? Can we use this example to get a intuitive definition of complete?
Solution

Intuitively, a space is complete if there are no "points missing" from it (inside or at the boundary). For instance, the set of rational numbers is not complete, because e.g. $\sqrt{2}$ is "missing" from it, even though one can construct a Cauchy sequence of rational numbers that converges to it. More information can be found at Wikipedia: Complete Metric Space.
Explain the difference between a Banach and Hilbert Space. Is every Hilbert space a Banach space?
Solution

A Banach space is a vector space in which each vector has a non-negative length, or norm, and in which every Cauchy sequence converges to a point of the space. Also known as complete normed linear space. A Hilbert space is a Banach space with inner product, which defines the norm.
In Machine Learning, kernels can be thought of as a “dot product” (a kind of similarity score) in high-dimensional space. Why would this be useful? Given a feature map, do we always have a corresponding kernel? Given any kernel, can we always explicitly write out the elements of the corresponding feature map?
Solution

Kernels (and the corresponding kernel trick) allow us to compute similarities in high-dimensional space without explicitly writing out and computing the dot product. However, not ever feature map corresponds to a kernel; there are certain properties a kernel must have, and not every feature map imbues it with those properties. Likewise, given a kernel, it may be the case that we can never write out (explicitly) the corresponding feature map. A good example of this is the popular exponential kernel.
Assume that we just need the log-likeihood in many machine learning tasks so that we can compute $KL(q||p)$ , and iteratively fit our model $p$ to the underlying, generating data distribution $q$. Why is this already too large of an assumption (“We assume that we have the ability to calculate the log-likelihood under the model that we specify”)?
Solution

The dreaded normalization constant! Most models we see will give an unnormalized likelihood, and the normalization constant (which we will see in a few weeks, often denoted as $Z$) is intractable to compute. We need the normalization constant to bring a probability function to a probability density function.
What is the use of Monte-Carlo methods in machine learning?
Solution

They are a way to estimate quantities in the presence of complex, many-random-variable situations. They do so by repeatedly generating (via simulation) instances from which they estimate the quantities.
Explain the reproducing property in your own words.
Solution

Sanyam Kapoor's answer from our class was: "Every feature map is a linear combination of the full Hilbert space weighted by the kernel evaluations."

2 Stein’s Method

Motivation: Most of the theory we will see in this curriculum builds off the general theoretical framework of Stein’s Method, a tool to obtain bounds on distances between distributions. In Machine Learning (as we shall later see), distances between distributions can be used to quantify how well (or poorly) a model is at approximating a certain distribution of interest. We shall start from Stein’s Identity and Operator, while explaining their theoretical significance and working through some proofs to get an understanding of some terms (Stein’s Method, Stein’s Discrepancy) we’ll see in the coming weeks. Lastly, we will discuss why Stein’s Method has historically been a theoretical tool, and hint at how ideas from Week 1 (particularly RKHS) can be used in combination with Stein’s Method to build the tractable discrepancy measure at the center of Week 3’s discussion.

Topics:

Stein’s Method
The Stein Operator
Stein Equation
Stein’s Identity

Notes: In this class, we discussed the theoretical concepts behind Stein’s method, and discussed different ways to interpret the core ideas. See here for the notes in Colab or here for the PDF.

Required Reading:

Optional Reading:

Questions:

Prove Stein’s Identity for a standard Gaussian random variable $Z$.
Solution

Recall that Stein's Identity tells us that for a unit-normal random variable $Z$ (i.e $Z \sim \mathcal{N}(0, 1)$): $$ \mathbf{E}f'(Z) = \mathbf{E}Zf(Z)$$ for all absolutely continuous functions $f$ with $ \mathbf{E}[f'(Z)] < \infty $. To start, we state, without proof, that the density function of the unit normal Gaussian: $$ p(z) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^2}{2}} $$ satisfies $ zp(z) = p'(z) $. For some normal $Z$, we can break the left hand side of the original identity into two integrals: $$\mathbf{E}f'(Z) = \int_0^\infty f'(z)p(z)dz + \int_{-\infty}^0 f'(z)p(z)dz $$ For each left-hand side integral, we use Fubini's Theorem: $$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty f'(z) \int_z^\infty yp(y)dydz $$ $$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_z^\infty f'(z)yp(y)dydz $$ $$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty \int_0^y f'(z)yp(y)dzdy $$ Leading us to our final integral: $$ \int_0^\infty f'(z)p(z)dz = \int_0^\infty [f(y) - f(0)] yp(y)dy $$ For the second integral, it evaluates to $ \int_{-\infty}^0 [f(y) - f(0)] yp(y)dy $ When we combine each individual result, we get: $$ \mathbf{E}f'(Z) = \mathbf{E}Z[f(Z) - f(0)] = \mathbf{E}Zf(Z)$$ which proves the forward direction.
Explain why Stein’s Identity is useful.
Solution

Stein's Identity in the converse as well; if the identity holds, we can conclude the random variable, which we call $W$, is also normal. However, if the two quantities in Stein's Identity are approximately equal, then Stein's Identity also lets us conclude that $W$ is also approximately normal. Stein's Identity and Method are used to quantify this "approximately" term, which we briefly discuss below. Probability metrics (between two random variables $X$ and $Y$) take the general form of: $$d(X, Y) = \sup_{h \in \mathcal{H}} | \mathbf{E}h(X) - \mathbf{E}h(Y) |$$ for some class of functions $ \mathcal{H} $. We normally want to bound the distances between the corresponding distribution functions $P $ and $Q $, but that choice is less important for this brief discussion. When we choose different classes of functions, we can recover various distances that we often use (in machine learning) to compare probability distributions, such as the Kolmorgov or Wasserstein distance. We get to the Stein Discrepancy by measuring the distance between $W$ to our standard normal $Z$ via: $$ \mathbf{E}h(W) - \mathcal{N}h $$ where $\mathcal{N}h = \mathbf{E}h$ for $h \in \mathcal{H}$. Stein's Identity tells us that the discrepancy can also be measured by: $$ \mathbf{E}[f'(W) - Wf(W)]$$ which, when we evaluate at $w$, gives us the Stein Equation: $$ f'(w) - wf(w) = h(w) - \mathcal{N}h $$ Since we're trying to bound: $\mathbf{E}h(W) - \mathcal{N}h$, we can now instead bound the LHS, which turns out to be a lot easier once we account for all of the boundary conditions.

3 Kernelized Stein Discrepancy

Motivation: The main theoretical meat comes from a single 2016 paper titled Kernelized Stein Discrepancy (KSD). KSD takes the powerful Stein’s Identity, and uses RKHS theory to define a tractable discrepancy between a ground truth distribution and samples from an arbitrary one. Most importantly, KSD defines a discrepancy function that does not involve calculating the normalizing constant, allowing it to be much more widely applicable in practical tasks. We will discuss the difference between likelihood-free and likelihood-based methods in machine learning, how this normalization constant proves to be problematic in machine learning, and how KSD allows us to sidestep this issue with a new, tractable discrepancy. KSD will serve as the launch pad for the algorithm at the focus of this curriculum, Stein Variational Gradient Descent.

Topics:

A Stein Discrepancy
Goodness of Fit
Tractable Optimization of the Stein Discrepancy

Notes: In this class, we worked through the Kernelized Stein Discrepancy paper, focusing on the optimization and use cases of such a method. See here for the notes in Colab or here for the PDF.

Required Reading:

Optional Reading:

Although we focus on the work leading up to Stein Variational Gradient Descent, this week’s optional reading provides historical context on how Stein’s Method was introduced into the context of machine learning.

The first reference, from Gorham and Mackey, introduced the notion of a Stein Discrepancy. Kernelized Stein Discrepancy, the paper of focus for this week, built upon that idea with kernels, enabling the use of kernel functions in the Stein Discrepancy. The latter two references are also works that independently developed kernel-based Stein Discrepancies.

Questions:

What determines the choice of kernel in KSD?

Solution

Since KSD requires an RKHS for optimization, the kernel must be positive definite. However, whenever given a positive definite kernel $K$, we can always build an associated RKHS as follows. If we take $H$ as the Hilbert space of functions $f: \mathcal{X} \rightarrow \mathbf{R}$ defined on some set $\mathcal{X}$ with some inner product $ \langle \cdot, \cdot \rangle_H $ defined on $H$, then we can define the evaluation functional $e_x: H \rightarrow \mathbf{R}$ as $f \rightarrow e_x(f) = f(x) $. Using the above definitions, our space $ H$ is an RKHS iff the evaluation functionals are continuous. As we saw in the notes, we call the given kernel $K$ a reproducing kernel if:
1. $K(x, \cdot), \; \forall x \in \mathcal{X}$
2. $\langle f, K_x \rangle = f(x) \; \forall f \in H, \forall x \in \mathcal{X}$.
Thus, every reproducing kernel $ K$ induces a unique RKHS given the kernel is positive definite. Excitingly, in the context of machine learning, positive definite kernels themselves can be defined in terms of inner products. Therefore, we can generate arbitrary kernels and RKHS with some feature map $ \Phi: \mathcal{X} \rightarrow \mathcal{F}$ where feature space $ \mathcal{F}$ is a Hilbert space with some inner product $ \langle \cdot, \cdot \rangle $.

4 Stein Variational Gradient Descent

Motivation: Stein Variational Gradient Descent (SVGD) is a popular, non-parametric Bayesian Inference algorithm that’s been applied to Variational Inference, Reinforcement Learning, GANs, and much more. This week, we study the algorithm in its entirety, building off of last week’s work on KSD, and seeing how viewing KSD from a KL-Divergence-minimization lens induces a powerful, practical algorithm. We discuss the benefits of SVGD over other similar approximators, and look at a practical implementation of the algorithm.

Topics:

Stein Variational Gradient Descent
Implementing the Algorithm

Notes: In this class, we go over the core paper, Stein Variational Gradient Descent. At the end of the notes, we provide link to implementations in a variety of different languages. See here for the notes in Colab or here for the PDF.

Required Reading:

Optional Reading:

Questions:

Compare and contrast the method shown here and MCMC. What are some advantages MCMC still has over SVGD?
Solution

Below are some ideas we discussed in our class.
1. SVGD requires a compact subspace $ \mathcal{X} $, and as noted here in Chen '19, requires the number of particles to be fixed apriori.
2. SVGD has a lot less theoretical understanding compared to MCMC (which, is potentially due to the recency of the result). SVGD has had analysis done in the infinite-particle regime, but minimal work done in finite particle scenarios (an example of such work can be found here. A concern of theoretical analysis is the complexity of analyzing the interacting particle updates, so the works covered here either view it from a dynamical systems / differential equation perspective (which concerns the smooth transformation of density), or discuss the properties of the final particles, regardless of how they were algorithmically attained.
3. SVGD still seems to collapse in high-dimensional spaces, leading to exciting new research in why this occurs and ideas on how to get around it.
Prove that the discrepancy in Equation 3 of the Stein Variational Gradient Descent Paper only equals 0 when (p) and (q) are equal.
Solution

Recall the operator definition of Stein's Identity: $$ \mathbf{E}_p[\mathcal{A}_pf(x)] = 0$$ If $ p \neq q $, we get $ \mathbf{E}_q[\mathcal{A}_pf(x)] $ for some choice of function $ f $. We can expand this to: $$\mathbf{E}_q[\mathcal{A}_pf(x)] - \mathbf{E}_q[\mathcal{A}_qf(x)]$$ Recalling the full definition of the operator: $$\mathcal{A}_pf(x) = \mathbf{E}_p[s_p(x)f(x) + \nabla_x f(x)] = 0$$ where score function $ s_p(x) $ is just $ \nabla_x \log p(x) $, we are left with $$\mathbf{E}_q[(s_p(x) - s_q(x))f(x)]$$ This means unless $p = q \rightarrow s_p(x) = s_q(x) \; \forall x \in \mathcal{X} $, we can always find some function $f$ for which the above quantity is nonzero.
Implement SVGD in your favorite language (see the notes for links to different implementations). Then, let’s take a look at the role of the kernel in SVGD:
- Remove the repulsive kernel term and observe how particles collapse to modes.
- Remove the kernel’s contribution in the first term.
What happens?

5. SVGD as Gradient Flow

Motivation: SVGD as Gradient Flow is one of the first papers that analyzes the dynamics and theoretical properties of SVGD. This paper covers an incredible amount of seemingly-disparate topics, connecting them in a succinct explanation. Due to the relative difficulty of the material, especially the necessary background, the attached notes are self-contained and should be read alongside the paper.

Topics:

Large Sample Regime of SVGD
Continuous Time Analysis of SVGD
Optimal Transport, Wasserstein Distances, and Differential Geometry
SVGD as a Gradient Flow

Notes: In this class, we try to understand the geometric implications of SVGD. The notes are structured relatively differently - with the amount of background needed, relevant material is introduced in-line. As a result, the ideal way to understand this week requires reading the notes alongside the paper, using the background sections to understand the concepts and their connections within the paper. See here for the notes in PDF.

Required Reading:

Stein Variational Gradient Descent as Gradient Flow.

6. Stein in Reinforcement Learning

Motivation: One of the most exciting use cases of SVGD is in reinforcement learning, due to its connection to maximum entropy reinforcement learning. This week, we study two key techniques in reinforcement learning that use SVGD as the underlying mechanism. In reinforcement learning, the target distribution is not known, so we derive gradient updates to our parameters using policy gradients. As we derive the gradient estimators in the maximum-entropy framework of reinforcement learning, we will start to see what benefits SVGD-based methods have. In particular, we will focus on the explore-exploit tradeoff, as well as normalization constants for intractable distributions, and see how SVGD helps us get around complicated problems regarding both.

Topics:

Reinforcement Learning
Explore vs. Exploit
Maximum Entropy Reinforcement Learning

Notes: In this class, we look at the application area of reinforcement learning, and see how the diversity induced by SVGD (and its connection to maxmimum entropy reinforcement learning) generates strongly-exploring policies. See here for the notes in Colab or here for the PDF.

Required Reading:

Optional Reading:

Questions:

What are some of the issues with using the RBF kernel when comparing RL policies? Is parameter space appropriate for comparing policies?
Solution

While it works in practice, the networks used for particles in the original SVPG paper were reasonably small. With larger numbers of parameters (i.e which are necessary when working with image-based observations), parameter-based discrepancies start to make even less sense. This is one of two core ideas that drove the formulation of the Self-Imitating Diverse Policies paper, seen as Resource 4 in Optional Reading.
In SVPG, the introduction of a prior (and priors in RL) is one active area of research. To incorporate priors in this framework, what “space” does the prior need to be over?

Solution

SVPG incorporates a prior over $q $, which is actually a prior over the distribution of particle parameters $\theta$. Since this space is uninterpretable, the prior term is set to be a constant, generating an "improper" prior that, in most use cases, can get dropped out of the optimization. Even if you were to use an old set of particles as a prior, the term is basically unusable, because in order to estimate the density of $q$, you'd need to fit high-dimensional ( $ d = \mathbf{R}^{|\theta|} $) kernel-density estimators. In addition, usually the number of particles used is much less than the number of parameters each has, making the density estimation an ill-posed problem.
With the code implementation linked in the notes (or, your own), ablate on the architecture of each SVPG particle. What types of behavioral differences do you see in the different policies as you increase or decrease? Try adding a second layer instead; for example, how does a 2-layer, 200 neuron-per-layer network compare to a single-layer, 400 neuron particle?

Neural ODEs

2019-09-23T10:00:00+00:00

This guide would not have been possible without the help and feedback from many people.

Special thanks to Prof. Joan Bruna and his class at NYU, Mathematics of Deep Learning, and to Cinjon Resnick, who introduced me to DFL and helped complete this guide.

Thank you to Avital Oliver, Matt Johnson, Dougal MacClaurin, David Duvenaud, and Ricky Chen for useful contributions to this guide.

Thank you to Tinghao Li, Chandra Prakash Konkimalla, Manikanta Srikar Yellapragada, Shan-Conrad Wolf, Deshana Desai, Yi Tang, Zhonghui Hu for helping me prepare the notes.

Finally, thank you to all my fellow students who attended the recitations and provided valuable feedback.

Concepts used in Neural ODEs. Click to navigate.

Why

Neural ODEs are neural network models which generalize standard layer to layer propagation to continuous depth models. Starting from the observation that the forward propagation in neural networks is equivalent to one step of discretation of an ODE, we can construct and efficiently train models via ODEs. On top of providing a novel family of architectures, notably for invertible density models and continuous time series, neural ODEs also provide a memory efficiency gain in supervised learning tasks.

In this curriculum, we will go through all the background topics necessary to understand these models. At the end, you should be able to implement neural ODEs and apply them to different tasks.

Common resources:

Süli & Mayers: An Introduction to Numerical Analysis.
Quarteroni et al.: Numerical Mathematics.

1 Numerical solution of ODEs - Part 1

Motivation: ODEs are used to mathematically model a number of natural processes and phenomena. The study of their numerical simulations is one of the main topics in numerical analysis and of fundamental importance in applied sciences. To understand Neural ODEs, we need to first understand how ODEs are solved with numerical techniques.

Topics:

Initial values problems.
One-step methods.
Consistency and convergence.

Notes: In this class, we touched upon one-step method and their analysis. We also looked at some illustrative examples.

Required Reading:

Sections 12.1-4 from Süli & Mayers.
Sections 11.1-3 from Quarteroni et al.

Optional Reading:

Runge-Kutta methods: Section 12.5 from Süli & Mayers.
Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.2.

Questions:

Exercise 1 in Section 11.12 of Quarteroni et al.
Solution

The truncation error can be split as $$h\tau_{n+1} = y_{n+1} - y_n - h\Phi(t_n,y_n;h) = E_1 + E_2$$ where $$E_1 = \int_{t_n}^{t_{n+1}} f(s, y(s))\,ds - \frac{h}{2}\left( f(t_n,y_n) + f(t_{n+1},y_{n+1}) \right)$$ and $$E_2 = \frac{h}{2}\left( f(t_{n+1},y_{n+1}) - f(t_{n+1},y_n + hf(t_n,y_n) \right)$$ We can bound $E_2$ as $$|E_2| = \frac{h}{2} \left| f(t_{n+1},y_{n+1}) - f(t_{n+1}, y_n + h f(t_n,y_n)) \right| \leq \frac{hL}{2}|y_{n+1}-y_{n} - hf(t_n,y_n)| = \frac{hL}{2}O(h^2) = O(h^3)$$ where $L$ is the Lipschitz constant of $f$. On the other hand, $E_1$ is bounded above by $O(h^3)$; see this link for a proof. It follows that $\tau_{n} = O(h^2)$.
Exercises 12.3,12.4, 12.7 in Section 12 of Süli & Mayers.
Solution to Exercise 12.3

Notice that we can write $$\left(y + \frac{q}{p}\right)'=p\left(y + \frac{q}{p}\right)$$ It follows that $y(t) = Ce^{pt} - q/p$ for some constant $C$. Imposing the initial condition $y(0)=1$, we get $y(t)=e^{pt} + q/p(e^{pt}-1)$. In particular, we expand $y$ in its Taylor series: $$y(t) = 1 + \left(y + \frac{q}{p}\right)\sum_{k=1}^\infty \frac{(pt)^k}{k!}$$ To conclude the exercise we only need to notice that $$y_n(t) = q/p + \left(y + \frac{q}{p}\right)\sum_{k=1}^n \frac{(pt)^k}{k!}$$ satisfies Picard's iteration: $y_0 \equiv 1$, $y_{n+1}(t) = y_0 + \int_0^t (py_n(s) + q)\,ds$.

Solution to Exercise 12.4

Applying Euler's method with step-size $h$, we get $\hat{y}(0) = 0$, $\hat{y}(h) = \hat{y}(0) + h \hat{y}(0)^{1/5} = 0$, $\hat{y}(2h) = \hat{y}(h) + h \hat{y}(h)^{1/5} =0$. Iterating, we see that $y(nh) = 0$ for all $n\geq 0$. On the other hand, the implicit Euler's method says that $$\hat{y}_{n+1} = \hat{y}_n + h \hat{y}_{n+1}^{1/5}$$ for $n \geq 0$ and $\hat{y}_0 = 0$. After substituting $\hat{y}_{n} = (C_nh)^{5/4}$ in the above relation, we only need to check that there exists a sequence $C_n$ satisfying the requirements.

Solution to Exercise 12.7

First, notice that $$e_{n+1} = y(x_{n+1}) - y_{n} - \frac{1}{2}h(f_{n+1} + f_n)= e_n - \frac{1}{2}h (f_{n+1}+f_n) + \int_{x_n}^{x_{n+1}} f(s,y(s))\,ds$$ and that the second component of the RHS is the same as $E_1$ in Exercise 1 above. Therefore the first bound follows. The last inequality is simply obtained by re-arranging the terms.
Consider the following method for solving $y' = f(y)$:
\[y_{n+1} = y_n + h(\theta f(y_n) + (1-\theta) f(y_{n+1}))\]
Assuming sufficient smoothness of $y$ and $f$, for what value of $0 \leq\theta\leq 1$ is the truncation error the smallest? What does this mean about the accuracy of the method?

Solution

By definition, it holds that $$h\tau_n = y_{n+1} - y_n - h (\theta f_n + (1-\theta) f_{n+1}) = y_{n+1} - y_n - h \theta y_n' - h(1-\theta) y_{n+1}'$$ Taylor-expanding, we get $$h\tau_n = y_{n} + hy_n' + h^2/2y_n'' + O(h^3) - y_n - h \theta y_n' - h(1-\theta) y_{n}' - h^2(1-\theta) y_{n}'' + O(h^3) = h^2(\theta - 1/2)y_n''+O(h^3)$$ It follows that the truncation error is the smallest for $\theta=1/2$. For $\theta = 1/2$, the method has order $2$, otherwise it has order $1$.
Colab notebook.
Solution

See this Colab for the solution.

2 Numerical solution of ODEs - Part 2

Motivation: In the previous class, we introduced some simple schemes to numerically solve ODEs. In order to understand which numerical scheme is more proper to apply, it is important to know and understand their different properties. For this reason, in this class, we go through some more involved schemes and analyze them with regards to convergence and stability.

Topics:

Runge-Kutta methods.
Multi-step methods.
System of ODEs and absolute converge.

Notes: In this class, we went through different ways to construct multi-step methods and their convergence analysis. We then looked into absolute stability regions for different methods.

Required Reading:

Runge-Kutta methods: Section 11.8 from Quarteroni et al. or Sections 12.{5,12} from Süli & Mayers.
Multi-step methods: Sections 12.6-9 from Quarteroni et al. or Section 11.5-6 from Süli & Mayers.
System of ODEs: Sections 12.10-11 from Quarteroni et al. or Sections 11.9-10 from Süli & Mayers.

Optional Reading:

Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.1.
Predictor-corrector methods: Section 11.7 from Quarteroni et al.
Richardson extrapolation: Section 16.4 from Numerical Recipes.
Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equations.

Questions:

Exercises 12.11, 12.12, 12.19 in Section 12 of Süli & Mayers.
Solution to Exercise 12.11

By definition, the truncation error is given by $$h\tau_n = y_{n+3} + \alpha y_{n+2} -\alpha y_{n+1} - y_n -h\beta y_{n+2}' - h\beta y_{n+1}'$$ Taylor-expanding, we have that $$y_{n+3} = y_n + 3hy_n' + 9/2h^2 y_n'' + 9/2h^3 y_n''' + 27/8h^4 y_n^{(4)} + O(h^5)$$ $$y_{n+2} = y_n + 2hy_n' + 2h^2 y_n'' + 4/3h^3 y_n''' + 2/3h^4 y_n^{(4)} + O(h^5)$$ $$y_{n+1} = y_n + hy_n' + h^2 y_n'' + h^3 y_n''' + h^4y_n^{(4)} + O(h^5)$$ $$y_{n+2}' = y_n' + 2hy_n'' + 2h^2y_n''' + 4/3 h^3 y_{n}^{(4)}$$ $$y_{n+1}' = y_n' + hy_n'' + h^2y_n''' + h^3 y_{n}^{(4)}$$ Substituting these in the first equation and imposing the terms in $h^i$, $i = 0,1,2,3,4$, to be $0$, we get the equations $$3 + \alpha - 2\beta = 0$$ $$27 + 7\alpha - 15\beta = 0$$ $$27 + 5\alpha - 12\beta = 0$$ Solving for these, we find $\alpha = 9$ and $\beta = 6$. The resulting method reads $$y_{n+3} + 9(y_{n+2} - y_{n+1}) - y_n = 6h(f_{n+2} + f_{n+1})$$ The characteristic polynomial is given by $$\rho(z) = z^3 +9z^2 - 9z -1$$ One of the roots of this polynomial satisfies $|z|>1$ and this implies that the method is not zero-stable.

Solution to Exercise 12.12

By definition, the truncation error is given by $$h\tau_n = y_{n+1} + b y_{n-1} +a y_{n-2} -h y_{n}'$$ Taylor-expanding, we have that $$y_{n+1} = y_n + hy_n' + 1/2h^2 y_n'' + O(h^3)$$ $$y_{n-1} = y_n - hy_n' + 1/2h^2 y_n'' + O(h^3)$$ $$y_{n-2} = y_n - 2hy_n' + 2h^2 y_n'' + O(h^3)$$ Substituting these in the first equation and solving for the terms in $h^i$, $i = 0,1$, to be $0$, we get $a=1$ and $b=-2$. In particular $$\tau_n = 3/2h + O(h^2)$$ and thus the method has order of accuracy $1$. The resulting method reads $$y_{n+1} -2 y_{n-1} + y_{n-2} = h f_{n}$$ The characteristic polynomial is given by $$\rho(z) = z^3 -2z -1$$ One of the roots of this polynomial satisfies $|z|>1$ and this implies that the method is not zero-stable.

Solution to Exercise 12.19

The first equation can be found by substituting $f(t,y) = \lambda y$ in equation (12.51) in the book and by solving for $k_1,k_2$ (it is a $2\times 2$ linear system). Substituting the values of $A$ and $b$ from the Butcher tableau in this formula and in the one right before equation (12.51) in the book, and simplifying, we get the formula for $R(\lambda h)$. Finally, $p$ and $q$ are given by $p,q=-3\pm i \sqrt{3}$. One can see that this implies $|R(z)|<1$ if $Re(z) <0$ and thus the method is A-stable.

3 ResNets

Motivation: The introduction of Residual Networks (ResNets) made it possible to train very deep networks. In this section, we study residual architectures and their properties. We then look into how ResNets approximate ODEs and how this interpretation can motivate neural net architectures and new training approaches. This is important in order to understand the basic models underlying Neural ODEs and gain some insights into their connection to numerical solutions of ODEs.

Topics:

ResNets.
ResNets and ODEs.

Notes: In this class, we defined and briefly discussed residual network architecture. We then looked at a stability notion for ResNets, derived from the connection with discretisation of ODEs, and to a simple way to make such architectures reversible.

Required Reading:

ResNets:
- ResNets.
- An Overview of ResNet and its Variants.
ResNets and ODEs:
- Sections 1-3 from Multi-level Residual Networks from Dynamical Systems View.
- Reversible Architectures for Arbitrarily Deep Residual Neural Networks.
- Invertible ResNets: The Reversible Residual Network: Backpropagation Without Storing Activations
- Stable Architectures for Deep Neural Networks.

Optional Reading:

The original ResNets paper: Deep Residual Learning for Image Recognition.
Another blog post on ResNets: Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification.

Questions:

Do you understand why adding ‘residual layers’ should not degrade the network performance?
Solution

Let $$x_k = x_{k-1} + f(W_k, x_{k-1})$$ be the output of the $k$-th layer of a residual net. Then, adding a residual layer consists of considering $$x_{k+1} = x_{k} + f(W_{k+1}, x_{k})$$ instead of $x_k$. For most common architectures, it holds that $f(W, x) \equiv 0$ for $W=0$. This is why adding a layer should not degrade the performances: any residual network with $k$ layers can be also written as a residual network with $k+1$ layers, by simply taking $W_{k+1}=0$.
How do the authors of (Multi-level Residual Networks from Dynamical Systems View) explain the phenomena of still having almost as good performances in residual networks when removing a layer?
Solution

Viewing the network output as time-step of the forward Euler's method, we have that $$x^{(n+1)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$ where $x^{(n)}(x_i)$ is the output of the $n$-th layer of the network evaluated on the input point $x_i$. Then $$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta) + h F(x^{(n+1)}(x_i); \theta)$$ Therefore, removing layer $n+1$ consists of taking $$x^{(n+2)}(x_i) = x^{(n)}(x_i) + h F(x^{(n)}(x_i); \theta)$$ instead. As $h$ is small (and this is motivated by the experiments in Section 3.2), the removed term is small and so is the variation in the output layer. Nevertheless, it must be noticed that this analysis is only based on empirical evaluations.
Implement your favourite ResNet variant.
Example

See this tutorial for an example of implementation of a ResNet.

4 Normalising Flows

Motivation: In this class, we take a little detour to learn about Normalising Flows. These are used for density estimation and generative modeling, and their implementation is motivated by a discretisation of an ODE. Understanding it at a basic level is necessary to understanding continuous normalizing flows, a central application of neural ODEs.

Topics:

Normalising Flows.
End-to-end implementations with neural nets.

Notes: In this class, we defined nomalising flow, starting from the non-parametric form and then deriving their algorithmic (and parametric) implementation. We concluded by discussing some architectures proposed in the literature and their trade-offs.

Required Reading:

DE: Density Estimation by Dual Ascent of the Log-likelihood (Skip Section 3).
A family of non-parametric density estimation algorithms.
A post on Normalising flow.

Optional Reading:

Questions:

In DE, what is the difference between $\rho_t$ and $\tilde{\rho}_t$, i.e. what do they represent?
Solution

The function $\tilde{\rho}_t$ is the density of the distribution of the random variable $\phi_t^{-1}(y)$ where $y\sim \mu$. The function $\rho_t$ is the density of the distribution of the random variable $\phi_t(x)$ where $x\sim \rho$.
What is the computational complexity of evaluating a determinant of an $N\times N$ matrix, and why is that relevant in this context?
Solution

In general, the cost of computing the determinant of an $N\times N$ matrix is $O(N^3)$. To compute densities transported by normalising flows, we need to compute the determinants of the Jacobians; therefore, an important feature of practical normalising flows, is that the Jacobian structure must allow an efficient computation of its determinant. See this week notes for more discussion on this.

5 The Adjoint Method (and Auto-Diff)

Motivation: The adjoint method is a numerical method for efficiently computing the gradient of a function in numerical optimization problems. Understanding this method is essential to understand how to train ‘continuous depth’ nets. We also review the basics of Automatic Differentiation, which will help us understand the efficiency of the algorithm proposed in the NeuralODE paper.

Topics:

Adjoint Method.
Auto-Diff.

Notes: In this class, we discussed the adjoint method. We started from the case of linear system and went through non-linear equations and recurrent relations. We concluded by discussing their application to ODE constrained optimization problems, which is the case of interest for Neural ODEs.

Required Reading:

Section 8.7 from CSE: Computational Science and Engineering.
Sections 2 and 3 from Automatic Differentiation in Machine Learning: a Survey.

Optional Reading:

Prof. Steven G. Johnson’s notes on adjoint method.

Questions:

Exercises 1,2,3 from Section 8.7 of CSE.
Solution to Exercise 1

This follows immediately by noticing that the number of multiply-add operations of multiplying an $N\times M$ matrix with an $M\times P$ matrix is given by $O(NMP)$.

Solution to Exercise 2

Apply the chain rule. Since $\frac{\partial C}{\partial S} = 2S$ and $\frac{dT}{dS} = \frac{\partial T}{\partial S} + \frac{\partial T}{\partial C}\frac{\partial C}{\partial S}$, we get $\frac{d T}{d S} = 1 -2S$.

Solution to Exercise 3

This follows from Exercise 1 by seeing $u^T$ and $w^T$ as $1\times N$ matrices and $v$ as an $N\times 1$ matrix.
Consider the problem of optimizing a real-valued function $g$ over the solution of the ODE $y'(t) = A(p)y(t)$, $y(0) = b(p)$ at time $T>0$: $\min_p\, g(T) \doteq g(y(T; p))$. Find $\frac{dg(T)}{dp}$ by solving the ODE and by applying chain rule. Check the correctness of equations (16-17) in CSE.
Solution

It holds that $$y(t) = e^{tA(p)}y(0)$$ Applying the chain rule, we get $$\frac{dg}{dp} = \frac{dg}{dy}e^{TA(p)}\frac{db}{dp} + T\frac{dg}{dy}\frac{\partial A}{\partial p}e^{TA(p)}b(p)$$ On the other hand, the adjoint ODE reads $$\lambda'(t) = -A(p)^T\lambda(t)$$ with the final condition $\lambda(T) = \left(\frac{\partial g}{\partial y}\right)^T$, which gives $\lambda(t) = e^{A(p)^T(T-t)}\left(\frac{\partial g}{\partial y}\right)^T$. Equation (17) from CSE gives $$\frac{dg}{dp} = \left(e^{TA(p)^T}\left(\frac{\partial g}{\partial y}\right)^T\right)^T\frac{\partial b}{\partial p} + \int_0^T \frac{\partial g}{\partial y} e^{A(p)(T-t)}\frac{\partial A}{\partial p}e^{tA(p)}b(p)\,dt$$ which coincides with the above.
Prove equations (14-15) in Section 8.7 of CSE.
Solution

By definition, it holds that $$\frac{dG}{dp} = \int_0^T\left(\frac{\partial g}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\right)\,dt $$ On the other hand, it holds that $$\lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T\lambda^T \frac{\partial f}{\partial p}\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt $$ Using equation (14) from CSE and the equality $\frac{\partial u}{\partial p} = \frac{\partial f}{\partial p} + \frac{\partial f}{\partial u}\frac{\partial u}{\partial p}$, we get $$\int_0^T \left( \lambda^T\frac{\partial f}{\partial p} -\frac{d}{dt}\left( \lambda^T \frac{\partial u}{\partial p}\right) \right)\,dt = \int_0^T \left( \lambda^T\frac{\partial f}{\partial p} + \lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} + \frac{\partial g}{\partial u}\frac{\partial u}{\partial p} - \lambda^T \frac{\partial f}{\partial p} -\lambda^T \frac{\partial f}{\partial u}\frac{\partial u}{\partial p} \right)\,dt$$ which gives $$ \lambda(0)^T\frac{\partial u}{\partial p}(0) + \int_0^T \lambda^T\frac{\partial f}{\partial p}\,dt = \int_0^T \frac{\partial g}{\partial u}\frac{\partial u}{\partial p}\,dt $$ and thus completes the proof.

6 The Paper

Motivation: Let’s read the paper! Here is a summary of what’s going on to help with your understanding:

Any residual network can be seen as the Explicit Euler’s method discretisation of a certain ODE; given the network parameters, any numerical ODE solver can be used to evaluate the output layer. The application of the adjoint method makes it possible to efficiently back-propagate (and thus train) these models. The same idea can be used to train time-continuous normalising flows. In this case, moving to the continuous formulation allows us to avoid the computation of the determinant of the Jacobian, one of the major bottlenecks of normalising flows. Neural ODEs can also be used to model latent dynamics in time-series modeling, allowing us to easily tackle irregularly sampled data.

Topics:

Normalising Flows.
End-to-end implementations with neural nets.

Notes: In this class, we defined Neural ODEs and derived the respective adjoint method, essential for their implementation. We then discussed continuous normalising flows and the computational advantages offered by Neural ODEs in this setting.

Required Reading:

Optional Reading:

A follow-up paper by the authors on scalable continuous normalizing flows: Free-form Continuous Dynamics for Scalable Reversible Generative Models.

Wasserstein GAN

2019-05-02T14:00:00+00:00

[Editor’s Note: We are especially proud of this one. James and his group went above and beyond the call of duty and made a guide from their class that we feel is especially superb for understanding their target paper. Moving forward, he has forced us to up our game because it will be hard to release a curriculum that is not as strong as this one. We highly recommend earnestly studying with this at hand.]

A number of people need to be thanked for their parts in making this happen. Thank you to Martin Arjovsky, Avital Oliver, Cinjon Resnick, Marco Cuturi, Kumar Krishna Agrawal, and Ishaan Gulrajani for contributing to this guide.

Of course, thank you to Sasha Naidoo, Egor Lakomkin, Taliesin Beynon, Sebastian Bodenstein, Julia Rozanova, Charline Le Lan, Paul Cresswell, Timothy Reeder, and Michał Królikowski for beta-testing the guide and giving invaluable feedback. A special thank you to Martin Arjovsky, Tim Salimans, and Ishaan Gulrajani for joining us for the weekly meetings.

Finally, thank you to Ulrich Paquet and Stephan Gouws for introducing many of us to Cinjon.

Concepts used in Wasserstein GAN. Click to navigate.

Why

The Wasserstein GAN (WGAN) is a GAN variant which uses the 1-Wasserstein distance, rather than the JS-Divergence, to measure the difference between the model and target distributions. This seemingly simple change has big consequences! Not only does WGAN train more easily (a common struggle with GANs) but it also achieves very impressive results — generating some stunning images. By studying the WGAN, and its variant the WGAN-GP, we can learn a lot about GANs and generative models in general. After completing this curriculum you should have an intuitive grasp of why the WGAN and WGAN-GP work so well, as well as, a thorough understanding of the mathematical reasons for their success. You should be able to apply this knowledge to understanding cutting edge research into GANs and other generative models.

1 Basics of Probability & Information Theory

Motivation: To understand GAN training (and eventually WGAN & WGAN-GP) we need to first have some understanding of probability and information theory. In particular, we will focus on Maximum Likelihood Estimation and the KL-Divergence. This week we will make sure that we understand the basics so that we can build upon them in the following weeks.

Topics:

Probability Theory
Information Theory
Mean Squared Error (MSE)
Maximum Likelihood Estimation (MLE)

Required Reading:

Chs 3.1 - 3.5 of Deep Learning by Goodfellow et. al (the DL book)
- These chapters are here to introduce fundamental concepts such as random variables, probability distributions, marginal probability, and conditional probability. If you have the time, reading the whole of chapter 3 is highly recommended. A solid grasp of these concepts will be important foundations for what we will cover over the next 5 weeks.
Ch 3.13 of the DL book
- This chapter covers KL-Divergence & the idea of distances between probability distributions which will also be a key concept going forward.
Chs 5.1.4 and 5.5 of the DL book
- The aim of these chapters is to make sure that everyone understands maximum likelihood estimation (MLE) which is a fundamental concept in machine learning. It is used explicitly or implicitly in both supervised and unsupervised learning as well as in both discriminative and generative methods. In fact, many methods using gradient descent are doing approximate MLE. It is important to understanding MLE as a fundamental concept, and its use in machine learning in practice. Note that, if you are not familiar with the notation used in these chapters, you might want to start at the beginning of the chapter. Also note that, if you are not familiar with the concept of estimators, you might want to read Ch 5.4. However, you can probably get by simply knowing that minimizing mean squared error (MSE) is a method for optimizing some approximation for a function we are trying to learn (an estimator).
The first 3 sections of GANs and Divergence Minimization (check out the rest after week 3)
- This blog gives a great description of the connections between the KL divergence and MLE. It also provides a nice teaser for what is to come in the following weeks, particularly with regards to the difficulties of training generative models.

Optional Reading:

Ch 2 from Information Theory, Inference & Learning Algorithms by David MacKay (MacKay’s book)
- This is worth reading if you feel like you didn’t quite grok the probability and information theory content in the DL book. MacKay provides a different perspective on these ideas which might help make things click. These concepts are going to be crucial going forward so it is definitely worth making sure you are comfortable with them.
Chs 1.6 and 10.1 of Pattern Recognition and Machine Learning by Christopher M. Bishop (PRML)
- Similarly, this is worth reading if you don’t feel comfortable with the KL-Divergence and want another perspective.
Aurélien Géron’s video A Short Introduction to Entropy, Cross-Entropy and KL-Divergence
- An introductory, but interesting video that describes the KL-Divergence.
Notes on MLE and MSE
- An alternative discussion on the links between MLE and MSE.
The first 37ish minutes of Arthur Gretton’s MLSS Africa talk on comparing probability distributions — video, slides
- An interesting take on comparing probability distributions. The first 37 minutes are fairly general and give some nice insights as well as some foreshadowing of what we will be covering in the following weeks. The rest of the talk is also very interesting and ends up covering another GAN called the MMD-GAN, but it isn’t all that relevant for us.
On Integral Probability Metrics, φ-Divergences and Binary Classification
- For those of you whose curiosity was piqued by Arthur’s talk, this paper goes into depth describing IPMs (such as MMD and the 1-Wasserstein distance) and comparing them the φ-divergences (such as the KL-Divergence). This paper is fairly heavy mathematically so don’t be discouraged if you struggle to follow it.

Questions:

The questions this week are here to make sure that you can put all the theory you’ve been reading about to a little practice. For example, do you understand how to perform calculations on probabilities, or, what Bayes’ rule is and how to use it?

Examples/Exercises 2.3, 2.4, 2.5, 2.6, and 2.26 in MacKay’s book
- Bonus: 2.35, and 2.36
Solutions

Examples 2.3, 2.5, and 2.6 have their solutions directly following them.

Exercise 2.26 has a solution on page 44.

Exercise 2.35 has a solution on page 45.

Exercise 2.36: 1/2 and 2/3.

(Page numbers from Version 7.2 (fourth printing) March 28, 2005, of MacKay's book.)
Derive Bayes’ rule using the definition of conditional probability.
Solution

The definition of conditional probability tells us that $$p(y|x) = \frac{p(y,x)}{p(x)}$$ and that $$p(x|y) = \frac{p(y,x)}{p(y)}.$$ From this we can see that $p(y,x) = p(y|x)p(x) = p(x|y)p(y)$. Finally if we divide everything by $p(x)$ we get $$p(y|x) = \frac{p(x|y)p(y)}{p(x)}$$ which is Bayes' rule.
Exercise 1.30 in PRML
Solution

Here is a solution.

The result should be $\log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$.
Prove that minimizing MSE is equivalent to maximizing likelihood (assuming Gaussian distributed data).
Solution

Mean squared error is defined as $$MSE = \frac{1}{N}\sum^N_{n=1}(\hat{y}_n - y_n)^2$$ where $N$ is the number of examples, $y_n$ are the true labels, and $\hat{y}_n$ are the predicted labels. Log-likelihood is defined as $LL = \log(p(\mathbf{y}|\mathbf{x}))$. Assuming that the examples are independent and identically distributed (i.i.d.) we get $$ LL = \log\prod_{n=1}^Np(y_n|x_n) = \sum_{n=1}^{N}\log p(y_n|x_n). $$ Now, substituting in the definition of the normal distribution $$ \mathcal{N}(y;\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y - \mu)^2}{2\sigma^2}}$$ for $p(y_n|x_n)$ and simplifying the expression, we get $$ LL = \sum_{n=1}^{N} -\frac{1}{2}\log(2\pi) - \log\sigma - \frac{(y_n - \mu_n)^2}{2\sigma^2}.$$ Finally, replacing $\mu$ with $\hat{y}$ (because we use the mean as our prediction), and noticing that maximizing the expression above depends only on the third term (because the others are constants), we arrive at the conclusion that to maximize the log-likelihood we must minimize $$ \frac{(y_n - \hat{y}_n)^2}{2\sigma^2} $$ which is the same as minimising the MSE.
Prove that maximizing likelihood is equivalent to minimizing KL-Divergence.
Solution

KL-Divergence is defined as $$ D_{KL}(p||q) = \sum_x p(x) \log\frac{p(x)}{q(x|\bar{\theta})}$$ where $p(x)$ is the true data distribution, $q(x|\bar{\theta})$ is our model distribution, and $\bar{\theta}$ are the parameters of our model. We can rewrite this as $$ D_{KL}(p||q) = \mathbb{E}_p[\log p(x)] - \mathbb{E}_p[\log q(x|\bar{\theta})]$$ where the notation $\mathbb{E}_p[f(x)]$ means that we are taking the expected value of $f(x)$ by sampling $x$ from $p(x)$. We notice that minimizing $D_{KL}(p||q)$ means maximizing $\mathbb{E}_p[\log q(x|\bar{\theta})]$ since the first term in the expression above is constant (we can't change the true data distribution). Now, to maximize the likelihood of our model, we need to maximize $$q(\bar{x}|\bar{\theta}) = \prod_{n=1}^Nq(x_n|\bar{\theta}).$$ Recall that taking a logarithm does not change the result of optimization which means that we can maximize $$\log q(\bar{x}|\bar{\theta}) = \sum_{n=1}^N\log q(x_n|\bar{\theta}).$$ If we divide this term by a constant factor of $N$ we the same term that would minimize the to maximize the KLD: $\mathbb{E}_p[\log q(x|\bar{\theta})]$.

Notes: Here is a link to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!

2 Generative Models

Motivation: This week we’ll take a look at generative models. We will aim to understand how they are similar and how they differ from the discriminative models covered last week. In particular, we want to understand the challenges that come with training generative models.

Topics:

Generative Models
Evaluation of Generative Models

Required Reading:

The “Overview”, “What are generative models?”, and “Differentiable inference” sections of the webpage for David Duvenaud’s course on Differentiable Inference and Generative Models.
- Here we want to get a sense of the big picture of what generative models are all about. There are also some fantastic resources here for further reading if you are interested.
A note on the evaluation of generative models
- This paper is the real meat of this week’s content. After reading this paper you should have a good idea of the challenges involved in evaluating (and therefore training) generative models. Understanding these issues will be important for appreciating what the WGAN is all about. Don’t worry too much if some sections don’t completely make sense yet - we’ll be returning to the key ideas in the coming weeks.

Optional Reading:

Ch 20 of the DL book, particularly:
- Differentiable Generator Networks (Ch 20.10.2)
  - Description of a broad class of generative models to which GANs belong which will help contextualize GANs when we look at them next week.
- Variational Autoencoders (Ch 20.10.3)
  - Description of another popular class of differentiable generative model which might be nice to contrast to GANs next week.
- Evaluating Generative Models (Ch 20.14)
  - Summary of techniques and challenges for evaluating generative models which might put Theis et al.’s paper into context.

Questions:

The first two questions are here to make sure that you understand what a generative model is and how it differs from a discriminative model. The last two questions are a good barometer for determining your understanding of the challenges involved in training generative models.

Fit a multivariate Gaussian distribution to the Fisher Iris dataset using maximum likelihood estimation (see Section 2.3.4 of PRML for help) then:
1. Determine the probability of seeing a flower with a sepal length of 7.9, a sepal width of 4.4, a petal length of 6.9, and a petal width of 2.5.
2. Determine the distribution of flowers with a sepal length of 6.3, a sepal width of 4.8, and a petal length of 6.0 (see section 2.3.2 of PRML for help).
3. Generate 20 flower measurements.
4. Generate 20 flower measurements with a sepal length of 6.3.
(congrats you’ve just trained and used a generative model)

Solution

Here is a Jupyter notebook with solutions. Open the notebook on your computer or Google colab to render the characters properly.
Describe in your own words the difference between a generative and a discriminative model.
Solution

This is an open ended question but here are some of the differences:
- In the generative setting, we usually model $p(x)$, while in the discriminative setting we usually model $p(y|x)$.
- Generative models are usually non-deterministic, and we can sample from them, while discriminative models are often deterministic, and we can't necessarily sample from them.
- Discriminative models need labels while generative models typically do not.
- In generative modelling the goal is often to learn some latent variables that describe the data in a compact manner, this is not usually the case for discriminative models.
Theis et al. claim that “a model with zero KL divergence will produce perfect samples” — why is this the case?
Solution

As we showed last week, $D_{KL}(p||q) = 0$ if and only if $p(x)$, the true data distribution, and $q(x)$ the model distribution, are the same.

Therefore, if $D_{KL}(p||q) = 0$, samples from our model will be indistinguishable from the real data.
Explain why the high log-likelihood of a generative model might not correspond to realistic samples?
Solution

Theis et al. outlined two scenarios where this is the case:
- Low likelihood & good samples: our model can overfit to the training data and produce good samples, however, because the model has overfitted it will have a low likelihood for unseen test data.
- High likelihood & poor samples: here the issue is that high dimensional data will tend to have higher log-likelihoods than low dimensional data.

Notes: Here is a link to our notes for the lesson. We were fortunate enough to have Tim Salimans sit in on the session!

3 Generative Adversarial Networks

Motivation: Let’s read the original GAN paper. Our main goal this week is to understand how GANs solve some of the problems with training generative models, as well as, some of the new issues that come with training GANs.

The second paper this week is actually optional but highly recommended — we think that it contains some interesting material and sets the stage for looking at WGAN in week 4, however, the core concepts will be repeated again. Depending on your interest you might want to spend more or less time on this paper (we recommend that most people don’t spend too much time).

Topics:

Generative Adversarial Networks
The Jensen-Shannon Divergence (JSD)
Why training GANs is hard

Required Reading:

Goodfellow’s GAN paper
- This is the paper the started it all and if we want to understand WGAN & WGAN-GP we’d better understand the original GAN.
Toward Principled Methods for Generative Adversarial Network Training
- This paper explores the difficulties in training GANs and is a precursor to the WGAN paper that we will look at next week. The paper is quite math heavy so unless math is your cup of tea you shouldn’t spend too much time trying to understand the details of the proofs, corollaries, and lemmas. The important things to understand here are: what is the problem, and how do the proposed solutions solve the problem. Focus on the introduction, the English descriptions of the theorems and the figures. Don’t spend too much time on this paper.

Optional Reading:

Goodfellow’s tutorial on GANs
- A more in-depth explanation of GANs from the man himself.
The GAN chapter in the DL book (20.10.4)
- A summary of what a GAN is and some of the issues involved in GAN training.
Coursera (Stanford) course on game theory videos: 1-05, 2-01, 2-02, and 3-04b
- This is really here just for people who are interested in the game theory ideas such as minmax.
Finish reading GANs and Divergence Minimization.
- Now that we know what a GAN is it will be worth it to go back and finish reading this blog. It should help to tie together many of the concepts we’ve covered so far. It also has some great resources for extra reading at the end.
Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory
- A short blog post which briefly summarises many of the topics we’ve covered so far.
How to Train your Generative Models? And why does Adversarial Training work so well? and An Alternative Update Rule for Generative Adversarial Networks
- Two great blog posts from Ferenc Huszár that discuss the challenges in training GANs as well as the differences between the JSD, KLD and reverse KLD.
Simple Python GAN example
- This example illustrates how simple GANs are to implement by doing it in 145 lines of Python using Numpy and a simple autograd library.

Questions:

The first three questions this week are here to make sure that you understand some of the most important points in the GAN paper. The last question is to make sure you understood the overall picture of what a GAN is, and to get your hands dirty with some of the practical difficulties of training GANs.

Prove that minimizing the optimal discriminator loss, with respect to the generator model parameters, is equivalent to minimizing the JSD.
- Hint, it may help to somehow introduce the distribution $p_m(x) = \frac{p_d(x) + p_g(x)}{2}$.
Solution

The loss we are minimizing is $$\mathbb{E}_{x \sim p_d(x)}[\log D^*(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D^*(G(x)))]$$ where $p_d(x)$ is the true data distribution, $p_z(z)$ is the noise distribution from which we draw samples to pass through our generator, $D$ and $G$ are the discriminator and generator, and $D^*$ is the optimal discriminator which has the form: $$ D^*(x) = \frac{p_d(x)}{p_d(x) + p_g(x)}.$$ Here $p_g(x)$ is the distribution of the data sampled from the generator. Substiting in $D^*(x)$ and $p_g(x)$, we can rewrite the loss as $$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_d(x) + p_g(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_d(x) + p_g(x)}]. $$ Now we can multiply the values inside the logs by $1 = \frac{0.5}{0.5}$ to get $$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{0.5 p_d(x)}{0.5(p_d(x) + p_g(x))}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{0.5 p_g(x)}{0.5(p_d(x) + p_g(x))}]. $$ Recall that $\log(ab) = \log(a) + \log(b)$ and define $p_m(x) = \frac{p_d(x) + p_g(x)}{2}$, we now get $$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_m(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_m(x)}] - 2\log2. $$ Using the definition of the KL-Divergence, this simplifies to $$ D_{KL}(p_d||p_m) + D_{KL}(p_g||p_m) - 2\log2. $$ Finally, using the definition of the JS-Divergence and noting that for the purposes of minimization the $2\log2$ term can be ignored, we get $$ D_{JS}(p_d||p_g).$$
Explain why Goodfellow says that $D$ and $G$ are playing a two-player minmax game and derive the definition of the value function $V(G,D)$.
Solution

$G$ wants to maximize the probability that $D$ thinks the generated samples are real $\mathbb{E}_{z \sim p_z(z)}[D(G(z))]$. This is the same as minimizing the probability that $D$ thinks the generated samples are not fake $\mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]$.

On the other hand, $D$ wants to maximise the probability that it assigns the labels correctly $\mathbb{E}_{x \sim p_d(x)}[D(x)] + \mathbb{E}_{z \sim p_z(z)}[1 - D(G(z))]$. Note that $D(x)$ should be 1 if $x$ is real, and 0 if $x$ is fake.

We can take logs without changing the optimization, which gives $$ V(G,D) = \min_G\max_D \mathbb{E}_{x \sim p_d(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]. $$
Why is it important to carefully tune the amount that the generator and discriminator are trained in the original GAN formulation?
- Hint, it has to do with the approximation for the JSD & the dimensionality of the data manifolds.
Solution

If we train the discriminator too much we get vanishing gradients. This is due to the fact that when the true data distribution and model distribution lie on low dimensional manifolds (or have disjoint support almost everywhere), the optimal discriminator will be perfect — i.e. the gradient will be zero almost everywhere. This is something that almost always happens.

On the other hand, if we train the discriminator too little, then the loss for the generator no longer approximates the JSD. This is because the approximation only holds if the discriminator is near the optimal $D^*(x) = \frac{p)d(x)}{p_d(x) + p_g(x)}$.
Implement a GAN and train it on Fashion MNIST.
- This notebook contains a skeleton with boilerplate code and hints.
- Try various settings of hyper-parameters, other than those suggested, and see if the model converges.
- Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.
Solution

Here is a GAN implementation using Keras.

Notes: Here is a link to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!

4 Wasserstein GAN

Motivation: Last week we saw how GANs solve some problems in training generative models but also that they bring in new problems. This week we’ll look at the Wasserstein GAN which goes a long way to solving these problems.

Topics:

Wasserstein Distance vs KLD/JSD
Wasserstein GAN

Required Reading:

The WGAN paper
- This should be pretty self-explanatory! We’re doing a DFL on Wasserstein GANs so we’d better read the paper! (This isn’t the end of the road, however, next week we’ll look at WGAN-GP.) The paper builds upon an intuitive idea: the family of Wasserstein distances is a nice distance between probability distributions, that is well grounded in theory. The authors propose to use the 1-Wasserstein distance to estimate generative models. More specifically, they propose to use the 1-Wasserstein distance in place of the JSD in a standard GAN — that is to measure the difference between the true distribution and the model distribution of the data. They show that the 1-Wasserstein distance is an integral probability metric (IPM) with a meaningful set of constraints (1-Lipschitz functions), and can, therefore, be optimized by focusing on discriminators that are “well behaved” (meaning that their output does not change to much if you perturb the input, i.e. they are Lipschitz!).

Optional Reading:

Summary blog for the paper
- This is a brilliant blog post that summarises almost all of the key points we’ve covered over the last 4 weeks and puts them in the context of the WGAN paper. In particular, if any of the more theoretic aspects of the WGAN paper were a bit much for you then this post is worth reading.
Another good summary of the paper
Wasserstein / Earth Mover distance blog posts
Set of three lectures by Marco Cuturi on optimal transport (with accompanying slides)
- If you are interested in the history of optimal transport and would like to see where the KR duality comes from (that’s the crucial argument in the WGAN paper which connects the 1-Wasserstein distance to an IPM with a Lipschitz constraint), the Wasserstein distance, or if you feel like you need a different explanation of what the Wasserstein distance and the Kantorovich-Rubinstein duality are, then watching these lectures is recommended. There are some really cool applications of optimal transport here too, and a more exhaustive description of other families of Wasserstein distances (such as the quadratic one) and their dual formulation.
The first 15 or so minutes of this lecture on GANs by Sebastian Nowozin
- Great description of WGAN, including Lipschitz and KR duality. This lecture is actually part 2 of a series of 3 lectures from MLSS Africa. Watching the whole series is also highly recommended if you are interested in knowing more about the bigger picture for GANs (including other interesting developments and future work) and how WGAN relates to other GAN variants. However, to avoid spoilers for next week, you should wait to watch the rest of part 2.
Computational Optimal Transport by Peyré and Cuturi (Chapters 2 and 3 in particular)
- If you enjoyed Marco’s lectures above, or want a more thorough theoretical understanding of the Wasserstein distance, then this textbook is for you! However, please keep in mind that this textbook is somewhat mathematically involved, so if you don’t have a mathematics background you may struggle with it.

Questions:

The first two questions are here to highlight the key difference between the WGAN and the original GAN formulation. As before, the last question is to make sure you understood the overall picture of what a WGAN is and to get your hands dirty with how they differ from standard GANs in practice.

What happens to the KLD/JSD when the real data and the generator’s data lie on low dimensional manifolds?
Solution

The true distribution and model distribution tend to have different supports which causes the KLD and JSD to saturate.
With this in mind, how does using the Wasserstein distance, rather than JSD, reduce the sensitivity to careful scheduling of the generator and discriminator?
Solution

The Wasserstein distance does not saturate or blow up for distributions with different supports. This means that we still get signals in these cases which in turn means that we don’t have to worry about training the discriminator (or critic) to optimality — in fact, we want to train it to optimality since it will give better signals.
Let’s compare the 1-Wasserstein Distance (aka Earth Mover’s Distance — EMD) with the KLD for a few simple discrete distributions. We want to build up an intuition for the differences between these two metrics and why one might be better than another in certain scenarios. You might find it useful to use the Scipy implementations for 1-Wasserstein and KLD.
1. Let $P(x)$, $Q(x)$ and $R(x)$ be discrete distributions on $Z$ with:
  - $P(0) = 0.5$, $P(1) = 0.5$,
  - $Q(0) = 0.75$, $Q(1) = 0.25$, and
  - $R(0) = 0.25$ and $R(1) = 0.75$.
    Calculate both the KLD and EMD for the following pairs of distributions. You should notice that while Wasserstein is a proper distance metric, KLD is not ($D_{KL}(P||Q) \ne D_{KL}(Q||P)$).
    1. $P$ and $Q$
    2. $Q$ and $P$
    3. $P$ and $P$
    4. $P$ and $R$
    5. $Q$ and $R$
2. Let $P(x)$, $Q(x)$, $R(x)$, $S(x)$ be discrete distributions on $Z$ with:
  - $P(0) = 0.5$, $P(1) = 0.5$, $P(2) = 0$,
  - $Q(0) = 0.33$, $Q(1) = 0.33$, $Q(2) = 0.33$,
  - $R(0) = 0.5$, $R(1) = 0.5$, $R(2) = 0$, $R(3) = 0$, and
  - $S(0) = 0$, $S(1) = 0$, $S(2) = 0.5$, $S(3) = 0.5$.
    Calculate the KLD and EMD between the following pairs of distributions. You should notice that the EMD is well behaved for distributions with disjoint support while the KLD is not.
    1. $P$ and $Q$
    2. $Q$ and $P$
    3. $R$ and $S$
3. Let $P(x)$, $Q(x)$, $R(x)$, and $S(x)$ be discrete distributions on $Z$ with:
  - $P(0) = 0.25$, $P(1) = 0.75$, $P(2) = 0$,
  - $Q(0) = 0$, $Q(1) = 0.75$, $Q(2) = 0.25$,
  - $R(0) = 0$, $R(1) = 0.25$, $R(2) = 0.75$, and
  - $S(0) = 0$, $S(1) = 0$, $S(2) = 0.25$, $S(3) = 0.75$.
    Calculate the EMD between the following pairs of distributions. Here we just want to get more of a sense for the EMD.
    1. $P$ and $Q$
    2. $P$ and $R$
    3. $Q$ and $R$
    4. $P$ and $S$
    5. $R$ and $S$
Solution

Here is a Jupyter notebook with solutions.
Based on the GAN implementation from week 3, implement a WGAN for FashionMNIST.
- Try various settings of hyper-parameters. Does this model seem more resilient to the choice of hyper-parameters?
- Examine samples from various stages of the training. Rank them without looking at the corresponding loss and see if your ranking agrees with the loss.
Solution

Here is a WGAN implementation using Keras.

Notes: Here is a link to our notes for the lesson. We were fortunate enough to have Martin Arjovsky sit in on the session!

5 WGAN-GP

Motivation: Let’s read the WGAN-GP paper (Improved Training of Wasserstein GANs). As has been the trend over the last few weeks, we’ll see how this method solves a problem with the standard WGAN: weight clipping, as well as a potential problem in the standard GAN: overfitting.

Topics:

WGAN-GP
Weight clipping vs gradient penalties
Measuring GAN performance

Required Reading:

WGAN-GP paper
- This is our final required reading. The paper suggests improvements to the training of Wasserstein GANs with some great theoretical justifications and actual results.

Optional Reading:

On the Regularization of Wasserstein GANs
- This paper came out after the WGAN-GP paper but gives a thorough discussion of why the weight clipping in the original WGAN was an issue (see Appendix B). In addition, they propose other solutions for how to get around doing so and provide other interesting discussions of GANs and WGANs.
Wasserstein GAN & WGAN-GP blog post
- Another blog that summarises many of the key points we’ve covered and includes WGAN-GP.
GAN — How to measure GAN performance?
- A blog that discusses a number of approaches to measuring the performance of GANs, including the Inception score, which is useful to know about when reading the WGAN-GP paper.

Questions:

This week’s questions follow the same pattern as last week’s. How does the formulation of WGAN-GP differ from that of the original GAN or WGAN (and how is it similar)? What does this mean in practice?

Why does weight clipping lead to instability in the training of a WGAN & how does the gradient penalty avoid this problem?
Solution

The instability comes from the fact that if we choose the weight clipping hyper-parameter poorly we end up with either exploding or vanishing gradients. This is because weight clipping encourages the optimizer to push the absolute all of the weights very close to the clipping value. Figure 1b in the paper shows this happening. To explain this phenomenon, consider a simple logistic regression model. Here if any of the features are highly predictive of a particular class it will be assigned as positive a weight as possible, similarly, if a feature is not predictive of a particular class, it will be assigned as negative a weight as possible. Now depending on our choice of the weight clipping value, we either get exploding or vanishing gradients.
- Vanishing gradients: this is similar to the issues if vanishing gradients in a vanilla RNN, or a very deep feed-forward NN without residual connections. If we choose the weight clipping value to be too small, during back-propagation, the error signal going to each layer will be multiplied by small values before being propagated to the previous layer. This results in exponential decay in the error signal as it propagates farther backward.
- Exploding gradients: similarly, if we choose a weight clipping value that is too large, the error signals will get repeatedly multiplied by large numbers as the propagate backward — resulting in exponential growth.
This phenomena also related to the reason we use weight initialization schemes such as Xavier and He and also why batch normalization is important — both of these methods help to ensure that information is propagated through the network without decaying or exploding.
Explain how WGAN-GP addresses issues of overfitting in GANs.
Solution

Both WGAN-GP, and indeed the original weight-clipped WGAN, have the property that the discriminator/critic loss corresponds to the sample quality from the discriminator, which lets us use the loss to detect overfitting (we can compare the negative discriminator/critic loss for a validation set to that of the training set of real images — when the two diverge we have overfitted). The correspondence between the loss and the sample quality can be explained by a number of factors.
- With a WGAN we can train our discriminator to optimality. This means that if the critic is struggling to tell the difference between real and generated images we can conclude that the real and generated images are similar. In other words, the loss is meaningful.
- In addition, in a standard GAN where we cannot train the discriminator to optimality, our loss no longer approximates the JSD. We do not know what function our loss is actually approximating and as a result we cannot say (and in practise we do not see) that the loss is a meaningful measure of sample quality.
- Finally, there are arguments to be made that even if the loss for a standard GAN was approximating the JSD, the Wasserstein distance is a better distance measure for images distributions than the JSD.
Based on the WGAN implementation from week 4, implement an improved WGAN for MNIST.
- Compare the results, ease of hyper-parameter tuning, and correlation between loss and your subjective ranking of samples, with the previous two models.
- The Keras implementation of WGAN-GP can be tricky. If you are familiar with another framework like TensorFlow or Pytorch it might be easier to use that instead. If not, don’t be too hesitant to check the solution if you get stuck.
Solution

Here is a WGAN-GP implementation using Keras.

Notes: Here is a link to our notes for the lesson. We were fortunate enough to have Ishaan Gulrajani sit in on the session!

Announcing the 2019 DFL Fellows

2019-04-15T16:00:00+00:00

After we launched Depth First Learning last year, we wanted to keep the momentum and continue outputting high-quality study guides for machine learning. Subsequently, we launched the Depth First Learning Fellowship with funding provided by Jane Street.

We were blown away by the response. With over 100 applicants from 5 continents, we had a tremendously hard time selecting only four proposals. After speaking with many of the applicants, we could not be more thrilled with the groups we selected. See below for bios of the inaugural class, as well as the papers that their groups will be respectively learning.

What’s the process now you ask? The fellows are hard at work constructing their curricula and will soon begin online classes. Participants will meet weekly to discuss and go beyond the material.

We are looking for participants for these groups.
If you’re interested, please let us know by filling out this form.

Steve Kroon - Stellenbosch (South Africa)

Target paper: “Variational Inference with Normalizing Flows”, by Rezende and Mohamed (ICML 2015)

Dr Steve Kroon obtained MCom (Computer Science) and PhD (Mathematical Statistics) degrees while studying at Stellenbosch University. He joined the Stellenbosch University Computer Science department in 2008. His PhD thesis considered aspects of statistical learning theory, and his subsequent research has focused on decision making in artificial intelligence, including machine learning, reinforcement learning, and search techniques. He has supervised and co-supervised 5 graduated and 3 current master’s students, and has published 3 journal articles and 8 peer-reviewed conference and conference workshop articles. He has served as a reviewer for the journals Algorithmica, the Journal of Universal Computer Science, and the South African Computer Journal, as well as on the programme committee for 2 conferences. He holds a Diploma in Actuarial Techniques, and is a member of the Centre for Artificial Intelligence Research, the Institute of Electrical and Electronics Engineers (IEEE) and the IEEE Computational Intelligence Society, the International Computer Games Association, the South African Statistical Association, and the South African Institute for Computer Scientists and Information Technologists.

Sandhya Prabhakaran - New York (USA)

Target paper: “Spherical CNNs” by Cohen, Geiger, Köhler and Welling (ICLR 2018)

Dr. Sandhya Prabhakaran is a Research Fellow at Memorial Sloan Kettering Cancer Centre, NYC. Before that she was a Research Scientist at Columbia University in the City of New York.

She received her Ph.D. from the Department of Mathematics and Computer Science, University of Basel (Switzerland) and her Masters in Intelligent Systems (Robotics) from School of Informatics, University of Edinburgh (Scotland). Her research deals with developing statistical theory and inference models, particularly to problems in Cancer Biology and Computer Vision.

Prior to academics, she was an Assembler programmer working with the Mainframe Operating System (z/OS) at IBM Software Laboratories, Bangalore and has developed Mainframe applications at UST Global, Thiruvananthapuram.

She is an avid hiker and distance runner and has completed 4 out of the 6 World Marathon Majors.

Bhairav Mehta - Montreal (Canada)

Target paper: “Stein Variational Gradient Descent” by Liu and Wang (NIPS 2016)

After finishing my undergraduate studies at the University of Michigan, I migrated north to Montreal, where I’m now a graduate student at Mila. I work mostly on reinforcement learning and robotics, but continue to find that teaching is the most rewarding part of graduate (and undergraduate) studies. I’ve been serving as a tutor, TA, and now, GSI, for over a decade, and I’m incredibly excited by the opportunity to lead a DFL course online. In my free time, you can find me helping ducks waddle across the street at Duckietown, or building deep learning models for my nonprofit tackling core problems in K-12 education.

Vinay Ramasesh, Piyush Patil, and Riley Edmunds - Berkeley (USA)

Target paper: “Resurrecting the sigmoid in deep learning through dynamical isometry” by Pennington, Schoenholz and Ganguli (NIPS 2017)

Vinay: I am finishing up a Ph. D. in physics at UC Berkeley, where I have worked on building and testing small quantum processors made from superconducting circuits. At Berkeley, I work in the Quantum Nanoelectronics Lab under the guidance of Dr. Irfan Siddiqi. My experience with machine learning comes from Berkeley's machine learning student group, ML@B, which I joined in 2017. Previously, I studied physics and electrical engineering at MIT, working in the group of Dr. Martin Zwierlein to build up an experiment to cool, trap, and image strongly-interacting atomic gases.

Piyush: I graduated from UC Berkeley last May, where I studied electrical engineering and computer science and mathematics. While at Berkeley, I helped to get the university's student-run machine learning club, ML@B, up and running, serving as the vice president of projects during the last couple years. I was involved with research in quantum machine learning, adversarial examples, and natural language understanding. After graduating, I joined Nuro, a robotics startup working to build autonomous vehicles. Outside of ML, I enjoy reading philosophy, going hiking and backpacking, and spending time with friends.

Riley: I'm currently finishing up my undergrad degree in computer science at UC Berkeley. I was one of the early members of ML@B, where as vice president of research, I helped club members form teams to work on ML research projects. At UC Berkeley, I've worked under Prof. Dawn Song, Alice Agogino and Stella Yu. With a couple friends, in February 2018 I co-founded an ML consulting company, Alinea AI. You can find more on my background at rileyedmunds.com. In my spare time, I enjoy traveling, playing spikeball, and discussing thought-provoking books.