InfoGAN
Thank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.
Why
InfoGAN is an extension of GANs that learns to represent unlabeled data as codes, aka representation learning. Compare this to vanilla GANs that can only generate samples or to VAEs that learn to both generate code and samples. Representation learning is an important direction for unsupervised learning and GANs are a flexible and powerful interpretation. This makes InfoGAN an interesting stepping stone towards research in representation learning.
1 Information Theory
Motivation: Information theory formalizes the concept of the “amount of randomness” or “amount of information”. These concepts can be extended to relative quantities among random variables. This section leads to Mutual Information (MI), a concept core to InfoGAN. MI extends entropy to the amount of additional information you yield from observing a joint sample of two random variables as compared to the baseline of observing each variable separately.
Topics:
- Entropy
- Differential Entropy
- Conditional Entropy
- Jensen’s Inequality
- KL divergence
- Mutual Information
Required Reading:
- Chapter 1.6 from Pattern Recognition and Machine Learning / Bishop. (“PRML”)
- A good intuitive explanation of Entropy, from Quora.
Optional Reading:
- Notes on Kullback-Leibler Divergence and Likelihood Theory
- For more perspectives and deeper dependencies, see Metacademy:
- Visual Information Theory
Questions:
- From PRML: 1.31, 1.36, 1.37, 1.38, 1.39, 1.41.
Solution
PRML 1.31: Consider two variables \(x\) and \(y\) having joint distribution \(p(x,y)\). Show that the differential entropy of this pair of variables satisfies \(H(x,y) \le H(x) + H(y)\) with equality if, and only if, \(x\) and \(y\) are statistically independent.
If \(p(x)\) and \(p(y)\) are independent then the joint distribution is given by:
\(p(x,y) = p(x)p(y)\)Based on the independent \(p(x)\) and \(p(y)\) the joint entropy can be derived from the conditional entropies \(H(x|y)\) and \(H(y|x)\):
\(H(x|y) = H(x)\)
\(H(y|x) = H(y) \to\)
\(H(x,y) = H(x) + H(y|x) = H(y) + H(x|y) \to\)
\(H(x,y) = H(x) + H(y)\)Therefore, there is no mutual information \(I(x,y)\) if \(p(x)\) and \(p(y)\) are independent:
\(H(x,y) = H(x) + H(y)\ \to\)
\(I(x,y) = H(x) + H(y) - H(x,y) = H(x,y) - H(x,y) = 0\)If \(p(x)\) and \(p(y)\) are dependent:
\(H(x,y) < H(x) + H(y)\ \to\)
\(I(x,y) = H(x) + H(y) - H(x,y) > 0\)The indepent and dependent case can be combined to a general form:
\(H(x,y) \le H(x) + H(y)\ \to\)
\(I(x,y) = H(x) + H(y) - H(x,y) \ge 0\) - How is Mutual Information similar to correlation? How are they different? Are they directly related under some conditions?
Solution
Start here.
- In classification problems, minimizing cross-entropy loss is the same as minimizing the KL divergence
of the predicted class distribution from the true class distribution. Why do we minimize the KL, rather
than other measures, such as L2 distance?
Solution
In classification problem: One natural measure of “goodness” is the likelihood or marginal probability of observed values. By definition, it’s \(P(Y | X; params)\), which is \(\prod_i P(Y_i = y_i | X; params)\). This says that we want to maximize the probability of producing the “correct” \(y_i\) class only, and don’t really care to push down the probability of incorrect class like L2 loss would.
E.g., suppose the true label \(y = [0, 1, 0]\) (one-hot of class label {1, 2, 3}), and the softmax of the final layer in NN is \(y’ = [0.2, 0.5, 0.3]\). One could use L2 between these two distributions, but if instead we minimize KL divergence \(KL(y || y’)\), which is equivalent to minimizing cross-entropy loss (the standard loss everyone uses to solve this problem), we would compute \(0 \cdot \log(0) + 1 \cdot \log (0.5) + 0 \cdot \log(0) = \log(0.5)\), which describes exactly the log likelihood of the label being class 2 for this particular training example.
Here choosing to minimize KL means we’re maximizing the data likelihood. I think it could also be reasonable to use L2, but we would be maximizing the data likelihood + “unobserved anti-likelihood” :) (my made up word) meaning we want to kill off all those probabilities of predicting wrong labels as well.
Another reason L2 is less prefered might be that L2 involves looping over all class labels whereas KL can look only at the correct class when computing the loss.
2 Generative Adversarial Networks (GAN)
Motivation: GANs are framework for constructing models that learn to sample from a probability distribution, given a finite sample from that distribution. More concretely, after training on a finite unlabeled dataset (say of images), a GAN can generate new images from the same “kind” that aren’t in the original training set.
Topics:
- JS (Jensen-Shannon) divergence
- How are GANs trained?
- Various possible GAN objectives. Why are they needed?
- GAN training minimizes the JS divergence between the data-generating distribution and the distribution of samples from the generator part of the GAN
Required Reading:
Optional Reading:
Questions:
- Prove that \(0 \leq JSD(P||Q) \leq 1\) bit for all P, Q. When are the bounds achieved?
Solution
Start here.
- What are the bounds for KL divergence? When are those bounds achieved?
Solution
Start here.
The Kullback–Leibler divergence \(D_{KL}(P||Q)\) between \(P\) and \(Q\) is defined as:
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right)\)The lower bound is reached when \(P(x) = Q(x)\) because \(\left(\frac{P(x)}{Q(x)}\right) = 1\):
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2(1) = \sum_{x}P(x) 0 = 0\)The upper bound is reached when \(Q(x)\) is disjoint from \(P(x)\), i.e., \(Q(x)\) is zero where \(P(x)\) is not zero, because then the log-ratio \(\log_2\left(\frac{P(x)}{Q(x)}\right)\) becomes \(\infty\):
\(x_i \in x\)
\(P(x_i) \log_2\left(\frac{P(x_i)}{Q(x_i)}\right) = P(x_i) \log_2\left(\frac{P(x_i)}{0}\right) = \infty \to\)
\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right) = \infty\) - In the paper, why do they say “In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, \(log(1 − D(G(z)))\) saturates”?
- Implement a Colab that trains a GAN for MNIST. Try both the saturating and non-saturating discriminator loss.
Solution
An implementation can be found here.
3 The Paper
Motivation: Let’s read the paper. Keep in mind that InfoGAN modifies the original GAN objective in this way:
- Split the incoming noise vector z into two parts - noise and code. The goal is to learn meaningful codes for the dataset.
- In addition to the discriminator, it adds another prediction head to the network that tries to predict the code from the generated sample. The loss is a combination of the normal GAN loss and the prediction loss.
- This new loss term can be interpreted as a lower bound on the mutual information between the generated samples and the code.
Topics:
- The InfoGAN objective
- Why can’t we directly optimize for the mutual information \(I(c; G(z,c))\)
- Variational Information Maximization
- Possible choices for classes of random variables for dimensions of the code c
Reproduce:
Required Reading:
Optional Reading:
Questions:
- How does one compute \(log Q(c|x)\) in practice? How does this answer change based on the choice of the type of random variables in c?
Solution
What is \(\log Q(c|x)\) when c is a Gaussian centered at \(f_\theta(x)\)? What about when c is the output of a softmax?
See section 6 in the paper.
- Which objective in the paper can actually be optimized with gradient-based algorithms? How? (An answer to this needs to refer to “the reparameterization trick”)
- Why is an auxiliary \(Q\) distribution necessary?
- Draw a neural network diagram for InfoGAN
Solution
There is a good diagram in this blog post
- In the paper they say “However, in this paper we opt for
simplicity by fixing the latent code distribution and we will treat \(H(c)\) as a constant.”. What if you want to learn
the latent code (say, if you don’t know that classes are balanced in the dataset). Can you still optimize for this with gradient-based algorithms? Can you implement this on an intentionally class-imbalanced variant of MNIST?
Solution
You could imagine learning the parameters of the distribution of c, if you can get H(c) to be a differentiable function of those parameters.
- In the paper they say “the lower bound … is quickly maximized to … and maximal mutual information is achieved”. How do they know this is the maximal value?
- Open-ended question: Is InfoGAN guaranteed to find disentangled representations? How would you tell if a representation is disentangled?