# In this assignment, we’ll fit both generative and discriminative models to the MNIST dataset of…

Assignment #2 Due: Sunday 12 March, 11:59 pm In this assignment, we’ll fit both generative and discriminative models to the MNIST dataset of handwritten numbers. Each datapoint in the MNIST dataset [http://yann.lecun.com/exdb/mnist/] is a 28×28 black-and-white image of a number in {0 . . . 9}, and a label indicating which number. MNIST is the ’fruit fly’ of machine learning – a simple standard problem useful for comparing the properties of different algorithms. Some code for handling MNIST is on the course webpage. You can use whichever programming language you like, and libraries for loading and plotting data. You’ll need to write your own initialization, fitting, and prediction code. You can use automatic differentiation in your code, but must still answer the gradient questions. For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black or white (0 or 1) with > 0.5 being the cutoff. When comparing models, we’ll need a training and test set. Use the first 10000 samples for training, and another 10000 for testing. Hint: Also build a dataset of only 100 training samples to use when debugging, to make loading and training faster. Question 1 (Basic Na¨ive Bayes, 10 points) In this question, we’ll fit a na¨ive Bayes model to the MNIST digits using maximum likelihood. Na¨ive Bayes defines the joint probability of each datapoint x and its class label c as follows: p(x, c|?, p) = p(c|p)p(x|c, ?c) = p(c|p) 784 ? d=1 p(xd |c, ?cd) (1) where the vector p provides the prior probabilities for the classes and matrix ? contains a set of parameters. For binary data, the likelihood p(xd |c, ?cd) can have a Bernoulli distribution: p(xd |c, ?cd) = Ber(xd |?cd) = ? xd cd (1 – ?cd) (1-xd ) (2) Which is just a way of expressing that p(xd = 1|c, ?cd) = ?cd. For p(c|p), we can use a categorical distribution: p(c|p) = Cat(c|p) = 9 ? i=0 p i=c i (3) Which is a way of expressing that p(c = t|p) = pt . Note that we need ? 9 i=0 pi = 1. (a) Write down the maximum a posteriori estimate for the class-conditional pixel means ?, using a Beta(2, 2) prior on each ?cd. Hint: it has a simple form, and you can ignore normalizing constants. (b) Fit ? to the training set. Plot ? as 10 separate greyscale images, one for each class. (c) Derive the predictive log-likelihood log p(c|x, ?, p). Again you may ignore normalizing constants. (d) Given parameters fit to the training set, and pc = 1 10 for all c, report the average predictive loglikelihood per datapoint, on both the training and test set. That is, select for each datum one of the 10 possible predictive log-likelihoods and then average over the dataset’s 10k cases. Report the predictive accuracy on both the training and test set. The predictive accuracy is defined as the fraction of examples where the true class t = argmaxc p(c|x, ?, p). Remember to show your code for all questions in this assignment which involve coding. 1 Question 2 (Advanced Na¨ive Bayes, 10 points) One of the advantages of generative models is that they can handle missing data, or be used to answer different sorts of questions about the model. (a) True or false: Any two pixels xi and xj where i 6= j are independent given c. (b) True or false: Any two pixels xi and xj where i 6= j are independent when marginalizing over c. (c) Using the parameters fit in question 1, produce random data samples from the model. That is, randomly sample and plot 10 binary images from the marginal distribution p(x|p, ?). (d) Derive p(xbottom|xtop, ?, p), the joint distribution over the bottom half of an image given the top half, conditioned on your fit parameters. Hint: the relative class probabilities p(c|xtop, ?, p) will act as an information bottleneck. (e) Derive p(xi?bottom|xtop, ?, p), the marginal distribution of a single pixel in the bottom half of an image given the top half, conditioned on your fit parameters. (f) For 20 images from the training set, plot the top half of the image concatenated with the marginal distribution over each pixel in the bottom half. Question 3 (Logistic Regression, 10 points) Now, we’ll fit a simple predictive model using gradient descent. Our model will be multiclass logistic regression: p(c|x, w) = exp(wT c x) ? 9 c 0=0 exp(wT c 0x) (4) You can ignore biases for this question. (a) How many parameters does this model have? (b) Since class probabilities must sum to one, this imposes constraints on the predictions of our model. How many effective degrees of freedom does the model have? That is, what is the smallest number of parameters we could use to write an equally expressive model with a different parametrization? (c) Derive the gradient of the predictive log-likelihood w.r.t. w: ?w log p(c|x, w) (d) Code up the derivative, and a gradient-based optimizer of your choosing. Optimize w to maximize the log-likelihood of the training set, and plot the resulting parameters using one image per class. Since this objective is concave, you can initialize w = 0. Using minibatches is optional. Hint: Use scipy.logsumexp or its equivalent to make your code more numerically stable. (e) Given parameters fit to the training set, report both the average predictive log-likelihood per datapoint, and the predictive accuracy on both the training and test set. How does it compare to Na¨ive Bayes? 2 Question 4 (Unsupervised Learning, 10 points) Another advantage of generative models is that they can be trained in an unsupervised or semisupervised manner. In this question, we’ll fit the Na¨ive Bayes model without using labels. Since we don’t observe labels, we now have a latent variable model. The probability of an image under this model is given by the marginal likelihood, integrating over c: p(x|?, p) = k ? c=1 p(x, c|?, p) = k ? c=1 p(c|p) 784 ? d=1 p(xd |c, ?cd) = k ? c=1 Cat(c|p) 784 ? d=1 Ber(xd |?cd) (5) It turns out that this gives us a mixture model! This model is sometimes called a “mixture of Bernoullis”, although it would be clearer to say “mixture of products of Bernoullis”. Again, this is just the same Na¨ive Bayes model as before, but where we haven’t observed the class labels c. In fact, we are free to choose K, the number of mixture components. (a) Given K, how many parameters does this model have? (b) (This question has been changed; if you answered the first version, that’s fine. Either is acceptable.) How many ways can we permute the parameters of the model (? and p) and get the same marginal likelihood p(x|?)? Hint: switching any two classes wont change the predictions made by the model about xn. (c) Derive the gradient of the log marginal likelihood with respect to ?: ?? log p(x|?, p). (d) For a fixed pc = 1 K ?c and K = 30, fit ? on the training set using gradient-based optimization. Note: you can’t initialize with ? = 0 – you need to break symmetry somehow. Plot the learned ?. How do these cluster means compare to the supervised model? (e) For 20 images from the training set, plot the top half the image concatenated with the marginal distribution over each pixel in the bottom half. Hint: You can reuse the formula for p(xi?bottom|xi?top, ?, p) from before. How do these compare with the image completions from the supervised model? 3