Connection between Sparse Autoencoders and maximum likelihood from Goodfellow deeplearning book

Question

I am struggling to understand how sparse autoencoder can be thought of as "approximating maximum likelihood training of a generative model that has latent variables", from section 14.2.1 in Goodfellow's deeplearning book (pg. 502). I understand the mechanisms of sparse autoencoders, just not the connection to the theory. I have asked specific questions below in my attempt to be specific with this post, but I might not be asking the right ones as I am generally confused.

"Suppose we have a model with visible variables $x$ and latent variables $h$, with an explicit joint distribution $p_{model}(x,h)=p_{model}(h)p_{model}(x|h)$. We refer to $p_{model}(h)$ as the model’s prior distribution over the latent variables, representing the model’s beliefs prior to seeing $x$. This is diﬀerent from the way we have previously used the word “prior,” to refer to the distribution $p(θ)$ encoding ourbeliefs about the model’s parameters before we have seen the training data."
=> (Q1) In this setup, is $h=argmax_x p_{model}(h | x)$? If not, what is the encoder trying to do? And the autoencoder?

Then it says:

"The log-likelihood can be decomposed as
$\log{p_{model}(x)}=\log{\sum\limits_{h}p_{model}(h,x)}$. (14.3)"
=> (Q2) What log-likelihood is this referring to?
=> (Q3) What is the relevance of $\log{p_{model}(x)}$ to the paragraph above?

"We can think of the autoencoder as approximating this sum with a point estimate for just one highly likely value for $h$."
=> (Q4) Is this saying that $p_{model}(x) \simeq p_{model}(h,x)$, because $p_{model}(h | x)\simeq 1$? (i.e., probability is sharply peeked at $h$) \

"From this point of view, with this chosen $h$, we are maximizing
$\log{p_{model}(x,h)}=\log{p_{model}(h)}+\log{p_{model}(x|h)}$. (14.4)"
=> (Q5) Why is $h$ fixed, and how does this describe what the autoencoder is doing?

score 0 · Answer 1 · answered Jan 14 '25 at 02:12

This section is talking about sparse autoencoder which can be shown as a maximum likelihood estimation (MLE) of a generative latent variable model (LVM) where $h$ are the latent variables, instead of as a regularized MAP approximation to Bayesian inference as shown in the last section about Regularized Autoencoders. To decide what's the actual likelihood to be maximized here, the paragraph just after your quote is critical.

The log-likelihood can be decomposed as $$\log p_{model}(x) = \log \sum_h p_{model}(h, x)$$ We can think of the autoencoder as approximating this sum with a point estimate for just one highly likely value for $h$... From this point of view, with this chosen $h$, we are maximizing $$\log p_{model}(h, x) = \log p_{model}(h) + \log p_{model}(x|h)$$

Therefore the likelihood to be maximized here is $\log p_{model}(h, x) = \log p_{model}(h) + \log p_{model}(x|h)$, which is equivalent to maximize the usual model-predicted data likelihood $\log p_{model}(x)$ for sparse autoencoder since there's approximately only one possible value of $h$ accounted for the summation as shown above.

In SE here we only focus on one question per post, if you have other related questions, you're always encouraged to write separate post for each question.

Connection between Sparse Autoencoders and maximum likelihood from Goodfellow deeplearning book

1 Answers1