I am struggling to understand how sparse autoencoder can be thought of as "approximating maximum likelihood training of a generative model that has latent variables", from section 14.2.1 in Goodfellow's deeplearning book (pg. 502). I understand the mechanisms of sparse autoencoders, just not the connection to the theory. I have asked specific questions below in my attempt to be specific with this post, but I might not be asking the right ones as I am generally confused.
"Suppose we have a model with visible variables $x$ and latent variables $h$, with an explicit joint distribution $p_{model}(x,h)=p_{model}(h)p_{model}(x|h)$. We refer to $p_{model}(h)$ as the model’s prior distribution over the latent variables, representing the model’s beliefs prior to seeing $x$. This is different from the way we have previously used the word “prior,” to refer to the distribution $p(θ)$ encoding ourbeliefs about the model’s parameters before we have seen the training data."
=> (Q1) In this setup, is $h=argmax_x p_{model}(h | x)$? If not, what is the encoder trying to do? And the autoencoder?
Then it says:
"The log-likelihood can be decomposed as
$\log{p_{model}(x)}=\log{\sum\limits_{h}p_{model}(h,x)}$. (14.3)"
=> (Q2) What log-likelihood is this referring to?
=> (Q3) What is the relevance of $\log{p_{model}(x)}$ to the paragraph above?
"We can think of the autoencoder as approximating this sum with a point estimate for just one highly likely value for $h$."
=> (Q4) Is this saying that $p_{model}(x) \simeq p_{model}(h,x)$, because $p_{model}(h | x)\simeq 1$? (i.e., probability is sharply peeked at $h$) \
"From this point of view, with this chosen $h$, we are maximizing
$\log{p_{model}(x,h)}=\log{p_{model}(h)}+\log{p_{model}(x|h)}$. (14.4)"
=> (Q5) Why is $h$ fixed, and how does this describe what the autoencoder is doing?