2

In The" wake-sleep" algorithm for unsupervised neural networks (GE Hinton, P Dayan, BJ Frey, RM Neal - Science, 1995), the authors gave this interpretation of Helmholtz free energy:

$$\begin{aligned} C(d) &=\sum_{\alpha} Q(\alpha \mid d) C(\alpha, d) \\ &-\left[-\sum_{\alpha} Q(\alpha \mid d) \log Q(\alpha \mid d)\right]. \end{aligned}$$

The set up is this: a sender needs to send $d$ to the receiver with as few bits as possible. They share a stochastic "encoder" function $f$, such that the function would take $d$, and encode it stochastically into $a = f(d)$, with $Q(a|d)$ being the probability that the encoder would return $a$.

$C(a, d)$ is the cost (bits needed) to transmit both $a$ and $d$, and $C(d)$ is an upper bound on the cost to transmit $d$.

The second term on the right is the entropy of the distribution that the recognition weights assign to the various altemative representation. If, for example, there are two alternative representations, each of which costs 4 bits, the combined cost is only 3 bits provided we use the two alternatives with equal probability.

Then they claimed this is exactly analogous to the physical meaning of Helmholtz free energy:

It is precisely analogous to the way in which the energies of the alternative states of a physical system are combined to yield the Helmholtz free energy of the system. As in physics, C(d) is minimized when the probabilities of the alternatives are exponentially related to their costs by the Boltzmann distribution (at a temperature of 1)

How is it analogous? I know that Helmholtz free energy in classical thermodynamics is the maximal amount of mechanical work a system can perform at constant temperature, but this does not have an information-theoretic meaning, which this paper seems to claim.

0 Answers0