0

Jayne's introduced the effectiveness of Bayesian Inference in deriving different statistical ensembles. His idea was that the distribution that maximizes the Information Entropy of the system subject to certain physical constraints, lead to different ensembles in statistical mechanics

Method 1:

The entropy is defined as, $$ S = - \sum_{i=1}^N p_i \ln p_i $$ for a sample space $\{x_i\}$ with $i = 1,2,... N$, with respective probabilities $p_i = p(x_i)$.

Subject to a normalization constraint, $\sum_i p(x_i) = 1$, most of the books typically evaluate the extremum of a function, $$ \tilde S = -\sum_{i=1}^N p_i \ln p_i - \lambda \left( \sum_{i=1}^N p_i - 1\right) $$ by setting the partial derivative with respect to $p(x_j)$ to be zero, $$ \frac{\partial \tilde S}{\partial p_j} = - \sum_{i=1}^N \left[\ln p_i \delta_{ij} + 1 + \lambda \right] = 0 $$ This effectively leads to the distribution of maximum entropy $p^{me}_j$ $$ p^{me}_j = \frac{1}{N} $$ While, this method provides the correct result, is it mathematically consistent?

Method 2:

Wouldn't it be more accurate to write down Entropy as a functional of distribution $p_i$? The entropy functional can be defined as,

$$ S\left[\mathbf{p}(x) \right] = - \sum_{i=1}^N p_i \ln p_i $$ where $\mathbf{p}(x)$ is the set of probability values $\{p_i\}$. This functional maps a distribution to a real number. Then the maximum entropy problem becomes estimating the distribution $\mathbf{p}$ for which, the entropy is maximum. For this purpose, we have to consider the idea of functional derivative, which can be defined as,

$$ \sum \frac{\delta \tilde S\left[\mathbf{p}(x)\right]}{\delta \mathbf{p}(x)} \mathbf{q}(x) := \frac{d \tilde S\left[\mathbf{p}(x)+ h \mathbf{q}(x) \right]}{dh}\Bigg|_{h=0} $$ where $\mathbf{q}(x)$ is another arbitrary distribution. In this method, we can genearalize this to the cases when the sample space is continuous, by replacing the summation with an integral and an approrpriate measure, like discussed in this question.

Question: Does method 1 have some obvious fallacies due to not treating entropy as a functional, but rather as a function? It looks like method 1 is optimizing the derivatives with respect to $p_j$ (individual components of $\mathbf{p}$) independently, does this have some consequences in the interpretation?

It is also important to note that in all of these derivations, we are looking at the first derivatives, which simply consider extremum of Entropy. However, for the case of Maximum Entropy, we should consider the second derivative, where there might be sharp differences in methods 1 and 2.

user35952
  • 3,134

2 Answers2

1

Methods 1 and 2 are not fundamentally different. Quite often, max entropy methods are introduced with reference to discrete probability spaces to leave aside some technical complications connected to the continuous probability distributions. However, the treatment of the discrete case can easily be extended to the continuous case.

In both cases, the constraint of normalization can be satisfied with the technique of Lagrange multiplicators. Again, there is no fundamental difference when dealing with a function of many variables or a functional.

Summarizing,

  1. method 1 doesn't have any fallacy. It is just a restriction to discrete probability spaces but easily extended to the continuous case;
  2. The Lagrange multipliers method is not equivalent to independently optimizing each $p_i$. The equation for the Lagrange multiplicator establishes a link between them. Therefore, there are no consequences in the interpretation.
0

Firstly note that Lagrange multipliers is a mathematically rigorous method for finding extrema under constraints. Note that the probabilities must be constrained by normalization regardless of whether we treat them as functionals or not.

IMHO, treating entropy as a functional (Method 2) largely misses the point of Jaynes derivation: in equilibrium statistical mechanics all the states are considered equally probable. This assumption is usually justified by a claim that, if we wait for a long enough time, the phase space trajectory would pass through all the points in the phase space, satisfying given constraints (energy, number of particles, etc.) Using ergodicity assumption the time averaging is then is recast as averaging over a large ensemble of systems which are represented as points in the phase space. The volume occupied by these points (i.e. the probability density) then remains unchanged according to the Liouville theorem.

Jaynes tries to substitute this traditional reasoning by a statistical argument based on our ignorance about the actual state of the system, which is characterized by information entropy. In statistical terms it is somewhat close to frequentists vs. bayesians debate, but the mathematical results are the same: we need to establish the equality of the probabilities, after which the usual machinery of statistical physics applies.

In particular, thermodynamic transformations do no change the relative probabilities of different states - rather they change the number of states, which is tied to its probability via the normalization. Thermodynamic/Stat.Phys. entropy then characterizes the number of states, $\Omega$ or the probability of a single state, $1/\Omega$, and how the number of states changes... but it is trivial in the sense of information theory (I remind that I speak here of equilibrium entropy.)

Remark
Note that Jaynes distinguished at least six different kinds of entropy.

Roger V.
  • 68,984