Jayne's introduced the effectiveness of Bayesian Inference in deriving different statistical ensembles. His idea was that the distribution that maximizes the Information Entropy of the system subject to certain physical constraints, lead to different ensembles in statistical mechanics
Method 1:
The entropy is defined as, $$ S = - \sum_{i=1}^N p_i \ln p_i $$ for a sample space $\{x_i\}$ with $i = 1,2,... N$, with respective probabilities $p_i = p(x_i)$.
Subject to a normalization constraint, $\sum_i p(x_i) = 1$, most of the books typically evaluate the extremum of a function, $$ \tilde S = -\sum_{i=1}^N p_i \ln p_i - \lambda \left( \sum_{i=1}^N p_i - 1\right) $$ by setting the partial derivative with respect to $p(x_j)$ to be zero, $$ \frac{\partial \tilde S}{\partial p_j} = - \sum_{i=1}^N \left[\ln p_i \delta_{ij} + 1 + \lambda \right] = 0 $$ This effectively leads to the distribution of maximum entropy $p^{me}_j$ $$ p^{me}_j = \frac{1}{N} $$ While, this method provides the correct result, is it mathematically consistent?
Method 2:
Wouldn't it be more accurate to write down Entropy as a functional of distribution $p_i$? The entropy functional can be defined as,
$$ S\left[\mathbf{p}(x) \right] = - \sum_{i=1}^N p_i \ln p_i $$ where $\mathbf{p}(x)$ is the set of probability values $\{p_i\}$. This functional maps a distribution to a real number. Then the maximum entropy problem becomes estimating the distribution $\mathbf{p}$ for which, the entropy is maximum. For this purpose, we have to consider the idea of functional derivative, which can be defined as,
$$ \sum \frac{\delta \tilde S\left[\mathbf{p}(x)\right]}{\delta \mathbf{p}(x)} \mathbf{q}(x) := \frac{d \tilde S\left[\mathbf{p}(x)+ h \mathbf{q}(x) \right]}{dh}\Bigg|_{h=0} $$ where $\mathbf{q}(x)$ is another arbitrary distribution. In this method, we can genearalize this to the cases when the sample space is continuous, by replacing the summation with an integral and an approrpriate measure, like discussed in this question.
Question: Does method 1 have some obvious fallacies due to not treating entropy as a functional, but rather as a function? It looks like method 1 is optimizing the derivatives with respect to $p_j$ (individual components of $\mathbf{p}$) independently, does this have some consequences in the interpretation?
It is also important to note that in all of these derivations, we are looking at the first derivatives, which simply consider extremum of Entropy. However, for the case of Maximum Entropy, we should consider the second derivative, where there might be sharp differences in methods 1 and 2.