How does the Stirling approximation give an exact formula for entropy?

Question

I'm watching Leonard Susskind's Statistical mechanics lecture (3 on Youtube). $\ $ We have an exact formula for the number of microstates $\Omega$: $$ \Omega = \binom{N}{n_1}\binom{N - n_1}{n_2}\binom{N - n_1 - n_2}{n_3}...\ = \frac{N!}{\prod\limits_{i} n_i!} $$

We are interested in the log.

$$ \ln(\Omega) = \ln(N!) - \sum_i\ln(n_i!) $$

We use Stirling's approximation $ \ln(N!) \approx N\ln(N) - N + 1$ further shortened to $\ln(N!) \approx N\ln(N) - N $.

And, after some algebra (I can put below if desired), end up with exactly $$ \ln(\Omega) \approx -N \sum p_i \ln(p_i) = NS $$

Where $p_i = \dfrac{n_i}{N} $ is the probability of state $i$. $\ $ Which is the formula for entropy $S$, scaled by the size of the system $N$.

So we start with an exact formula for $\Omega$, make an approximation, however good for large $N$, and end up with an exact formula for entropy. How is this possible?

All I can think of is that either

As $ N \rightarrow \infty $ the Stirling approximation gets infinitely accurate, therefore we end up with an exact formula only in the limiting case of large $N$. $\ $ This is all well and good for the idea of a probability $p_i$, however my understanding of all presentations I've seen of the formula $ S = -\sum_i p_i \ln(p_i) $ is that it is exact even for not as large $N$. $\ $ This would imply that the formula is incorrect for not large $N$. $\ $ Furthermore, I believe (correct me) that although the percent error of the Stirling approximation decreases with $N$, the absolute error still grows.
One of these formulae is wrong. $\ $ I don't believe that either are so as they are both derived from first principles.

Ján Lalinský · Answer 1 · 2025-06-07T22:25:11.137

This is really about the difference between the Boltzmann entropy and the discrete Gibbs-Shannon entropy (I will write just Gibbs entropy from now on), which is not only conceptual, but also numerical. We can show why the Gibbs entropy is given by what looks like an approximate formula for the Boltzmann entropy.

Notice there are four concepts of entropy at play, all meaning different things.

The general Boltzmann entropy formula for a situation where there is a discrete number $\Omega$ of equally weighed microstates (definition by Planck): $$ S_B = k_B \ln \Omega.\tag{1} $$
The value of $S_B$ for a system in which there are $N$ distinguishable entities in $K$ departments, and we know the occupation numbers of every department $n_i$, which is an array which we call "distribution" $\{n_i\}$ ; and it is assumed that all possible microstates that achieve such distribution are considered equally weighed. Under these assumptions, it can be shown that

$$ S_{B}^{(N~in~K)} = k_B \ln \frac{N!}{n_1!n_2!~... n_K!}.\tag{2} $$ This is a functional of the distribution $\{n_i\}$.

An approximate value of (2) - the higher the $N$, the better the approximation: $$ S_{B,approx}^{(N~in~K)} = -k_B\sum_{i=1}^K n_i \ln \frac{n_i}{N} ,\tag{3.1} $$ The same approximate value, but using the occupation fractions $w_i = \frac{n_i}{N}$:

$$ S_{B,approx}^{(N~in~K)} = -Nk_B\sum_{i=1}^K w_i \ln w_i.\tag{3.2} $$

At this point, entropy (2) is just a mathematical concept relating to distribution $\{n_i\}$ of $N$ entities in $K$ departments; no physical interpretation of the entities has been assumed, and no connection to any physical system or to its thermodynamic entropy has been shown.

We want to make a connection to the Gibbs entropy formula (below). However, notice the factor $N$, which does not appear in the usual formula for the Gibbs entropy. It appears here because $i$ does not run over all microstates of the system of $N$ entities, but only over microstates of single entity. Thus although we are really calculating the Boltzmann entropy of an "ensemble" of $N$ entities, we are using "inappropriate" states $i$ of a single entity, instead of the states of the ensemble (thus $i$ runs from 1 to $K$, instead of $1$ to $K^N$). Hence the factor $N$ in the formula above.

The Gibbs entropy of a probability distribution $p_i$; notice the lack of the factor $N$: $$ S_G = -k_B\sum_{i=1}^K p_i \ln p_i.\tag{4} $$ This is better called information entropy or the Shannon entropy, but in physics, the term Gibbs entropy sticks strong and is overloaded. Sometimes "Gibbs entropy" is used to refer to the value of this information entropy for the "correct" probability distribution $p_i^*$.

Again, this is just a mathematical concept, until we say which system is being described and probabilities of what $p_i$'s are.

The Gibbs entropy, when taken as a functional of the probabilities of microstates of a single physical system, is an abstract quantity characterizing that probability distribution; the purpose of this quantity in physics is that the probability distribution $p_i^*$ which maximizes it (all while obeying the constraints implied by the macrostate $\mathbf X$) is the correct distribution for that macrostate. The constraint can be e.g. equal energy of all microstates and then the result is equal probabilities (in the microcanonical approach); or the constraint is fixed average energy and the result is probabilities decaying exponentially with energy (in the canonical approach); there are other approaches.

Also, it turns out the value of the Gibbs entropy, for the correct probabilities of microstates implied by the macrostate, is the statistical physics estimate for thermodynamic entropy of the single system in that macrostate (up to an additive constant).

Thus the approximate result (3.2) of the OP calculation based on Boltzmann's entropy formula does not really give the correct Gibbs entropy of a single system described by probabilities $p_i$, but it gives the correct Gibbs entropy of a super-system that consists of $N$ independent such systems. However, this calculation can be interpreted in a more abstract way that makes it useful for motivating the Gibbs entropy: those $N$ entities in (2) are really $N$ imaginary copies of a physical system in the same macrostate, but in possibly different microstates, which we are considering to find the "correct" probability distribution for the single system. The point of the calculation is to use an ensemble of very many copies ($N$) of the system, apply the statistical argument to it, and derive the rule that the correct occupation fractions maximize (3.2). Then, this result motivates the definition of the Gibbs entropy functional (4) and the maximum information entropy principle.

Here is a short attempt at such an argument.

Consider many copies ($N$) of the macroscopic system, all in the same macrostate $\mathbf X$, but possibly in different microstates $i$, all of which are considered compatible with $\mathbf X$ and equally weighed. If $N$ was very small, our ensemble would not sample all the possible microstates well and it can't tell us which are most likely. Thus we should consider $N$ to be high enough, so that each microstate is occupied by very many individual systems.

We are interested in the "correct" occupation fractions $w_i^*$ for all microstates of the single system. In other words, we seek the correct distribution of the $N$ systems among the microstates. The statistical argument is that in the limit $N\to \infty$, the "correct" can be given the meaning "the most probable", or "can be realized by the greatest number of ways - microstates of the whole ensemble". Thus we are after the distribution $w_i^*$ implying the highest value of the Boltzmann entropy of the ensemble (2).

Maximizing the formula (2) and even (3.2) for finite $N$ exactly is hard, since $n_i$'s and $w_i$'s cannot assume all values in between integers or rationals; the allowed values and the maximizing distribution depends somewhat, in a not easily expressible way, on $N$. However, the higher the $N$, the smaller this dependence, and we expect that in the limit $N\to \infty$, both the exact and the approximate formula have the same maximizing occupation fractions $w_i^*$.

The limiting maximizing distribution thus does not depend on the value of $N$, and so to find it, we can instead maximize directly the modified expression which we get from (3.2) by dividing by $N$ and by replacing $w_i$'s by $p_i$'s. Thus we can instead maximize the Gibbs formula (4) to find the limiting maximizing probabilities $p_i$.

The expression (3.2) gives a value that is $N$ times higher than the Gibbs entropy of the system. This is logical, as in (2) we really have the Boltzmann entropy of $N$ copies of the actual system.

So, contrary to what you may have thought, the calculation did not derive the exact formula for the Gibbs entropy (4), but only the formula (3.2) for the approximate value of the Boltzmann entropy of $N$ copies. In this view, the calculation resulting in (3.2) does not derive the Gibbs entropy formula. However, (3.2) and its maximization is a motivation to define the Gibbs entropy functional exactly by (4).

The Boltzmann entropy (1) and the Gibbs entropy (4) are different mathematical concepts, one is related to finite multiplicity of a distribution of finite number of things, the other is a functional of real-valued probabilities. The Gibbs entropy can be regarded as something like a "continuous extrapolation" of the concept of Boltzmann entropy (2) per single entity: we divide by the number of entities $N$ to keep the thing finite and free of the auxiliary parameter $N$, and take the limit $N\to\infty$. Also, we replace the ratios $w_i$ by real-valued probabilities $p_i$ which can then assume values from the continuous interval $\langle0,1\rangle$.

we start with an exact formula for $\Omega$, make an approximation, however good for large $N$, and end up with an exact formula for entropy. How is this possible?

We rather ended up with an approximate formula for the Boltzmann entropy of $N$ copies of the system, which turns out to be very similar to the Gibbs entropy of those $N$ copies.

The Boltzmann entropy per one entity ((2) divided by $N$) is given approximately by the Gibbs entropy of a single entity (4). These two numbers agree very well for very high $N$, but of course, we should not expect that when $N$ is low. They are exactly the same only in the limit $N\to \infty$ and $w_i\to p_i$.

hft · Answer 2 · 2025-05-31T19:50:07.137

5

How does the Stirling approximation give an exact formula for entropy?

It doesn't.

As discussed in many many posts on this website, the word "entropy" has different meanings in different contexts. (Some unfortunately only slightly different.)

edited May 31 '25 at 19:50

answered May 31 '25 at 19:43

hft

27,235

Andrew · Answer 3 · 2025-06-01T21:18:13.580

Summary

Since this turned into a huge answer, here's the bottom line up front: your starting point by taking $\Omega=\frac{N!}{n_1! n_2! \cdots n_k!}$ was already an approximation to the true number of states. To get the exact $\Omega$, you should sum over possible configurations of occupation numbers, $\Omega = \sum_{n_1, n_2, \cdots, n_k}\frac{N!}{n_1! n_2!\cdots n_k!}$. In the limit of large $N$, that sum becomes dominated by the configuration with all $n_i=1/k$, in which case it becomes a good approximation to take just the log of the one term that you did, apply Stirling's formula, and use the formula $n_i/N = p_i$.

Setup

This answer is going to start off saying a bunch of things that you already know; it's not because I don't think you know them, but so that I can phrase them in my language, because I think the issue is a little subtle, so I want to lay out my perspective carefully. I am going to use units where $k_B=1$

I haven't watched the lecture but here is what I am assuming you are describing. You have $N$ distinguishable particles, and there are $k$ states, each of which has the same energy (so you can say we are working in the microcanonical ensemble where the energy of the system is fixed). Then we need to count how many ways we can distribute the $N$ particles among $k$ states. Given the the occupation numbers $\{n_1, n_2, \cdots, n_k\}$ (meaning, $n_i$ particles in state $i$, with $1\leq i \leq k$), we can come the number of ways $\Omega$ to arrange the particles consistent with those occupation numbers. That will lead to the result you wrote down $$ \Omega = \frac{N!}{\prod_{i=1}^k n_i!} $$

As others have said, there are multiple definitions of entropy, which can make things complicated. I am pretty opinionated (possibly overly so) and my point of view is that the best way to understand entropy is through the Shannon (or Gibbs) entropy. This says, given a discrete set of $k$ states labeled by $i$, and a probability distribution over those states $p_i$, then the entropy is $$ S = - \sum_{i=1}^k p_i \log p_i $$ (As an aside, defining entropy for a continuous set of states is complicated). The normal "Boltzmann entropy" $S = \log \Omega$ is a special case that applies in the situation where each state has uniform probability. Then $p_i=1/k$, and $$ S = - \sum_i \frac{1}{k} \log \frac{1}{k} = \log k $$ where in this case $k$ is the number of microstates.

Exact calculation of entropy

Now we start to make our way back to your problem, but I'm going to approach it a different way, first. We will build to an exact calculation of the entropy in steps.

First, note that a set of occupation numbers $\{n_1, n_2, \cdots, n_k\}$, with $n_1+n_2+\cdots+n_k=N$, can be thought of as a random draw from a probability distribution over $k$ states. In other words, we can assign probabilities $\{p_1, p_2, \cdots, p_k\}$ (with $\sum_k p_k = 1$) to the $k$ states. Then if we randomly assign $N$ particles to the $k$ states, we will get a histogram $\{n_1, n_2, \cdots, n_k\}$. It is not the case in general that $n_i/N=p_i$. For example, if $k=2$, then you can imagine you assigned each particle to one of the two states by flipping a coin. If $N=100$, you don't expect to get exactly 50 heads and 50 tails, you might get 54 heads and 46 tails for instance. But as $N$ gets larger then the ratio $n_i/N$ will approach $p_i$ by the law of large numbers. Where we are headed, intuitively, is that when you used Stirling's approximation you implicitly used the law of large numbers in this sense.

Second, in thermodynamics we are interested in the probability distribution that maximizes the entropy. Here I'm going to introduce the special case $k=2$, so that we can do a few analytical calculations easily.

Now let's start with $N=1$ particle. Suppose the probability of state $1$ is $p$, so the probability of state $2$ is $1-p$. Then the entropy is $$ S(p) = -p \log p - (1-p) \log(1-p) $$ You can find the maximum of $S(p)$ by solving $dS/dp=0$; it is an easy calculation and is solved for $p=1/2$, meaning the two microstates are equally likely. (With $k$ states we would find $p=1/k$). Plugging $p=1/2$ into $S(p)$, we find the entropy is $S=\log 2$, or $S=\log k$ for $k$ states, consistent with the Boltzmann entropy. So we see the Boltzmann entropy applies after we have maximized the entropy over a set of probability distributions.

Third, you might have noticed there's a factor of $N$ between $S=\log k$ and $S = N \log k$ that you wrote in your answer. That's because so far we have only looked at $1$ particle. Generalizing to $N$ non-interacting distinguishable particles (for $k=2$ states) is straightforward. Particle $1$ has a probability $p_1$ to be in state $1$ and $1-p_1$ to be in state $2$; particle $2$ has probability $p_2$ to be in state $1$ and $1-p_2$ to be in state $2$, etc. Since the particles are independent, the $N$ particle probability distribution is just the product of the $N$ $1$ particle distributions. So the probability that particle $1$ is in state $i_1$ (which could be $0$ or $1$), particle $2$ is in state $i_2$, etc, is $$ p(i_1, i_2, \cdots, i_N) = \left(p_{i_1} \delta_{i_1, 1} + (1-p_1) \delta_{i_1, 2}\right) \left(p_{i_2} \delta_{i_2, 1} + (1-p_2) \delta_{i_2, 2}\right) \cdots \left(p_{i_N}\delta_{i_N, 1}+ (1-p_N) \delta_{i_N, 2}\right) $$ where $\delta_{a,b}=1$ if $a=b$ and $0$ otherwise. So the overall entropy as a function of $p_1, p_2, \cdots, p_N$ is $$ S(p_1, p_2, \cdots, p_N) = \sum_{i=1}^N \left(-p_i\log p_i - (1-p_i) \log(1-p_i)\right) $$ You can again maximize this as a function of $p_1, \cdots, p_N$, and find that the maximum entropy distribution has $p_1=p_2=\cdots=p_N=1/2$. In other words, each particle has a $1/2$ probability to be in either state $1$ or in state $2$. However, the configuration of the whole system involves specifying the state of \emph{each} particle. Which means there are $2^N$ total states. Since each of these states is equally likely (you can either say that's an assumption of thermodynamics, or you can say that we argued this is the maximum entropy distribution), that means the probability of each state is $2^{-N}$. That means the overall entropy of the maximum entropy distribution is $$ S = N \log 2 $$ You can calculate this a number of ways. First, you can take the log of the number of microstates $2^N$. Second, you can evaluate $-\sum_i p_i \log p_i$ with $2^N$ states each with probability $2^{-N}$ (ie, evaluating $S(1/2, 1/2, \cdots, 1/2)$). Third, you can use the fact that entropy is additive, meaning that if $p(x,y)=p(x)p(y)$, then $S[p(x,y)]=S[p(x)]+S[p(y)]$, so the entropy for $N$ non-interacting particles is just $N$ times the entropy of $1$ particle, or $N \log 2$.

The main reason I went through this in such explicit detail is to point out that there are states in the above distribution that don't have $n_i/N=p_i=1/2$ (where $n_i$ is the occuptation number, remember since $k=2$ we have $n_1$ and $n_2$ with $n_1+n_2=N$). For example, there is a non-zero probability that all $N$ particles are in state $1$ and zero are in state $2$. (In the notation above, this corresponds to $i_1=i_2=\cdots=i_N=1$). The probability is very small for large $N$ -- specifically $2^{-N}$. But nevertheless, those states exist in the ensemble. In the next section, we'll show that the core of your question is that the approximation you made was to ignore those states, and at large $N$ those states can be ignored.

Approximate calculation of entropy at large $N$

So, now, finally, let's come back to the calculation you did for $\Omega$. Let's again fix $k=2$ so we can simplify the algebra. Let's suppose that we are looking at the maximum entropy distribution, with $p=1/2$. You argued that $$ \Omega = \frac{N!}{n_1! (N-n_1)!} = \binom{N}{k} $$ is the number of states. In fact, this is only the number of states for one particular way of arranging the particles. In general, the real value of $\Omega$ involves summing over all the ways we can arrange the particles $$ \Omega = \sum_{n_1=0}^N \binom{N}{k} = 2^N $$ where the second equation follows from the sum of binomial coefficients. Physically it's easy to understand this result; given that we have $N$ distinguishable particles and $2$ states, we can decide the configuration of each independently, so we get $2^N$ possibilities. Since we assume each configuration is equally likely, the entropy reduces to the Boltzmann entropy, and $$ S = \log \Omega = N \log 2 $$ as we expect. (And as we calculated multiple ways in the previous section.)

What happens at large $N$ is that the occupation numbers $n_i/N$ becomes sharply peaked around the probabilities $p=1/k$. So the sum over $n_1$ in $\Omega$ can be approximated by just the term with $n_1/N=n_2/N=1/2$. So $$ S = \log \Omega = \log \sum_{n_1} \frac{N!}{n_1!(N-n_1)!} \approx \log \frac{N!}{((N/2)!)^2} = \log N! - 2 \log\left(\frac{N}{2}\right)! $$ Then we can use Stirling's approximation $\log N! = N \log N - N$ to evaluate the factorials at large $N$ \begin{eqnarray} S &\approx& N \log N - N - N \log \frac{N}{2} + N \\ &=& N \log 2 \end{eqnarray} Since all these approximations become better and better at large $N$, the exact calculation we did of the entropy of the maximum entropy distribution above, agrees with the approximate calculation of the entropy using Stirling's approximation plus the limiting value $n_i/N=p$, in the large $N$ limit.

Appendix: A different (but related) problem

As a technical aside not directly related to your question, one technical point I found surprising while I was writing this up is that there is a subtle distinction between the entropy of the binomial distribution, and the entropy of the microcanonical ensemble for $k=2$ we looked at above. The ultimate reason is that the binomial distribution is appropriate for identical bosons at zero temperature with a degenerate ground state, whereas above we were looking at distinguishable particles with a fixed energy for the total system.

The binomial distribution tells us the probability of getting $n_1$ successes, given $N$ trials, and that the probability of success is $p$. The distribution is $$ p(n_1) = \binom{N}{n_1} p^{n_1}(1-p)^{N-n_1} $$ and using Stirling's approximation, one can show for large $N$ the entropy is approximately $$ S \approx \frac{1}{2} \log\left(2\pi e N p(1-p)\right) $$ The maximum entropy distribution has $p=1/2$, in which case $$ p(n_1) = \binom{N}{n_1} \left(\frac{1}{2}\right)^{n_1 + N - n_1} = \binom{N}{n_1} 2^{-N} $$ in which case using Stirling's approximation the entropy is approximately $$ S = \frac{1}{2} \log\left(\frac{\pi e N}{2}\right) $$ This grows as $\log N$, not like $N$ as we saw above.

The issue is that the binominal distribution only counts the number of successes, and doesn't distinguish between trials. In other words, the binomial distribution tells you the probability of getting $1$ head out of $100$ trials, not the probability of the first trial being heads.

In the microcanonical ensemble used in the main question, each of the $\Omega=2^N$ states has probability $2^{-N}$. We have $2^N$ states because we can assign each particle to be in state $0$ or $1$ independently. This counting is not the same for identical bosons.

score 3 · Answer 4 · answered Jun 01 '25 at 12:22

One way to interpret this, is to say that the formula $S = \log(\Omega)$ only holds for the microcanonical ensemble.

There prepare the system by taking the all states in some intensive energy interval and assigning them the same a priori probability.

If the distribution is uniform, then $-\sum p_i \log(p_i) = - \sum \frac 1 \Omega \log\left(\frac 1 \Omega\right) = \log(\Omega)$ exactly, even for small $N$.

The connection to classical thermodynamic entropy only works for either a microcanonical ensemble of large systems, or for system coupled to a heat bath – which again can be seen as a larger "total system" made up of the bath and the system of interest.

When explicitly doing the analysis of increasing system size for a toy system in the microcanonical ensemble (such as a "lattice gas") to reach the thermodynamic limit, you will note that the entropy is not even exactly extensive. That is, if you double the system size, the entropy doesn't double. The leading term $S^\text{ext}$ is extensive, but there are sub-leading corrections $S^\text{sub}$. However for large systems those corrections are such that $\lim \frac{S^\text{sub}}{S^\text{ext}} = 0$ as you go to the limit.

And this is the power of thermodynamics: The simple results obtained as you reach the thermodynamic limits and simple results for the extensive part of the thermodynamic variables. The exact results, on the other hand, are not very useful for large systems.

(Of course, ensembles of small systems coupled to heat baths can be handled precisely, and there the non-leading corrections are often relevant – but you need to do a precise statistical analysis and clearly state how the ensemble is prepared.)

Andrew Steane · Answer 5 · 2025-06-03T08:48:49.300

Existing answers include some good ones, but they are a bit long. Sometimes short is better.

Answer.

In microcanonical ensemble, the two formulae agree exactly: $$ -\sum_i p_i \ln p_i = -\ln p_1 = \ln \Omega $$ since $p_1 = p_i = 1/\Omega$.
In canonical ensemble, the quantity $\Omega$ (for the system, as opposed to system plus thermal reservoir) is not a single number but is distributed over a range of values, because the system can exchange heat with the reservoir. It is not true, for example, that $\Omega$ is simply $N!/\prod n_i!$, but the log of that value is a good approximation to the log of $\Omega$ at large $N$.

My own notes on this can be found here: https://users.physics.ox.ac.uk/~Steane/teaching/lecture_course.html under the link https://users.physics.ox.ac.uk/~Steane/teaching/entropy.pdf ("Introduction to statistical thermal physics; includes careful definition of entropy and derivation of basic results"). It is section of 6 of the latter that handles the question asked about here. This reference might be regarded as self-advertising but it is an attempt to be helpful. The most commonly used undergraduate textbooks either handle this issue incorrectly or are rather vague about what is being assumed.

score 2 · Answer 6 · answered May 31 '25 at 23:19

What is important in my opinion is the justification of using the logarithm of the formula that was derived. After all, the method of Lagrange multipliers should be used for $\Omega$. And if the Stirling approximation is used directly there, it is: $n! = (\sqrt{2\pi n}) n^n \exp(-n)$. The square root term is necessary for the correct approximation when $n$ grows.

But taking $\log(n!) = \log(\sqrt{2\pi n}) + n\log(n) - n$, the ratio of using only the last 2 terms of the RHS (neglecting the square root term), over the full RHS side, tends to unity when n grows.

But why it is valid to apply the Lagrange multipliers on $\log (\Omega)$? Because the results will be the same as for $\Omega$. It can be shown comparing the minimization of $f(x,y) = xy$, (both positive) with restriction $y = 1 - x$, and $\log(xy)$ under the same restriction. We get $x = y = \frac{1}{2}$ both ways.

Of course it is still an approximation, but when $n>10^6$ or bigger it is fair accurate.

How does the Stirling approximation give an exact formula for entropy?

6 Answers6

Summary

Setup

Exact calculation of entropy

Approximate calculation of entropy at large $N$

Appendix: A different (but related) problem

Linked