Chris White's answer using an analogy to incoherent light pretty much answers the question; it's fundamentally a question of the statistics of how wave sources add.
Here's a slightly different but equivalent rephrasing of Chris White's answer using matrices:
Given $N$ wave sources, incoherent waves add "diagonally" ($I\propto N)$, ie, additively. Coherent waves add "off-diagonally" ($I\propto N^2$).
From reasonably far away, a choir of people can be approximated as a set of $N$ spherical wave sources located at points $\mathbf{q}_k$. At a point $\mathbf{p}$ (observer), the amplitude/pressure from the $k$th source becomes $A_k(\mathbf{p},t)=\frac{K_k}{|\mathbf{p}-\mathbf{q}_k|}V_k(t-|\mathbf{p}-\mathbf{q}_k|/c)$ where $K_k$ is some amplitude constant and where $V_k(t)$ is the time-dependent part. The total pressure at time $t$ becomes $\sum_{k=1}^NA_k(\mathbf{p},t)$.
Since acoustic intensity is the square of the modulus of the pressure, the expectation value of the intensity at $\langle I(\mathbf{p})\rangle$ becomes
$$\langle I(\mathbf{p})\rangle=\left<\overline{\left(\sum_{j=1}^N\frac{K_jV_j\left(t-\frac{|\mathbf{p}-\mathbf{q}_j|}{c}\right)}{|\mathbf{p}-\mathbf{q}_j|}\right)}\left(\sum_{j=1}^N\frac{K_jV_j\left(t-\frac{|\mathbf{p}-\mathbf{q}_j|}{c}\right)}{|\mathbf{p}-\mathbf{q}_j|}\right)\right>$$
$$=\sum_{j=1}^N\sum_{k=1}^N\frac{\overline{K_j}K_k}{|\mathbf{p}-\mathbf{q}_j||\mathbf{p}-\mathbf{q}_k|}\left<\overline{V_j\left(t-\frac{|\mathbf{p}-\mathbf{q}_j|}{c}\right)}V_k\left(t-\frac{|\mathbf{p}-\mathbf{q}_k|}{c}\right)\right>$$
$$=\sum_{j=1}^N\sum_{k=1}^N\frac{\overline{K_j}K_k}{|\mathbf{p}-\mathbf{q}_j||\mathbf{p}-\mathbf{q}_k|}\left<\overline{V_j\left(t\right)}V_k\left(t+\frac{|\mathbf{p}-\mathbf{q}_j|-|\mathbf{p}-\mathbf{q}_k|}{c}\right)\right>$$
$$=\mathbf{a}^\dagger\boldsymbol{\Gamma}\mathbf{a}$$
where
$$\mathbf{a}_j=\frac{K_j}{|\mathbf{p}-\mathbf{q}_j|}$$
is a vector of amplitude coefficients and
$$\boldsymbol{\Gamma}_{jk}=\left<\overline{V_j\left(t\right)}V_k\left(t+\tau_{jk}\right)\right>$$
is the correlation matrix element between the fields at $V_j$ and $V_k$ with time delay $\tau_{jk}=\frac{|\mathbf{p}-\mathbf{q}_j|-|\mathbf{p}-\mathbf{q}_k|}{c}$.
Without loss of generality we can assume the $V_k$ are normalized, ie, $\boldsymbol{\Gamma}_{kk}=1$, by absorbing any excess amplitude into the intensity factor $K_k$. Then the off-diagonal elements $\boldsymbol{\Gamma}_{jk}\leq1$ by elementary geometry, and $\boldsymbol{\Gamma}$ is called the normalized coherence matrix.
Note $\mathbf{a}^\dagger\boldsymbol{\Gamma}\mathbf{a}$ is a matrix quadratic form, and in the absence of correlation between the sources $A_k$ (in the time-averaged sense) the off-diagonal elements vanish and $\boldsymbol{\Gamma}=I$, the identity matrix, giving
$$\langle I(\mathbf{p})\rangle=\mathbf{a}^\dagger\mathbf{a}=\sum_{k=1}^N\left|\frac{K_k}{|\mathbf{p}-\mathbf{q}_k|}\right|^2,$$
which is just the sum of the individual intensities of the voices. This is what is meant by "incoherent waves add diagonally"; their intensities just add, because only the on-diagonal elements of the normalized coherence matrix come into play, and $I\propto N$.
Meanwhile, if there are long-term correlations between the sources $A_k$, the off-diagonal elements will not be zero, and additional intensity will come from the $N^2-N$ off-diagonal entries of $\boldsymbol{\Gamma}$. In this case, the intensities add "off-diagonally", and in the best-case scenario they can add as $I\propto N^2$.