Nonclassicality in general
I should start by pointing out that there is no univocal notion of "(non)classicality". To name a few examples, in the context of quantum optics one might call a state "nonclassical" if it cannot be written as a convex combination of coherent states. Or when entanglement is involved you might call "nonclassical" an entangled state. In the context of nonlocality, you might call "nonclassical" a probability distribution that cannot be reproduced by local hidden variable theories.
The type of (non)classicality discussed in this paper is thus a specific one that is suited for the context at hand. While certainly in the spirit of nonclassicality as discussed in the context of Bell nonlocality, it is not exactly the same idea I think. The point seems here to rule out the possibility that the observed correlations cannot be explained by Alice simply sending a classical message to Bob. More details below.
Very rough summary of the context
The setup consists of Alice preparing a state and Bob measuring it.
Alice chooses the value of some index $i\in[1,M]$, and sends a state $\rho_{i,\lambda}$ accordingly. Here $\lambda$ denotes a hidden variable; this is added to take into account the possibility of Alice and Bob being classically correlated beforehand.
Bob then performs some measurement corresponding to a POVM $\Pi_{k,\lambda}$, for some $k\in[1,K]$ he decides, and gets an outcome $j\in[1,N]$.
Repeating the above procedure a number of times, they collect the associated statistics, in the form of the conditional probabilities $p_{j|i,k}$. These conditional probabilities are the main subject of the discussion.
They call these correlations "classical" if, roughly speaking, they can be written as $p_{j|i,k}=\operatorname{Tr}[\rho_i\Pi^j_k]$ for some "classical set of states" $\{\rho_i\}_i$.
So... why commuting states?
I think the point of this definition is that if the states commute then there is no need for Alice to actually send quantum states to Bob. She might as well just send a classical message. Indeed, the authors point out after Lemma 1 that
Lemma 1 thus proves that every set of probabilities obtained with commuting states can be performed with classical states and classical POVMs. This clearly implies that commuting states may be equally regarded as classical variables, and commuting-element measurements as read-out of classical variables.
A possible argument to see why this is the case, loosely inspired by the proof of Lemma 1 (though in fairness I haven't really read the paper or the lemma in depth), is the following.
If the states are pairwise commuting, $[\rho_i,\rho_j]=0$, then they can be simultaneously diagonalised:
$$\rho_i = \sum_\ell \lambda_{i\ell}|\ell\rangle\!\langle \ell|,$$
for some $\lambda_{i\ell}\ge0$ and an orthonormal basis $\{|\ell\rangle\}_\ell$. We then have
$$p_{j|i,k}=\operatorname{Tr}[\rho_i \Pi^j_k]
= \sum_\ell \lambda_{i\ell} \langle \ell|\Pi^j_k|\ell\rangle.$$
What this is telling you is that, to reproduce the correlation, it is enough for Alice to send a classical message, and more specifically the value of $i$, to Bob's device. Bob's device will then sample one of the (fixed beforehand) states $|\ell\rangle$, with probabilities $\lambda_{i\ell}$, and measure it with the POVM $\{\Pi^j_k\}_j$, obtaining an outcome $j$.
For example, Alice presses the button "$i=1$". Her box sends a classical message saying "1" to Bob's box. Bob's box is then programmed to pick one of the states $|\ell\rangle$. More specifically, it will pick the state $|\ell\rangle$ with probability $\lambda_{i\ell}$ (again, these probabilities will have been chosen beforehand, when the boxes were calibrated). Say the box picks the value "$\ell=3$". Now, if Bob presses the button "$k=2$", his Box will perform the measurement corresponding to the POVM $\{\Pi_2^j\}_j$ on the state $|3\rangle$. The result will be some value of $j\in[1,N]$. Say $j=36\le N$.
Repeating this circus an "infinite" number of times, Alice and Bob will collect the results and find the correct values of the conditional probabilities $p_{j|i,k}$.
When this type of thing is possible, then we say that the correlations $p_{j|i,k}$ are "classical", in the sense that no actual quantum states needed to be exchanged to get it.