Im trying to understand how the likelhood function for gaussian discriminant analysis is derived. I self studying Murphy's Probabilistic Machine learning, and in it, he states the likelihood function as follows:
$$P(D|\theta) = \prod_{i=1:n} Cat(y^i|\pi) \prod_{c=1:C} N(x^i | \mu_c, \sum_c)^{I(y^i=c)}$$
where $N(x^i | \mu_c, \sum_c)$ is the normally distributed random variable with mean $\mu_c$ which is the mean vector for each feature whose class label is $y^i=c$ and $\sum_c$ is the covariance matrix between the features for the ith example.
Based on my understanding, the way I have been deriving Likelihood is by assuming the following:
We are given the dataset $D = {(x^1, y^1),(x^2, y^2), ..., (x^n, y^n)}$ drawn from unknown distribution $p(X,Y)$. B/c all pairs are sampled iid, we obtain:
$$p(D) = p((x^1, y^1),(x^2, y^2), ..., (x^n, y^n)) = \prod_{i=1:n} p(x^i, y^i | \theta)$$
Hence...
$$p(D|\theta) = \prod_{i=1:n} p(x^i | y^i = c, \theta)p(y^i=c|\theta)$$
Now, I can see the prior is just a categorical distribution given by $Cat(y | \pi)$ so I can see how the Cat function is coming in equation for likelihood since $y^i$ falls in one of C categories. But what im not able to understand is why is the equation taking product of all clases from $c=1:C$ for the class conditional density?
Is it assuming that for the ith sample, the label can only take on one of c values, hence the distribution of $p(x|y,\theta)$ is given by the product of all the disributions raised to the power of indicator? hence only one of them can be true?
Thanks