Likelihood function for Gaussian Discriminant Analsis

Question

Im trying to understand how the likelhood function for gaussian discriminant analysis is derived. I self studying Murphy's Probabilistic Machine learning, and in it, he states the likelihood function as follows:

$$P(D|\theta) = \prod_{i=1:n} Cat(y^i|\pi) \prod_{c=1:C} N(x^i | \mu_c, \sum_c)^{I(y^i=c)}$$

where $N(x^i | \mu_c, \sum_c)$ is the normally distributed random variable with mean $\mu_c$ which is the mean vector for each feature whose class label is $y^i=c$ and $\sum_c$ is the covariance matrix between the features for the ith example.

Based on my understanding, the way I have been deriving Likelihood is by assuming the following:

We are given the dataset $D = {(x^1, y^1),(x^2, y^2), ..., (x^n, y^n)}$ drawn from unknown distribution $p(X,Y)$. B/c all pairs are sampled iid, we obtain:

$$p(D) = p((x^1, y^1),(x^2, y^2), ..., (x^n, y^n)) = \prod_{i=1:n} p(x^i, y^i | \theta)$$

Hence...

$$p(D|\theta) = \prod_{i=1:n} p(x^i | y^i = c, \theta)p(y^i=c|\theta)$$

Now, I can see the prior is just a categorical distribution given by $Cat(y | \pi)$ so I can see how the Cat function is coming in equation for likelihood since $y^i$ falls in one of C categories. But what im not able to understand is why is the equation taking product of all clases from $c=1:C$ for the class conditional density?

Is it assuming that for the ith sample, the label can only take on one of c values, hence the distribution of $p(x|y,\theta)$ is given by the product of all the disributions raised to the power of indicator? hence only one of them can be true?

Thanks

cinch · Accepted Answer · 2024-02-29T06:14:35.597

Your understanding is correct. The indicator function ensures that only the term corresponding to the true class $y_i$ contributes, and all other terms become $1$, effectively ignoring them. This is actually a useful way to write out the likelihood function for mixed continuous–discrete distributions with class-conditional densities similar to the categorical variables as mentioned below, so that you can optimize $\pi$ and the $(\mu_c , Σ_c)$ terms separately based on the log-likelihood later on to derive the MLEs for the Gaussians and tied/diagonal covariances.

Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket: $${f(x\mid {\boldsymbol {p}})=\prod _{i=1}^{k}p_{i}^{[x=i]}},$$ where ${[x=i]}$ evaluates to $1$ if ${x=i, 0}$ otherwise. There are various advantages of this formulation, e.g.:

It is easier to write out the likelihood function of a set of independent identically distributed categorical variables.

It connects the categorical distribution with the related multinomial distribution.

It shows why the Dirichlet distribution is the conjugate prior of the categorical distribution, and allows the posterior distribution of the parameters to be calculated.

Likelihood function for Gaussian Discriminant Analsis

1 Answers1