2

why cross entropy formulation is different with binary cross entropy?

cross entorpy loss

$$ H_p(q) = -\sum_{q_i}^C [q_i \log(p_i)] $$

binary cross entorpy

$$ -\sum [k_i \log(p_i)+(1-k_i) \log(1-p_i)] $$ $k_i \in \{0,1\}$

"Where did the Binary Cross-Entropy Loss Function come from?" Relation between Binary Cross Entropy and Binomial Distribution

2 Answers2

1

There's no essential difference, information theoretically speaking any loss consisting of a negative log-likelihood (NLL) is a cross entropy between the empirical distribution defined by the training set and the probability distribution defined by a specific model. Even the usual mean squared error (MSE) can be shown as the cross entropy between the empirical distribution and a Gaussian distribution.

For discrete probability distributions $p$ and $q$ with the same support $\mathcal {X}$, this means ${H(p,q)=-\sum _{x\in {\mathcal {X}}}p(x)\,\log q(x)}$. (Eq.1)

Therefore clearly binary cross entropy is just a special case of above equation for discrete probability distributions where your $k_i$ is the ground truth label for the $i$-th example, say $0$ for negative case and $1$ for positive case here, acting as the empirical distribution of the training set, and your $p_i$ is the predicted probability assigned by the model to the positive class for the $i$-th example. And now we can derive the binary cross-entropy loss using the maximum likelihood principle and the Binomial distribution to show their relation.

Since each example of Binomial data-generating process is independent, the likelihood function can be expressed as the product of the probabilities of each individual example $L=∏_{i=1}^Np_i^{k_i}(1−p_i)^{1−k_i}$. To simplify computations and prevent numerical underflow, it's common to work with the logarithm of the likelihood function, known as the log-likelihood $\log L=\sum_{i=1}^N[k_i\log(p_i)+(1−k_i)\log(1−p_i)]$. Finally maximizing the log-likelihood for a fixed $N$ is equivalent to minimizing the negative log-likelihood (NLL) which leads to your binary cross-entropy loss function $-\sum [k_i \log(p_i)+(1-k_i) \log(1-p_i)]$ which measures the discrepancy between the predicted probabilities and the true labels.

cinch
  • 11,000
  • 3
  • 8
  • 17
0

The cross entropy loss with 2 classes is equivalent to the binary cross entropy loss:

$$ \begin{equation} H_p(q) = - \sum_{i=1}^2 [q_i \log p_i] = q_1 \log p_1 + q_2 \log p_2 = q_1 \log p_1 + (1 - q_1) \log (1 - p_1) \end{equation} $$

The last equality follows from the fact that $q$ and $p$ are discrete probability distributions with support $\{1, 2\}$, so $q_1 = 1 - q_2$ and $p_1 = 1 - p_2$.

The summation in your binary cross entropy loss is there because you are summing over the entire batch of data points, while the cross entropy formula is the loss on a single data point with prediction $p \in [0,1]^2$ and ground truth $q \in \{0,1\}^2$. If you have soft ground truth, for example if you have label smoothing then in general $q \in [0,1]^2$.

c p
  • 121
  • 5