Weighted Focal Loss is defined like so
$FL(p_t) = -\alpha_t log(p_t) (1-p_t)^\gamma $
Whereas weighted Cross Entropy Loss is defined like so
$CE(p_t) = -\alpha_t log(p_t)$
Some blog posts try to explain the core difference, but I still fail to understand why select one over the other?
Compiling some of those blogs, boils down to
Another way, apart from Focal Loss, to deal with class imbalance is to introduce weights. Give high weights to the rare class and small weights to the dominating or common class. These weights are referred to as $\alpha$
Adding these weights does help with class imbalance however, the focal loss paper reports:
The large class imbalance encountered during the training of dense detectors overwhelms the cross-entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While balances the importance of positive/negative examples, it does not differentiate between easy/hard examples
Even when we add α, while it does add different weights to different classes, thereby balancing the importance of positive/negative examples - just doing this in most cases is not enough. What we also want to do is to reduce the loss of easily-classified examples because otherwise these easily-classified examples would dominate our training.
My question
I fail to see why the added multiplicative factor of focal loss handles "how easily samples are classified" moreso than what is already present in the very essence of CrossEntropyLoss.
It seems like with FocalLoss, the (CrossEntropy)loss is simply skewed downwards at the "certain" part, such that loss there becomes insignificant - but that is also the case for CrossEntropy. I simply fail to see the fundamental difference that FocalLoss can solve and CrossEntropyLoss can't.
Their qualitative shape is quite similar, so I don't see why they what problem one can solve that the other can't.