I'm coding an FNN in Rust using the nalgebra crate. I coded the backpropagation based on this article from Brilliant (the link directly highlights the formulas' section I).
The issue
My network tends to overclassify one class. See logs of some iterations:
Pred: [0.4957141833444244, 0.5042858166555757] Exp: [1.0, 0.0]
Pred: [0.1159099292782023, 0.8840900707217978] Exp: [0.0, 1.0]
Pred: [0.49812391110550713, 0.5018760888944928] Exp: [0.0, 1.0]
Pred: [0.785823559592914, 0.21417644040708617] Exp: [1.0, 0.0]
(0) Loss: 1.7553771084566945 accuracy: 75%
Pred: [0.921104553204082, 0.07889544679591799] Exp: [1.0, 0.0]
Pred: [0.43669346100767914, 0.5633065389923209] Exp: [0.0, 1.0]
Pred: [0.8390020708741205, 0.16099792912587946] Exp: [0.0, 1.0]
Pred: [0.9358457093202687, 0.0641542906797313] Exp: [1.0, 0.0]
(1) Loss: 2.5487814857735516 accuracy: 75%
Pred: [0.9938891101436632, 0.0061108898563368074] Exp: [1.0, 0.0]
Pred: [0.8117580010095468, 0.18824199899045324] Exp: [0.0, 1.0]
Pred: [0.9637103688436267, 0.03628963115637338] Exp: [0.0, 1.0]
Pred: [0.9836993037515837, 0.01630069624841633] Exp: [1.0, 0.0]
(2) Loss: 5.008814788847665 accuracy: 50%
Pred: [0.9995224418411457, 0.0004775581588542201] Exp: [1.0, 0.0]
Pred: [0.9640882324687131, 0.03591176753128681] Exp: [0.0, 1.0]
Pred: [0.9940401356203591, 0.005959864379640862] Exp: [0.0, 1.0]
Pred: [0.9967080323177023, 0.0032919676822977125] Exp: [1.0, 0.0]
(3) Loss: 8.453172874424054 accuracy: 50%
Pred: [0.9999695301335619, 3.0469866438138036e-5] Exp: [1.0, 0.0]
Pred: [0.9955011694391566, 0.004498830560843456] Exp: [0.0, 1.0]
Pred: [0.9993378187078414, 0.0006621812921585512] Exp: [0.0, 1.0]
Pred: [0.9995329830642827, 0.00046701693571723986] Exp: [1.0, 0.0]
(4) Loss: 12.724406571994546 accuracy: 50%
Pred: [0.9999986403638063, 1.3596361937582465e-6] Exp: [1.0, 0.0]
Pred: [0.9996152538707804, 0.0003847461292195687] Exp: [0.0, 1.0]
Pred: [0.9999496312441125, 5.036875588738215e-5] Exp: [0.0, 1.0]
Pred: [0.9999544427172501, 4.5557282750020274e-5] Exp: [1.0, 0.0]
(5) Loss: 17.759113261749448 accuracy: 50%
Pred: [0.9999999567632185, 4.323678145397258e-8] Exp: [1.0, 0.0]
Pred: [0.9999768122330864, 2.3187766913582424e-5] Exp: [0.0, 1.0]
Pred: [0.9999973081629956, 2.6918370044323987e-6] Exp: [0.0, 1.0]
Pred: [0.9999969459673318, 3.054032668285032e-6] Exp: [1.0, 0.0]
(6) Loss: 23.497175499984603 accuracy: 50%
This also happens when I have more classes, using other datasets (like the MNIST hand-digits).
My implementation
Based on the Brilliant article I coded a matrix based implementation of the Backpropagation, i.e. all calculations are made matrix-wise. I do not see any problems with my implementation. I also coded the RMSProp which seems to be fine since when I remove it and only use SGD to train the networks the same issue happens.
If you want to take a look at the whole project, I created a branch, which is a copy of the current main branch, that I'll never delete for the sake of this question.
What I tried
- Checked every activation and loss function I have coded (see my functions module code). I'm using sigmoid for the hidden layer and softmax for the output layer. But even if I use relu for the hidden layer the issue is the same.
- Applied the math by hand, it seems to be check (I'm not an expert at math at all).
- Remove the RMSProp, just used SGD, same issue the network doesn't learn.
- Use the Relu activation function for the hidden layer.
Question
- In theory, what could cause this?