How can the exact same model give different confusion matrices for the test dataset and the entire dataset?

Question

I have recently implemented a simple artificial neural network with 1 hidden layer. I split my data using train_test_split and I end up with the following confusion matrix in my test set.

Confusion matrix:
[[193  14]
 [ 24 169]]
Accuracy score is 0.905

Then when I run the same model for my entire dataset (which of course contains my test set) it fails to detect the first class. I get the following confusion matrix

Confusion matrix:
[[  0 500]
 [  0 499]]
Accuracy score is 0.4994994994994995

How can this be possible? How can the exact same model give different confusion matrices?

score 5 · Answer 1 · answered Apr 15 '25 at 11:56

The short answer is that the most likely cause is that your train/test dataset does not well represent reality.

As you are likely aware, when you train a network the goal is for the network to generalize on the problem so that non-training data is classified well. In an effort to accomplish this we (as you did) split our available training data to hold out test data, which is used to evaluate how well the model has generalized.

If you were having this problem during training I would suggest that the problem was more likely to be that all of the samples had been inadvertently assigned the same label before training. In fact, I would encourage you to double check that the confusion matrix you posted originally is, in fact, from the test set and not the training set. If it is the training data, double check your labels before training.

Assuming that the confusion matrix is from the test set, I would suggest that this is more likely to indicate that your training/testing set does not well represent the problem space. Consider this illustration:

Imagine that you wish to create a model that can approximate the value of sine. If you train the model with 10,000 $(x,y)$ pairs where $x$ is in the range $[0, 2]$ and hold out some random sample of that to create a test set, your model will definitely perform well and evaluate with low loss on the test set (provided you train for a reasonable number of epochs).

However, even with 10,000 or 100,000 samples in that range, if you now ask your trained model to produce values of sine with $x$ values in the range $[3,5]$ it is likely that your model isn't even close to correct. This is because the training and testing data used does not well represent how the function behaves outside of that narrow range.

Of course, there are many problems that an ANN can model very well with a narrow set of data, but in those cases the data outside of the training set range behaves in very much the same way as it does within that range. To be clear, this problem isn't specific to sine, but to problems in general; It is very important to verify that your training/testing data well covers the problem domain.

How can the exact same model give different confusion matrices for the test dataset and the entire dataset?

1 Answers1