Can I ignore duplicate records of dataset for training?

Question

I have a a dataset that most of records and their corresponding labels are the same and only timestamp of each records is different from other record. If I ignore duplicate records in for training some algorithm like DQN, is this a correct approach?

score 1 · Answer 1 · answered Aug 08 '22 at 14:59

This depends on:

Whether the timestamp is additional information. I.e. is the temporal dimension relevant?
Removing samples will shift the distribution of the data set. I.e. if you have 2 possible states, with 90 copies of 1 and 10 copies of the other, removing all duplicates means the model will not come into contact with the 90/10 ratio in the data, but will see a 1/1 ratio, this can bias your model towards the underrepresented class, reducing performance.

Can I ignore duplicate records of dataset for training?

1 Answers1