I have a a dataset that most of records and their corresponding labels are the same and only timestamp of each records is different from other record. If I ignore duplicate records in for training some algorithm like DQN, is this a correct approach?
Asked
Active
Viewed 113 times
1 Answers
1
This depends on:
- Whether the timestamp is additional information. I.e. is the temporal dimension relevant?
- Removing samples will shift the distribution of the data set. I.e. if you have 2 possible states, with 90 copies of 1 and 10 copies of the other, removing all duplicates means the model will not come into contact with the 90/10 ratio in the data, but will see a 1/1 ratio, this can bias your model towards the underrepresented class, reducing performance.
Kroshtan
- 259
- 1
- 10