1

I have a a dataset that most of records and their corresponding labels are the same and only timestamp of each records is different from other record. If I ignore duplicate records in for training some algorithm like DQN, is this a correct approach?

Saurav Maheshkar
  • 750
  • 1
  • 8
  • 20
Zahra
  • 111
  • 5

1 Answers1

1

This depends on:

  1. Whether the timestamp is additional information. I.e. is the temporal dimension relevant?
  2. Removing samples will shift the distribution of the data set. I.e. if you have 2 possible states, with 90 copies of 1 and 10 copies of the other, removing all duplicates means the model will not come into contact with the 90/10 ratio in the data, but will see a 1/1 ratio, this can bias your model towards the underrepresented class, reducing performance.
Kroshtan
  • 259
  • 1
  • 10