1

I'm working on a dataset that contains several cumulative variables, which are values that always increase and depend on their previous values (such as an odometer reading in a vehicle). My aim is to train ML or DL models to perform regression and classification tasks in supervised learning from the variables in my dataset.

I wonder whether it makes sense to create differential variables from these cumulative variables to train ML or DL models. For example, calculating the difference between consecutive records to obtain a rate of change between them (and even dividing by the time interval between samples to create the time derivatives). Could this approach provide any advantages for machine learning models? Is it a standard practice?

user386164
  • 21
  • 4

1 Answers1

1

In most cases, classification at the final decision point is attempting to linearly separate classes. Complex data is rarely linearly separable so it is common to perform transformations on the data until it is.

The training algorithm must find any important transformations. If some expert knowledge exists that a transformation is important, it’s often good to do it in pre-processing so it doesn’t need to be recreated by the model.

However, a transformation might also hide information or make the data harder to linearly separate. So the best answer is pre-processing is sometimes trial and error.

There is a clear pro to your specific example: Unless a datapoint contains both current and previous odometers reading, speed could never be derived. Any prediction dependent on speed could not be made. The same could be said of any point to point derivation.

Similar arguments can be made in regression where the goal is to find a simple polynomial for best generalization. It may take some transformations to get to a representation where the polynomial is simple.

foreverska
  • 2,347
  • 4
  • 21