3

Let's say I've got a training sample set of 1 million records, which I pull batches of 100 from to train a basic regression model using gradient descent and MSE as a loss function. Assume test and cross validation samples have already been withheld from the training set, so we have 1 million entries to train with.

Consider following cases:

  • Run 2 epochs (I'm guessing this one is potentially bad as it's basically 2 separate training sets)
    • In the first Epoch train over records 1-500K
    • In the second epoch train over the 500K-1M
  • Run 4 epochs
    • In the first and third Epoch train over records 1-500K
    • In the second and fourth epoch train over the 500K-1M
  • Run X epochs, but each epoch has a random 250K samples from the training set to choose from

Should every epoch have the exact samples? Is there any benefit/negative to doing so? My intuition is any deviation in samples changes the 'topography' of the surface you're descending, but I'm not sure if the samples are from the same population if it matters.

This relates to a SO question: https://stackoverflow.com/questions/39001104/in-keras-if-samples-per-epoch-is-less-than-the-end-of-the-generator-when-it

Ray
  • 131
  • 4

1 Answers1

1

Your goal in regression should be to obtain the factors which result in the best fit model without over-fitting. The more data you have in the training set, the better your regression will be. Thus you would want to train on the most data, but you also want to have some data held out to validate that your model is not over-fit. So this is where you should have your data split into say a 80/20 training and validation set. And if data is scarce or you want that 20% to contribute to the model then you could do a 5 fold cross validation.

In the spirit of research perhaps you should try both of these routes, and report your findings.

Snives
  • 111
  • 1