1

Here's a BackProp Algo definition from here:

Initially all the edge weights are randomly assigned. For every input in the training dataset, the ANN is activated and its output is observed. This output is compared with the desired output that we already know, and the error is “propagated” back to the previous layer. This error is noted and the weights are “adjusted” accordingly. This process is repeated until the output error is below a predetermined threshold.

Something I'm not understanding here: If inputs are fed one by one, and the weights are adjusted for each input, won't the NN essentially be trained for the last input?

Please clarify. Thank you.

1 Answers1

3

The concern you've raised touches on the method of training known as online learning or stochastic gradient descent (SGD), where weights are updated after each individual training example. It's a common misunderstanding that, when using SGD even though weights are updated after each individual training example, the process involves iterating over the entire dataset multiple times. Each iteration is called an epoch. There's also a simple example in the cited reference.

As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

The network is exposed to each training example many times over multiple epochs. Each example contributes to the weight updates incrementally, ensuring the model learns from the entire dataset about which the usual empirical risk minimization (ERM) ML framework targets to minimize its average loss. The updates in SGD are noisy but more efficient than batch gradient descent using the entire datatset, which helps the model escape local minima and potentially find a better global minimum.

cinch
  • 11,000
  • 3
  • 8
  • 17