After reading through this thread and some other resources online, I still do not understand the role of epochs in training a neural network. I understand that one epoch is one iteration through the entire data set. But I don't understand what happens when we finish one epoch and start the second, i.e., what is happening such that the error is reduced after each epoch?
Assuming we have an outer loop over epochs, and an inner loop for the gradient descent iteration. What happens is the following, at least to my understanding:
For the first epoch, we take some random initial parameters and perform, say, 1000 gradient descent steps until we have found a local minima where the loss is minimised. So we have landed in a local minima, but it could be a bad one and we want to find a better one, preferably even the global minima. However, this is what I was thinking about:
The only way we can escape a bad local minima is to start a new gradient descent search with different initial parameters, but it seems like starting a new epoch is not just taking different parameters and repeating the process. According to the loss plotted over epochs, the loss is decreasing after each iteration, so it seems like the parameters are not chosen randomly, instead it looks like there is some knowledge used from the previous epoch. First, I assumed that a new epoch just means starting a new search over the loss landscape, this time with another set of random initial values, such that we end up in a different local minima. However, it is not guaranteed that this local minima is a better one, so we could also land in a worse local minima.
But if I look at graphs where the loss is plotted over epochs, it almost always decreases quite smoothly with each epoch, indicating that every epoch does indeed lead to a better minima. This is where I get confused. How is this possible? What is happening at the start of a new epoch? In order to decrease the error, we must have knowledge from the previous iteration, which means that we do not just simply start with a different random set of initial parameters. So, how does a new epoch iteration contribute to a better result? What is happening that the loss is decreasing without ever rising due to worse local minima?
------------ EDIT ------------
I want to add this simple example: consider a simple linear network, no activation function. We have 2 inputs, each multiplied by a weight to generate a scalar output, i.e. the first step would look like:
x1[0]*w1 + x2[1]*w2 = out
where x1[0] contains the initial value and x2 is a different, known value. We want to find the weights w1 and w2 which minimise the loss between the output and the ground truth. In this case, I will work with one batch only, which contains the 2 inputs. The goal is to approximate a function linearly, given a known initial value. Then, I would run the training loop as follows
...
net = Net(2,1)
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(),lr=0.01)
for m in range(1,11):
    inputs = torch.Tensor(x[m-1], x[m]])
    for iteration in range(5000):
        net.zero_grad() 
        out = net(inputs)
        loss = criterion(out, truth)
        loss.backward()
        optimizer.step()
This is just as snapshot of the loop, but basically the idea here is to find the appropriate weights for each point m of the function I want to approximate, by repeating the gradient descent search until we have found the local minima, and then proceed to the next point m+1 and so on. Using this example, how and why would a third loop over epochs decrease the error?
 
    