Deep networks notoriously take a long time to train.
What is the most time-consuming aspect of training them? Is it the matrix multiplications? Is it the forward pass? Is it some component of the backward pass?
Deep networks notoriously take a long time to train.
What is the most time-consuming aspect of training them? Is it the matrix multiplications? Is it the forward pass? Is it some component of the backward pass?
Check out Figure 6 in this paper: PyTorch Distributed: Experiences on Accelerating Data Parallel Training
It breaks down the latency of the forward pass, the backward pass, the communication step, and the optimization step for running both ResNet50 and BERT on a NVIDIA Tesla V100 GPUs.
From measuring the pixels in the figure, I estimated the times for the forward, backward, and optimization steps as a percentage of their total time combined. (I ignored the communication step shown in the figure because that was only to show how long an unoptimized communication step would take when doing data-parallel training). Here are the estimates I got:
So the backward pass takes about 3x as long as the forward pass, and the optimization step is relatively fast.
 
    
    The output of a layer can be calculated given the output of the previous layer. So the GPU can parallelize this computation for every layer and over the minibatch which is done by calculating a big matrix. But it needs to be sequential from layer to layer (earlier layers to higher layers). Regarding the layer type convolutions or especially fully connected layers can result in a big matrix calculation.
The gradient of a layer with respect to the layer input (and layer parameters) can only be calculated given the gradient of the layer output (input gradient of a subsequent layer) and input to the layer (output of the previous layer). This again can be parallelized over a layer and minibatch but is sequential from higher layers to earlier layers. Moreover, since the backward pass relies on the outputs of the forward pass all intermediate layer outputs of the forward pass have to be cached for the backward pass which results in a high (GPU) memory usage.
So, these two steps take a long time for 1 training iteration, and (depending on your network) high GPU memory usage. But you should read and understand the backpropagation algorithm that basically explains everything.
Moreover, to train a network from scratch, in general, takes lots of iterations because especially in the earlier layers training the parameters is based on gradients that are affected by lots of previous layers, which can result in noisy updates, etc., that do not always push the network parameters in the right direction directly. In contrast, e.g. fine-tuning a pre-trained network on some new task can for example already be done with much less training iterations.
 
    
     
    
    There is no such single hard and slow step in training neural networks , forward pass involves large number of matrix multiplications so does backward pass , even though there are highly optimized libraries for matrix multiplications neural networks act on very high dimensional (tensor) multiplications in both forward and backward passes which makes it difficult train . however backward pass would be even more slower or even intractable if we won't use backpropagation in case of large neural networks, since computing derivatives is time-taking.
refer training results in for exact number https://github.com/baidu-research/DeepBench#types-of-operations
