I am training an RNN in PyTorch to produce captions for images. It's a pretty standard architecture – the image is processed by a pre-trained InceptionV3 to extract features, the recurrent module processes the words seen so far and then its result is merged with image features and fed through a linear layer to produce a probability distribution over the vocab to predict the next word. A network trained this way can auto-regressively predict a whole caption for an image (it's seeded with an artificial token at the beginning to have a preceding sequence to start off with).
To train the network every caption is decomposed like this in the training dataset:
X, y
<START>, a
<START> a, man
<START> a man, is
<START> a man is, sitting
<START> a man is sitting, on
<START> a man is sitting on, a
<START> a man is sitting on a, chair
<START> a man is sitting on a chair, <END>
The result is that there is a lot of input sequence length variation in the training data which is a problem since we would need a lot of padding in batches. I am aware that there are techniques of counteracting this (pack_padded_sequence in PyTorch) but I came up with another idea.
I am modifying the Sampler object that's used to provide indices for the DataLoader to construct batches in such a way that the Sampler orders all of the example indices by input sequence length, groups it into batch-size chunks, and shuffles the chunks before feeding them to the DataLoader. As a result the sequences in individual batches are almost always the same length – not because of any padding but because of the fact that we ordered the examples by length before grouping them. There will be some batches that will encompass examples on a border of length groups so some padding will be necessary but it will be minimal (just one zero for a group of lengths [5, 5, 5, 6, 6, 6] for example) and the number of such groups would be negligible (~0.002% of the batches in my case).
My question is – does this method introduce some kind of drawback that I am not aware of? I find it simpler than having completely random batches and padding and packing them. I can't tell however, if it would introduce some kind of bias to the model. My intuition tells me that this should not be problem since the sequence lengths for batches would vary greatly across training (since I am shuffling the index chunks), but I wanted to get a second opinion. I guess the question boils down to -> is sequence length variation within the batch important for proper training? Or, for this specific case -> is it important that the net process examples decomposed from a single caption within the same batch or close to each other time-wise?
I'd be happy to train the network with both solutions and check the results myself but it's quite time-expensive and I'd prefer to avoid some stupid stuff if I can.
 
     
    