I am approaching the implementation of the OpenPose algorithm for realtime human body pose estimation.
According to the official paper OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, $L$ and $S$ fields (body part maps, and part affinity fields) are estimated. These have the same size as the input image, and, according to the paper, these fields should be outputted at a given step in the forward pass (after a given number of $L$ stages, and $S$ stages), but, since before entering these stages the image is passed through the initial layers of the VGG-19 model, the spatial dimension is encoded and the features that finally enter the $L$ and $S$ stages have other dimensionality.
All the network is convolutional, there's no FC layer at all. The VGG-19 part is the only one that contains MaxPooling layers, hence affecting the spatial relations and size of the receptive fields.
My point is, after stage execution, I get tensors of shape [batch_size, filter_number, 28, 28]. The issue is that the paper is not stating how to decode this information into the $L$ and $S$ maps of size $224 \times 224$.
Following a traditional approach and decoding the final tensors with a linear net from, let's say, $15000 \rightarrow (224 * 224 * \text{ number of body parts }) + (224 * 224 * \text{ number of limbs } * 2)$~A very huge number!, is out of question for any domestic computer, I presume I should have the least 128GbRAM installed, and is not the case.
Another solution is to remove the max-pooling layers from the VGG-19 part, but then although the map size is preserved to $224$, instead of $28$, the huge amount of computations and values that need to be stored also lead to memory errors.
So, the problem is, how can I get to a final output of $224 \times 224$ without FC layers, from a tensor of shape [batch_size, bodyparts, 28, 28]?
Not an easy answer. I will check a TensorFlow implementation I have seen around to see how the problem was solved.
Any parallel ideas are greatly welcome.