How can I classify a given sequence of images (video) as either moving or staying still from the perspective of the person inside the car?
Below is an example of the sequence of 12 images animated.
- Moving from the point of the person inside the car.
- Staying still from the point of the person inside the car.
Methods I tried to achieve this:
- A simple CNN (with 2d convolutions) with those 12 images (greyscaled) stacked in the channels dimension (like Deepmind's DQN). The input to the CNN is - (batch_size, 200, 200, 12).
- A CNN with 3d convolutions. The input to the CNN is - (batch_size, 12, 200, 200, 1).
- A CNN+LSTM (time-distributed with 2d convolutions). The input to the neural network is - (batch_size, 12, 200, 200, 1).
- The late fusion method, which is taking 2 frames from the sequence that are some time steps apart and passing them into 2 CNNs (with same weights) separately and concatenating them in a dense layer As mentioned in this paper. This is also like CNN+LSTM without the LSTM part. The input to this net is - (batch_size, 2, 200, 200, 1)-> the 2 images are first and last frames in the sequence
All the methods I tried failed to achieve my objective. I tried tuning various hyperparameters, like the learning rate, the number of filters in CNN layers, etc., but nothing worked.
All the methods had a batch_size of 8 (due to memory constraint) and all images are greyscaled. I used ReLUs for activations and softmax in the last layer. No pooling layer was used.
Any help on why my methods are failing or any pointers to a related work


 
     
    