I am trying to train a multi-label video classification model. My dataset consists of just one video, sampled at 1fps. I have a total of 12k frames and 21 classes, and in a single frame multiple classes can be present. I added a simple classification head to my Video Swin Transformer backbone as follows;
model.head = nn.Sequential(
nn.LayerNorm(model.num_features),
nn.Dropout(p=0.5),
nn.Linear(model.num_features, 512),
nn.GELU(),
nn.Dropout(p=0.3),
nn.Linear(512, NUM_CLASSES)
)
I am sampling clips from the video such that each time a classification change occurs in the video, a clip is sampled, so the clip length varies. The labels look like this;
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,0., 0., 0.]], device='cuda:0')
Then, I simply use a BCE Loss with logits for training;
outputs = model(inputs)
loss = F.binary_cross_entropy_with_logits(outputs, labels)
Lastly, after the training is complete, I simply apply sigmoid function to the output labels from my model and use a different threshold for each class.
However this approach comes with limitations;
-I can only use a batch size of 1, since the T (num of frames) dimension is different for each clip.
-This way, the model can't learn from the transitions between the clips.
-In real leaf scenario I want to put a whole video and run inference from evenly sampled clips, which means there may not be a single "true" label for the input clip.
For these reasons, I want to write a custom loss function that somehow can handle the transitions between labels, but I am stuck.
I could not find any source for this specific problem, how can I possibly modify my code?