How to write a custom loss for multi-label video classification?

Question

I am trying to train a multi-label video classification model. My dataset consists of just one video, sampled at 1fps. I have a total of 12k frames and 21 classes, and in a single frame multiple classes can be present. I added a simple classification head to my Video Swin Transformer backbone as follows;

model.head = nn.Sequential(
                nn.LayerNorm(model.num_features),
                nn.Dropout(p=0.5),
                nn.Linear(model.num_features, 512),
                nn.GELU(),
                nn.Dropout(p=0.3),
                nn.Linear(512, NUM_CLASSES)
            )

I am sampling clips from the video such that each time a classification change occurs in the video, a clip is sampled, so the clip length varies. The labels look like this;

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,0., 0., 0.]], device='cuda:0')

Then, I simply use a BCE Loss with logits for training;

outputs = model(inputs)
loss = F.binary_cross_entropy_with_logits(outputs, labels)

Lastly, after the training is complete, I simply apply sigmoid function to the output labels from my model and use a different threshold for each class.

However this approach comes with limitations;

-I can only use a batch size of 1, since the T (num of frames) dimension is different for each clip.

-This way, the model can't learn from the transitions between the clips.

-In real leaf scenario I want to put a whole video and run inference from evenly sampled clips, which means there may not be a single "true" label for the input clip.

For these reasons, I want to write a custom loss function that somehow can handle the transitions between labels, but I am stuck.

I could not find any source for this specific problem, how can I possibly modify my code?

How to write a custom loss for multi-label video classification?

0 Answers0