I'm on a robotics team and we've been tasked to write a program to differentiate between a live and dead fish. We've been given ~15 minutes of training footage and it's absolutely terrible. It's low quality, hard to label (even for humans) and it's like 20 frames a second.
I have tried everything I can think of. YOLO, 3D convolutions (to take movement over time into account), residual networks with anywhere from 1-10 layers and more. I have narrowed it down to the data is just terrible.
Is there anything I can do to fix this? I know of data augmentation and have used it, but that doesn't increase the usefulness of the data, it just creates more terrible data. I feel like using machine learning to clean the data wouldn't be helpful (because of studies I can't remember the name of showing that adding one white pixel to an image can completely confuse an object classifier I just assume that using a machine learning model to alter an image would also just confuse a network), is this an accurate assumption?
Either way: is there anyway to improve the data I've been given? Or another way to approach this problem?
 
    