Suppose we have $1000$ products that we want to detect. For each of these products, we have $500$ training images/annotations. Thus we have $500,000$ training images/associated annotations. If we want to train a good object detection algorithm to recognize these objects (e.g. YOLO) would it be better to have multiple detection models? In other words, should we have 10 different YOLO models where each YOLO model is responsible for detecting 100 products? Or is it good enough to have one YOLO model that can detect all 1000 products? Which would be better in terms of mAP/recall/precision?
1 Answers
This is called decomposition of multi-class classifier. Your proposed method is called one vs all.
One vs. all provides a way to leverage binary classification. Given a classification problem with $N$ possible solutions, a one-vs.-all solution consists of $N$ separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question.
Source: https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/one-vs-all.
According to this article. The author of the article did experiments on SVM on 8 different benchmark problems. According to the results, this method is sometimes as good as others, but usually not the best. It is also never substantially better than any other method. The article also stated that the best method is usually problem dependent.
Also, this method will decrease inference speed a lot, and used substantial amount of GPU memory. According to the source, it does not improve performance a lot, so you best bet for getting a higher performance is probably to use a different model architecture, for example the FPN FRCN, which is stated in the YOLO v3 paper having the best performance, but not fast inference speed. YOLOv3 is designed to have a fast inference speed, to provide real time object detection system, so for performance you should probably use other model architecture instead.