To assess our deep learning models (CNN) we have labeled a big benchmark dataset (it was labeled by specialists so it is kind of Ideal). I of course know that we do not want to train new models using the benchmark dataset since it would be "cheating ourselves". The question is if we can train a small "secret model" based only on the benchmark, and use it only for faster labeling of other "real" datasets that unlike the benchmark dataset WILL be used to train our models. Does such an approach sound legitimate?
2 Answers
neural network are function approximators...
What you are saying is that given a data distribution $D$, you train a model $f$ on it, and then you you se $f$ to label a new dataset $D'$
Now, you want to train a new model $g$ on $D'$... however, $D'$ is labeled by $f$, thus $g$ will approximate the function learned by $f$, which is supposedly a good function for $D$, thus $g$ will be close to $f$, thus it's pretty much saying that $g$ will be trained on $D$
I'm rephrasing with all the definitions:
- $S$ is the specialist
- $D$ the small dataset (benchmark)
- $f$ is a neural network trained on $D$
- $D'$ is a second dataset labeled by $f$
- $g$ is a neural network trained on $D'$
the labels of $D'$ are point-wise evaluations of $f$, therefore when you will ask to $g$ to approximate them, you will have a new function ($g$) that approximate $f$
Now, depending on how related is $D'$ to $D$, you will have more or less "leakage" from $f$ to $g$
Conclusion: this method is effective only if $D'$ is similar to $D$, but if that's the case, then you will have a lot of leakage from $f$ to $g$, and thus you will have a final function $g$ that is almost trained on $D$
- 2,863
- 5
- 12
I'm not sure I understand the point. The secret model will invariably have some loss. This loss will be overlaid on the larger dataset as the secret model operates on it. Whatever model is trained from this larger dataset will learn the labels as given including the loss. If we then assume this final model also has a loss it likely learns a lossy version of the secret model's loss. Which theoretically could be a more accurate version of the secret model but more than likely will be less.
If I was forced to steel man this idea, given a known confusion one could use the secret model to label the larger dataset and focus specialists on correcting common confusion points. This should be faster than labeling the whole dataset. But without this step it's an invariably lossy model intimating an invariably lossy model.
- 2,347
- 4
- 21