What does it mean when a model "statistically outperforms" another?

Question

I was reading this paper where they are stating the following:

We also use the T-Test to test the significance of GMAN in 1 hour ahead prediction compared to Graph WaveNet. The p-value is less than 0.01, which demonstrates that GMAN statistically outperforms Graph WaveNet.

What does "Model A statistically outperforms B" mean in this context? And how should the p-value threshold be selected?

alltom · Accepted Answer · 2020-07-04T07:11:04.450

Most model-fitting is stochastic, so you get different parameters every time you train, and you usually can't say that one algorithm will always give you a better-performing model.

However, since you can retrain many times to get a distribution of models, you can use a statistical test like the T-Test to say "algorithm A usually produces a better model than algorithm B," which is what they mean by "statistically outperforms."

p-value is usually set by consensus in the field. The higher the p-value, the less confidence you have that there's a statistical difference between the distribution of values being compared. 0.1 might be normal in a field where data is very expensive to collect (like risky, long-term studies of humans), but in machine learning, it's usually easy enough to retrain a model that 0.01 is common, and demonstrates very high confidence. To know more about selecting and interpreting the values, I recommend Wikipedia's page on statistical significance.

What does it mean when a model "statistically outperforms" another?

1 Answers1