If I can repeat ML experiments, how can I bound my results?

Question

It has been asked here if we should repeat lengthy experiments.

Let's say I can repeat them, how should I present them? For instance, if I am measuring the accuracy of a model on test data during some training epochs, and I repeat various times this training, I will have different values of test accuracy. I can average them to take into account all the experiments. Can I then calculate a sort of confidence interval to say that the accuracy will most likely be within an interval? Does this make sense? If it does, what formula should I use?

It says here that we can use $\hat{x} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}$, but I don't quite understand the theory behind.

Sanjar Adilov · Answer 1 · 2022-05-19T19:08:03.910

It says here that we can use $\hat{x} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}$, but I don't quite understand the theory behind.

Following the Gaussian distribution, $1.96$ is an approximate value by which we multiply the sample standard deviation $\hat{\sigma}$ to get the $95\%$ confidence interval for unknown $x$$-$i.e., $95\%$ of multiple intervals $[\hat{x} - 1.96\frac{\hat{\sigma}}{\sqrt{n}}, \hat{x}+1.96\frac{\hat{\sigma}}{\sqrt{n}}]$ constructed on the basis of different experiments and their corresponding test-score lists will contain the true value of test score $x$.

I guess this makes sense for $k \geq 10$ cross-validation, although this issue baffles me too, and from my experience, practitioners either report $\text{mean}(x) \pm \text{std}(x)$ or just leave the details out.

If I can repeat ML experiments, how can I bound my results?

1 Answers1