Questions
I am training deep neural networks and I have heard that changing system configurations (such as CUDA, cuDNN, Hardware, or even the OS version) can sometimes lead to different training results, even when using the same dataset, model architecture, and hyperparameters.
I understand that floating-point arithmetic on GPUs is not always deterministic, but I would like to know more about how changes in system libraries (e.g., upgrading from CUDA 11 to 12, or switching from cuDNN 8.0 to 8.9) can impact training reproducibility.
My questions are:
- How exactly do changes in CUDA, cuDNN, or the OS kernel affect deep learning training?
- Are there any empirical studies or papers that analyze and demonstrate the effect of these changes on model convergence and final accuracy?
- What are the best practices to ensure reproducibility across different system configurations (other than fixing random seed)?
- If I want to obtain different results while training a model, can I achieve this using Docker containers? What should I change in the Dockerfile?
Any insights, references, or explanations would be greatly appreciated!
Experiments
I have tried to train the same model on two different configurations. In order to speedup the experiments, i have used docker containers.
In particular, on a machine equipped with GTX1080ti and ubuntu 24.04LTS I have built two containers with the following specs:
- ubuntu 18 | cuda 12.0.1 | cudnn 8.8.0.121-1+cuda12.0 | nvidia drivers 560.35.03
- ubuntu 22 | cuda 12.5.82 | cudnn 9.2.1.18-1 | nvidia drivers 560.35.03
unfortunately, with containers kernel and I guess also nvidia drivers are shared with the host os. Nevertheless, I expected to obtain different results training the same model with same hyperparameters, but this did not happen.