I have a 22.04.4 LTS running on an HPE DL380Gen10 server with NVIDIA A100. The GPU (with NVIDIA driver v515) was working fine till two weeks ago. Last week, I found out that the
nvidia-smi
command complained that
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Usually, reinstalling NVIDIA drivers works, but this time the (re)installation errored out with the following errors in /var/log/nvidia-installer.log
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol'__rcu_read_lock' ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol'__rcu_read_unlock'
I tried installing NVIDIA drivers v525 and v535 resulted in the same error. It is worth mentioning here that I get NVIDIA .run files (drivers) directly from NVIDIA's website. Later, I discovered that getting the NVIDIA drivers using the "ubuntu-drivers" tool is preferred. Forums suggest avoiding Nvidia drivers directly from NVIDIA's website and recommend getting them from signed sources like this post describes: https://ubuntu.com/server/docs/nvidia-drivers-installation.
I selected NVIDIA driver v535 because of 2 reasons:
- the official page from HPE says v535 is supported (https://support.hpe.com/connect/s/softwaredetails?language=en_US&collectionId=MTX-bebe602cd6364ad0&softwareId=MTX_a36eb26d486a4be188dcd00d9a&tab=Installation+Instructions).
- running "sudo ubuntu-drivers list --gpgpu" lists v535 as the latest driver version.
Some posts suggested upgrading to kernel solves this problem (https://forums.debian.net/viewtopic.php?t=158200). So, I upgraded the kernel from v5.15 to v6.7 using the following commands.
sudo apt install -t bookworm-backports linux-image-amd64
sudo apt install -t bookworm-backports linux-headers-amd64
Backports were added to apt source lists (/etc/apt/source.list) to complete this kernel update. Upgrading the kernel to v6.7 resulted in the missing kernel header errors while installing the NVIDIA drivers. A similar issue has been reported in the Debian forum on this post (https://forums.debian.net/viewtopic.php?t=157737). This post only talks about kernel v6.5, so to reduce the number of variables, I downgraded my kernel to v6.5 from v6.7 using this post (Downgrade kernel for ubuntu 22.04 LTS). Now, I have kernel v6.5 running up and running with its header files.
After this, I installed the NVIDIA drivers using "ubuntu-drivers" tool using the following commands.
sudo ubuntu-drivers install --gpgpu nvidia:535-server
sudo apt install nvidia-utils-535-server
This installation did not error out. Running
sudo ubuntu-drivers install --gpgpu nvidia:535-server
again returns
All the available drivers are already installed.
However,
nvidia-smi
still complains
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."
I wonder what NVIDIA driver version is expected.