0

I have a 22.04.4 LTS running on an HPE DL380Gen10 server with NVIDIA A100. The GPU (with NVIDIA driver v515) was working fine till two weeks ago. Last week, I found out that the

nvidia-smi 

command complained that

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Usually, reinstalling NVIDIA drivers works, but this time the (re)installation errored out with the following errors in /var/log/nvidia-installer.log

ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol'__rcu_read_lock' ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol'__rcu_read_unlock'

I tried installing NVIDIA drivers v525 and v535 resulted in the same error. It is worth mentioning here that I get NVIDIA .run files (drivers) directly from NVIDIA's website. Later, I discovered that getting the NVIDIA drivers using the "ubuntu-drivers" tool is preferred. Forums suggest avoiding Nvidia drivers directly from NVIDIA's website and recommend getting them from signed sources like this post describes: https://ubuntu.com/server/docs/nvidia-drivers-installation.

I selected NVIDIA driver v535 because of 2 reasons:

  1. the official page from HPE says v535 is supported (https://support.hpe.com/connect/s/softwaredetails?language=en_US&collectionId=MTX-bebe602cd6364ad0&softwareId=MTX_a36eb26d486a4be188dcd00d9a&tab=Installation+Instructions).
  2. running "sudo ubuntu-drivers list --gpgpu" lists v535 as the latest driver version.

Some posts suggested upgrading to kernel solves this problem (https://forums.debian.net/viewtopic.php?t=158200). So, I upgraded the kernel from v5.15 to v6.7 using the following commands.

sudo apt install -t bookworm-backports linux-image-amd64
sudo apt install -t bookworm-backports linux-headers-amd64

Backports were added to apt source lists (/etc/apt/source.list) to complete this kernel update. Upgrading the kernel to v6.7 resulted in the missing kernel header errors while installing the NVIDIA drivers. A similar issue has been reported in the Debian forum on this post (https://forums.debian.net/viewtopic.php?t=157737). This post only talks about kernel v6.5, so to reduce the number of variables, I downgraded my kernel to v6.5 from v6.7 using this post (Downgrade kernel for ubuntu 22.04 LTS). Now, I have kernel v6.5 running up and running with its header files.

After this, I installed the NVIDIA drivers using "ubuntu-drivers" tool using the following commands.

sudo ubuntu-drivers install --gpgpu nvidia:535-server
sudo apt install nvidia-utils-535-server

This installation did not error out. Running

sudo ubuntu-drivers install --gpgpu nvidia:535-server

again returns

All the available drivers are already installed.

However,

nvidia-smi 

still complains

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

I wonder what NVIDIA driver version is expected.

1 Answers1

1

Thank you @rob grune and @ubfan1. I purged all NVIDIA drivers and tried reinstalling. However, it seems running

sudo ubuntu-drivers install --gpgpu nvidia:535-server

is not the best command to install NVIDIA drivers.

Additionally, it turns out I introduced the following issues trying to get the NVIDIA drivers working:

  1. I had accidently installed unsigned version of Linux kernel v6.5 which resulted in a lot of warning.
  2. The headers for this kernel were still missing even though I thought I had installed them.

I rollbacked my Linux kernel to v5.15.112 which solved the above 2 issues. I installed NVIDIA drivers using the following command letting my server decide which NVIDIA version it wants.

sudo ubuntu-drivers autoinstall

The autoinstall selected NVIDIA v535 and this seems to work for now.