I am running into an odd issue. I have been struggling to get GPU passthrough to work properly to a Windows 11 VM and I've finally found something that works but it's not as ideal as I'm hoping for. Essentially if I add to /etc/modprobe.d/vfio.conf my PCI ids
options vfio-pci ids=10de:2684,10de:22ba
VFIO binds on startup and I can use it great for GPU passthrough. But if I then try to reattach the GPU to nvidia drivers I can't appear to use it with pytorch (although nvidia-smi works fine).
If I remove that vfio.conf file and reboot, the GPU is bound to nvidia and torch works great but when I try to unbind from nvidia and bind to vfio-pci, when I launch the VM I get Error Code 43 on the Nvidia driver and the following error in libvirt logs:
2024-04-09T15:38:49.796258Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.5,addr=0x0: Failed to mmap 0000:01:00.0 BAR 1. Performance may be slow
2024-04-09T15:39:07.971124Z qemu-system-x86_64: vfio_region_write(0000:01:00.0:region1+0x8c, 0x1,4) failed: Cannot allocate memory
It's really odd, because from all my inspections it appears the GPU is properly isolated but it appears I can't pass it through to the GPU without explicitly binding to vfio via /etc/modprobe.d/vfio.conf, and when I do that I can't seem to properly bind it back to nvidia. Once again, everything looks fine when I rebind it to nvidia but torch can't detect the GPU anymore. Any ideas?
My workaround works ok for now, but it requires reboot if I want to launch my VM. I'd ideally like to be able to bind/unbind my nvidia GPU on demand when I want to switch between using it on the host vs in the windows 11 VM. Example bind to VFIO script:
#!/bin/bash
set -x
Stop display manager
systemctl stop display-manager
Unbind VTconsoles: might not be needed
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
Unload NVIDIA kernel modules
modprobe -r nvidia_drm
modprobe -r nvidia_modeset
modprobe -r nvidia_uvm
modprobe -r nvidia
Detach GPU devices from host
Use your GPU and HDMI Audio PCI host device
sudo virsh nodedev-detach pci_0000_01_00_0
sudo virsh nodedev-detach pci_0000_01_00_1
Load vfio module
modprobe vfio-pci
If I run
lspci -nnk -d 10de:2684
lspci -nnk -d 10de:22ba
It looks properly bound to vfio-pci:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:40e5]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22ba] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:40e5]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
If I reboot with vfio.conf applied and inspect things it looks the same but oddly does work when launching my Windows 11 VM:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:40e5]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22ba] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:40e5]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
But if I then unbind from vfio and bind to nvidia:
#!/bin/bash
set -x
Attach GPU devices to host
Use your GPU and HDMI Audio PCI host device
sudo virsh nodedev-reattach pci_0000_01_00_0
sudo virsh nodedev-reattach pci_0000_01_00_1
Unload vfio module
modprobe -r vfio-pci
#stop race condition
sleep 2
Load NVIDIA kernel modules
modprobe nvidia
modprobe nvidia_modeset
modprobe nvidia_uvm
modprobe nvidia_drm
Bind VTconsoles: might not be needed
echo 1 > /sys/class/vtconsole/vtcon0/bind
echo 1 > /sys/class/vtconsole/vtcon1/bind
nvidia-smi works fine:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 49C P0 67W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
But when I run something in Docker that uses pytorch:
RuntimeError: Torch is not able to use GPU
Even worse, when I try to rebind to vfio it works as if I hadn't enabled vfio.conf and I get the same error on launching the Windows 11 VM:
2024-04-09T16:04:45.089687Z qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.5,addr=0x0: Failed to mmap 0000:01:00.0 BAR 1. Performance may be slow
2024-04-09T16:04:55.682373Z qemu-system-x86_64: vfio_region_write(0000:01:00.0:region1+0x8c, 0x1,4) failed: Cannot allocate memory
It feels pretty clear to me that something is still using nvidia somehow, even though it is using vfio-pci kernel driver and lsof /dev/nvidia0 returns a blank string. Any ideas? I'm going a bit crazy here!