3

I have been experiencing crashes while gaming but I cannot tell which of Ubuntu, Hardware (GPU/CPU), Steam, or the game is the problem.

Steps to replicate the problem

This last crash was on a fresh boot with only Steam and the game running.

When the game freezes, I alt-tab to Steam and hit the stop button. I get the warning about losing progress, accept it and then nothing happens.

I open a terminal and run top. I see the PID for the game and use sudo kill -9 <pid> and nothing changes. I used screen capture to show what was happening (game running, steam, and the terminal window.

This full-on zombie process seems unkillable.

I log out. I log back in. The system hangs with a black screen and a white mouse pointer.

At this point, I'm forced to restart.

When the system came back up and I logged in, the screenshot had not been saved.

System specs

This is a new build. A Gigbyte board with 128GB RAM, i9-14900Fx32 CPU, Radeon R7900XTX GPU. Firmware version F9; Ubuntu 24.04.1. I saved up for over a year to buy this thing and now I have the weirdest bug with which I am seeking help diagnosing.

I chose the i9-14900Fx32 specifically because it was not known to have instability issues. I have not overclocked anything.

Additional info

When this crash happens it will take Friefox with it (Chrome is fine) with the same zombie process nature and the System Monitor will say it is ready but will not present a GUI nor close the phantom window (also a zombie).

Update

So after messing about with amdgpu in the hopes of fixing things, I made it all much worse and spent a day with a system that booted to a black screen.

When I was finally back in, the system froze just after I opened Firefox restored my tabs and started a YouTube video. I rebooted. Did not get the same problem from Chrome. Not been able to repeat the Firefox one either as that is what I am using to post this update.

After much Googling, I think I un-foobared the thing but I'm getting some very contradictory feedback from the terminal

lordmatt@vision:/var/lib/dpkg$ sudo dpkg -P amdgpu && sudo dpkg -P amdgps-dkms
dpkg: warning: ignoring request to remove amdgpu which isn't installed
dpkg: warning: ignoring request to remove amdgps-dkms which isn't installed
lordmatt@vision:/var/lib/dpkg$ sudo dpkg --configure -a
Setting up amdgpu-dkms (1:6.7.0.60103-1787201.22.04) ...
debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable
dpkg: error processing package amdgpu-dkms (--configure):
 installed amdgpu-dkms package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 amdgpu-dkms
lordmatt@vision:/var/lib/dpkg$ 

I have Schrodinger's amdgps-dkms it is both not installed and installed (pending post-install script) at the same time.

Update 2

After a lot of faff and crashes, I undid whatever amdgpu nonsense I had set in motion. Now I just need some way to profile the initial problem or find a fix in some way. AMDGPU is not, it seems, my answer.

Update 3

So original symptoms (system freezes) still exist. I can't tell if the GPU is the culprit. Freeze ups have happened when running only Firefox.

Update 4

Running amdgpu-install --uninstall freed up some room but was not a fix. Crashes are more frequent. Sound keeps playing though even when screen frozen and no keyboard or mouse input is getting through (not even capslock or numlock).

Update 5 - running some commands as requested

free --mega -h

               total        used        free      shared  buff/cache   available
Mem:            132G        5.4G        125G        143M        2.9G        127G
Swap:           8.6G          0B        8.6G

cat /proc/sys/vm/swappiness

That returns 60. Is that good?

amdgpu

I tried to switch to amdgpu as ~I suspected that might be the problem. Oh, boy. The process hung; I watch Netflix; I go to bed; I get up and nothing has changed. It would get stuck setting up with the kernel. Most of my updates were me clawing back from that.

here is a screeny of that last build step just going hell-for-leather and getting nowhere.

enter image description here

lsmod | grep amdgpu

Nothing.

Update 4 is where I got off that train. (and got some system stability back). Along the way, I poked at a bunch of BIOS settings and learned a few hard lessons about not touching things I don't understand.

sudo hwinfo --gfxcard

sudo: hwinfo: command not found

I did a quick apt install and:

07: PCI 300.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.386]
  Unique ID: svHJ.+CDZH_5IkG4
  Parent ID: B35A.Sa24RQSJfUB
  SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0
  SysFS BusID: 0000:03:00.0
  Hardware Class: graphics card
  Model: "ATI VGA compatible controller"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x744c 
  SubVendor: pci 0x1eae "XFX Limited"
  SubDevice: pci 0x7901 
  Revision: 0xc8
  Memory Range: 0x40000000-0x4fffffff (ro,non-prefetchable)
  Memory Range: 0x50000000-0x501fffff (ro,non-prefetchable)
  I/O Ports: 0x5000-0x5fff (rw)
  Memory Range: 0x50c00000-0x50cfffff (rw,non-prefetchable)
  Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
  IRQ: 11 (no events)
  Module Alias: "pci:v00001002d0000744Csv00001EAEsd00007901bc03sc00i00"
  Driver Info #0:
    Driver Status: amdgpu is not active
    Driver Activation Cmd: "modprobe amdgpu"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #12 (PCI bridge)

Primary display adapter: #7

sudo kill and sudo killall

I threw both of them at the zombies and they just ignored me. System Monitor also zombied during these times. It had "technically" started but there was no GUI element. One time, it was running and I tried to go from the graphs to the process list. It was not happening. System Monitor has stopped responding. Click the option to end it. Get the message again like I had not done anything.

As I say, unkillable zombies. I've never seen anything quite like it.

Here's a capture of process 12388 refusing to be killed. The "Stop" button on Steam had a similar failure to make anything happen. I rebooted.

enter image description here

linux-crashdump

I also installed linux-crashdump at some point so there may be some super verbose files around I could go dig up.

Update 6

Following advice from comments

  • I ran sudo apt install libgl1-mesa-dri mesa-opencl-icd mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers
  • sudo systemctl stop gdm
  • sudo modprobe -r radeon
  • sudo modprobe amdgpu
  • sudo systemctl start gdm

Then sudo hwinfo --gfxcard gave me:

07: PCI 300.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.386]
  Unique ID: svHJ.+CDZH_5IkG4
  Parent ID: B35A.Sa24RQSJfUB
  SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0
  SysFS BusID: 0000:03:00.0
  Hardware Class: graphics card
  Model: "ATI VGA compatible controller"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x744c 
  SubVendor: pci 0x1eae "XFX Limited"
  SubDevice: pci 0x7901 
  Revision: 0xc8
  Driver: "amdgpu"
  Driver Modules: "amdgpu"
  Memory Range: 0x4800000000-0x4fffffffff (ro,non-prefetchable)
  Memory Range: 0x4400000000-0x44001fffff (ro,non-prefetchable)
  I/O Ports: 0x5000-0x5fff (rw)
  Memory Range: 0x50c00000-0x50cfffff (rw,non-prefetchable)
  Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
  IRQ: 205 (13478 events)
  Module Alias: "pci:v00001002d0000744Csv00001EAEsd00007901bc03sc00i00"
  Driver Info #0:
    Driver Status: amdgpu is active
    Driver Activation Cmd: "modprobe amdgpu"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #12 (PCI bridge)

Primary display adapter: #7

Success?

After all that help, I was able to run a game that previously crashed within minutes.

I'm getting far fewer crashes and they no longer zombie the system.

1 Answers1

1

sudo hwinfo --gfxcard output shows amdgpu is not active (Update 5 above):

Driver Info #0:
  Driver Status: amdgpu is not active
  Driver Activation Cmd: "modprobe amdgpu"

But you can use the default amdgpu kernel driver without installing the dkms version (it's built in to the regular kernel).


Before you begin, run the following comands to install the related mesa drivers, and also the amdgpu Xorg driver if you want to run X11 instead of Wayland:

sudo apt update
sudo apt install libgl1-mesa-dri mesa-opencl-icd mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers

and optionally, if you want to use X11 instead of Wayland:

sudo apt install xserver-xorg-video-amdgpu

Next, run the following command to enable the amdgpu kernel driver when you boot:

echo amdgpu | sudo tee -a /etc/modules

Finally, reboot and then run the following command to check the driver status:

sudo hwinfo --gfxcard | grep "Driver Status"

It should show the following

  Driver Status: amdgpu is active

UPDATE 1:

Okay, since amdgpu is blacklisted in /etc/modprobe.d you can either delete the offending file if blacklist amdgpu is the only entry in the file

or you can edit the file and comment the line out so it reads #blacklist amdgpu instead of blacklist amdgpu

or you can use the following command to list any files that contain blacklist amdgpu:

grep -l "blacklist amdgpu" /etc/modprobe.d/*

and if any files are listed, then this command should automatically edit those files:

sudo sed -i 's/blacklist amdgpu/#&/' $(grep -l "blacklist amdgpu" /etc/modprobe.d/*)

Reboot to apply the changes.

mchid
  • 44,904
  • 8
  • 102
  • 162