I have been experiencing crashes while gaming but I cannot tell which of Ubuntu, Hardware (GPU/CPU), Steam, or the game is the problem.
Steps to replicate the problem
This last crash was on a fresh boot with only Steam and the game running.
When the game freezes, I alt-tab to Steam and hit the stop button. I get the warning about losing progress, accept it and then nothing happens.
I open a terminal and run top. I see the PID for the game and use sudo kill -9 <pid> and nothing changes. I used screen capture to show what was happening (game running, steam, and the terminal window.
This full-on zombie process seems unkillable.
I log out. I log back in. The system hangs with a black screen and a white mouse pointer.
At this point, I'm forced to restart.
When the system came back up and I logged in, the screenshot had not been saved.
System specs
This is a new build. A Gigbyte board with 128GB RAM, i9-14900Fx32 CPU, Radeon R7900XTX GPU. Firmware version F9; Ubuntu 24.04.1. I saved up for over a year to buy this thing and now I have the weirdest bug with which I am seeking help diagnosing.
I chose the i9-14900Fx32 specifically because it was not known to have instability issues. I have not overclocked anything.
Additional info
When this crash happens it will take Friefox with it (Chrome is fine) with the same zombie process nature and the System Monitor will say it is ready but will not present a GUI nor close the phantom window (also a zombie).
Update
So after messing about with amdgpu in the hopes of fixing things, I made it all much worse and spent a day with a system that booted to a black screen.
When I was finally back in, the system froze just after I opened Firefox restored my tabs and started a YouTube video. I rebooted. Did not get the same problem from Chrome. Not been able to repeat the Firefox one either as that is what I am using to post this update.
After much Googling, I think I un-foobared the thing but I'm getting some very contradictory feedback from the terminal
lordmatt@vision:/var/lib/dpkg$ sudo dpkg -P amdgpu && sudo dpkg -P amdgps-dkms
dpkg: warning: ignoring request to remove amdgpu which isn't installed
dpkg: warning: ignoring request to remove amdgps-dkms which isn't installed
lordmatt@vision:/var/lib/dpkg$ sudo dpkg --configure -a
Setting up amdgpu-dkms (1:6.7.0.60103-1787201.22.04) ...
debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable
dpkg: error processing package amdgpu-dkms (--configure):
installed amdgpu-dkms package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
amdgpu-dkms
lordmatt@vision:/var/lib/dpkg$
I have Schrodinger's amdgps-dkms it is both not installed and installed (pending post-install script) at the same time.
Update 2
After a lot of faff and crashes, I undid whatever amdgpu nonsense I had set in motion. Now I just need some way to profile the initial problem or find a fix in some way. AMDGPU is not, it seems, my answer.
Update 3
So original symptoms (system freezes) still exist. I can't tell if the GPU is the culprit. Freeze ups have happened when running only Firefox.
Update 4
Running amdgpu-install --uninstall freed up some room but was not a fix. Crashes are more frequent. Sound keeps playing though even when screen frozen and no keyboard or mouse input is getting through (not even capslock or numlock).
Update 5 - running some commands as requested
free --mega -h
total used free shared buff/cache available
Mem: 132G 5.4G 125G 143M 2.9G 127G
Swap: 8.6G 0B 8.6G
cat /proc/sys/vm/swappiness
That returns 60. Is that good?
amdgpu
I tried to switch to amdgpu as ~I suspected that might be the problem. Oh, boy. The process hung; I watch Netflix; I go to bed; I get up and nothing has changed. It would get stuck setting up with the kernel. Most of my updates were me clawing back from that.
here is a screeny of that last build step just going hell-for-leather and getting nowhere.
lsmod | grep amdgpu
Nothing.
Update 4 is where I got off that train. (and got some system stability back). Along the way, I poked at a bunch of BIOS settings and learned a few hard lessons about not touching things I don't understand.
sudo hwinfo --gfxcard
sudo: hwinfo: command not found
I did a quick apt install and:
07: PCI 300.0: 0300 VGA compatible controller (VGA)
[Created at pci.386]
Unique ID: svHJ.+CDZH_5IkG4
Parent ID: B35A.Sa24RQSJfUB
SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0
SysFS BusID: 0000:03:00.0
Hardware Class: graphics card
Model: "ATI VGA compatible controller"
Vendor: pci 0x1002 "ATI Technologies Inc"
Device: pci 0x744c
SubVendor: pci 0x1eae "XFX Limited"
SubDevice: pci 0x7901
Revision: 0xc8
Memory Range: 0x40000000-0x4fffffff (ro,non-prefetchable)
Memory Range: 0x50000000-0x501fffff (ro,non-prefetchable)
I/O Ports: 0x5000-0x5fff (rw)
Memory Range: 0x50c00000-0x50cfffff (rw,non-prefetchable)
Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
IRQ: 11 (no events)
Module Alias: "pci:v00001002d0000744Csv00001EAEsd00007901bc03sc00i00"
Driver Info #0:
Driver Status: amdgpu is not active
Driver Activation Cmd: "modprobe amdgpu"
Config Status: cfg=new, avail=yes, need=no, active=unknown
Attached to: #12 (PCI bridge)
Primary display adapter: #7
sudo kill and sudo killall
I threw both of them at the zombies and they just ignored me. System Monitor also zombied during these times. It had "technically" started but there was no GUI element. One time, it was running and I tried to go from the graphs to the process list. It was not happening. System Monitor has stopped responding. Click the option to end it. Get the message again like I had not done anything.
As I say, unkillable zombies. I've never seen anything quite like it.
Here's a capture of process 12388 refusing to be killed. The "Stop" button on Steam had a similar failure to make anything happen. I rebooted.
linux-crashdump
I also installed linux-crashdump at some point so there may be some super verbose files around I could go dig up.
Update 6
Following advice from comments
- I ran
sudo apt install libgl1-mesa-dri mesa-opencl-icd mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers - sudo systemctl stop gdm
- sudo modprobe -r radeon
- sudo modprobe amdgpu
- sudo systemctl start gdm
Then sudo hwinfo --gfxcard gave me:
07: PCI 300.0: 0300 VGA compatible controller (VGA)
[Created at pci.386]
Unique ID: svHJ.+CDZH_5IkG4
Parent ID: B35A.Sa24RQSJfUB
SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0
SysFS BusID: 0000:03:00.0
Hardware Class: graphics card
Model: "ATI VGA compatible controller"
Vendor: pci 0x1002 "ATI Technologies Inc"
Device: pci 0x744c
SubVendor: pci 0x1eae "XFX Limited"
SubDevice: pci 0x7901
Revision: 0xc8
Driver: "amdgpu"
Driver Modules: "amdgpu"
Memory Range: 0x4800000000-0x4fffffffff (ro,non-prefetchable)
Memory Range: 0x4400000000-0x44001fffff (ro,non-prefetchable)
I/O Ports: 0x5000-0x5fff (rw)
Memory Range: 0x50c00000-0x50cfffff (rw,non-prefetchable)
Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
IRQ: 205 (13478 events)
Module Alias: "pci:v00001002d0000744Csv00001EAEsd00007901bc03sc00i00"
Driver Info #0:
Driver Status: amdgpu is active
Driver Activation Cmd: "modprobe amdgpu"
Config Status: cfg=new, avail=yes, need=no, active=unknown
Attached to: #12 (PCI bridge)
Primary display adapter: #7
Success?
After all that help, I was able to run a game that previously crashed within minutes.
I'm getting far fewer crashes and they no longer zombie the system.

