How to troubleshoot total system hang

Question

I have a new System76 Lemur Pro laptop with Ubuntu 20.04. I really want to love it, but I'm finding that it's completely and totally locking up several times a week, which kind of puts a damper on my feelings. I'm in contact with System76 support, but I'm also trying to do some troubleshooting of my own. I'm fairly new to Linux and am hoping to learn not just how to fix my machine, but also general troubleshooting steps that would be useful in the future.

The system: System76 Lemur Pro, i7, 40gb RAM, single SSD. Ubuntu 20.04. All updates installed. Only peripherals are a USB hub with a mouse and keyboard plugged in, and an external monitor hooked up via USB-C to DisplayPort adapter. Nothing exotic.

The crash: Several times a week, I'll return to my laptop (usually in the morning after it sits idle all night) to find that it's totally unresponsive to mouse/keyboard. Using ALT+F_ to try to switch to a terminal does not do anything. ALT + PRTSCR + REISUB does not do anything. Hitting the power button does not do anything. Trying to turn on the internal LCD does not do anything. Only holding the power button down and hard-resetting the machine allows me to recover. This did happen only one time while I was actively using the machine and the Gnome desktop stayed visible, the mouse and keyboard locked, and about 1/4 of a second of the song I was listening to just got stuck in a loop. Nothing but hard reset worked to recover.

What I've tried:

Stress testing CPU. I monitored CPU temps while running a stress test for several minutes. Temps never exceeded upper 80s, and the CPU fan kicked in to keep it under control. This seems safe, given that the hot/critical temps were listed as 100.
Running memtester. Looped through 5 times, everything passed.
Installing any updates recommended by Ubuntu.
Looking at system logs (/var/log/syslog). These logs simply go blank when the system hangs and stay blank until I hard reset it. Nothing immediately before the crash looks terribly interesting.
Disabling sleep. Was already disabled, but thought I'd mention it.

At this point, I'm not quite sure what my next steps should be. Are there other logs I can look at? Other diagnostics I can run? Should I assume it's a peripheral and disconnect keyboard/mouse/monitor/hub one at a time to try to isolate? Seems unlikely to be a common peripheral, but who knows.

Edit: as requested, here's logs from /var/log/kern.log right before one of the crashes. It includes a lot of info about CPU throttling being managed. However, such messages occur regularly when the computer is stable as well...

Oct 22 07:52:00 system76-pc kernel: [44320.095989] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 7775)
Oct 22 07:52:00 system76-pc kernel: [44320.095990] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 4669)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 719)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.095994] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.096970] mce: CPU2: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU0: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU5: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096973] mce: CPU3: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU6: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU7: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096975] mce: CPU4: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096976] mce: CPU1: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU6: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU7: Package temperature/speed normal

score 0 · Answer 1 · answered Oct 24 '20 at 15:57

This is a partial answer, based on current information, including from the comments.

From the log files, there are indications that high CPU temperatures are involved, such that the system keeps hitting its throttling temperature limit. However, CPU stress tests indicate no problem.

As a test, find the system operating point where CPU thermal problems are not possible and run that way for long enough to determine the effect on system stability. The cost of this test will be performance. Later on, a proper thermal daemon (thermald, tlp, ...) should be investigated as a way to recover maximum performance.

The default CPU frequency scaling driver for the i7-10510U is intel_pstate, and this answer is written for that driver. Check via:

doug@s15:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu4/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu5/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu6/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu7/cpufreq/scaling_driver:intel_cpufreq

The mprime (prime95) high heat torture test is used as the CPU stress test because it consumes the most energy of any CPU stress test that I have ever tested. To protect my example computer, which has no thermal daemon running, the desired operating point of about 80 degrees will be found from the low side. First, note the current maximum CPU frequency percent, note the minimum as well (yours will be different):

cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
42

It might not be 100% if some thermal daemon is already limiting things. Anyway, I will start at 50%:

doug@s15:~$ echo 50 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
50

Then gradually raise the maximum CPU frequency percent, say in 10 percent increments, and find the operating point for about 80 degrees processor package temperature:

doug@s15:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.25    1754    725     25      3.81    0.12
0.02    1600    288     26      3.70    0.12
0.06    1600    360     26      3.70    0.12
38.82   1899    7740    39      16.28   0.12
100.00  1900    17594   41      36.20   0.12   <<< mprime torture test started
100.00  1900    17541   42      36.44   0.12
100.00  1900    17552   43      36.39   0.12
100.00  1900    17517   44      36.25   0.12
100.00  1927    17474   48      36.95   0.12
100.00  2300    17389   49      46.51   0.12
100.00  2300    17367   50      46.60   0.12
100.00  2300    17362   52      46.69   0.12
100.00  2300    17438   53      46.77   0.12
100.00  2552    18440   56      54.18   0.12
100.00  2700    17672   58      58.48   0.12
100.00  2700    17590   58      58.59   0.12
100.00  2700    17710   61      58.74   0.12
100.00  2953    17780   66      67.91   0.12
100.00  3100    17876   68      73.38   0.12  <<<< First time at 80%, temp lags.
100.00  3100    17843   69      73.55   0.12
100.00  3100    17860   70      73.64   0.12
100.00  3100    18794   71      73.78   0.12
100.00  3231    17826   77      79.69   0.12
100.00  3500    18305   80      92.33   0.12
100.00  3500    17765   81      92.66   0.12
100.00  3457    17747   80      90.72   0.12
100.00  3300    17720   81      82.62   0.12
100.00  3300    17723   81      82.72   0.12
100.00  3300    17708   80      82.81   0.12
100.00  3300    17712   83      82.95   0.12  <<<< Opps too high
100.00  3300    17788   82      83.08   0.12
100.00  3204    17882   81      79.25   0.12
100.00  3100    17778   80      74.78   0.12
100.00  3100    18571   81      74.83   0.12
100.00  3100    17806   80      74.85   0.12
100.00  3100    17787   80      74.89   0.12 <<<< 80 percent seems stable
100.00  3100    17772   81      74.84   0.12
100.00  3100    17824   81      74.85   0.12
100.00  3100    17777   80      74.89   0.12
100.00  3100    17799   81      74.95   0.12
100.00  3100    17867   81      74.77   0.12

So, for my system, limiting the CPU frequency to 80% of maximum will keep them away from any built in additional thermal throttling. Run the system this way for awhile.

score 0 · Accepted Answer · answered Nov 02 '20 at 13:01

This is a Kernel bug associated with CPU power management. It's fixed in kernel 5.8, which comes with Ubuntu 20.10. I upgraded to 20.10, turned off all the workarounds, and am running stable now.

If upgrading to 5.8/20.10 isn't something you want to do, you can also work around the bug by keeping your CPU from going into lower-power states (this will reduce battery life, obviously). Open up /etc/default/grub and add intel_idle.max_cstate=1 to the contents of the value for GRUB_CMDLINE_LINUX_DEFAULT. Save, run sudo update-grub, then re-boot. Reverse the process to reverse the workaround.

It's possible a cstate value higher than 1 would still be a stable workaround, but I never experimented enough to verify.

How to troubleshoot total system hang

2 Answers2