0

I have a new System76 Lemur Pro laptop with Ubuntu 20.04. I really want to love it, but I'm finding that it's completely and totally locking up several times a week, which kind of puts a damper on my feelings. I'm in contact with System76 support, but I'm also trying to do some troubleshooting of my own. I'm fairly new to Linux and am hoping to learn not just how to fix my machine, but also general troubleshooting steps that would be useful in the future.

The system: System76 Lemur Pro, i7, 40gb RAM, single SSD. Ubuntu 20.04. All updates installed. Only peripherals are a USB hub with a mouse and keyboard plugged in, and an external monitor hooked up via USB-C to DisplayPort adapter. Nothing exotic.

The crash: Several times a week, I'll return to my laptop (usually in the morning after it sits idle all night) to find that it's totally unresponsive to mouse/keyboard. Using ALT+F_ to try to switch to a terminal does not do anything. ALT + PRTSCR + REISUB does not do anything. Hitting the power button does not do anything. Trying to turn on the internal LCD does not do anything. Only holding the power button down and hard-resetting the machine allows me to recover. This did happen only one time while I was actively using the machine and the Gnome desktop stayed visible, the mouse and keyboard locked, and about 1/4 of a second of the song I was listening to just got stuck in a loop. Nothing but hard reset worked to recover.

What I've tried:

  • Stress testing CPU. I monitored CPU temps while running a stress test for several minutes. Temps never exceeded upper 80s, and the CPU fan kicked in to keep it under control. This seems safe, given that the hot/critical temps were listed as 100.
  • Running memtester. Looped through 5 times, everything passed.
  • Installing any updates recommended by Ubuntu.
  • Looking at system logs (/var/log/syslog). These logs simply go blank when the system hangs and stay blank until I hard reset it. Nothing immediately before the crash looks terribly interesting.
  • Disabling sleep. Was already disabled, but thought I'd mention it.

At this point, I'm not quite sure what my next steps should be. Are there other logs I can look at? Other diagnostics I can run? Should I assume it's a peripheral and disconnect keyboard/mouse/monitor/hub one at a time to try to isolate? Seems unlikely to be a common peripheral, but who knows.

Edit: as requested, here's logs from /var/log/kern.log right before one of the crashes. It includes a lot of info about CPU throttling being managed. However, such messages occur regularly when the computer is stable as well...

Oct 22 07:52:00 system76-pc kernel: [44320.095989] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 7775)
Oct 22 07:52:00 system76-pc kernel: [44320.095990] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 4669)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 719)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.095994] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.096970] mce: CPU2: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU0: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU5: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096973] mce: CPU3: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU6: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU7: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096975] mce: CPU4: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096976] mce: CPU1: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU6: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU7: Package temperature/speed normal
John Chrysostom
  • 181
  • 1
  • 10

2 Answers2

0

This is a partial answer, based on current information, including from the comments.

From the log files, there are indications that high CPU temperatures are involved, such that the system keeps hitting its throttling temperature limit. However, CPU stress tests indicate no problem.

As a test, find the system operating point where CPU thermal problems are not possible and run that way for long enough to determine the effect on system stability. The cost of this test will be performance. Later on, a proper thermal daemon (thermald, tlp, ...) should be investigated as a way to recover maximum performance.

The default CPU frequency scaling driver for the i7-10510U is intel_pstate, and this answer is written for that driver. Check via:

doug@s15:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu4/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu5/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu6/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu7/cpufreq/scaling_driver:intel_cpufreq

The mprime (prime95) high heat torture test is used as the CPU stress test because it consumes the most energy of any CPU stress test that I have ever tested. To protect my example computer, which has no thermal daemon running, the desired operating point of about 80 degrees will be found from the low side. First, note the current maximum CPU frequency percent, note the minimum as well (yours will be different):

cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
42

It might not be 100% if some thermal daemon is already limiting things. Anyway, I will start at 50%:

doug@s15:~$ echo 50 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
50

Then gradually raise the maximum CPU frequency percent, say in 10 percent increments, and find the operating point for about 80 degrees processor package temperature:

doug@s15:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

0.25 1754 725 25 3.81 0.12 0.02 1600 288 26 3.70 0.12 0.06 1600 360 26 3.70 0.12 38.82 1899 7740 39 16.28 0.12 100.00 1900 17594 41 36.20 0.12 <<< mprime torture test started 100.00 1900 17541 42 36.44 0.12 100.00 1900 17552 43 36.39 0.12 100.00 1900 17517 44 36.25 0.12 100.00 1927 17474 48 36.95 0.12 100.00 2300 17389 49 46.51 0.12 100.00 2300 17367 50 46.60 0.12 100.00 2300 17362 52 46.69 0.12 100.00 2300 17438 53 46.77 0.12 100.00 2552 18440 56 54.18 0.12 100.00 2700 17672 58 58.48 0.12 100.00 2700 17590 58 58.59 0.12 100.00 2700 17710 61 58.74 0.12 100.00 2953 17780 66 67.91 0.12 100.00 3100 17876 68 73.38 0.12 <<<< First time at 80%, temp lags. 100.00 3100 17843 69 73.55 0.12 100.00 3100 17860 70 73.64 0.12 100.00 3100 18794 71 73.78 0.12 100.00 3231 17826 77 79.69 0.12 100.00 3500 18305 80 92.33 0.12 100.00 3500 17765 81 92.66 0.12 100.00 3457 17747 80 90.72 0.12 100.00 3300 17720 81 82.62 0.12 100.00 3300 17723 81 82.72 0.12 100.00 3300 17708 80 82.81 0.12 100.00 3300 17712 83 82.95 0.12 <<<< Opps too high 100.00 3300 17788 82 83.08 0.12 100.00 3204 17882 81 79.25 0.12 100.00 3100 17778 80 74.78 0.12 100.00 3100 18571 81 74.83 0.12 100.00 3100 17806 80 74.85 0.12 100.00 3100 17787 80 74.89 0.12 <<<< 80 percent seems stable 100.00 3100 17772 81 74.84 0.12 100.00 3100 17824 81 74.85 0.12 100.00 3100 17777 80 74.89 0.12 100.00 3100 17799 81 74.95 0.12 100.00 3100 17867 81 74.77 0.12

So, for my system, limiting the CPU frequency to 80% of maximum will keep them away from any built in additional thermal throttling. Run the system this way for awhile.

Doug Smythies
  • 16,146
0

This is a Kernel bug associated with CPU power management. It's fixed in kernel 5.8, which comes with Ubuntu 20.10. I upgraded to 20.10, turned off all the workarounds, and am running stable now.

If upgrading to 5.8/20.10 isn't something you want to do, you can also work around the bug by keeping your CPU from going into lower-power states (this will reduce battery life, obviously). Open up /etc/default/grub and add intel_idle.max_cstate=1 to the contents of the value for GRUB_CMDLINE_LINUX_DEFAULT. Save, run sudo update-grub, then re-boot. Reverse the process to reverse the workaround.

It's possible a cstate value higher than 1 would still be a stable workaround, but I never experimented enough to verify.

John Chrysostom
  • 181
  • 1
  • 10