I have a new System76 Lemur Pro laptop with Ubuntu 20.04. I really want to love it, but I'm finding that it's completely and totally locking up several times a week, which kind of puts a damper on my feelings. I'm in contact with System76 support, but I'm also trying to do some troubleshooting of my own. I'm fairly new to Linux and am hoping to learn not just how to fix my machine, but also general troubleshooting steps that would be useful in the future.
The system: System76 Lemur Pro, i7, 40gb RAM, single SSD. Ubuntu 20.04. All updates installed. Only peripherals are a USB hub with a mouse and keyboard plugged in, and an external monitor hooked up via USB-C to DisplayPort adapter. Nothing exotic.
The crash: Several times a week, I'll return to my laptop (usually in the morning after it sits idle all night) to find that it's totally unresponsive to mouse/keyboard. Using ALT+F_ to try to switch to a terminal does not do anything. ALT + PRTSCR + REISUB does not do anything. Hitting the power button does not do anything. Trying to turn on the internal LCD does not do anything. Only holding the power button down and hard-resetting the machine allows me to recover. This did happen only one time while I was actively using the machine and the Gnome desktop stayed visible, the mouse and keyboard locked, and about 1/4 of a second of the song I was listening to just got stuck in a loop. Nothing but hard reset worked to recover.
What I've tried:
- Stress testing CPU. I monitored CPU temps while running a stress test for several minutes. Temps never exceeded upper 80s, and the CPU fan kicked in to keep it under control. This seems safe, given that the hot/critical temps were listed as 100.
- Running memtester. Looped through 5 times, everything passed.
- Installing any updates recommended by Ubuntu.
- Looking at system logs (/var/log/syslog). These logs simply go blank when the system hangs and stay blank until I hard reset it. Nothing immediately before the crash looks terribly interesting.
- Disabling sleep. Was already disabled, but thought I'd mention it.
At this point, I'm not quite sure what my next steps should be. Are there other logs I can look at? Other diagnostics I can run? Should I assume it's a peripheral and disconnect keyboard/mouse/monitor/hub one at a time to try to isolate? Seems unlikely to be a common peripheral, but who knows.
Edit: as requested, here's logs from /var/log/kern.log
right before one of the crashes. It includes a lot of info about CPU throttling being managed. However, such messages occur regularly when the computer is stable as well...
Oct 22 07:52:00 system76-pc kernel: [44320.095989] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 7775)
Oct 22 07:52:00 system76-pc kernel: [44320.095990] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 4669)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 719)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.095994] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.096970] mce: CPU2: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU0: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU5: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096973] mce: CPU3: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU6: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU7: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096975] mce: CPU4: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096976] mce: CPU1: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU6: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU7: Package temperature/speed normal