Monday, May 6, 2024
 Popular · Latest · Hot · Upcoming
0
rated 0 times [  0] [ 0]  / answers: 1 / hits: 2665  / 2 Years ago, thu, february 3, 2022, 6:03:47

tl;dr: I have a Machine with Ubuntu Server which I want to run 24/7, but it tends to shut down everyday. What can I look into? I have checked a few things (see below), but I have not been able to fix it and I am running out of ideas :)


Context


The Problem


I have a new machine (Lenovo P340) and I want to run it uninterruptedly. This is a Desktop, which is permanently connected to Power. More accurately, to a UPS, although power outages are not a problem where I live.


The machine runs Ubuntu Server and is running things like Docker (like web services), but after a few hours, it tends to "shut down" by itself. I would love to know how to fix this.


What I mean by Shutdown


When I say shutdown what I mean is that I cannot interact with the machine: none of the docker apps run, and I cannot ssh into the machine. If I connect a keyboard and press enter, or try to change "environment" with cntrl+alt+f1/f3, nothing happens. Now, I do not think it completely shuts down, maybe it enters power mode since I can see the light of the desktop on.


This happens while the docker apps are running, and sometimes even while I am ssh or connected via samba (without actual input from user, just connected).


I have never seen it happening while I am actually active on the machine (e.g. while I am on ssh running commands or executing things). This is what makes me think that this might be related to power management. However, it could just be that it didn't happen at the same time. The shutdown happens around once per day and at different times. It can happen in the morning, afternoon or evening.


The only way I've found to "get out" of it is keep the off button pressed and then switch it on again. One soft press (as in to perhaps "unlock") didn't seem to work, but not sure without a proper way to interact with the machine.


Troubleshooting


I have been reading, testing and collecting data. Putting some of the things I've tried below.


Hypothesis: System not up to date


I currently have Ubuntu 20.04.1 LTS (GNU/Linux 5.6.0-1042-oem x86_64) and I periodically run updates on the machine. I've done it as part of this troubleshooting.


Hypothesis: Memory is overloaded and system shuts down


One of the most common reasons I've found is that perhaps the memory or CPU are overloaded. The machine is new. The CPU is an Intel Core i9-10900 2.8G 10C vPro with 64GB of RAM while I am just running a few (~10) containers.


I have also been talking snapshots of Memory usage every 15 minutes and storing them. This is an example of top -b -o %MEM -n 1 > top.txt just before "stopping".


top - 06:30:01 up 1 day, 11:10,  0 users,  load average: 0.05, 0.03, 0.02
Tasks: 511 total, 1 running, 508 sleeping, 0 stopped, 2 zombie
%Cpu(s): 0.6 us, 0.3 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64033.5 total, 43098.9 free, 7544.9 used, 13389.7 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 56011.2 avail Mem

The memory usage is about 8GB out of 64GB. The only thing that calls my attention is the 2 zombie tasks, but otherwise none of the processes are heavily used.


Hypothesis: UPS monitor is telling the machine to shut down


The server is connected to a UPS server. I thought this could be the cause as sometimes it failed to connect (but usually wouldn't switch off). However, after completely disabling upsmon configuration, the UPS is not connected, nor I see logs on it, and this still happens.


Aside note: In terms of power, the machine is still connected to the UPS, just not monitoring its status and I live in an area where power cuts are uncommon.


Hypothesis: Machine is going to sleep if not being used


I am taking snapshots every 30 minutes. The snapshots are run to keep the machine awake. They are run using Jenkins, which ssh into the server and runs the two commands that captures logs and memory usage. I would expect this to count as interaction. For a full day, I've tried running them every 5 minutes and somehow that day, no "shutdowns" happened. Not sure if coincidence or due to the process, but I am testing again to see the results.


Hypothesis: GUI has power management and I should remove it


I had a GUI installed, but I have already uninstalled it as per advice of @guiverc on the comments.


This used to be the sessions I had:


nito-server:~$ ls /usr/share/xsessions/
gnome-xorg.desktop gnome.desktop ubuntu.desktop

I have followed this tutorial, this answer and this other answer to remove both gnome and ubuntu desktop.


After running this, now there are no more GUIs shown in:


nito-server:~$ ls /usr/share/xsessions/

Despite this, the system still shuts down periodically.


Hypothesis: Disabling power interface via GRUB would solve it


I researched a bit on this forum and I often saw changes to GRUB to configure Power Interface. I have tried different variants and progressively increasing it:



  • GRUB_CMDLINE_LINUX_DEFAULT="text"

  • GRUB_CMDLINE_LINUX_DEFAULT="text acpi=force"

  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset acpi=force"

  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset pci=noaer"

  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset acpi=force pci=noaer"


I checked the shutdown logs running sudo journalctl -b -1 -e


Jan 18 05:33:55 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:36:26 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:36:39 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:37:39 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:39:05 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:43:31 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:48:59 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:49:07 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:50:02 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr
Jan 18 05:50:32 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER: device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER: [ 0] RxErr

After checking this, the last configuration is currently set to: GRUB_CMDLINE_LINUX_DEFAULT="text pcie_aspm=off".


This seems to make the problem happen less often, but it still occurs.


Hypothesis: System (BIOS) has a power savings mode


I've checked the BIOS. There's an Enhanced Power Saving Mode. However, (1) this is about entering power saving mode when it's already Off, not about turning off; and (2) it's Disabled. In Power, most features were about automatic or controlled power on. Nothing else related.


Hypothesis: Issues with cron drives the machine to shutdown


There are no cron jobs configured on the machine directly. The only crons come from Jenkins which is configured inside a docker container.


crontab -l shows no crontab for nito


As for ll /etc/cron.hourly/ shows:


total 20
drwxr-xr-x 2 root root 4096 Aug 1 00:28 ./
drwxr-xr-x 134 root root 12288 Jan 29 16:37 ../
-rw-r--r-- 1 root root 102 Feb 14 2020 .placeholder

CURRENT STATUS AND LOGS


After all the previous, the machine stabilized for a while, but shutdowns still happen every 48-72h. These are the last journal logs (sudo journalctl -b -1 -e):


Jan 22 07:03:00 nito-server sshd[74442]: pam_unix(sshd:session): session opened for user nito by (uid=0)
Jan 22 07:03:00 nito-server systemd[1]: Created slice User Slice of UID 1000.
Jan 22 07:03:00 nito-server systemd[1]: Starting User Runtime Directory /run/user/1000...
Jan 22 07:03:00 nito-server systemd-logind[1131]: New session 61 of user nito.
Jan 22 07:03:00 nito-server systemd[1]: Finished User Runtime Directory /run/user/1000.
Jan 22 07:03:00 nito-server systemd[1]: Starting User Manager for UID 1000...
Jan 22 07:03:00 nito-server systemd[74473]: pam_unix(systemd-user:session): session opened for user nito by (uid=0)
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Paths.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Timers.
Jan 22 07:03:00 nito-server systemd[74473]: Starting D-Bus User Message Bus Socket.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG network certificate management daemon.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on debconf communication socket.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on Sound System.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on REST API socket for snapd user session agent.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on D-Bus User Message Bus Socket.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Sockets.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Basic System.
Jan 22 07:03:00 nito-server systemd[1]: Started User Manager for UID 1000.
Jan 22 07:03:00 nito-server systemd[74473]: Starting Sound Service...
Jan 22 07:03:00 nito-server systemd[1]: Started Session 61 of user nito.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server systemd[74473]: Started D-Bus User Message Bus.
Jan 22 07:03:00 nito-server dbus-daemon[74579]: [session uid=1000 pid=74579] AppArmor D-Bus mediation is enabled
Jan 22 07:03:00 nito-server systemd[74473]: Started Sound Service.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Main User Target.
Jan 22 07:03:00 nito-server systemd[74473]: Startup finished in 122ms.
Jan 22 07:03:00 nito-server bluetoothd[1105]: Endpoint registered: sender=:1.477 path=/MediaEndpoint/A2DPSink/sbc
Jan 22 07:03:00 nito-server bluetoothd[1105]: Endpoint registered: sender=:1.477 path=/MediaEndpoint/A2DPSource/sbc
Jan 22 07:03:07 nito-server sshd[74442]: pam_unix(sshd:session): session closed for user nito
Jan 22 07:03:07 nito-server systemd[1]: session-61.scope: Succeeded.
Jan 22 07:03:07 nito-server systemd-logind[1131]: Session 61 logged out. Waiting for processes to exit.
Jan 22 07:03:07 nito-server systemd-logind[1131]: Removed session 61.
Jan 22 07:03:07 nito-server bluetoothd[1105]: Endpoint unregistered: sender=:1.477 path=/MediaEndpoint/A2DPSink/sbc
Jan 22 07:03:07 nito-server bluetoothd[1105]: Endpoint unregistered: sender=:1.477 path=/MediaEndpoint/A2DPSource/sbc
Jan 22 07:03:07 nito-server systemd[74473]: pulseaudio.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopping User Manager for UID 1000...
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Main User Target.
Jan 22 07:03:17 nito-server systemd[74473]: Stopping D-Bus User Message Bus...
Jan 22 07:03:17 nito-server systemd[74473]: dbus.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped D-Bus User Message Bus.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Basic System.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Paths.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Sockets.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Timers.
Jan 22 07:03:17 nito-server systemd[74473]: dbus.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed D-Bus User Message Bus Socket.
Jan 22 07:03:17 nito-server systemd[74473]: dirmngr.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG network certificate management daemon.
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-browser.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-extra.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-ssh.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 22 07:03:17 nito-server systemd[74473]: pk-debconf-helper.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed debconf communication socket.
Jan 22 07:03:17 nito-server systemd[74473]: pulseaudio.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed Sound System.
Jan 22 07:03:17 nito-server systemd[74473]: snapd.session-agent.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed REST API socket for snapd user session agent.
Jan 22 07:03:17 nito-server systemd[74473]: Reached target Shutdown.
Jan 22 07:03:17 nito-server systemd[74473]: systemd-exit.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Finished Exit the Session.
Jan 22 07:03:17 nito-server systemd[74473]: Reached target Exit the Session.
Jan 22 07:03:17 nito-server systemd[1]: [email protected]: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopped User Manager for UID 1000.
Jan 22 07:03:17 nito-server systemd[1]: Stopping User Runtime Directory /run/user/1000...
Jan 22 07:03:17 nito-server systemd[1]: run-user-1000.mount: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: [email protected]: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopped User Runtime Directory /run/user/1000.
Jan 22 07:03:17 nito-server systemd[1]: Removed slice User Slice of UID 1000.
Jan 22 07:09:52 nito-server wpa_supplicant[1135]: wlo1: WPA: Group rekeying completed with 76:ac:b9:30:c7:b5 [GTK=CCMP]
Jan 22 07:17:01 nito-server CRON[75952]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 22 07:17:01 nito-server CRON[75953]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 22 07:17:01 nito-server CRON[75952]: pam_unix(cron:session): session closed for user root

Running sudo cat /var/log/syslog | grep -i "panic|error|hang"


Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:12 nito-server systemd-resolved[1129]: message repeated 47 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 03:01:30 nito-server networkd-dispatcher[1163]: ERROR:Unknown interface index 50 seen even after reload
Jan 29 03:01:30 nito-server networkd-dispatcher[1163]: ERROR:Unknown interface index 50 seen even after reload
Jan 29 03:01:30 nito-server kernel: [22595.674380] IPv6: ADDRCONF(NETDEV_CHANGE): vethc7eb143: link becomes ready
Jan 29 03:33:00 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 03:33:05 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 03:33:05 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 03:33:25 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 03:33:45 nito-server boltd[117319]: power: state changed: supported/off
Jan 29 08:31:20 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 08:31:40 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 08:32:00 nito-server boltd[117319]: power: state changed: supported/off
Jan 29 08:45:13 nito-server NetworkManager[1154]: <info> [1611881113.6500] dhcp4 (wlo1): state changed bound -> extended
Jan 29 14:12:25 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 14:12:25 nito-server systemd-resolved[1129]: message repeated 2 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 16:36:01 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 16:36:01 nito-server systemd-resolved[1129]: message repeated 2 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 16:36:25 nito-server NetworkManager[1154]: <info> [1611909385.6366] manager: kernel firmware directory '/lib/firmware' changed
Jan 29 16:36:29 nito-server NetworkManager[1154]: <info> [1611909389.8075] manager: kernel firmware directory '/lib/firmware' changed
Jan 29 16:37:04 nito-server ntpd[351772]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Jan 29 16:37:04 nito-server ntpd[351772]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized

Note: the power state change shown there is probably a manual one as there has not been automatic shutdowns on Jan 29.


Final Decision


Hi everyone,


Thanks for your tips and hypothesis. After implementing all changes and running all commands recommended, the machine still reboots. Good news is that it reboots every 72h instead of 24h.


I have decided to reinstall Ubuntu Server and see if that solves the problem. Thanks!


More From » server

 Answers
5

While it's unclear the root cause, the changes on uninstalling UI and Power Management changed the reboot time from to ~24h to ~72h.


Additionally, after reinstalling Ubuntu Server, this doesn't seem to be an issue anymore.


[#2072] Saturday, February 5, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
tresein

Total Points: 197
Total Questions: 113
Total Answers: 112

Location: Hungary
Member since Wed, Nov 9, 2022
2 Years ago
tresein questions
;