I know that this should be fixable, because after the fresh install, the whole boot process, from GRUB to final desktop, took maybe 5-10 seconds (and this is reasonable, given that this is an extremely fast desktop PC, AMD Ryzen 3900X, 3600 MHz DDR4 RAM, Samsung 970 EVO Plus M.2 NVM).
This is with (K)Ubuntu 20.04.
After doing a couple of changes, the startup time become suddenly much longer (2 minutes or so).
How can I debug systematically why that is?
I read about bootchart, which seems to be systemd-bootchart now. I installed it, but I'm not really sure if I need to setup anything that this runs.
Then I saw here that there is systemd-analyze which already does sth similar by default. I created this plot:
However, I'm not exactly sure how to read that. Are these all run sequentially or in parallel? I assume in parallel. I see that some of the services (apparmor etc) are started only much later. But why? How can I see why they start so much later? I assume they depend on sth. But on what? I don't really see anything in this plot which got ready at that point.
The services also have light red and a bit of dark red colors. What do these mean?
The output of systemd-analyze critical-chain gives me:
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
graphical.target @1min 37.068s
└─multi-user.target @1min 37.068s
└─kerneloops.service @1min 37.058s +10ms
└─network-online.target @1min 37.057s
└─NetworkManager-wait-online.service @1min 30.725s +6.331s
└─NetworkManager.service @1min 30.625s +99ms
└─dbus.service @1min 30.623s
└─basic.target @1min 30.606s
└─sockets.target @1min 30.606s
└─snapd.socket @1min 30.605s +359us
└─sysinit.target @1min 30.603s
└─systemd-timesyncd.service @1min 30.418s +185ms
└─systemd-tmpfiles-setup.service @1min 30.396s +18ms
└─systemd-journal-flush.service @493ms +49ms
└─systemd-journald.service @173ms +319ms
└─systemd-journald.socket @170ms
└─-.mount @168ms
└─system.slice @168ms
└─-.slice @168ms
There is a big jump between systemd-tmpfiles-setup.service @1min 30.396s and systemd-journal-flush.service @493ms. But why?
I assume this is related to an older (non-system relevant) hard disk which I installed. Which I actually want to be in standby most of the time, as I will only rarely use it, or mostly not use at all (maybe once per week or less). I already configured this in /etc/fstab as 0 2 (after the options). Maybe I should use 0 0 instead? Or maybe I should put noauto to the options to not automount it? (Edit: I added noauto, and it is not mounted automatically anymore. However, this does not seem to have any effect. Still exactly the same critical-chain output.)
Edit: I just checked the output of journalctl --no-pager, which seems to be a bit more helpful for this case:
...
Mai 03 13:03:33 az-Desktop2020 systemd[1]: Finished udev Wait for Complete Device Initialization.
Mai 03 13:03:33 az-Desktop2020 systemd[1]: Starting Install ZFS kernel module...
Mai 03 13:03:33 az-Desktop2020 systemd[1]: Finished Install ZFS kernel module.
Mai 03 13:03:33 az-Desktop2020 systemd[1]: Starting Import ZFS pools by cache file...
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Finished Import ZFS pools by cache file.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Reached target ZFS pool import target.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Starting Mount ZFS filesystems...
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Mai 03 13:03:34 az-Desktop2020 zvol_wait[1418]: No zvols found, nothing to do.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Finished Wait for ZFS Volume (zvol) links in /dev.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Reached target ZFS volumes are ready.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Finished Mount ZFS filesystems.
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Starting Load/Save Random Seed...
Mai 03 13:03:34 az-Desktop2020 systemd[1]: Finished Load/Save Random Seed.
Mai 03 13:03:35 az-Desktop2020 systemd[1]: systemd-rfkill.service: Succeeded.
Mai 03 13:04:00 az-Desktop2020 systemd[1]: systemd-fsckd.service: Succeeded.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.device: Job dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.device/start timed out.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/88eaaaf4-2ccf-43c2-aa6e-caea7fa32aef.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: Dependency failed for File System Check on /dev/disk/by-uuid/88eaaaf4-2ccf-43c2-aa6e-caea7fa32aef.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: Dependency failed for /mnt/ssd1.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: Dependency failed for Local File Systems.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: mnt-ssd1.mount: Job mnt-ssd1.mount/start failed with result 'dependency'.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: systemd-fsck@dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.service: Job systemd-fsck@dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.service/start failed with result 'dependency'.
Mai 03 13:05:00 az-Desktop2020 systemd[1]: dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.device: Job dev-disk-by\x2duuid-88eaaaf4\x2d2ccf\x2d43c2\x2daa6e\x2dcaea7fa32aef.device/start failed with result 'timeout'.
...
There is a jump already to systemd-fsckd, which took long (25 secs)
But even longer is then the jump to dev-disk-by\x2duuid-88ea....device: Job dev-disk-by....device/start timed out. I guess this is one of the problems.
Edit: Yes, that dev-disk-by ... device/start timed out was the problem. That was another entry in my /etc/fstab of a disk which was even not installed in my system. After adding noauto to that, it worked.
I wonder however if it is really the best strategy for Ubuntu/systemd to just idle there waiting for some timeout (1:30 mins or so?) and then fail? It could be more clever, right?
(I will anyway leave this question open, as there were lots of still open questions along the way, which I'm still curious about. E.g. how to systematically debug such a problem. Why systemd-analyze was not so helpful here, or how to better read it.)