Systemd Service Hang

Why and when do systemd services hang indefinitely

Finally, TIL, what can all be the reason for systemd services to hang indefinitely. The internet is flooded with numerous reports on this topic but no clear answers. So no more uselessly marked workarounds like: systemctl daemon-reload and systemctl-daemon-reexec for this scenario.

The scene would be something along the lines of:

rrs         6467  0.0  0.0  23088 15852 pts/1    Ss   12:53   0:00  |   |   \_ /bin/bash
rrs        11512  0.0  0.0  14876  4608 pts/1    S+   13:18   0:00  |   |   |   \_ systemctl restart snapper-timeline.timer
rrs        11513  0.0  0.0  14984  3076 pts/1    S+   13:18   0:00  |   |   |       \_ /bin/systemd-tty-ask-password-agent --watch
rrs        11514  0.0  0.0 234756  6752 pts/1    Sl+  13:18   0:00  |   |   |       \_ /usr/bin/pkttyagent --notify-fd 5 --fallback

The snapper-timeline service is important to me and it not running for months is a complete failure. Disappointingly, commands like systemctl --failed do not report of this oddity. The overall system status is reported to be fine, which is completely incorrect.

Thankfully, a kind soul’s comment gave the hint. The problem is that you could be having certain services in Activating status, which thus blocks all other services; quietly. So much for the unnecessary fun.

Looking further, in my case, it was:

rrs@priyasi:~$ systemctl list-jobs 
JOB  UNIT                           TYPE  STATE  
81   timers.target                  start waiting
85   man-db.timer                   start waiting
88   fstrim.timer                   start waiting
3832 snapper-timeline.service       start waiting
83   snapper-timeline.timer         start waiting
39   systemd-time-wait-sync.service start running
87   logrotate.timer                start waiting
84   debspawn-clear-caches.timer    start waiting
89   plocate-updatedb.timer         start waiting
91   dpkg-db-backup.timer           start waiting
93   e2scrub_all.timer              start waiting
40   time-sync.target               start waiting
86   apt-listbugs.timer             start waiting

13 jobs listed.
13:12 ♒ ॐ ♅ ♄ ⛢     ☺ 😄    

That was it. I knew the systemd-timesyncd service, in the past, had given me enough headaches. And so was it this time, just quietly doing it all again.

rrs@priyasi:~$ systemctl status systemd-time-wait-sync.service
● systemd-time-wait-sync.service - Wait Until Kernel Time Synchronized
     Loaded: loaded (/lib/systemd/system/systemd-time-wait-sync.service; enabled; vendor preset>
     Active: activating (start) since Fri 2022-04-22 13:14:25 IST; 1min 38s ago
       Docs: man:systemd-time-wait-sync.service(8)
   Main PID: 11090 (systemd-time-wa)
      Tasks: 1 (limit: 37051)
     Memory: 836.0K
        CPU: 7ms
     CGroup: /system.slice/systemd-time-wait-sync.service
             └─11090 /lib/systemd/systemd-time-wait-sync

Apr 22 13:14:25 priyasi systemd[1]: Starting Wait Until Kernel Time Synchronized...
Apr 22 13:14:25 priyasi systemd-time-wait-sync[11090]: adjtime state 5 status 40 time Fri 2022->
13:16 ♒ ॐ ♅ ♄ ⛢      ☹ 😟=> 3  

Dear LazyWeb, anybody knows of why the systemd-time-wait-sync service would hang indefinitely? I’ve had identical setups on many machines, in the same network, where others don’t exhibit this problem.

rrs@priyasi:~$ systemctl cat systemd-time-wait-sync.service

...snipped...

[Service]
Type=oneshot
ExecStart=/lib/systemd/systemd-time-wait-sync
TimeoutStartSec=infinity
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

The TimeoutStartSec=infinity is definitely an attribute that shouldn’t be shipped in any system services. There are use cases for it but that should be left for local admins to explicitly decide. Hanging for infinity is not a desired behavior for a system service.

In figuring all this out, today I learnt the handy systemctl list-jobs command, which will give the list of active running/blocked/waiting jobs.

Update: 2024-08-15

This week I finally found the cause of the issue. I have a bunch of bridge interfaces defined on my machine. And all of them, most of their time, would be DOWN


@ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: withnet: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether XXXXXXXXXXXXXXXXX brd ff:ff:ff:ff:ff:ff
3: nonet: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether XXXXXXXXXXXXXXXXX brd ff:ff:ff:ff:ff:ff
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether XXXXXXXXXXXXXXXXX brd ff:ff:ff:ff:ff:ff
6: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether XXXXXXXXXXXXXXXXX brd ff:ff:ff:ff:ff:ff
    inet 10.42.1.66/24 brd 10.42.1.255 scope global dynamic noprefixroute wlan0
       valid_lft 5325sec preferred_lft 5325sec
    inet6 fe80::9bc0:e362:7c9d:be7c/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: email-laptop: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1280 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet XXXXXXXXXXXXXXXXX scope global noprefixroute email-laptop
       valid_lft forever preferred_lft forever
    inet6 XXXXXXXXXXXXXXXXX scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 XXXXXXXXXXXXXXXXXXXXXXXX/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
⛢ 13:55:21 rrs@priyasi ~/pu/researchut-hugo   researchut|+1…2

And as it stands, that is the cause of the problem.

systemd-timesyncd would invoke an attempt to sync time for all the interfaces defined under systemd-networkd. And given the setup, that is bound to fail timeout; which is why I had added the override against:

TimeoutStartSec=infinity

So, apparently, for every interface in networkd, it’d attempt the ntp sync. It reports along the lines of:

Aug 15 13:24:06 priyasi systemd-timesyncd[1987]: Network configuration changed, trying to establish connection.

So I learnt this week, of the attribute to handle this:

@ cat sysbr0.network
[Match]
Name=withnet

[Network]
DHCPServer=yes
IPv4Forwarding=yes
IPv6Forwarding=yes
IPMasquerade=both
Address=192.168.1.1/24
LLMNR=yes

[Link]
RequiredForOnline=no

#[DHCPServer]
#DNS=192.168.1.1

With that, NTP time synchronization is back to what it should be. Precision synchronization of the system time.

ॐ 14:11:35 rrs@priyasi /etc/systemd/network  
@ timedatectl show
LocalRTC=no
CanNTP=yes
NTP=yes
NTPSynchronized=yes
TimeUSec=Thu 2024-08-15 14:12:45 IST
RTCTimeUSec=Thu 2024-08-15 14:12:45 IST

See also