For the past 3 months, I have been struggling with a random issue on my homeserver where DNS resolution drops for a brief period of time (10-60 seconds) for absolutely no reason. Pinging via hostname results in ping: signal.org: Temporary failure in name resolution
, and any services that attempt a DNS lookup fail near instantly. There are no systemd-resolved
or dnsmasq
logs in /var/log/syslog
when these outages happen, but other services will report issues. For example:
ddclient[573749]: message repeated 14 times: [ WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org']
dockerd[1811]: time="2021-04-29T13:50:19.080258289-05:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
rsyslogd: DNS error: Can't resolve "<local_domain>" [v8.2001.0]
whoopsie[1816]: [17:38:15] Sent; server replied with: Couldn't resolve host name
Current setup: Ubuntu 20.04.2, Netplan
set to static IP, dnsmasq
is the DNS server, with dns-forward-max=1024
, systemd-resolved
is disabled and stopped. Server is a Ryzen 3950X, 64GB RAM, OS is installed on an NVMe drive. The server runs many webapp-type services, but the nosiest for DNS requests is easily matrix-synapse
.
Things I have tried:
· I have restarted the systemd-resolved
service hundreds of times, disabled the service a dozen times, turned off/on the stub resolver, and deleted and re-created the symlink.
· I set a static IP with netplan
, and played with /etc/NetworkManager/NetworkManager.conf.
· I Installed pihole
and unbound
via apt
for just the server itself. (pihole
is currently uninstalled, and unbound
is running but nothing is using it to resolve.
· I Installed dnsmasq
and completely disabled systemd-resolved
.
· I've disabled IPv6 completely on the server.
· I've set * soft nofile 1048576
and * hard nofile 1048576
in /etc/security/limits.conf, and /proc/sys/fs/file-max
shows 9223372036854775807
.
I suspect Docker is the issue, but I have no idea how to verify this. I've currently got 38 Docker containers running, and when I run sudo lsof -i :53
while the issue is happening, I will see:
thomcat@servername:~$ sudo lsof -i :53
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dockerd 1623 root 217u IPv4 1577888 0t0 UDP localhost:46003->localhost:domain
dockerd 1623 root 226u IPv4 1605902 0t0 UDP localhost:50192->localhost:domain
dockerd 1623 root 227u IPv4 1610070 0t0 UDP localhost:52637->localhost:domain
dockerd 1623 root 228u IPv4 1605907 0t0 UDP localhost:55021->localhost:domain
dockerd 1623 root 229u IPv4 1618981 0t0 UDP localhost:57618->localhost:domain
dockerd 1623 root 230u IPv4 1610081 0t0 UDP localhost:35776->localhost:domain
dockerd 1623 root 231u IPv4 1610086 0t0 UDP localhost:60635->localhost:domain
dockerd 1623 root 232u IPv4 1589998 0t0 UDP localhost:43036->localhost:domain
dockerd 1623 root 234u IPv4 1602056 0t0 UDP localhost:58408->localhost:domain
dockerd 1623 root 235u IPv4 1614011 0t0 UDP localhost:43421->localhost:domain
dockerd 1623 root 236u IPv4 1589999 0t0 UDP localhost:60957->localhost:domain
dockerd 1623 root 237u IPv4 1597695 0t0 UDP localhost:53026->localhost:domain
dockerd 1623 root 242u IPv4 1590000 0t0 UDP localhost:41842->localhost:domain
dockerd 1623 root 244u IPv4 1597696 0t0 UDP localhost:49179->localhost:domain
dockerd 1623 root 246u IPv4 1572736 0t0 UDP localhost:46471->localhost:domain
dockerd 1623 root 266u IPv4 1616008 0t0 UDP localhost:35262->localhost:domain
dockerd 1623 root 267u IPv4 1616009 0t0 UDP localhost:54501->localhost:domain
dockerd 1623 root 268u IPv4 1579887 0t0 UDP localhost:33130->localhost:domain
dockerd 1623 root 269u IPv4 1579888 0t0 UDP localhost:33491->localhost:domain
dockerd 1623 root 270u IPv4 1613280 0t0 UDP localhost:49504->localhost:domain
dockerd 1623 root 273u IPv4 1579890 0t0 UDP localhost:43801->localhost:domain
dockerd 1623 root 278u IPv4 1613283 0t0 UDP localhost:44804->localhost:domain
dockerd 1623 root 279u IPv4 1568692 0t0 UDP localhost:39425->localhost:domain
dockerd 1623 root 293u IPv4 1577890 0t0 UDP localhost:52194->localhost:domain
dockerd 1623 root 296u IPv4 1605903 0t0 UDP localhost:50866->localhost:domain
dockerd 1623 root 319u IPv4 1605904 0t0 UDP localhost:58574->localhost:domain
dockerd 1623 root 341u IPv4 1605910 0t0 UDP localhost:37123->localhost:domain
dockerd 1623 root 342u IPv4 1610067 0t0 UDP localhost:48734->localhost:domain
dockerd 1623 root 343u IPv4 1610069 0t0 UDP localhost:35580->localhost:domain
dockerd 1623 root 344u IPv4 1605905 0t0 UDP localhost:45133->localhost:domain
dockerd 1623 root 345u IPv4 1618982 0t0 UDP localhost:53052->localhost:domain
dockerd 1623 root 346u IPv4 1589996 0t0 UDP localhost:56714->localhost:domain
dockerd 1623 root 347u IPv4 1614009 0t0 UDP localhost:37216->localhost:domain
dockerd 1623 root 348u IPv4 1589997 0t0 UDP localhost:38032->localhost:domain
dockerd 1623 root 349u IPv4 1618984 0t0 UDP localhost:53714->localhost:domain
dockerd 1623 root 350u IPv4 1610084 0t0 UDP localhost:42922->localhost:domain
dockerd 1623 root 351u IPv4 1577893 0t0 UDP localhost:32865->localhost:domain
dockerd 1623 root 352u IPv4 1608975 0t0 UDP localhost:58307->localhost:domain
dockerd 1623 root 353u IPv4 1597699 0t0 UDP localhost:33564->localhost:domain
dockerd 1623 root 354u IPv4 1608977 0t0 UDP localhost:58235->localhost:domain
dockerd 1623 root 355u IPv4 1577896 0t0 UDP localhost:46068->localhost:domain
dockerd 1623 root 356u IPv4 1597702 0t0 UDP localhost:32827->localhost:domain
systemd-r 106795 systemd-resolve 12u IPv4 980615 0t0 UDP localhost:domain
systemd-r 106795 systemd-resolve 13u IPv4 980616 0t0 TCP localhost:domain (LISTEN)
http 165553 _apt 3u IPv4 1611999 0t0 UDP localhost:54478->localhost:domain
More things to note:
· The upstream DNS server is a Raspberry Pi 3 B+ running pihole. Nothing else on my network has these DNS resolution problems, so the problem is not with the pihole.
· ssh
sessions to the server do not drop when this issue is happening.
· ping
ing external IPs works just fine when the issue is happening.
I've been pulling my hair out trying to figure this out. If anyone has any ideas, I would be glad to hear them.