Troubleshooting Longhorn and DNS Networking

Lately I’ve been trying out Longhorn as the persistent storage engine for stateful k8s-workloads. Overall I am really impressed by Longhorn. Installing on a cluster is a breeze and creating replicated volumes works great.

After a routine reboot of one worker node (which had been properly drained and cordoned beforehand) any pod with an attached Longhorn-volume would refuse to start.

tl;dr

  • pod longhorn-csi-plugin-* failed to start due to connection timeouts to longhorn-backend.
  • checking DNS on the worker node revealed network issue which caused DNS-resolution and inter-node communication failure.
  • Restart of flannel daemonset fixed the network on the node and issue was resolved.

Troubleshooting

On inspecting the event log of one of the affected pods these errors were present:

kubectl describe pod -n default pod-a-79d65ffcd4-zp9lm
...
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    13m                  default-scheduler  Successfully assigned default/pod-a-79d65ffcd4-zp9lm to worker-1
  Warning  FailedMount  6m41s                kubelet            Unable to attach or mount volumes: unmounted volumes=[pod-a-data], unattached volumes=[default-token-jqfxl pod-a-data]: timed out waiting for the condition
  Warning  FailedMount  2m36s (x6 over 12m)  kubelet            MountVolume.MountDevice failed for volume "pvc-e8d3b090-41ad-469e-b208-609e693c0fcf" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name driver.longhorn.io not found in the list of registered CSI drivers
  Warning  FailedMount  2m5s (x4 over 11m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[pod-a-data], unattached volumes=[pod-a-data default-token-jqfxl]: timed out waiting for the condition
...

There was a good clue in there: driver name driver.longhorn.io not found in the list of registered CSI drivers.

The Longhorn CSI drivers are managed by specific workloads that comes bundled with the deployment of Longhorn. The logical next step were to check status of all the pods in longhorn-system namespace:

kubectl get pods -n longhorn-system
NAME                                      READY   STATUS            RESTARTS    AGE     IP              NODE
...
instance-manager-r-bc118f38               1/1     Running           0           13m     10.244.3.130    worker-1
longhorn-csi-plugin-6468k                 2/2     Running           0           7d9h    10.244.2.14     worker-2
longhorn-csi-plugin-nwsz8                 1/2     CrashLoopBackOff  11          7h52m   10.244.3.126    worker-1
longhorn-csi-plugin-vblqk                 2/2     Running           0           7d9h    10.244.5.18     worker-3
longhorn-driver-deployer-f6d46898b-mbpl9  1/1     Running           0           14m     10.244.2.83     worker-2
...

Pod longhorn-csi-plugin-nwsz8 was not looking too healthy with the CrashLoopBackOff state. This pod was also deployed to node worker-1, which can explain the driver.longhorn.io not found error. Further log inspection of the containers revealed:

kubectl logs -n longhorn-system longhorn-csi-plugin-nwsz8 node-driver-registrar
I0202 01:21:06.106826   41062 main.go:110] Version: v1.2.0-lh1-0-g03d523f9
I0202 01:21:06.106872   41062 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0202 01:21:06.106880   41062 connection.go:151] Connecting to unix:///csi/csi.sock
W0202 01:21:16.107162   41062 connection.go:170] Still connecting to unix:///csi/csi.sock
W0202 01:21:26.107140   41062 connection.go:170] Still connecting to unix:///csi/csi.sock

Container node-driver-registrar was running but were unable to connect to the CSI socket on worker-1. Inspecting the log of the second container:

kubectl logs -f -n longhorn-system longhorn-csi-plugin-nwsz8 longhorn-csi-plugin
time="2021-02-02T08:15:04Z" level=info msg="CSI Driver: driver.longhorn.io csiVersion: 0.3.0, manager URL http://longhorn-backend:9500/v1"
time="2021-02-02T08:15:14Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get \"http://longhorn-backend:9500/v1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Aha! A Client.Timeout when longhorn-csi-plugin container tried to connect to the manager via service longhorn-backend:9500.

Hmm, usually when I see a timeout I am thinking firewall or DNS. As there had not been any recent changes to any firewall rules DNS should be the more likely suspect.

First I had a look in kube-system namespace to confirm that kube-dns was present and that the backing coredns-pods were running. Everything looked fine there.

Next step was to verify if node worker-1 could resolve longhorn-backend service in longhorn-system namespace by performing DNS lookup directly from the node:

ssh worker-1

## Use the clusterIP of kube-dns service
## in kube-system namespace as the resolver

dig @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local

; <<>> DiG 9.11.25-RedHat-9.11.25-2.fc33 <<>> @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Uh-oh, worker-1 couldn’t even connect to the kube-dns service. Now a networking issue seemed more likely.

Earlier when I looked at the coredns-pods in kube-system I could see that all pods for flannel CNI were running fine.

But just to be on the safe side I restarted daemonset kube-flannel-ds in kube-system:

kubectl rollout restart daemonset -n kube-system kube-flannel-ds

Repeating the DNS-lookup on worker-1:

dig @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local

...
;; ANSWER SECTION:
longhorn-backend.longhorn-system.svc.cluster.local. 30 IN A 10.96.5.142
...

Great! worker-1 is now able to connect to the DNS service and also resolve longhorn-backend.longhorn-system.svc.cluster.local to an actual IP. This seemed to do the trick as pod longhorn-csi-plugin-nwsz8 had started running properly and pod-a-79d65ffcd4-zp9lm could successfully attach the longhorn volume.

Meme of Yoda saying it's always DNS error

Ok, to be fair it was flannel CNI… but still: it’s always DNS.

What was the actual root cause?

My suspicion is that flannel failed to re-new the subnet lease on the worker node. Unfortunately the flannel logs were purged after subsequent reboots so I can’t confirm this.

As this k8s-cluster is more of a build-up/tear-down kind of deal, than something mission-critical, I don’t have the biggest motivator to try to replicate the issue right now and perform a thorough root cause analysis. But who knows, if the same issue starts happening frequently I might dedicate some time and update this post.