Kubernetes iSCSI Target Gotchas

When the kubelet tells you that your iSCSI volumes failed to mount because it has requested a session and found that a session already exists, what do you do?

May 21, 2018

kubernetes iscsi rancher

This is one of those things that has plagued me for months. I stumbled into it again today, and I was fortunate enough to solve it. I say fortunate because that’s what ultimately happened.

A few weeks back I switched to running Plex Media Server under Kubernetes. My history with Plex is long and sordid, and since I also switched to an Apple TV for the Media Player portion, I repurposed the Mac Mini that was the home theater to be another Kubernetes node in my Rancher cluster.

I know that I want iSCSI for container services where anything like a database is happening, and Plex is no different. I attempted to configure the container to use an iSCSI volume off of my Synology NAS, and it refused, spitting back the following cryptic error:

failed to attach disk: Error: iscsiadm: default: 1 session requested, but 1 already present.

Further troubleshooting appeared to show the kubelet doing discovery, logging into the target (visible in dmesg as the disk attaching), and then trying to log in again. This returned the above error.

In April I pegged this as a Rancher problem and opened an issue about it. They were neck deep in migrating from v1 to v2, so it went nowhere.

Yesterday I created an internal MQTT broker for my Home Assistant installation and bridged it to the external broker that I run out in EC2 for OwnTracks data. I knew that I wanted it to use iSCSI for the database volume, so I buckled myself in and deployed Rancher 2.0 here in the house. The iSCSI mount worked great, and today I started migrating other services over.

This brings us to Plex, which bombed, throwing the same error. Ugh. I once again scoured the Internet, finding various ideas thrown around for the last 5 years. None directly solved the issue. I spent the afternoon testing different hypotheses:

iSCSI volumes in Kubernetes won’t mount if the container is running with hostNetwork: true
A volume larger than X takes too long to mount because of some unknown fdisk-like scan
Only the first IQN from a target will mount successfully
The consumer of the mount has to understand that an exit code of 15 is a warning that says the session is already attached and move past it
I am behind in my monthly quota of goat sacrifices

By the end of the day, I was ready to throw in the towel. I deleted the Plex configuration and restarted the MQTT service. It started fine.

Wait a minute. It started fine. It has always started fine, even when others didn’t. What made it different?

I pulled up the YAML for the deployment, PVs, and PVCs and compared them line by line. In the end I discovered one thing that was different:

The MQTT PV used an IP for the target. The others used a hostname.

I quickly released the goat that I had captured and rebuilt the Plex PVs with an IP for the target, and sure enough…they started.