I'm running a home Kubernetes cluster on Talos Linux where some applications depend on SQLite databases stored on an iSCSI target from my TrueNAS server. I manually configure Persistent Volumes (PV) and Persistent Volume Claims (PVC) for these workloads and don't use CSI drivers. Occasionally, I need to restart my TrueNAS server for maintenance, causing the iSCSI target to be unavailable for about 5 to 30 minutes.
During this downtime, my pods fail their liveness/readiness probes, and while Kubernetes attempts to restart them once the iSCSI server is back online, I still encounter I/O errors. It seems Kubernetes reuses the old iSCSI connection, leading to failures. The only way to resolve this issue is to delete the pod manually, which then allows everything to function normally again.
How do you all manage iSCSI target disconnects that last for a significant period?
3 Answers
Unfortunately, there’s no perfect solution to this problem. Once the underlying infrastructure goes down for a bit, the volume mount becomes stale and can't be recovered. It's crucial to either scale down before taking the storage offline or ensure your pods have a way to restart automatically after reconnecting.
Have you considered just scaling down your workloads in Kubernetes before you do maintenance on the TrueNAS server? If it's planned maintenance, that might prevent some of the issues you're facing.
I actually use a liveness script that checks if the mounted volumes have become stale. If the script fails, the pod gets terminated which hopefully helps with re-establishing the connection when the external storage returns.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures