I've been experiencing a frustrating issue with BIND9 and zone transfers for quite some time now. Here's a brief outline of my setup: I have a master DNS server that only responds to queries from a private network and my slave servers, which are hosted on VPSs located in Germany and Bulgaria. The master server maintains several public zones, with the slave servers designated as primary and secondary at the registrar level.
Here's where the problem lies: On my primary slave server (vps03), I update the zones for Let's Encrypt DNS verification using `nsupdate`; these updates go through without issues, and the master successfully processes them. However, during the zone transfer to the slaves, about 1 in 3 or 4 transfers fails with the error "failed while receiving responses: end of file" on the slave side. The master logs show the transfers are successful and error-free. If I manually initiate the retransfer on the slave with `rndc -s 127.0.0.1 retransfer `, it works but may require 2 or 3 attempts.
This inconsistency makes automated renewal of Let's Encrypt certificates quite challenging. I've been troubleshooting this for approximately 5 months without success. I've tried a variety of fixes: allowing larger packets, using different ports for transfers, extending timeout periods, and even toggling between transfer formats. I'm running:
- Master: BIND 9.18.39-0ubuntu0.22.04.2-Ubuntu (Extended Support Version)
- First Slave: BIND 9.18.39-0ubuntu0.24.04.3-Ubuntu (Extended Support Version)
- Second Slave: BIND 9.18.39-0ubuntu0.24.04.3-Ubuntu (Extended Support Version)
The slaves were recently upgraded to Ubuntu 24.04, but they previously operated on 22.04 with the same version as the master. Even after moving the primary slave to a fresh installation, the issue persisted. Any assistance in resolving this would be greatly appreciated. Let me know if you need additional information!
2 Answers
It sounds like there might be something in the network path that's causing issues during transfers. Have you considered the possibility of your ISP filtering DNS traffic over UDP? I encountered a similar situation before. You should definitely try switching all transfers and notifications to TCP. Also, double-check the firewall rules on both your VPS systems; you want to ensure nothing's rate-limiting or blocking connections when there's a lot going on. It could be useful to check the logs for any kernel-level messages indicating packet drops.
I'd suggest focusing on potential network problems instead of just looking at your BIND configuration. Sometimes, the bottlenecks can occur in the communication between the master and slave servers, not BIND itself. If you've already changed one of your slave servers to a different provider and the issue remains, you might want to dig deeper into the network configurations or seek out external factors that could be affecting stability.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures