There was an interesting kerberos troubleshoot today, someone set up a round-robin DNS solution where host lets call it server-x had a bunch of A records pointing to different IP addresses.
For illustration purposes lets say it looks like this
server-x.domain.com 15 IN A 192.168.1.1
server-x.domain.com 15 IN A 192.168.1.2
server-x.domain.com 15 IN A 192.168.1.3
server-x.domain.com 15 IN A 192.168.1.4
server1.domain.com 300 IN A 192.168.1.1
server2.domain.com 300 IN A 192.168.1.2
server3.domain.com 300 IN A 192.168.1.3
server4.domain.com 300 IN A 192.168.1.4
1.1.168.192 300 IN PTR server1.domain.com
2.1.168.192 300 IN PTR server2.domain.com
3.1.168.192 300 IN PTR server3.domain.com
4.1.168.192 300 IN PTR server4.domain.com
When attempting to ssh to server-x it would sometimes work but sometimes return an error that it filed to initialize gss context. We finally dug in and found the following
With a completely clean cache (i.e. TGT only) when failure occurred we could tell the server1 was being contacted to but the cache contained a service ticket for server 2. It turned out that ssh would do it's own resolution separate from GSSAPI's canonicalisation. The work around we found was to wrap the call in some script that first resolves the name and passes it to ssh. This way both ssh and GSSAPI skip the resolution step.
A bit later my colleague discovered an option in ssh called GSSAPITrustDns, which makes sure that the name is resolved only once by ssh and then is passed to gssapi, preventing the double resolution.
The longer answer is that if you must use kerberos behind a load balancer do not use round-robin, in fact round-robin is a pretty bad load balancer for just about anything, kerberized or not
I am not looking to break new ground, just simply document some of the things that I found to be useful in everyday work. Sometimes I spend a considerable amount of time to find a solution for a problem that seemed silly and simple. I hope that some of my posts will save you some time.