Wednesday, December 19, 2012

Intersting kerberos and ssh troubleshot

There was an interesting kerberos troubleshoot today, someone set up a round-robin DNS solution where host lets call it server-x had a bunch of A records pointing to different IP addresses.

For illustration purposes lets say it looks like this
server-x.domain.com 15 IN A 192.168.1.1
server-x.domain.com 15 IN A 192.168.1.2
server-x.domain.com 15 IN A 192.168.1.3 
server-x.domain.com 15 IN A 192.168.1.4

server1.domain.com 300 IN A  192.168.1.1
server2.domain.com 300 IN A  192.168.1.2
server3.domain.com 300 IN A  192.168.1.3 
server4.domain.com 300 IN A  192.168.1.4

1.1.168.192 300 IN PTR server1.domain.com 
2.1.168.192 300 IN PTR server2.domain.com
3.1.168.192 300 IN PTR server3.domain.com
4.1.168.192 300 IN PTR server4.domain.com

When attempting to ssh to server-x it would sometimes work but sometimes return an error that it filed to initialize gss context.  We finally dug in and found the following

With a completely clean cache (i.e. TGT only) when failure occurred we could tell the server1 was being contacted to but the cache contained a service ticket for server 2.  It turned out that ssh would do it's own resolution separate from GSSAPI's canonicalisation.  The work around we found was to wrap the call in some script that first resolves the name and passes it to ssh.  This way both ssh and GSSAPI skip the resolution step.

A bit later my colleague discovered an option in ssh called GSSAPITrustDns, which makes sure that the name is resolved only once by ssh and then is passed to gssapi, preventing the double resolution.

The longer answer is that if you must use kerberos behind a load balancer do not use round-robin, in fact round-robin is a pretty bad load balancer for just about anything, kerberized or not