Thursday, January 31, 2013

BOOTP and DHCP option 66 and 67 and PXE boot

If you read most PXE documentations you will see that it specifies option 66 for the TFTP server and option 67 for the boot image, however from my experiments with freeradius DHCP where all it does is provide raw DHCP decoding and encoding I discovered that most of the docs were written after testing with Microsoft DHCP which seems to conveniently translate 66 into the old bootp sname and 67 into old bootp fname.  So do not be surprised if you attempt to set 66 and 67 and that NIC card still does not boot.

Tuesday, January 1, 2013

The trouble with mySQL autoincrement

Recently at work I wrote a webapp using django, the point of this app was to look at some data feeds and create or update entries an entry is believed to already exist if it can be found by content and category.  If the entry already exists we just have to update its last_seen field with a NOW() timestamp   Each entry had an id field which was auto incremented on each entry (handled by mySQL).

1. Using ORM exclusively
At first I relied on django to handle all the refresh details and provide the web UI for querying via the admin interface.  In addition a view was created where users could POST a JSON query and get a response.  The django ORM provides an API call get_or_create which I was going to use to this in order to know if I already have something with content and category X and just update last_seen or if I needed to create a new one (https://docs.djangoproject.com/en/dev/ref/models/querysets/).  However this turned out to be too slow since for each record (and there turned out to be about 4 million of them) I first had to execute a django get/SQL select first.

2.  Attempt to avoid network latency
I decided to use mySQL's upsert feature (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html) in order to let the database handle deduplication.  Also since at this point I was no longer using django for refreshes I decided to use mySQL batch (executemany see  http://mysql-python.sourceforge.net/MySQLdb.html) operation.  This looked a little better, in batches of 5000 loads only took two hours (locally running mySQL instance was only 15 minutes, sigh, but  I work with what I got) still maintaining approximately 4 million records on each load.

Disaster strikes
Everyone is happy now though could be better.  But two months later I notice that this db instance that hosts other databases for me is misbehaving.  After some investigation I noticed calls to my reputation database are hanging and I have about 20 or so stalled update jobs.  Killed the jobs and had a look, a local DBA advised me that I am bumping up against my auto increment limit

i.e. he ran this on my table SELECT Auto_increment FROM information_schema.tables WHERE table_name='the_table_you_want';

Further he suggested that we update the table id field to bigint to quickly solve the problem.  I looked at my code and saw that I do not do any deletes and not too many adds I was puzzled as to how I was able to hit the upper limit of int, but decided to revisit this problem if it happened again.

It happened again (or is about to)
The same DBA called me two days later and informed me that while I have about a week to fix this I am once again approaching my upper limit of auto increment in leaps of 5000 (where did I see this number before).  Again I told him that I do no deletes and very few inserts as the entities on the black list chnage rarely so for the most part I set the NOW() timestamp in the last_seen field

Root cause
As it turned out because my batch operation first attempts an INSERT incermenting the autoincrement pointer and then UPDATES on duplicate, however it never decrements the pointer.  The DBA assured me that this was duplicated behavior (non-the-less lame as I assured him).

Solution
At his point the only thing you can really do is attempt to generate a unique random ID such as UUID4 and use that as opposed to letting the database auto increment a number.  This still allows me to deduplicate on a specific key and batch my operations.

UPDATE 2013-01-05
The unique index somehow interfered with primary key after it was switched to char, so I dumped the unique index and dumped uuid and generated pk by concatenating the two attributes that needed to be unique and generating a sha1sum.  That also made updates a lot faster.
век живи, век учись

Wednesday, December 19, 2012

Intersting kerberos and ssh troubleshot

There was an interesting kerberos troubleshoot today, someone set up a round-robin DNS solution where host lets call it server-x had a bunch of A records pointing to different IP addresses.

For illustration purposes lets say it looks like this
server-x.domain.com 15 IN A 192.168.1.1
server-x.domain.com 15 IN A 192.168.1.2
server-x.domain.com 15 IN A 192.168.1.3 
server-x.domain.com 15 IN A 192.168.1.4

server1.domain.com 300 IN A  192.168.1.1
server2.domain.com 300 IN A  192.168.1.2
server3.domain.com 300 IN A  192.168.1.3 
server4.domain.com 300 IN A  192.168.1.4

1.1.168.192 300 IN PTR server1.domain.com 
2.1.168.192 300 IN PTR server2.domain.com
3.1.168.192 300 IN PTR server3.domain.com
4.1.168.192 300 IN PTR server4.domain.com

When attempting to ssh to server-x it would sometimes work but sometimes return an error that it filed to initialize gss context.  We finally dug in and found the following

With a completely clean cache (i.e. TGT only) when failure occurred we could tell the server1 was being contacted to but the cache contained a service ticket for server 2.  It turned out that ssh would do it's own resolution separate from GSSAPI's canonicalisation.  The work around we found was to wrap the call in some script that first resolves the name and passes it to ssh.  This way both ssh and GSSAPI skip the resolution step.

A bit later my colleague discovered an option in ssh called GSSAPITrustDns, which makes sure that the name is resolved only once by ssh and then is passed to gssapi, preventing the double resolution.

The longer answer is that if you must use kerberos behind a load balancer do not use round-robin, in fact round-robin is a pretty bad load balancer for just about anything, kerberized or not

Saturday, November 10, 2012

A little on verilog

So I am trying to learn verilog, I have a conter,
reg counter;
And am expecting it to reach 15 at some point, but it never does it goes between 1 and 0.  It works when I declare counter as a integer everything works.  Well reg by default is just 1 bit adding 1+0 = 1 and 1+1=0, declaring as reg [31:0] counter fixed the problem.

Friday, August 31, 2012

error:0B084009:x509 certificate routines:X509_load_cert_crl_file:PEM lib

Started getting this error today after renewing my fake CA cert im using to test out an app.  No it wasn't the new lines, the problem was a missing '-' character at the end of the last line.

This is what happens when you just copy and paste the cert

Sunday, August 12, 2012

Borland C++

Moving can be fun you find all these forgotten gems.  One last look before I throw these out

Saturday, June 2, 2012