Recently at work I wrote a webapp using django, the point of this app was to look at some data feeds and create or update entries an entry is believed to already exist if it can be found by content and category. If the entry already exists we just have to update its last_seen field with a NOW() timestamp Each entry had an id field which was auto incremented on each entry (handled by mySQL).
1. Using ORM exclusively
At first I relied on django to handle all the refresh details and provide the web UI for querying via the admin interface. In addition a view was created where users could POST a JSON query and get a response. The django ORM provides an API call get_or_create which I was going to use to this in order to know if I already have something with content and category X and just update last_seen or if I needed to create a new one (https://docs.djangoproject.com/en/dev/ref/models/querysets/). However this turned out to be too slow since for each record (and there turned out to be about 4 million of them) I first had to execute a django get/SQL select first.
2. Attempt to avoid network latency
I decided to use mySQL's upsert feature (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html) in order to let the database handle deduplication. Also since at this point I was no longer using django for refreshes I decided to use mySQL batch (executemany see http://mysql-python.sourceforge.net/MySQLdb.html) operation. This looked a little better, in batches of 5000 loads only took two hours (locally running mySQL instance was only 15 minutes, sigh, but I work with what I got) still maintaining approximately 4 million records on each load.
Disaster strikes
Everyone is happy now though could be better. But two months later I notice that this db instance that hosts other databases for me is misbehaving. After some investigation I noticed calls to my reputation database are hanging and I have about 20 or so stalled update jobs. Killed the jobs and had a look, a local DBA advised me that I am bumping up against my auto increment limit
i.e. he ran this on my table SELECT Auto_increment FROM information_schema.tables WHERE table_name='the_table_you_want';
Further he suggested that we update the table id field to bigint to quickly solve the problem. I looked at my code and saw that I do not do any deletes and not too many adds I was puzzled as to how I was able to hit the upper limit of int, but decided to revisit this problem if it happened again.
It happened again (or is about to)
The same DBA called me two days later and informed me that while I have about a week to fix this I am once again approaching my upper limit of auto increment in leaps of 5000 (where did I see this number before). Again I told him that I do no deletes and very few inserts as the entities on the black list chnage rarely so for the most part I set the NOW() timestamp in the last_seen field
Root cause
As it turned out because my batch operation first attempts an INSERT incermenting the autoincrement pointer and then UPDATES on duplicate, however it never decrements the pointer. The DBA assured me that this was duplicated behavior (non-the-less lame as I assured him).
Solution
At his point the only thing you can really do is attempt to generate a unique random ID such as UUID4 and use that as opposed to letting the database auto increment a number. This still allows me to deduplicate on a specific key and batch my operations.
UPDATE 2013-01-05
The unique index somehow interfered with primary key after it was switched to char, so I dumped the unique index and dumped uuid and generated pk by concatenating the two attributes that needed to be unique and generating a sha1sum. That also made updates a lot faster.
век живи, век учись
No comments:
Post a Comment