Friday, February 6, 2015

Dynamically changing database routing for Django database accesss

Django's db routing allows to define read and write databases, however sometimes that is not enough

Why just read and write are not good enough
-- Multi op function rely on both sources being the same and reflecting updates
-- However with transaction isolation that is not possible
-- Need to set a context based db as opposed to a simple per read/per write, that is some other source of data that is not read or write database

This helps
http://python.dzone.com/articles/django-switching-databases

Recently a project I have completed a while ago needed a small modification, the modification was such that each app server would have a Read Only replica of the master database, and while we would be unable to manipulate the data in case we lost connectivity to the Read/Write Master, we could at least continue to serve data that is already contained in the database.

Luckily the project was in Django and one of the things that Django provides is ability to use multiple databases.  Django also provides a concept of a database router that allows to direct reads and write to different databases.  That should be enough right?  Wrong.  Imagine a function that first reads an objects from a database, modifies it and writes it back, Django does not like that because the read database from where the object came form is different from the write databse.  So what needs to be does it to permit relations between the objects and databases or since the tables are identical and hence the objects contained as well just permit relations between between objects:
allow_relation(obj1, obj2, **hints)

Tuesday, July 22, 2014

Django QuerySets are lazy

https://docs.djangoproject.com/en/dev/topics/db/queries/#querysets-are-lazy
I've known this for quite some time, but never had the chance to really use this, until recently I was going over some old code.  Someone has asked me to implement a search feature where objects are being searched on by some attributes.  This was still when I was just starting out in python so the search was implemented poorly a lot of filtering was done in the app itself and filter building was really awkward and QuerySets were being evaluated for intermediate results.  The logic basically consisted of a loop that would take a filter condition and create a query set, then combine resulting query sets together and return them.

This worked ok, until recently someone pointed out that the proper way would be to apply the AND condition to all the filters or at least provide an option to do so.

Knowing that QuerySets are lazy I knew it was safe to come up with something like this

reduce(lambda acc, x: acc.filter(criteria=x), xs, myobj.objects.all())

This would repeatedly apply the filter and also allow me to construct filters on the fly.

Oh but you might say what if you are not searching by field called criteria.  Well you can actually define the field and the value you are searching for, however rather then me copying and pasting someone else's work, just read this for further explanation: http://www.nomadjourney.com/2009/04/dynamic-django-queries-with-kwargs/

My whole point is that you can apply successive filters without any time penalty.  Not a single db query will be sent until you evaluate the query set.

Friday, February 28, 2014

on logging

I like to log, I log in as much details as I can, my application produce massive logs daily and still I am not happy and keep tweaking my logs -- another variable into output here, this needs to be logged as well, etc.

If you have a distributed application with function calls going up and down the stack and forwarding to other components on the network, you need to have a unifying identifier otherwise correlating and interpreting logs will be very difficult.

How to tie logs together
How to pick an identifier and set it.  Lets say you have some application be it a web app or some network app and you want to be able to trace function calls and logs and connect them together.  At the first entry point pick an identifier, if no meaningful identifier that pertains to the application exists (such as DHCP-Transaction-Id) just pick a random one.  I find that a uuid works really well in this case.  At this point you have established an identifier, but how will you let everyone know what it is?  In case of synchronous frameworks such as django you may be able to just use some global variable which everyone can refer to.  However in asynchronous frameworks such as twisted, you have no other choice but to pass the id around from a function to a function.  So just go ahead and make sure all your functions have a req_id field.  It also helps to have a uniform logging function to wrap around your log facility, something similar to

func log(logger, req_id, status, msg, extra):
    ...

extra is for things that did not fit that category, a variable or a dict that I will make up on the spot when I want to print multiple things.  Call str on it or use pprint to format it

What else to put there
However in addition to printing that it is a good idea to also let the person who is reading logs know where this log came from, such as module, function and line number.  And again in python's logging a formatter may be configured to do all that, however the formatter will report the function and line number as being the log function, so I usually replace that with inspect

func_name = inspect.stack()[1][3]
line_no = inspect.stack()[1][2]

Where to print
EVERYWHERE!  I usually print when entering a functions and existing, when entering I print the interesting parameters and when exiting I print what it is returning.  Every time a function does something such as find available IP in the subnet, I log, ping that IP to see if it is free, I log, create a record with the IP, log, creation passed or failed log.  Any other action such as database calls always log results.  One day you will want to know why your app did something and you will wish you had those logs.

Log confidentiality
Logs usually go on local disk and then may get moved to some cheap storage, most logs are not confidential.  If your application accepts passwords and you must log user input for certain things, I suggest replacing the password wiht **** or ####, by no means do a 1:1 replacement, you do not want to give a hint as to the length.  This way your logs do not have to contain confidential information

Logs without context are meaningless
Logs need to provide context, such as:
- I am in the process of finding an IP for as:bb:cc:dd:ee.  
- Sent and offer with IP of 192.168.1.23 to as:bb:cc:did:ee. 
- Received a NAC from as:bb:cc:did:ee.
- about to write /etc/cron.d/job with artifact ID crown.fabdeed12da




Friday, November 8, 2013

How to randomize a list

I think random already does this, but it was a fun challenge doing this recursively we decided to see how we can kill this horse

import random
def randomize_list(lst):
    if lst == []:
        return []
    # this case is only here because randrange will not
    # take (x,y) where x == y
    if len(lst) == 1:
        # really it would be enough to just pop the last
        # element and not have to cons with an []
        return [ lst.pop() ]
    else:
        return [ lst.pop(random.randrange(0, len(lst))) ] + randomize_list(lst)

YES!

>>> random_list.randomize_list([1,2,3,4,5,6,7,8,9,10])
[8, 3, 2, 4, 9, 7, 5, 1, 6, 10]
>>> random_list.randomize_list([1,2,3,4,5,6,7,8,9,10])
[8, 5, 10, 2, 4, 1, 9, 6, 3, 7]
>>> random_list.randomize_list([1,2,3,4,5,6,7,8,9,10])
[3, 5, 2, 8, 7, 10, 6, 9, 4, 1]

at some much geekery ensued over what would be a more functional way of doing this for example
lst.pop creates a sideeffect so what we really want to do is perhaps something like

index = random.randrange(0, len(lst))
return [ lst[index] ] + randomize_list( lst[:index1] + lst[index+1:]

And then someone came along and said just use random.shuffle
>>> a = range(1,20)
>>> a
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> random.shuffle(a)
>>> a
[11, 9, 7, 19, 13, 18, 16, 3, 6, 2, 14, 8, 5, 4, 10, 17, 1, 12, 15]

>>>

While this is fine, I am not a particular fan because it changes  a list in place, one of the things I find lacking in python.


Thursday, June 20, 2013

python types as tuples and reduce

I find it interesting how things can be expressed in terms of tuples in python
>>> list(( dict( ( ("a",1), ("b", "abc"), ("c", list( (1,2,3) ) ) ) ), ))
[{'a': 1, 'c': [1, 2, 3], 'b': 'abc'}]

reminds me of cons a little

so now if I break this further I can get:

>>> ( list, ( dict, ( list, ( ("a",1), ("b", "abc"), ("c", list, (1,2,3)) ) ) ) )
(<type 'list'>, (<type 'dict'>, (<type 'list'>, (('a', 1), ('b', 'abc'), ('c', <type 'list'>, (1, 2, 3))))))

and then rebuild this with possibly some sort of a reduce, such as

>>> reduce(lambda acc, x: acc +[x], (1,2,3), [])
[1, 2, 3]


Sunday, March 24, 2013

Delimiter less network protocol with JSON

Recently, an "opportunity" came up to optimize a certain network application.  This service allows users to send queries in JSON format and receive a list of results in reply also as JSON, no results will just yield a [].  This worked fine for a while until the amount of queries grew, and reports started coming in that it is taking a really long time (over an hour to get work done).  I first optimized some DB queries such that not all searches will be done upfront, however if an interesting match is found and additional search can be sent, banking on the fact that most queries will be misses.  This has considerably seeded up processing since all it does is a lookup from a single table on indexed data and no joins, question is can this be made even faster.

Current implementation is an nginx web server sitting on top of uwsgi, sitting on top of django, sitting on top of mysql database.  I wondered if there maybe any way I can optimize this stack.  The obvious choice is perhaps get rid of django, however with uwsgi spawning ten workers and each worker being good for five thousand connections, forking processes per request is probably not the overhead I am looking for, also it does not appear that uwsgi even bothers to reread my python code from file on every request, so again not an issue.  I am left with HTTP and mySQL over a network connection.  The obvious thing to do is perhaps move mySQL closer to the service, cache results until next DB reload or maybe even get rid of mySQL and do something very simple.  The problem is there are a few other components of this application that rely on mySQL + django.  So the next thought was, even if this accomplishes little wouldn't it be fun to take the webserver and HTTP out of the equation and implement something a little bit more light weight?

About seven years ago a buddy of mine and myself attempted to write a chat application in JAVA.  The client and server were both in JAVA and the protocol used a typical TLV scheme so that decoding such a beast was very simple.  Message boundaries or the lack of thereof were very clearly demarcated and taking apart messages was simple.  In this case I was hoping to not use any delimiters aside from the JSON itself which already provides enough structure.  In addition I wanted to give the user an option to keep a connection open and send multiple JSON messages and get multiple replies.  This is slightly more complicated then an HTTP POST since you are not guaranteed that all the data will arrive in the same packet or the entire message will be reassembled by a higher level protocol.  Since I also wanted to have an option to keep a connection open and send multiple requests, that  meant that the ending of one JSON structure and begging of the next one may arrive in one packet further complicating the handling.

Seems like a recursive descent parser will not work here, so I decided to write my own streaming JSON parser (think SAX vs DOM) so that I can scan input character by character and know when a full message is received.  Luckily as mentioned before JSON is self delimiting so I did not have to work too hard and also my parser does not have to be very smart, it just needs to know when to hand a resultant string to python's json.loads.

So I came up with the following constraints:
  • messages can either start with [ or {
  • messages cannot start with a scalar value such as an alphanumeric character
  • messages cannot start with ] or }
  • It should parse escaped characters, but I can worry about that later
In addition a simple design:
  • incoming data is scanned character by character and written to a String Buffer (StringIO in python).  This is done because strings are immutable and creating new strings through string concatenation every time is costly
  • a stack will be kept onto which characters [, ", { will be placed, when a closing ], ", } and is indeed the closing character for the last character on the stack, we pop.
  • We need to have a control flag to know when we are in a string "string" because we should not treat [{]} as anything else but parts of a string. 
  • If a parsing error is encountered the existing accumulator is dumped along with the stack and parsing proceeds from where it left off
I picked twisted python as my framework because I always wanted to play with it more and I have yet to write a project from scratch using it.   Supposedly event loops are much faster then having multiple workers and also I just wanted to do something new.


Packet processor:
The fist thing I wrote was the parser which I called JSONAccumulator.  Overall it is not very interesting except for how the data is returned back to the caller.  At first  I wrote it to return results, but then realized that I should not return from my data processor until the entire data packet is processed.  Otherwise this causes the processor to drop remaining data on the floor if it returns to signal completion or error.

Generator to the rescue:
Luckily python has a great feature that allows to hand over control to the caller along with some data.  Ans so I rewrote the processor as a generator.

So now when the processor wants to tell the caller something, it simply yields a result.  The result is now yielded when either an error occurs, JSON assembly is complete or the data packet has been exhausted.

Here is the full code of the packet processor:
1:    def process_data(self, data):  
2:      for char in data:  
3:        self.char_buffer.write(char)  
4:        self.data_ptr += 1  
5:    
6:        # JSON structures should only start with [ or {  
7:        if len(self.ctrl_stack) == 0 and char != '[' and char != '{':  
8:          self.status.current_status = self.status.ERROR  
9:          self.status.error_str = "Must start with { or ["  
10:          self.reinit()  
11:          # return here do not place more char in the buffer  
12:          yield {'STATUS': self.status.current_status, 'ERROR_STR': self.status.error_str}  
13:            
14:        elif char == '"':  
15:          if self.in_string == True and self.ctrl_stack[-1] == '"':  
16:            # we've matched a string  
17:            self.ctrl_stack.pop()  
18:            self.in_string = False  
19:          else:  
20:            # We're in a string  
21:            self.ctrl_stack.append(char)  
22:            self.in_string = True  
23:              
24:        elif self.in_string == False and char == '[' or char == '{':  
25:          self.ctrl_stack.append(char)  
26:        elif self.in_string == False and char == ']' or char == '}':  
27:          # do somethign when control_stack is 0  
28:          if len(self.ctrl_stack) and self.ctrl_stack[-1] == self.closing_char[char]:  
29:            self.ctrl_stack.pop()  
30:          else:  
31:            # some sort of parsing error  
32:            self.status.current_status = self.status.ERROR  
33:            self.status.error_str = "Unmatched closing control '{char}'".format(char=char)  
34:            self.reinit()  
35:            # return here do not place more char in the buffer  
36:            yield {'STATUS': self.status.current_status, 'ERROR_STR': self.status.error_str}  
37:        if len(self.ctrl_stack) == 0:  
38:          # we've assmebled a json structure  
39:          self.status.current_status = self.status.COMPLETE  
40:          self.status.error_str = ""  
41:          assembled_json = self.char_buffer.getvalue()  
42:          self.reinit()  
43:          yield {'STATUS': self.status.current_status,  
44:              'ERROR_STR': self.status.error_str,  
45:              'JSON': assembled_json}  
46:    
47:      self.status.current_status = self.status.MORE_DATA  
48:      self.status.error_str = ""  
49:      yield {'STATUS': self.status.current_status,  
50:          'ERROR_STR': self.status.error_str}  
51:      return  

This is rather straightforward, and only a few things warrant attention here, the reinit() function , is called to reset state, hence it will be called in JSONAccumulator's constructor, when JSON assembly is complete and when an error is encountered.

Reinit code:
1:    def reinit(self):  
2:      self.char_buffer = StringIO.StringIO()  
3:      self.ctrl_stack = []  
4:      self.in_string = False  
5:      self.data_ptr = -1  

 A few more things the ctrl_stack is the stack where opening characters are placed, the char_buffer is the string buffer where characters are written and then extracted form when the JSON assembly is completed.

There are  a few lines worth mentioning:
Line 7 is a check to make sure that a structure opens with { or [
Line 14 handles " since the closing and opening characters for a string are the same
Line 24 handles opening characters and pushes them onto the stack
Line 26 checks if a closing character matches an opening character and pops the matching opening character off the stack
Line 37 JSON assembly completeness check

The twisted part:
This being a quite simple protocol it turned out that there was not a whole lot to do as far as twisted is concerned

1:  class JSONProtocol(protocol.Protocol):  
2:    def __init__(self):  
3:      self.json_accumulator = JSONAccumulator()  
4:        
5:    def dataReceived(self, data):  
6:      char_processor = self.json_accumulator.process_data(data)  
7:      try:  
8:        for yvalue in char_processor:  
9:          if yvalue['STATUS'] == JSONAccumulator.status.ERROR:  
10:            self.transport.write(str(yvalue) + "\n")  
11:          if yvalue['STATUS'] == JSONAccumulator.status.COMPLETE:  
12:            self.transport.write(str(yvalue) + "\n")  
13:      except StopIteration:  
14:        pass  
15:    
16:  class JSONFactory(protocol.ServerFactory):  
17:    protocol = JSONProtocol  
18:    
19:    
20:  def main():  
21:    reactor.listenTCP(8000, JSONFactory())  
22:    reactor.run()  
23:    
24:    
25:  if __name__ == '__main__':  
26:    main()  
27:    

Sample output:

levb@levb-desktop:~/DEV$ echo -n '["a", "b", "c"]{"a":1}' | nc -q 2 localhost 8000
{'STATUS': 'COMPLETE', 'JSON': '["a", "b", "c"]', 'ERROR_STR': ''}
{'STATUS': 'COMPLETE', 'JSON': '{"a":1}', 'ERROR_STR': ''}


levb@levb-desktop:~/DEV$ echo -n '"c"]{"a":1}' | nc -q 2 localhost 8000
{'STATUS': 'ERROR', 'ERROR_STR': 'Must start with { or ['}
{'STATUS': 'COMPLETE', 'JSON': '', 'ERROR_STR': ''}
{'STATUS': 'ERROR', 'ERROR_STR': 'Must start with { or ['}
{'STATUS': 'COMPLETE', 'JSON': '', 'ERROR_STR': ''}
{'STATUS': 'ERROR', 'ERROR_STR': 'Must start with { or ['}
{'STATUS': 'COMPLETE', 'JSON': '', 'ERROR_STR': ''}
{'STATUS': 'ERROR', 'ERROR_STR': 'Must start with { or ['}
{'STATUS': 'COMPLETE', 'JSON': '', 'ERROR_STR': ''}
{'STATUS': 'COMPLETE', 'JSON': '{"a":1}', 'ERROR_STR': ''}


What else?
Right now there is no actual work being done, I still have to teach this code to query a database, but even though I got distracted form my main purpose, this turned out to be an interesting project.  I am sure at some point it might also be fun to add some sort of authentication, and privacy on top of this channel perhaps TLS or GSSAPI.  Also I am sure this code can be cleaned up and shortened a bit.

Code:
Full code is available at https://github.com/pu239ppy/json-stream-parser

Saturday, February 2, 2013

Haskell: Sieve of Eratosthenes

Recently I needed to compute prime numbers starting from 2 and on to x , Sieve of Eratosthenes seemed like a simple and straightforward (albeit not too fast) way to get this done


eliminator :: Int -> [Int] -> [Int]
eliminator _ [] = []
eliminator start list = filter (\x -> x `rem` start /= 0) list

sieve :: [Int] -> [Int]
sieve [] = []
sieve (x:xs) = x : (sieve $ eliminator x xs)

*Main> sieve [2..100]
[2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]

There is one known limitation -- sieve assumes that the first number in the sequence is a prime.