Monday, January 30, 2012

Shingles


It is easy to create a naive Data Leakage Protection (DLP) Product that will look for exact data or pattern matches, it is a lot more difficult to spot similarity between documents such as this document is x% similar to this reference.  This article looks interesting and the approach seems easy to implement  http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html