http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf (Chapter 3, round page 83)
Doc --> Shingle --> hash --> minhash (may lose accuracy by using a much smaller signature matrix) --> LSH (not compute the similarity for every pair; possible false positive or false negative)
50,000 bytes --> 400,000 bytes --> 200,000 bytes --> 1,000 bytes (based on 250 signatures) --> additional b (band) x r (row) hash matrix