Tuesday, October 27, 2015

Good articles for shingle, minhash and LSH

http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html

http://infolab.stanford.edu/~ullman/mmds/bookL.pdf   (Chapter 3, round page 83)

Doc --> Shingle --> hash --> minhash (may lose accuracy by using a much smaller signature matrix) -->  LSH (not compute the similarity for every pair; possible false positive or false negative)

50,000 bytes --> 400,000 bytes --> 200,000 bytes --> 1,000 bytes (based on 250 signatures) -->  additional b (band) x r (row) hash matrix

Friday, September 25, 2015

How to override cluster jar files

If you need to use your own version of a jar file in your map reduce job, you can do so through following steps:

1. Bundle the jar file into your map reduce jar or pass the jar file through distributed cache.

2. From your map reduce client, update mapred-site.xml file to put following section into it:

mapreduce.job.user.classpath.first=true

(of course, you need to call conf.addResource() to add mapred-site.xml on your client code)

3. Restart your client.  

Saturday, July 25, 2015

Dynamically add jar files to classpath

    /**
     * Adds jar files to classpath.
     * @param file the folder for jar files.
     * @param classLoader the URLClassLoader
     * @throws IOException
     */
public void addCustomJars(File file, URLClassLoader classLoader) {
File[] jarFiles = file.listFiles();
if (jarFiles != null) {
if (logger.isDebugEnabled()) {
      URL[] urls = classLoader.getURLs();
      for (URL url : urls) {
      logger.debug("URL before custom jars are added:" + url.toString());
      }
}

   Class<?> sysclass = URLClassLoader.class;
    Method method = null;
try {
method = sysclass.getDeclaredMethod("addURL",parameters);
} catch (NoSuchMethodException e1) {
logger.error("Unable to find addURL method", e1);
return;
} catch (SecurityException e1) {
logger.error("Unable to get addURL method", e1);
return;
  }
    method.setAccessible(true);

    // add each jar file under such folder
for (File jarFile : jarFiles) {
if (jarFile.isFile() && jarFile.getName().endsWith("jar")) {
try {
             method.invoke(classLoader,new Object[]{ jarFile.toURI().toURL() });
} catch (Exception e) {
logger.error("Failed to add classpath for " + jarFile.getName(), e);;
}
}
}

if (logger.isDebugEnabled()) {
      URL[] urls = classLoader.getURLs();
      for (URL url : urls) {
        logger.debug("URL after custom jars are added:" + url.toString());
      }
}
}

In your main:

URLClassLoader classLoader = (URLClassLoader)Thread.currentThread().getContextClassLoader();
addCustomJars(new File("path_to_jar_files"), classLoader);


Wednesday, June 10, 2015

Skip Hadoop mapper when condition is satisfied

If conf is an instance of JobConf:

conf.setMapRunnerClass(MyMapRunner.class);

A sample of your MyMapRunner, 

public class FindAndExitMapRunner<K1, V1, K2, V2> extends MapRunner<K1, V1, K2, V2> {
private Mapper<K1, V1, K2, V2> mapper;
private boolean incrProcCount;
public static final String FIND_INDICATE = "Exit after finding at least one data";
public static final String REACH_MAX = "Exit after reaching max lines";
 
@SuppressWarnings("unchecked")
public void configure(JobConf job) {
this.mapper = ReflectionUtils.newInstance(job.getMapperClass(), job);
//increment processed counter only if skipping feature is enabled
this.incrProcCount = SkipBadRecords.getMapperMaxSkipRecords(job)>0 && 
SkipBadRecords.getAutoIncrMapperProcCount(job);
}
@Override
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output,
Reporter reporter) throws IOException {
   try {
    // allocate key & value instances that are re-used for all entries
    K1 key = input.createKey();
    V1 value = input.createValue();
    while (input.next(key, value)) {
    // map pair to output
    mapper.map(key, value, output, reporter);
    if(incrProcCount) {
    reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, 
    SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);
    }
    }
    } catch (IOException e) {
    // break if exception is thrown
    if (!FIND_INDICATE.equalsIgnoreCase(e.getMessage()) && 
    !REACH_MAX.equalsIgnoreCase(e.getMessage())) {
    // re-throw except if it is not what we expect.
    throw e;
    }
} finally {
    mapper.close();
    }
}

protected Mapper<K1, V1, K2, V2> getMapper() {
return mapper;
}
}

Then, you throw exception in your mapper when your condition is satisfied:

throw new Exception(MyMapRunner.FIND_INDICATE);


Tuesday, February 3, 2015

Awesome solutions for some super good problems in Cryptograph

http://webstersprodigy.net/tag/aes/

Just some quotations from above website:

One of the biggest reasons I think the class was so good was its focus on offense. I don’t really understand how defensive security people can try to defend stuff without understanding offense… yet the crypto classes I’d taken before tried to do exactly that. How was I supposed to understand why things needed to be done a certain way if I don’t know how it can break? Crypto books have been the same way – every crypto book I’ve read before (e.g. Bruce Schneier books) don’t seem to give much page space to offense. Dan brings the attacker’s perspective into every lecture, and I have a much better understanding of practical cryptography because of it.

Week 4 – CBC with IV

Problem:
An attacker intercepts the following ciphertext (hex encoded):
1
20814804c1767293b99f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d
He knows that the plaintext is the ASCII encoding of the message “Pay Bob 100$” (excluding the quotes). He also knows that the cipher used is CBC encryption with a random IV using AES as the underlying block cipher. Show that the attacker can change the ciphertext so that it will decrypt to “Pay Bob 500$”. What is the resulting ciphertext (hex encoded)? This shows that CBC provides no integrity.
Solution:
This is insecure because the first message block is xored with the random IV
20814804c1767293b99f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d
P a y B o b 1 0 0 $
9th char
0xb9 decrypts to 1
0xb9 xor ascii (1 xor 5)
0xb9 xor 0x31 xor 0x35
= 0xbd
20814804c1767293bd9f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d

Monday, December 15, 2014

Simple Java Book

http://www.programcreek.com/wp-content/uploads/2013/01/SimpleJava1.pdf

Saturday, November 8, 2014

IndexMinPQ

Index minimal priority queue combine the beauty of priority queue and functionality from array.

It is a priority queue sorted by Key (Key extends Comparable<Key>):

int delMin();  <-- delete the one which has lowest priority

It supports indexed insert, update and delete:

void insert(int i, Key key);
void increaseKey(int i, Key key);
void decreaseKey(int i, Key key);
void delete(int);  <-- delete the one on position i regardless its priority

All operations are log V or less.

It is used by:

1. Eager Prim MST (minimal spanning tree)

2. Dijkstra's shortest path algorithm

Invariables to keep in mind when dealing with IndexMinPQ:

1. Priority is stored in Key[]
2. Vertex is stored in pq[] -- this is the real priority queue
3. Using qp[] to reverse lookup from vertex id to its entry in pq[], i.e. pq[qp[i]] == i.

We can use 'delete' to see their relationships:

    /**
     * Remove vertex i (stored in pq[]) and its priority (stored in keys[]).
     * @param i the vertex id
     */
    public void delete(int i) {
        if (i < 0 || i >= NMAX) throw new IndexOutOfBoundsException();
        if (!contains(i)) throw new NoSuchElementException("index is not in the priority queue");
        int index = qp[i];
        exch(index, N--);
        swim(index);
        sink(index);
        keys[i] = null;
        qp[i] = -1;
    }

Both sink and swim are modified to use priority values stored in Key[] to change the order of priority queue:

private void swim(int k) {
    while (k > 1 && greater(k/2, k) {
        exch(k, k/2);
        k = k/2;
    }
}
private boolean greater(int i, int j) {
        return keys[pq[i]].compareTo(keys[pq[j]]) > 0;
}