http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf (Chapter 3, round page 83)
Doc --> Shingle --> hash --> minhash (may lose accuracy by using a much smaller signature matrix) --> LSH (not compute the similarity for every pair; possible false positive or false negative)
50,000 bytes --> 400,000 bytes --> 200,000 bytes --> 1,000 bytes (based on 250 signatures) --> additional b (band) x r (row) hash matrix
Your project can be on or off. Your project's priority can be changed. Your job can be changed. But the technology is always heading north!
Tuesday, October 27, 2015
Friday, September 25, 2015
How to override cluster jar files
If you need to use your own version of a jar file in your map reduce job, you can do so through following steps:
1. Bundle the jar file into your map reduce jar or pass the jar file through distributed cache.
2. From your map reduce client, update mapred-site.xml file to put following section into it:
mapreduce.job.user.classpath.first=true
(of course, you need to call conf.addResource() to add mapred-site.xml on your client code)
3. Restart your client.
1. Bundle the jar file into your map reduce jar or pass the jar file through distributed cache.
2. From your map reduce client, update mapred-site.xml file to put following section into it:
mapreduce.job.user.classpath.first=true
(of course, you need to call conf.addResource() to add mapred-site.xml on your client code)
3. Restart your client.
Saturday, July 25, 2015
Dynamically add jar files to classpath
/**
* Adds jar files to classpath.
* @param file the folder for jar files.
* @param classLoader the URLClassLoader
* @throws IOException
*/
public void addCustomJars(File file, URLClassLoader classLoader) {
File[] jarFiles = file.listFiles();
if (jarFiles != null) {
if (logger.isDebugEnabled()) {
URL[] urls = classLoader.getURLs();
for (URL url : urls) {
logger.debug("URL before custom jars are added:" + url.toString());
}
}
Class<?> sysclass = URLClassLoader.class;
Method method = null;
try {
method = sysclass.getDeclaredMethod("addURL",parameters);
} catch (NoSuchMethodException e1) {
logger.error("Unable to find addURL method", e1);
return;
} catch (SecurityException e1) {
logger.error("Unable to get addURL method", e1);
return;
}
method.setAccessible(true);
// add each jar file under such folder
for (File jarFile : jarFiles) {
if (jarFile.isFile() && jarFile.getName().endsWith("jar")) {
try {
method.invoke(classLoader,new Object[]{ jarFile.toURI().toURL() });
} catch (Exception e) {
logger.error("Failed to add classpath for " + jarFile.getName(), e);;
}
}
}
if (logger.isDebugEnabled()) {
URL[] urls = classLoader.getURLs();
for (URL url : urls) {
logger.debug("URL after custom jars are added:" + url.toString());
}
}
}
In your main:
URLClassLoader classLoader = (URLClassLoader)Thread.currentThread().getContextClassLoader();
addCustomJars(new File("path_to_jar_files"), classLoader);
* Adds jar files to classpath.
* @param file the folder for jar files.
* @param classLoader the URLClassLoader
* @throws IOException
*/
public void addCustomJars(File file, URLClassLoader classLoader) {
File[] jarFiles = file.listFiles();
if (jarFiles != null) {
if (logger.isDebugEnabled()) {
URL[] urls = classLoader.getURLs();
for (URL url : urls) {
logger.debug("URL before custom jars are added:" + url.toString());
}
}
Class<?> sysclass = URLClassLoader.class;
Method method = null;
try {
method = sysclass.getDeclaredMethod("addURL",parameters);
} catch (NoSuchMethodException e1) {
logger.error("Unable to find addURL method", e1);
return;
} catch (SecurityException e1) {
logger.error("Unable to get addURL method", e1);
return;
}
method.setAccessible(true);
// add each jar file under such folder
for (File jarFile : jarFiles) {
if (jarFile.isFile() && jarFile.getName().endsWith("jar")) {
try {
method.invoke(classLoader,new Object[]{ jarFile.toURI().toURL() });
} catch (Exception e) {
logger.error("Failed to add classpath for " + jarFile.getName(), e);;
}
}
}
if (logger.isDebugEnabled()) {
URL[] urls = classLoader.getURLs();
for (URL url : urls) {
logger.debug("URL after custom jars are added:" + url.toString());
}
}
}
In your main:
URLClassLoader classLoader = (URLClassLoader)Thread.currentThread().getContextClassLoader();
addCustomJars(new File("path_to_jar_files"), classLoader);
Wednesday, June 10, 2015
Skip Hadoop mapper when condition is satisfied
If conf is an instance of JobConf:
conf.setMapRunnerClass(MyMapRunner.class);
conf.setMapRunnerClass(MyMapRunner.class);
A sample of your MyMapRunner,
public class FindAndExitMapRunner<K1, V1, K2, V2> extends MapRunner<K1, V1, K2, V2> {
private Mapper<K1, V1, K2, V2> mapper;
private boolean incrProcCount;
public static final String FIND_INDICATE = "Exit after finding at least one data";
public static final String REACH_MAX = "Exit after reaching max lines";
@SuppressWarnings("unchecked")
public void configure(JobConf job) {
this.mapper = ReflectionUtils.newInstance(job.getMapperClass(), job);
//increment processed counter only if skipping feature is enabled
this.incrProcCount = SkipBadRecords.getMapperMaxSkipRecords(job)>0 &&
SkipBadRecords.getAutoIncrMapperProcCount(job);
}
@Override
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output,
Reporter reporter) throws IOException {
try {
// allocate key & value instances that are re-used for all entries
K1 key = input.createKey();
V1 value = input.createValue();
while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
if(incrProcCount) {
reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);
}
}
} catch (IOException e) {
// break if exception is thrown
if (!FIND_INDICATE.equalsIgnoreCase(e.getMessage()) &&
!REACH_MAX.equalsIgnoreCase(e.getMessage())) {
// re-throw except if it is not what we expect.
throw e;
}
} finally {
mapper.close();
}
}
protected Mapper<K1, V1, K2, V2> getMapper() {
return mapper;
}
}
Then, you throw exception in your mapper when your condition is satisfied:
throw new Exception(MyMapRunner.FIND_INDICATE);
Tuesday, February 3, 2015
Awesome solutions for some super good problems in Cryptograph
http://webstersprodigy.net/tag/aes/
Just some quotations from above website:
One of the biggest reasons I think the class was so good was its focus on offense. I don’t really understand how defensive security people can try to defend stuff without understanding offense… yet the crypto classes I’d taken before tried to do exactly that. How was I supposed to understand why things needed to be done a certain way if I don’t know how it can break? Crypto books have been the same way – every crypto book I’ve read before (e.g. Bruce Schneier books) don’t seem to give much page space to offense. Dan brings the attacker’s perspective into every lecture, and I have a much better understanding of practical cryptography because of it.
Just some quotations from above website:
One of the biggest reasons I think the class was so good was its focus on offense. I don’t really understand how defensive security people can try to defend stuff without understanding offense… yet the crypto classes I’d taken before tried to do exactly that. How was I supposed to understand why things needed to be done a certain way if I don’t know how it can break? Crypto books have been the same way – every crypto book I’ve read before (e.g. Bruce Schneier books) don’t seem to give much page space to offense. Dan brings the attacker’s perspective into every lecture, and I have a much better understanding of practical cryptography because of it.
Week 4 – CBC with IV
Problem:
An attacker intercepts the following ciphertext (hex encoded):
1
| 20814804c1767293b99f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d |
He knows that the plaintext is the ASCII encoding of the message “Pay Bob 100$” (excluding the quotes). He also knows that the cipher used is CBC encryption with a random IV using AES as the underlying block cipher. Show that the attacker can change the ciphertext so that it will decrypt to “Pay Bob 500$”. What is the resulting ciphertext (hex encoded)? This shows that CBC provides no integrity.
Solution:
This is insecure because the first message block is xored with the random IV
20814804c1767293b99f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d
P a y B o b 1 0 0 $
P a y B o b 1 0 0 $
9th char
0xb9 decrypts to 1
0xb9 xor ascii (1 xor 5)
0xb9 xor 0x31 xor 0x35
= 0xbd
0xb9 decrypts to 1
0xb9 xor ascii (1 xor 5)
0xb9 xor 0x31 xor 0x35
= 0xbd
20814804c1767293bd9f1d9cab3bc3e7 ac1e37bfb15599e5f40eef805488281d
Monday, December 15, 2014
Simple Java Book
http://www.programcreek.com/wp-content/uploads/2013/01/SimpleJava1.pdf
Saturday, November 8, 2014
IndexMinPQ
Index minimal priority queue combine the beauty of priority queue and functionality from array.
It is a priority queue sorted by Key (Key extends Comparable<Key>):
int delMin(); <-- delete the one which has lowest priority
It supports indexed insert, update and delete:
void insert(int i, Key key);
void increaseKey(int i, Key key);
void decreaseKey(int i, Key key);
void delete(int); <-- delete the one on position i regardless its priority
All operations are log V or less.
It is used by:
1. Eager Prim MST (minimal spanning tree)
2. Dijkstra's shortest path algorithm
Invariables to keep in mind when dealing with IndexMinPQ:
1. Priority is stored in Key[]
2. Vertex is stored in pq[] -- this is the real priority queue
3. Using qp[] to reverse lookup from vertex id to its entry in pq[], i.e. pq[qp[i]] == i.
We can use 'delete' to see their relationships:
/**
* Remove vertex i (stored in pq[]) and its priority (stored in keys[]).
* @param i the vertex id
*/
public void delete(int i) {
if (i < 0 || i >= NMAX) throw new IndexOutOfBoundsException();
if (!contains(i)) throw new NoSuchElementException("index is not in the priority queue");
int index = qp[i];
exch(index, N--);
swim(index);
sink(index);
keys[i] = null;
qp[i] = -1;
}
Both sink and swim are modified to use priority values stored in Key[] to change the order of priority queue:
private void swim(int k) {
while (k > 1 && greater(k/2, k) {
exch(k, k/2);
k = k/2;
}
}
private boolean greater(int i, int j) {
return keys[pq[i]].compareTo(keys[pq[j]]) > 0;
}
It is a priority queue sorted by Key (Key extends Comparable<Key>):
int delMin(); <-- delete the one which has lowest priority
It supports indexed insert, update and delete:
void insert(int i, Key key);
void increaseKey(int i, Key key);
void decreaseKey(int i, Key key);
void delete(int); <-- delete the one on position i regardless its priority
All operations are log V or less.
It is used by:
1. Eager Prim MST (minimal spanning tree)
2. Dijkstra's shortest path algorithm
Invariables to keep in mind when dealing with IndexMinPQ:
1. Priority is stored in Key[]
2. Vertex is stored in pq[] -- this is the real priority queue
3. Using qp[] to reverse lookup from vertex id to its entry in pq[], i.e. pq[qp[i]] == i.
We can use 'delete' to see their relationships:
/**
* Remove vertex i (stored in pq[]) and its priority (stored in keys[]).
* @param i the vertex id
*/
public void delete(int i) {
if (i < 0 || i >= NMAX) throw new IndexOutOfBoundsException();
if (!contains(i)) throw new NoSuchElementException("index is not in the priority queue");
int index = qp[i];
exch(index, N--);
swim(index);
sink(index);
keys[i] = null;
qp[i] = -1;
}
Both sink and swim are modified to use priority values stored in Key[] to change the order of priority queue:
private void swim(int k) {
while (k > 1 && greater(k/2, k) {
exch(k, k/2);
k = k/2;
}
}
private boolean greater(int i, int j) {
return keys[pq[i]].compareTo(keys[pq[j]]) > 0;
}
Subscribe to:
Posts (Atom)