Saturday, September 27, 2014

Magic of the array

1. Union find:

    public void union(int p, int q) {
        int rootP = find(p);
        int rootQ = find(q);
        if (rootP == rootQ) return;

        // make smaller root point to larger one
        if   (sz[rootP] < sz[rootQ]) { id[rootP] = rootQ; sz[rootQ] += sz[rootP]; }
        else                         { id[rootQ] = rootP; sz[rootP] += sz[rootQ]; }
        count--;
    }

2. Priority queue implemented by binary heap:

    public Key delMin() {
        if (isEmpty()) throw new NoSuchElementException("Priority queue underflow");
        exch(1, N);
        Key min = pq[N--];
        sink(1);
        pq[N+1] = null;         // avoid loitering and help with garbage collection
        if ((N > 0) && (N == (pq.length - 1) / 4)) resize(pq.length  / 2);
        assert isMinHeap();
        return min;
    }

3. Heap sort:

    public static void sort(Comparable[] pq) {
        int N = pq.length;
        for (int k = N/2; k >= 1; k--)
            sink(pq, k, N);
        while (N > 1) {
            exch(pq, 1, N--);
            sink(pq, 1, N);
        }
    }

4. Randomized queue:

    public Item dequeue() {
        if (size <= 0) {
            throw new NoSuchElementException("Tried to dequeue an empty queue.");
        }

        int index = StdRandom.uniform(size);
        Item item = items[index];
        items[index] = items[--size];
        items[size + 1] = null;

        if (size > ORIG_CAPACITY && size < capacity / 4) {
            resize(capacity / 2);
        }

        return item;
    }

5. 8-puzzle:

The n-puzzle is a classical problem for modelling algorithms involving heuristics. Commonly used heuristics for this problem include counting the number of misplaced tiles and finding the sum of the taxicab(or manhanttan) distances between each block and its position in the goal configuration

   // return sum of Manhattan distances between blocks and goal
    public int manhattan() {
        int manhattan = 0;
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                int current = blocks[i][j];
                if (current == 0) continue;  // skip blank square.
                if (current != (i * N + j + 1)) {
                    manhattan += Math.abs((current -1 ) / N - i) + Math.abs((current - 1) % N - j);                    
                }
            }
        }
        return manhattan;
    }

Sunday, August 31, 2014

Write ORC file

First of all, there is no way to write ORC file for hive-exec-0.12 or earlier version since OrcStruct is package private for these old versions.

For hive-exec-0.13 or later:

Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\mapred-site.xml"));

ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class,
             ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
   
Writer writer = OrcFile.createWriter(new Path("/user/output_orcwriter"),
           OrcFile.writerOptions(conf).inspector(inspector).stripeSize(1048576/2).bufferSize(1048577)
               .version(OrcFile.Version.V_0_12));

// even OrcStruct is public, its constructor and setFieldValue method are not.
    Class<?> c = Class.forName("org.apache.hadoop.hive.ql.io.orc.OrcStruct");
    Constructor<?> ctor = c.getDeclaredConstructor(int.class);
ctor.setAccessible(true);
    Method setFieldValue = c.getDeclaredMethod("setFieldValue", int.class, Object.class);
    setFieldValue.setAccessible(true);

    for (int j=0; j<5; j++) {
    OrcStruct orcRow = (OrcStruct) ctor.newInstance(8);
    for (int i=0; i<8; i++)
    setFieldValue.invoke(orcRow, i, "AAA"+j+"BBB"+i);
    writer.addRow(orcRow);
    }
       writer.close();

Thursday, August 28, 2014

Access Hive collection type data in map reduce for Orc file

public void map(Object key, Writable value,
Context context) throws IOException, InterruptedException {

OrcStruct orcStruct = (OrcStruct)value;
int numberOfFields = orcStruct.getNumFields();

for (int i=0; i<numberOfFields; i++) {

// getFieldValue is private at this moment. Use reflection to access it.
Object field = null;
try {
field = getFieldValue.invoke(orcStruct, i);
} catch (Exception e) {
e.printStackTrace();
}
if (field==null) continue;

// process Hive collection type array, struct or map
if (field instanceof List) {
List list = (List)field;
for (int j=0; j<list.size(); j++)
System.out.println(list.get(j));
}
else if (field instanceof Map) {
Map map = (Map)field;
for (Iterator entries = map.entrySet().iterator(); entries.hasNext();) {
Map.Entry entry = (Entry) entries.next();
System.out.println("key="+entry.getKey()+",value="+entry.getValue());
}
}
else if (field instanceof OrcStruct) {
OrcStruct struct = (OrcStruct)field;
int numberOfField = struct.getNumFields();
for (int j=0; j<numberOfField; j++) {
try {
System.out.println("field"+j+"="+getFieldValue.invoke(struct, j));
} catch (Exception e) {
e.printStackTrace();
}
}
}
else {
System.out.println("Unknown type for field"+ field);
}
}
}


Friday, August 15, 2014

Troubleshooting version mismatch issue of map reduce job

When coding map reduce job and code to access Hadoop file system, you have to include hadoop jar files to make your java code to pass compiler. If the version of jar files are not compatible with the jar files of Hadoop system, the 500 error will be gotten when running your java code.

The 500 error code does not really tell anything about what is the problem.  This makes troubleshooting little bit hard.  To identify this kind of problem, we have to go to cluster to check logs.

For example, looking into hadoop-hdfs-namenode-*.log and try to find following error:

2014-08-14 14:21:42,386 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 192.168.72.1:15444 got version 7 expected version 4

Basically, this is telling us that cluster expects version 4, but client is using version 7.

Having this kind of information, lower the version of hadoop jar files.  Now my hadoop code works perfect fine with cluster.

Monday, July 21, 2014

Assigning a static IP to a VMware Workstation VM

http://bytealmanac.wordpress.com/2012/07/02/assigning-a-static-ip-to-a-vmware-workstation-vm/

Assumption: the VM is running a DHCP client and is assigned a dynamic IP by the DHCP service running on the host. My host machine is runs on Windows 7 Ultimate x64 with VMware Workstation 8.
Open C:\ProgramData\VMware\vmnetdhcp.conf as Administrator. This file follows the syntax of dhcpd.conf. Add the following lines (change host name, MAC, IP appropriately – these are shown in a different colour) under the correct section (for me it was for a NAT based network – VMnet8). The MAC can be found from the VM’s properties.
host ubuntu {
    hardware ethernet 00:0C:29:16:2A:D6;
    fixed-address 192.168.84.132;
}
Restart the VMware DHCP service. Use the following commands from an elevated prompt:
net stop vmnetdhcp
net start vmnetdhcp
On the VM, acquire a new lease using the below command (if VM runs Linux):

ifconfig eth0 down
ifconfig eth0 up

Thursday, July 17, 2014

Cloudera CDH5 source code download

https://repository.cloudera.com/artifactory/public/org/apache/hadoop/hadoop-core/

Wednesday, July 2, 2014

zookeeper-env.sh issue when setting up HBase

Followed CDH4.2.2 installation guide to setup HBase.

Ran "service zookeeper-server start"

No error from command line.  But zookeeper.log says "nohup: failed to run command java’: No such file or directory".

Interesting!  Java home is right by doing "echo $JAVA_HOME".

After about one hour troubleshooting and script checking, it turned out:

zookeeper-env.sh is needed under /etc/zookeeper/conf directory (as other Hadoop component).  But for somehow, zookeeper installer did not have such file by default.

I have to manually create such file and put following line into it:

export JAVA_HOME=/opt/jdk1.6.0_45/

After that, starting zookeeper works.  I can see it from 'jps':

[root@centos conf]# jps
2732 TaskTracker
4964 Jps
4776 QuorumPeerMain
3133 NameNode
2548 JobTracker
2922 DataNode