Tech Notes: August 2014

Sunday, August 31, 2014

Write ORC file

First of all, there is no way to write ORC file for hive-exec-0.12 or earlier version since OrcStruct is package private for these old versions.

For hive-exec-0.13 or later:

Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\mapred-site.xml"));

ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class,
ObjectInspectorFactory.ObjectInspectorOptions.JAVA);

Writer writer = OrcFile.createWriter(new Path("/user/output_orcwriter"),
OrcFile.writerOptions(conf).inspector(inspector).stripeSize(1048576/2).bufferSize(1048577)
.version(OrcFile.Version.V_0_12));

// even OrcStruct is public, its constructor and setFieldValue method are not.
Class<?> c = Class.forName("org.apache.hadoop.hive.ql.io.orc.OrcStruct");
Constructor<?> ctor = c.getDeclaredConstructor(int.class);
ctor.setAccessible(true);
Method setFieldValue = c.getDeclaredMethod("setFieldValue", int.class, Object.class);
setFieldValue.setAccessible(true);

for (int j=0; j<5; j++) {
OrcStruct orcRow = (OrcStruct) ctor.newInstance(8);
for (int i=0; i<8; i++)
setFieldValue.invoke(orcRow, i, "AAA"+j+"BBB"+i);
writer.addRow(orcRow);
}
writer.close();

Thursday, August 28, 2014

Access Hive collection type data in map reduce for Orc file

public void map(Object key, Writable value,
Context context) throws IOException, InterruptedException {

OrcStruct orcStruct = (OrcStruct)value;
int numberOfFields = orcStruct.getNumFields();

for (int i=0; i<numberOfFields; i++) {

// getFieldValue is private at this moment. Use reflection to access it.
Object field = null;
try {
field = getFieldValue.invoke(orcStruct, i);
} catch (Exception e) {
e.printStackTrace();
}
if (field==null) continue;

// process Hive collection type array, struct or map
if (field instanceof List) {
List list = (List)field;
for (int j=0; j<list.size(); j++)
System.out.println(list.get(j));
}
else if (field instanceof Map) {
Map map = (Map)field;
for (Iterator entries = map.entrySet().iterator(); entries.hasNext();) {
Map.Entry entry = (Entry) entries.next();
System.out.println("key="+entry.getKey()+",value="+entry.getValue());
}
}
else if (field instanceof OrcStruct) {
OrcStruct struct = (OrcStruct)field;
int numberOfField = struct.getNumFields();
for (int j=0; j<numberOfField; j++) {
try {
System.out.println("field"+j+"="+getFieldValue.invoke(struct, j));
} catch (Exception e) {
e.printStackTrace();
}
}
}
else {
System.out.println("Unknown type for field"+ field);
}
}
}

Friday, August 15, 2014

Troubleshooting version mismatch issue of map reduce job

When coding map reduce job and code to access Hadoop file system, you have to include hadoop jar files to make your java code to pass compiler. If the version of jar files are not compatible with the jar files of Hadoop system, the 500 error will be gotten when running your java code.

The 500 error code does not really tell anything about what is the problem. This makes troubleshooting little bit hard. To identify this kind of problem, we have to go to cluster to check logs.

For example, looking into hadoop-hdfs-namenode-*.log and try to find following error:

2014-08-14 14:21:42,386 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 192.168.72.1:15444 got version 7 expected version 4

Basically, this is telling us that cluster expects version 4, but client is using version 7.

Having this kind of information, lower the version of hadoop jar files. Now my hadoop code works perfect fine with cluster.