Sunday, August 31, 2014

Write ORC file

First of all, there is no way to write ORC file for hive-exec-0.12 or earlier version since OrcStruct is package private for these old versions.

For hive-exec-0.13 or later:

Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\mapred-site.xml"));

ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class,
             ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
   
Writer writer = OrcFile.createWriter(new Path("/user/output_orcwriter"),
           OrcFile.writerOptions(conf).inspector(inspector).stripeSize(1048576/2).bufferSize(1048577)
               .version(OrcFile.Version.V_0_12));

// even OrcStruct is public, its constructor and setFieldValue method are not.
    Class<?> c = Class.forName("org.apache.hadoop.hive.ql.io.orc.OrcStruct");
    Constructor<?> ctor = c.getDeclaredConstructor(int.class);
ctor.setAccessible(true);
    Method setFieldValue = c.getDeclaredMethod("setFieldValue", int.class, Object.class);
    setFieldValue.setAccessible(true);

    for (int j=0; j<5; j++) {
    OrcStruct orcRow = (OrcStruct) ctor.newInstance(8);
    for (int i=0; i<8; i++)
    setFieldValue.invoke(orcRow, i, "AAA"+j+"BBB"+i);
    writer.addRow(orcRow);
    }
       writer.close();

1 comment:

  1. Hi, I just developed a library to serialize java objects to ORC file driven by annotations. Apache 2.0 license and available from github.com/eclecticlogic/eclectic-orc

    ReplyDelete