Tech Notes: 2014

Monday, December 15, 2014

Simple Java Book

http://www.programcreek.com/wp-content/uploads/2013/01/SimpleJava1.pdf

Saturday, November 8, 2014

Index minimal priority queue combine the beauty of priority queue and functionality from array.

It is a priority queue sorted by Key (Key extends Comparable<Key>):

int delMin(); <-- delete the one which has lowest priority

It supports indexed insert, update and delete:

void insert(int i, Key key);
void increaseKey(int i, Key key);
void decreaseKey(int i, Key key);
void delete(int); <-- delete the one on position i regardless its priority

All operations are log V or less.

It is used by:

1. Eager Prim MST (minimal spanning tree)

2. Dijkstra's shortest path algorithm

Invariables to keep in mind when dealing with IndexMinPQ:

1. Priority is stored in Key[]
2. Vertex is stored in pq[] -- this is the real priority queue
3. Using qp[] to reverse lookup from vertex id to its entry in pq[], i.e. pq[qp[i]] == i.

We can use 'delete' to see their relationships:

/**
* Remove vertex i (stored in pq[]) and its priority (stored in keys[]).
* @param i the vertex id
*/
public void delete(int i) {
if (i < 0 || i >= NMAX) throw new IndexOutOfBoundsException();
if (!contains(i)) throw new NoSuchElementException("index is not in the priority queue");
int index = qp[i];
exch(index, N--);
swim(index);
sink(index);
keys[i] = null;
qp[i] = -1;
}

Both sink and swim are modified to use priority values stored in Key[] to change the order of priority queue:

private void swim(int k) {
while (k > 1 && greater(k/2, k) {
exch(k, k/2);
k = k/2;
}
}
private boolean greater(int i, int j) {
return keys[pq[i]].compareTo(keys[pq[j]]) > 0;
}

Monday, October 20, 2014

Yet the best explanation for Kerberos

http://windowsitpro.com/security/kerberos-active-directory

Saturday, October 18, 2014

Find a duplicate in array

Given an array of N elements in which each element is an integer between 1 and N-1, write an algorithm to find any duplicate. Your algorithm should run in linear time, use O(n) extra space, and may not modify the original array.
Given an array of N elements in which each element is an integer between 1 and N-1 with one element duplicated, write an algorithm to find such duplicate. Your algorithm should run in linear time, use O(1) extra space, and may not modify the original array.
Given an array of N elements in which each element is an integer between 1 and N-1, write an algorithm to find any duplicate. Your algorithm should run in linear time and use O(1) extra space, and may modify the original array.
Given an array of N elements in which each element is an integer between 1 and N-1, write an algorithm to find any duplicate. Your algorithm should run in linear time, use O(1) extra space, and may not modify the original array.

More info: http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html

Solutions:

let s ← array [1..n]
initialize s to all zeroes
for 1 <= i <= n:
  if s[A[i]] > 0: return A[i]
  set s[A[i] ← 1

let s ← 0
for 1 <= i <= n:
  s ← s + A[i]
return s - n*(n-1)/2
Above solution for question #2 is not the best since there may be overflow of integer.  We can do better by:
temp = 0;
for (i = 0; i < n; i++) 
    temp = temp ^ A[i] ^ i;
return temp;

for 1 <= i <= n:
  while A[i] ≠ i:
    if A[A[i]] = A[i]: return A[i]
    swap(A[A[i]], A[i])

let i ← n, j ← n
do: i ← A[i], j ← A[A[j]]; until i = j
set j ← n
do: i ← A[i], j ← A[j]; until i = j
return i

Wednesday, October 15, 2014

Add Kerberos to Hadoop

Environment: CentOS 6.5; HW 1.3;

Install Kerberos:

yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation

Intialize KDC:

kdb5_util create -s

Edit the Access Control List (/var/kerberos/krb5kdc/kadm5.acl in RHEL or CentOS and /var/lib/kerberos/krb5kdc/kadm5.acl in SLES ) to define the principals that have admin (modifying) access to the database. A simple example would be a single entry:

*/admin@EXAMPLE.COM *

Create first user principal (through root user only):

[root@centos johnz]# kadmin.local -q "addprinc adminuser/admin"
Authenticating as principal root/admin@EXAMPLE.COM with password.
WARNING: no policy specified for adminuser/admin@EXAMPLE.COM; defaulting to no policy
Enter password for principal "adminuser/admin@EXAMPLE.COM":
Re-enter password for principal "adminuser/admin@EXAMPLE.COM":
Principal "adminuser/admin@EXAMPLE.COM" created.

Start Kerberos:

[root@centos johnz]# service krb5kdc start
Starting Kerberos 5 KDC: [ OK ]
[root@centos johnz]# service kadmin start
Starting Kerberos 5 Admin Server: [ OK ]

Create a directory to store keytabs:

[root@centos johnz]# mkdir -p /etc/security/keytabs/
[root@centos johnz]# chown root:hadoop /etc/security/keytabs/
[root@centos johnz]# chmod 750 /etc/security/keytabs/

Add service principles for Hadoop:

[root@centos johnz]# kadmin -p adminuser/admin
Authenticating as principal adminuser/admin with password.
Password for adminuser/admin@EXAMPLE.COM:
kadmin: addprinc -randkey nn/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey dn/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey HTTP/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey jt/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey tt/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey hbase/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey zookeeper/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey hcat/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey oozie/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey hdfs/centos.hw13@EXAMPLE.COM
kadmin: addprinc -randkey hive/centos.hw13@EXAMPLE.COM

Export these principals to keytab files:

kadmin: xst -k /etc/security/keytabs/spnego.service.keytab HTTP/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/nn.service.keytab nn/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/dn.service.keytab dn/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/jt.service.keytab jt/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/tt.service.keytab tt/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/hive.service.keytab hive/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/oozie.service.keytab oozie/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/hbase.service.keytab hbase/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/zk.service.keytab zookeeper/centos.hw13@EXAMPLE.COM
kadmin: xst -k /etc/security/keytabs/hdfs.headless.keytab hdfs/centos.hw13@EXAMPLE.COM

Config Hadoop:

1. add following to core-site.xml:

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>RULE:[2:$1@$0](jt@.*EXAMPLE.COM)s/.*/mapred/
RULE:[2:$1@$0](tt@.*EXAMPLE.COM)s/.*/mapred/
RULE:[2:$1@$0](nn@.*EXAMPLE.COM)s/.*/hdfs/
RULE:[2:$1@$0](dn@.*EXAMPLE.COM)s/.*/hdfs/
RULE:[2:$1@$0](hbase@.*EXAMPLE.COM)s/.*/hbase/
RULE:[2:$1@$0](hbase@.*EXAMPLE.COM)s/.*/hbase/
RULE:[2:$1@$0](oozie@.*EXAMPLE.COM)s/.*/oozie/
DEFAULT</value>
</property>

2. add following to hdfs-site.xml:

<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/centos.hw13@EXAMPLE.COM</value>
<description> Kerberos principal name for the
NameNode </description>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
<value>nn/centos.hw13@EXAMPLE.COM</value>
<description>Kerberos principal name for the secondary NameNode.
</description>
</property>
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/centos.hw13@EXAMPLE.COM</value>
<description> The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos HTTP SPNEGO specification.
</description>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/_HOST@EXAMPLE.COM</value>
<description>The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real host name.
</description>
</property>

<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/spnego.service.keytab</value>
<description>The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP

endpoint.
</description>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
<description>Combined keytab file containing the NameNode service and host principals.
</description>
</property>
<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
<description>Combined keytab file containing the NameNode service and host principals.
</description>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
<description>The filename of the keytab file for the DataNode.
</description>
</property>

3. add following to mapred-site.xml:

<property>
<name>mapreduce.jobtracker.kerberos.principal</name>
<value>jt/centos.hw13@EXAMPLE.COM</value>
<description>Kerberos principal name for the JobTracker </description>
</property>
<property>
<name>mapreduce.tasktracker.kerberos.principal</name>
<value>tt/centos.hw13@EXAMPLE.COM</value>
<description>Kerberos principal name for the TaskTracker."_HOST" is replaced by the host name of the TaskTracker.
</description>
</property>
<property>
<name>mapreduce.jobtracker.keytab.file</name>
<value>/etc/security/keytabs/jt.service.keytab</value>
<description>The keytab for the JobTracker principal.
</description>
</property>
<property>
<name>mapreduce.tasktracker.keytab.file</name>
<value>/etc/security/keytabs/tt.service.keytab</value>
<description>The filename of the keytab for the TaskTracker</description>
</property>
<property>
<name>mapreduce.jobhistory.kerberos.principal</name>

<value>jt/centos.hw13@EXAMPLE.COM</value>
<description> Kerberos principal name for JobHistory. This must map to the same user as the JobTracker user (mapred).
</description>
</property>
<property>
<name>mapreduce.jobhistory.keytab.file</name>

<value>/etc/security/keytabs/jt.service.keytab</value>
<description>The keytab for the JobHistory principal.
</description>
</property>

4. Important: have to replace existing local_policy.jar and US_export_policy.jar files ($JAVA_HOME/jre/lib/security/ directory) from

following link:

http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html

5. Very important: do following before start data node.

export HADOOP_SECURE_DN_USER=hdfs

6. Very important: check permission of keytab files. They should be:

[root@centos ~]# ls -l /etc/security/keytabs/
total 40
-rw-------. 1 hdfs hadoop 400 Oct 13 14:56 dn.service.keytab
-rw-------. 1 hdfs hadoop 412 Oct 13 14:59 hdfs.headless.keytab
-rw-------. 1 mapred hadoop 400 Oct 13 14:56 jt.service.keytab
-rw-------. 1 hdfs hadoop 400 Oct 13 14:56 nn.service.keytab
-rw-------. 1 mapred hadoop 400 Oct 13 14:56 tt.service.keytab

7. Important: have to start all services through root user.

[root@centos security]# /usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode
[root@centos security]# /usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode
[root@centos security]# /usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start secondarynamenode
[root@centos security]# /usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start jobtracker
[root@centos security]# /usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start tasktracker

Access Hadoop:

[root@centos ~]# kinit -k -t /etc/security/keytabs/hdfs.headless.keytab -p hdfs/centos.hw13@EXAMPLE.COM
[root@centos ~]# hadoop fs -ls /user/hive
Found 1 items
drwxr-xr-x - hdfs supergroup 0 2014-10-03 17:13 /user/hive/warehouse

Sunday, October 5, 2014

Insert a node to red-black tree: LLRB algorithm is doing the best job!

private Node put(Node h, Key key, Value val) {
if (h == null) return new Node(key, val, RED, 1);

int cmp = key.compareTo(h.key);
if (cmp < 0) h.left = put(h.left, key, val);
else if (cmp > 0) h.right = put(h.right, key, val);
else h.val = val;

// fix-up any right-leaning links
if (isRed(h.right) && !isRed(h.left)) h = rotateLeft(h);
if (isRed(h.left) && isRed(h.left.left)) h = rotateRight(h);
if (isRed(h.left) && isRed(h.right)) flipColors(h);
h.N = size(h.left) + size(h.right) + 1;

return h;
}

Saturday, September 27, 2014

Magic of the array

1. Union find:

public void union(int p, int q) {
int rootP = find(p);
int rootQ = find(q);
if (rootP == rootQ) return;

// make smaller root point to larger one
if (sz[rootP] < sz[rootQ]) { id[rootP] = rootQ; sz[rootQ] += sz[rootP]; }
else { id[rootQ] = rootP; sz[rootP] += sz[rootQ]; }
count--;
}

2. Priority queue implemented by binary heap:

public Key delMin() {

if (isEmpty()) throw new NoSuchElementException("Priority queue underflow");

exch(1, N);

Key min = pq[N--];

sink(1);

pq[N+1] = null; // avoid loitering and help with garbage collection

if ((N > 0) && (N == (pq.length - 1) / 4)) resize(pq.length / 2);

assert isMinHeap();

return min;

}

3. Heap sort:

public static void sort(Comparable[] pq) {

int N = pq.length;

for (int k = N/2; k >= 1; k--)

sink(pq, k, N);

while (N > 1) {

exch(pq, 1, N--);

sink(pq, 1, N);

}

4. Randomized queue:

public Item dequeue() {

if (size <= 0) {

throw new NoSuchElementException("Tried to dequeue an empty queue.");

}

int index = StdRandom.uniform(size);

Item item = items[index];

items[index] = items[--size];

items[size + 1] = null;

if (size > ORIG_CAPACITY && size < capacity / 4) {

resize(capacity / 2);

}

return item;

}

5. 8-puzzle:

The n-puzzle is a classical problem for modelling algorithms involving heuristics. Commonly used heuristics for this problem include counting the number of misplaced tiles and finding the sum of the taxicab(or manhanttan) distances between each block and its position in the goal configuration

   // return sum of Manhattan distances between blocks and goal

public int manhattan() {

int manhattan = 0;

for (int i = 0; i < N; i++) {

for (int j = 0; j < N; j++) {

int current = blocks[i][j];

if (current == 0) continue; // skip blank square.

if (current != (i * N + j + 1)) {

manhattan += Math.abs((current -1 ) / N - i) + Math.abs((current - 1) % N - j);

}

return manhattan;

}

Sunday, August 31, 2014

Write ORC file

First of all, there is no way to write ORC file for hive-exec-0.12 or earlier version since OrcStruct is package private for these old versions.

For hive-exec-0.13 or later:

Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf_cdh5\\mapred-site.xml"));

ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class,
ObjectInspectorFactory.ObjectInspectorOptions.JAVA);

Writer writer = OrcFile.createWriter(new Path("/user/output_orcwriter"),
OrcFile.writerOptions(conf).inspector(inspector).stripeSize(1048576/2).bufferSize(1048577)
.version(OrcFile.Version.V_0_12));

// even OrcStruct is public, its constructor and setFieldValue method are not.
Class<?> c = Class.forName("org.apache.hadoop.hive.ql.io.orc.OrcStruct");
Constructor<?> ctor = c.getDeclaredConstructor(int.class);
ctor.setAccessible(true);
Method setFieldValue = c.getDeclaredMethod("setFieldValue", int.class, Object.class);
setFieldValue.setAccessible(true);

for (int j=0; j<5; j++) {
OrcStruct orcRow = (OrcStruct) ctor.newInstance(8);
for (int i=0; i<8; i++)
setFieldValue.invoke(orcRow, i, "AAA"+j+"BBB"+i);
writer.addRow(orcRow);
}
writer.close();

Thursday, August 28, 2014

Access Hive collection type data in map reduce for Orc file

public void map(Object key, Writable value,
Context context) throws IOException, InterruptedException {

OrcStruct orcStruct = (OrcStruct)value;
int numberOfFields = orcStruct.getNumFields();

for (int i=0; i<numberOfFields; i++) {

// getFieldValue is private at this moment. Use reflection to access it.
Object field = null;
try {
field = getFieldValue.invoke(orcStruct, i);
} catch (Exception e) {
e.printStackTrace();
}
if (field==null) continue;

// process Hive collection type array, struct or map
if (field instanceof List) {
List list = (List)field;
for (int j=0; j<list.size(); j++)
System.out.println(list.get(j));
}
else if (field instanceof Map) {
Map map = (Map)field;
for (Iterator entries = map.entrySet().iterator(); entries.hasNext();) {
Map.Entry entry = (Entry) entries.next();
System.out.println("key="+entry.getKey()+",value="+entry.getValue());
}
}
else if (field instanceof OrcStruct) {
OrcStruct struct = (OrcStruct)field;
int numberOfField = struct.getNumFields();
for (int j=0; j<numberOfField; j++) {
try {
System.out.println("field"+j+"="+getFieldValue.invoke(struct, j));
} catch (Exception e) {
e.printStackTrace();
}
}
}
else {
System.out.println("Unknown type for field"+ field);
}
}
}

Friday, August 15, 2014

Troubleshooting version mismatch issue of map reduce job

When coding map reduce job and code to access Hadoop file system, you have to include hadoop jar files to make your java code to pass compiler. If the version of jar files are not compatible with the jar files of Hadoop system, the 500 error will be gotten when running your java code.

The 500 error code does not really tell anything about what is the problem. This makes troubleshooting little bit hard. To identify this kind of problem, we have to go to cluster to check logs.

For example, looking into hadoop-hdfs-namenode-*.log and try to find following error:

2014-08-14 14:21:42,386 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 192.168.72.1:15444 got version 7 expected version 4

Basically, this is telling us that cluster expects version 4, but client is using version 7.

Having this kind of information, lower the version of hadoop jar files. Now my hadoop code works perfect fine with cluster.

Monday, July 21, 2014

Assigning a static IP to a VMware Workstation VM

http://bytealmanac.wordpress.com/2012/07/02/assigning-a-static-ip-to-a-vmware-workstation-vm/

Assumption: the VM is running a DHCP client and is assigned a dynamic IP by the DHCP service running on the host. My host machine is runs on Windows 7 Ultimate x64 with VMware Workstation 8.

Open C:\ProgramData\VMware\vmnetdhcp.conf as Administrator. This file follows the syntax of dhcpd.conf. Add the following lines (change host name, MAC, IP appropriately – these are shown in a different colour) under the correct section (for me it was for a NAT based network – VMnet8). The MAC can be found from the VM’s properties.

host ubuntu {
hardware ethernet 00:0C:29:16:2A:D6;
fixed-address 192.168.84.132;
}

Restart the VMware DHCP service. Use the following commands from an elevated prompt:
net stop vmnetdhcp
net start vmnetdhcp

On the VM, acquire a new lease using the below command (if VM runs Linux):

ifconfig eth0 down

ifconfig eth0 up

Thursday, July 17, 2014

Cloudera CDH5 source code download

https://repository.cloudera.com/artifactory/public/org/apache/hadoop/hadoop-core/

Wednesday, July 2, 2014

zookeeper-env.sh issue when setting up HBase

Followed CDH4.2.2 installation guide to setup HBase.

Ran "service zookeeper-server start"

No error from command line. But zookeeper.log says "nohup: failed to run command ‘java’: No such file or directory".

Interesting! Java home is right by doing "echo $JAVA_HOME".

After about one hour troubleshooting and script checking, it turned out:

zookeeper-env.sh is needed under /etc/zookeeper/conf directory (as other Hadoop component). But for somehow, zookeeper installer did not have such file by default.

I have to manually create such file and put following line into it:

export JAVA_HOME=/opt/jdk1.6.0_45/

After that, starting zookeeper works. I can see it from 'jps':

[root@centos conf]# jps
2732 TaskTracker
4964 Jps
4776 QuorumPeerMain
3133 NameNode
2548 JobTracker
2922 DataNode

Wednesday, June 25, 2014

Eclipse debug step into, step over, etc. disappeared. How to bring them back?

http://stackoverflow.com/questions/12912896/eclipse-buttons-like-step-in-step-out-resume-etc-not-working

Monday, June 9, 2014

Write RC file

private static void testWrite() throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\mapred-site.xml"));

FileSystem fs = null;
try {
fs = FileSystem.get(conf);
} catch (IOException e1) {
e1.printStackTrace();
}

// has to set column number manually
conf.setInt(RCFile.COLUMN_NUMBER_CONF_STR, 4);

RCFile.Writer rcWriter = new RCFile.Writer(fs, conf, new Path("/user/abc/output_rcwriter/output1"));

String[] values =
{"111222333,1200,999999.99,abc@yahoo.com",
"1112226666,1201,999999.99,abcdefg@yahoo.com"};

for (String value : values) {
String[] columns = value.split(",");

if (columns.length>0) {
BytesRefArrayWritable outputRow = new BytesRefArrayWritable(columns.length);
for (int i=0; i<columns.length; i++) {
BytesRefWritable column = new BytesRefWritable(columns[i].getBytes("UTF-8"));
outputRow.set(i, column);
}
rcWriter.append(outputRow);
}
}
rcWriter.close();
}

Friday, May 30, 2014

Eclipse: search files outside workspace

I used to use Visual Studio and IntelliJ IDEA for coding. When switching to use Eclipse, i found the search is very workspace oriented. I can not search something outside of my project.

Finally, i found this useful link to solve this problem:

http://eclipse.dzone.com/articles/5-best-eclipse-plugins-system

Friday, May 23, 2014

Row wise read vs column wise read for RCFile

Row Wise Read:

private static void readRowWise(RCFile.Reader rcReader) {
int rowcounter = 0;
Text len = rcReader.getMetadata().get(new Text("hive.io.rcfile.column.number"));
int numberOfColumns = Integer.valueOf(len.toString());

try {
while (rcReader.next(new LongWritable(rowcounter))) {
BytesRefArrayWritable cols = new BytesRefArrayWritable();

/** * Have to call 'resetValid' for all rows to allocate how many columns for each row. * This looks ugly. But this is the way to make the row wise reading working. */ cols.resetValid(numberOfColumns);

/**
* The name of getCurrentRow is kind of misleading. It actually reads all rows in the current row group,
* column by column (due to the file format nature of RCFile) and store them internally so next call to getCurrentRow
* will actually return the same data buffer. By default, it sets 'valid' variable to number of columns so only the columns
* for first row can be gotten by calling cols.get(i).
*
* Once first row is read, a call to 'resetValid' will allow us to read next row. The value passed to 'resetValid'
* have to be the number of columns to allow read all columns for next row.
*/
rcReader.getCurrentRow(cols);

int size = cols.size(); // this actually returns the number of columns in the current row.

for (int i= 0; i<size; i++) {
BytesRefWritable currentColumn = cols.get(i);

byte[] currentColumnBytes = currentColumn.getBytesCopy(); // get current column data for the current row
Text text = new Text(currentColumnBytes);
System.out.println("columnText="+text.toString());
}
rowcounter++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Column Wise Read:

private static void readColumnWise(RCFile.Reader rcReader) {

Text len = rcReader.getMetadata().get(new Text("hive.io.rcfile.column.number"));

int numberOfColumns = Integer.valueOf(len.toString());

String[][] firstNRows = null;

int numberOfRowsNeeded = 10; // only looking at first 10 rows

try {

// go through each row group

while (rcReader.nextColumnsBatch()) {

// go through each column in current row group

for (int i=0; i<numberOfColumns; i++) {

BytesRefArrayWritable columnData = rcReader.getColumn(i, null);

if (firstNRows==null)

firstNRows = new String[Math.min(numberOfRowsNeeded,columnData.size())][numberOfColumns];

// for a given column, go through each row in current row group

for (int j=0; j<columnData.size() && j<numberOfRowsNeeded; j++) {

BytesRefWritable cellData = columnData.get(j);

byte[] currentCell = Arrays.copyOfRange(cellData.getData(), cellData.getStart(), cellData.getStart()+cellData.getLength());

Text currentCellStr = new Text(currentCell);

System.out.println("columnText="+currentCellStr);

firstNRows[j][i] = currentCellStr.toString();

}

} catch (IOException e1) {

// TODO Auto-generated catch block

e1.printStackTrace();

}

// transfer the matrix to row based from column based

for (int i=0; i<numberOfRowsNeeded; i++) {

for (int j=0; j<numberOfColumns; j++) {

if (j>0) System.out.print(",");

System.out.print(firstNRows[i][j]);

}

System.out.println();

}

}

A Test Driver:

private static void testDirectRead(boolean rowWise) {
Configuration conf = new Configuration();
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\core-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\hdfs-site.xml"));
conf.addResource(new Path("C:\\etc\\Hadoop\\conf\\mapred-site.xml"));

FileSystem fs = null;
try {
fs = FileSystem.get(conf);
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}

RCFile.Reader rcReader = null;
try {
rcReader = new RCFile.Reader(fs, new Path("/user/hive/warehouse/rc_userdatatest2/000000_0"), conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

if (rowWise)
readRowWise(rcReader);
else
readColumnWise(rcReader);

rcReader.close();

}