Tech Notes

Wednesday, January 3, 2018

Workaround to the S3 colon issue on Hadoop MapReduce

When S3 file name or path contains colon (:), you will get following exception when you submit your map reduce job:

2018-01-03/00:43:52.632/UTC ERROR [pool-6-thread-1] com.dataguise.hadoop.util.WFNew$4:run [WF-ERR]: Task2 cannot be run (because: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1730)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.globStatus(EmrFileSystem.java:373)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:631)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:603)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at com.mycompany.hadoop.util.WFNew.SubmitJob(WFNew.java:603)
at com.mycompany.hadoop.util.WFNew.access$500(WFNew.java:37)
at com.mycompany.hadoop.util.WFNew$3.run(WFNew.java:428)
at java.lang.Thread.run(Thread.java:745)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 29 more

Basically, Hadoop does not allow colon in path or file name, but S3 allows.

There are multiple discussions about this, but no fix on Hadoop community. If you have to support colon in file path or name, you have to have your own solution.

One of discussions is:

https://stackoverflow.com/questions/34093098/load-a-amazon-s3-file-which-has-colons-within-the-filename-through-pyspark

final Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs." + CustomS3FileSystem.SCHEMA + ".impl", 
CustomS3FileSystem.class.getName());

public class CustomS3FileSystem extends NativeS3FileSystem {
public static final String SCHEMA = "s3";

  @Override
  public FileStatus[] globStatus(final Path pathPattern, final PathFilter filter)
      throws IOException {
    final FileStatus[] statusList = super.listStatus(pathPattern);
    final List<FileStatus> result = Lists.newLinkedList();
    for (FileStatus fileStatus : statusList) {
      if (filter.accept(fileStatus.getPath())) {
        result.add(fileStatus);
      }
    }
    return result.toArray(new FileStatus[] {});
  }
}

This does not work for me due to multiple reasons:

1. On my case, we are using EMR so EmrFileSystem is the FileSystem we need to use as base class.

2. We have to pass CustomS3FileSystem.class to cluster with other Java classes through our map reduce jar. Just using CustomS3FileSystem.class on client code will not work.

3. We are unable to set customized FieSystem through Java code. But we can set it through core-site.xml file by changing the default FileSystem class (com.amazon.ws.emr.hadoop.fs.EmrFileSystem) to our customized FileSystem (CustomS3FileSystem in this example).

After that, we can execute map reduce job for files which has colon in file name or path.

The changed properties in core-site.xml is:

<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

All head files we used are:

import com.amazon.ws.emr.hadoop.fs.EmrFileSystem;
import com.google.common.collect.Lists;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

This is just a workaround as above article pointed out: it will not allow you to specify wildcards in the URL

Tuesday, November 7, 2017

Command to use jconsole to monitor java program remotely

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=18745 -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.6.81 -Xrunjdwp:transport=dt_socket,address=8002,server=y,suspend=n -jar "xxxxAgent-jetty.jar"

Then from remote machine, run jconsole and put above ip and port into jconsole GUI.

Tuesday, August 1, 2017

log4j trick when creating ConsoleAppender from Java code

Right way to do it:

/**
* Dynamically initialize a log4j console appender so we can log info to stdout
* with specified level and pattern layout.
*/
public static void initStdoutLogging(Level logLevel, String patternLayout) {
LogManager.resetConfiguration();
ConsoleAppender appender = new ConsoleAppender(new PatternLayout(patternLayout));
appender.setThreshold(logLevel);
LogManager.getRootLogger().addAppender(appender);
}

initStdoutLogging(Level.INFO, PatternLayout.TTCC_CONVERSION_PATTERN);

When enabling logging for Spark, we need to call above initStdoutLogging twice: one for the driver and another for the executor. Otherwise, you can only see logs from driver.

To read logs from executor, you will have to use something like:

sudo -u hdfs yarn logs -applicationId application_1501527690446_0066

Analysis:

If I initialize it like this:

ConsoleAppender ca = new ConsoleAppender();
ca.setLayout(new PatternLayout(PatternLayout.TTCC_CONVERSION_PATTERN));

it gives an error and breaks the logging.

Error output:

log4j:ERROR No output stream or file set for the appender named [null].

If I initialize it like this it works fine:

ConsoleAppender ca = new ConsoleAppender(new PatternLayout(PatternLayout.TTCC_CONVERSION_PATTERN));

The reason:

If you look at the source for ConsoleAppender:

  public ConsoleAppender(Layout layout) {
    this(layout, SYSTEM_OUT);
  }

  public ConsoleAppender(Layout layout, String target) {
    setLayout(layout);
    setTarget(target);
    activateOptions();
  }

You can see that ConsoleAppender(Layout) passes SYSTEM_OUT as the target, and also that it calls activateOptions after setting the layout and target.

If you use setLayout yourself, then you'll also need to explicitly set the target and call activateOptions.

Monday, June 12, 2017

JavaOptions to pass to Spark driver and Snappy native library to pass to Java program

/usr/hdp

JavaOptions to pass to Spark driver: Here is a Hortonworks example:

JavaOptions=-Dhdp.version=2.4.2.0-258 -Dspark.driver.extraJavaOptions=-Dhdp.version=2.4.2.0-258

Snappy native library to pass to Java program. Here is an example for CDH5:

1. Copy /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native/* to your java program machine and put them to any directory such as /opt/native
2. Run your Java program with following Java options:
JavaOptions=-Djava.library.path=/opt/native

Wednesday, May 10, 2017

Search anything from a bugzilla bug given a range of bug numbers

[root@centos my-data-set]# for ((i=13420;i<13500;i++)); do echo $i >> /tmp/search_result.txt; curl http://192.168.5.105/bugzilla/show_bug.cgi?id=$i | grep "passed to DE constructor may be null" >> /tmp/search_result.txt; done;

Tuesday, May 2, 2017

A simple way to clone 10,000 files to cluster

# Generate 10,000 files from one seed and put them into 100 subdirectories

[hdfs@cdh1 tmp]$ for((i=1;i<=100;i++));do mkdir -p 10000files/F$i; for((j=1;j<=100;j++));do echo $i-$j; cp 1000line.csv 10000files/F$i/$i-$j.csv;done;done;

# Move them to cluster. One sub-directory at one time.

[hdfs@cdh1 tmp]$ for((i=1;i<=100;i++));do echo $i; hadoop fs -mkdir /JohnZ/10000files/F$i; hadoop fs -copyFromLocal 10000files/F$i/* /JohnZ/10000files/F$i/.;done;

Tuesday, April 25, 2017

Prepare Python machine learning environment on Centos 6.6 to train data

# major points: 1. Has to use Python 2.7, not 2.6. But Centos 6.6 uses Python 2.6 for OS so upgrading to 2.7 is not a solution. Need to install Python 2.7 in addition to 2.6. 2. Use setuptool to install pip and use pip to install rest.

# Download dependency files
yum groupinstall "Development tools"
yum -y install gcc gcc-c++ numpy python-devel scipy
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel

# Compile and install Python 2.7.13
wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
tar xzf Python-2.7.13.tgz
cd Python-2.7.13
./configure
# make altinstall is used to prevent replacing the default python binary file /usr/bin/python.
make altinstall

# Download setuptools using wget:
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
# Extract the files from the archive:
tar -xvf setuptools-1.4.2.tar.gz
# Enter the extracted directory:
cd setuptools-1.4.2

# Install setuptools using the Python we've installed (2.7.6)
# python2.7 setup.py install
/opt/python-2.7.13/Python-2.7.13/python ./setup.py install

# install pip
curl https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py | python2.7 -

or (following works for me)

[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python ./setuptools/setuptools-1.4.2/easy_install.py pip

# install numpy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install numpy

# Install SciPy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scipy

# Install Scikit
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scikit-learn

# Install nltk
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install nltk

# Download nltk data (will be stored under /root/nltk_data)
[root@centos SVM]# /opt/python-2.7.13/Python-2.7.13/python -m nltk.downloader all