Tech Notes: January 2018

When S3 file name or path contains colon (:), you will get following exception when you submit your map reduce job:

2018-01-03/00:43:52.632/UTC ERROR [pool-6-thread-1] com.dataguise.hadoop.util.WFNew$4:run [WF-ERR]: Task2 cannot be run (because: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1730)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.globStatus(EmrFileSystem.java:373)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:631)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:603)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at com.mycompany.hadoop.util.WFNew.SubmitJob(WFNew.java:603)
at com.mycompany.hadoop.util.WFNew.access$500(WFNew.java:37)
at com.mycompany.hadoop.util.WFNew$3.run(WFNew.java:428)
at java.lang.Thread.run(Thread.java:745)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 29 more

Basically, Hadoop does not allow colon in path or file name, but S3 allows.

There are multiple discussions about this, but no fix on Hadoop community. If you have to support colon in file path or name, you have to have your own solution.

One of discussions is:

https://stackoverflow.com/questions/34093098/load-a-amazon-s3-file-which-has-colons-within-the-filename-through-pyspark

final Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs." + CustomS3FileSystem.SCHEMA + ".impl", 
CustomS3FileSystem.class.getName());

public class CustomS3FileSystem extends NativeS3FileSystem {
public static final String SCHEMA = "s3";

  @Override
  public FileStatus[] globStatus(final Path pathPattern, final PathFilter filter)
      throws IOException {
    final FileStatus[] statusList = super.listStatus(pathPattern);
    final List<FileStatus> result = Lists.newLinkedList();
    for (FileStatus fileStatus : statusList) {
      if (filter.accept(fileStatus.getPath())) {
        result.add(fileStatus);
      }
    }
    return result.toArray(new FileStatus[] {});
  }
}

This does not work for me due to multiple reasons:

1. On my case, we are using EMR so EmrFileSystem is the FileSystem we need to use as base class.

2. We have to pass CustomS3FileSystem.class to cluster with other Java classes through our map reduce jar. Just using CustomS3FileSystem.class on client code will not work.

3. We are unable to set customized FieSystem through Java code. But we can set it through core-site.xml file by changing the default FileSystem class (com.amazon.ws.emr.hadoop.fs.EmrFileSystem) to our customized FileSystem (CustomS3FileSystem in this example).

After that, we can execute map reduce job for files which has colon in file name or path.

The changed properties in core-site.xml is:

<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

All head files we used are:

import com.amazon.ws.emr.hadoop.fs.EmrFileSystem;
import com.google.common.collect.Lists;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

This is just a workaround as above article pointed out: it will not allow you to specify wildcards in the URL

Tech Notes

Wednesday, January 3, 2018

Workaround to the S3 colon issue on Hadoop MapReduce