Tech Notes

Friday, March 30, 2018

Different locations of map reduce logs

HDP:

sudo -u hdfs hadoop fs -ls /app-logs/hdfs/logs

(7 days)

CDH5:

sudo -u hdfs hadoop fs -ls /tmp/logs/hdfs/logs/

(7 days)

EMR (AWS):

sudo -u hdfs hadoop fs -ls /var/logs/hdfs/logs/

(? days)

Default value for "yarn.log-aggregation.retain-seconds" is 7 days.

If log aggregation is not enabled, the logs are in local file system. For example, MapR put logs into following places:

/opt/mapr/hadoop/hadoop-2.7.0/logs/userlogs/

Thursday, March 29, 2018

SPARK_HOME, etc.

export SPARK_HOME==/usr/hdp/2.6.3.0-235/spark2

JavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.driver.extraJavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.3.0-235

More info:
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java

Thursday, March 22, 2018

Additional steps for Ranger/Kerberos enabled Hadoop

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_security/content/ch06s01s01s01.html

Add values for the following properties in the "Custom kms-site" section. These properties allow the specified system users (hive, oozie, the user we are using and others) to proxy on behalf of other users when communicating with Ranger KMS. This helps individual services (such as Hive) use their own keytabs, but retain the ability to access Ranger KMS as the end user (use access policies associated with the end user).

hadoop.kms.proxyuser.{hadoop-user}.users
hadoop.kms.proxyuser.{hadoop-user}.groups
hadoop.kms.proxyuser.{hadoop-user}.hosts

Friday, February 2, 2018

Increase map reduce timeout value and JVM heap size, or set different log level for map reduce job

Update the mapred-site.xml on client machine only:

For increasing heap size:

<property>
<name>mapreduce.map.java.opts</name>
<value>-Djava.net.preferIPv4Stack=true -Xmx3600m</value>
</property>

<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>

Make sure the value of "mapreduce.map.memory.mb" is greater than the value of "-Xmx"

For no timeout:

<property>
<name>mapreduce.task.timeout</name>
<value>0</value>
</property>

Default is 10 minutes: 600000

adding these to the mapred-site.xml to lower log level from INFO to WARN:
<property>
<name>mapreduce.map.log.level</name>
<value>WARN</value>
</property>
<property>
<name>mapreduce.reduce.log.level</name>
<value>WARN</value>
</property>

Of course, we can also so this through Java code. For example, use following code to reset JVM memory:

conf.setInt("yarn.app.mapreduce.am.resource.mb", 2048);
String opt = "-Xmx2048m";
conf.set("yarn.app.mapreduce.am.command-opts", opt);

Wednesday, January 3, 2018

Workaround to the S3 colon issue on Hadoop MapReduce

When S3 file name or path contains colon (:), you will get following exception when you submit your map reduce job:

2018-01-03/00:43:52.632/UTC ERROR [pool-6-thread-1] com.dataguise.hadoop.util.WFNew$4:run [WF-ERR]: Task2 cannot be run (because: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1730)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.globStatus(EmrFileSystem.java:373)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:631)
at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:603)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at com.mycompany.hadoop.util.WFNew.SubmitJob(WFNew.java:603)
at com.mycompany.hadoop.util.WFNew.access$500(WFNew.java:37)
at com.mycompany.hadoop.util.WFNew$3.run(WFNew.java:428)
at java.lang.Thread.run(Thread.java:745)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 29 more

Basically, Hadoop does not allow colon in path or file name, but S3 allows.

There are multiple discussions about this, but no fix on Hadoop community. If you have to support colon in file path or name, you have to have your own solution.

One of discussions is:

https://stackoverflow.com/questions/34093098/load-a-amazon-s3-file-which-has-colons-within-the-filename-through-pyspark

final Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs." + CustomS3FileSystem.SCHEMA + ".impl", 
CustomS3FileSystem.class.getName());

public class CustomS3FileSystem extends NativeS3FileSystem {
public static final String SCHEMA = "s3";

  @Override
  public FileStatus[] globStatus(final Path pathPattern, final PathFilter filter)
      throws IOException {
    final FileStatus[] statusList = super.listStatus(pathPattern);
    final List<FileStatus> result = Lists.newLinkedList();
    for (FileStatus fileStatus : statusList) {
      if (filter.accept(fileStatus.getPath())) {
        result.add(fileStatus);
      }
    }
    return result.toArray(new FileStatus[] {});
  }
}

This does not work for me due to multiple reasons:

1. On my case, we are using EMR so EmrFileSystem is the FileSystem we need to use as base class.

2. We have to pass CustomS3FileSystem.class to cluster with other Java classes through our map reduce jar. Just using CustomS3FileSystem.class on client code will not work.

3. We are unable to set customized FieSystem through Java code. But we can set it through core-site.xml file by changing the default FileSystem class (com.amazon.ws.emr.hadoop.fs.EmrFileSystem) to our customized FileSystem (CustomS3FileSystem in this example).

After that, we can execute map reduce job for files which has colon in file name or path.

The changed properties in core-site.xml is:

<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

All head files we used are:

import com.amazon.ws.emr.hadoop.fs.EmrFileSystem;
import com.google.common.collect.Lists;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

This is just a workaround as above article pointed out: it will not allow you to specify wildcards in the URL

Tuesday, November 7, 2017

Command to use jconsole to monitor java program remotely

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=18745 -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.6.81 -Xrunjdwp:transport=dt_socket,address=8002,server=y,suspend=n -jar "xxxxAgent-jetty.jar"

Then from remote machine, run jconsole and put above ip and port into jconsole GUI.

Tuesday, August 1, 2017

log4j trick when creating ConsoleAppender from Java code

Right way to do it:

/**
* Dynamically initialize a log4j console appender so we can log info to stdout
* with specified level and pattern layout.
*/
public static void initStdoutLogging(Level logLevel, String patternLayout) {
LogManager.resetConfiguration();
ConsoleAppender appender = new ConsoleAppender(new PatternLayout(patternLayout));
appender.setThreshold(logLevel);
LogManager.getRootLogger().addAppender(appender);
}

initStdoutLogging(Level.INFO, PatternLayout.TTCC_CONVERSION_PATTERN);

When enabling logging for Spark, we need to call above initStdoutLogging twice: one for the driver and another for the executor. Otherwise, you can only see logs from driver.

To read logs from executor, you will have to use something like:

sudo -u hdfs yarn logs -applicationId application_1501527690446_0066

Analysis:

If I initialize it like this:

ConsoleAppender ca = new ConsoleAppender();
ca.setLayout(new PatternLayout(PatternLayout.TTCC_CONVERSION_PATTERN));

it gives an error and breaks the logging.

Error output:

log4j:ERROR No output stream or file set for the appender named [null].

If I initialize it like this it works fine:

ConsoleAppender ca = new ConsoleAppender(new PatternLayout(PatternLayout.TTCC_CONVERSION_PATTERN));

The reason:

If you look at the source for ConsoleAppender:

  public ConsoleAppender(Layout layout) {
    this(layout, SYSTEM_OUT);
  }

  public ConsoleAppender(Layout layout, String target) {
    setLayout(layout);
    setTarget(target);
    activateOptions();
  }

You can see that ConsoleAppender(Layout) passes SYSTEM_OUT as the target, and also that it calls activateOptions after setting the layout and target.

If you use setLayout yourself, then you'll also need to explicitly set the target and call activateOptions.