Monday, April 30, 2018

HCatalog: Find Hive table name from file path

I believe I have talked about this logic multiple times.  But still be asked, so decide to document it so I will only need to send a link to this page next time when I am asked again.

From /etc/hive/conf/hive-site.xml, looking for the following property so you will know where is the root location for all hive tables:

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>

Then, your hive file should look like:

hdfs://cdh1.host:8020/user/hive/warehouse/table_name/2018-01-01/.../file_000000

Do a substring search for "/user/hive/warehouse" and the table name is just right after it (of course, need to filter out the /).

Above algorithm only works if a user uses default location and default database.

If a table is under non-default database, the full path will be something like:

hdfs://cdh1.host:8020/user/hive/warehouse/database_name.db/table_name/2018-01-01/.../file_000000

If a user specifies the 'Location' when defining Hive table (as in the following example), we have to use HiveMetaStoreClient to get the table name.  This is the external table case.

CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;

BTW, the way to get file path for a Hive table is to issue the following command from hive command line:

DESCRIBE FORMATTED table_name

Trie is the right data structure to hold locations so we can use do substring match for location given a full path of an HDFS file.  We can build a location to table name hash map, so we can get the table name once we get the location from the full path.

Caching all locations in trie can cause high memory usage if you have millions of Hive tables.  Also, need to update such trie when a new external table is created. 

Friday, March 30, 2018

Different locations of map reduce logs


HDP:

sudo -u hdfs hadoop fs -ls /app-logs/hdfs/logs

(7 days)

CDH5:

sudo -u hdfs hadoop fs -ls /tmp/logs/hdfs/logs/

(7 days)

EMR (AWS):

sudo -u hdfs hadoop fs -ls /var/logs/hdfs/logs/

(? days)

Default value for "yarn.log-aggregation.retain-seconds" is 7 days. 

If log aggregation is not enabled, the logs are in local file system.  For example, MapR put logs into following places:

/opt/mapr/hadoop/hadoop-2.7.0/logs/userlogs/

Thursday, March 29, 2018

SPARK_HOME, etc.

export SPARK_HOME==/usr/hdp/2.6.3.0-235/spark2




JavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.driver.extraJavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.3.0-235


More info:
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java





Thursday, March 22, 2018

Additional steps for Ranger/Kerberos enabled Hadoop

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_security/content/ch06s01s01s01.html


Add values for the following properties in the "Custom kms-site" section. These properties allow the specified system users (hive, oozie, the user we are using and others) to proxy on behalf of other users when communicating with Ranger KMS. This helps individual services (such as Hive) use their own keytabs, but retain the ability to access Ranger KMS as the end user (use access policies associated with the end user).
  • hadoop.kms.proxyuser.{hadoop-user}.users
  • hadoop.kms.proxyuser.{hadoop-user}.groups
  • hadoop.kms.proxyuser.{hadoop-user}.hosts

Friday, February 2, 2018

Increase map reduce timeout value and JVM heap size, or set different log level for map reduce job

Update the mapred-site.xml on client machine only:

For increasing heap size:

  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Djava.net.preferIPv4Stack=true -Xmx3600m</value>
  </property>

  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>4096</value>
  </property>

Make sure the value of "mapreduce.map.memory.mb" is greater than the value of "-Xmx"

For no timeout:

  <property>
    <name>mapreduce.task.timeout</name>
    <value>0</value>
  </property>

Default is 10 minutes: 600000

adding these to the mapred-site.xml to lower log level from INFO to WARN:
    <property>
    <name>mapreduce.map.log.level</name>
    <value>WARN</value>
  </property>
      <property>
    <name>mapreduce.reduce.log.level</name>
    <value>WARN</value>
  </property>

Of course, we can also so this through Java code.  For example, use following code to reset JVM memory:

conf.setInt("yarn.app.mapreduce.am.resource.mb", 2048);
String opt = "-Xmx2048m";
conf.set("yarn.app.mapreduce.am.command-opts", opt);

Wednesday, January 3, 2018

Workaround to the S3 colon issue on Hadoop MapReduce

When S3 file name or path contains colon (:), you will get following exception when you submit your map reduce job:

2018-01-03/00:43:52.632/UTC ERROR [pool-6-thread-1] com.dataguise.hadoop.util.WFNew$4:run [WF-ERR]: Task2 cannot be run (because: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
        at org.apache.hadoop.fs.Path.initialize(Path.java:205)
        at org.apache.hadoop.fs.Path.<init>(Path.java:171)
        at org.apache.hadoop.fs.Path.<init>(Path.java:93)
        at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1730)
        at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.globStatus(EmrFileSystem.java:373)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:631)
        at com.mycompany.hadoop.util.WFNew$4.run(WFNew.java:603)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at com.mycompany.hadoop.util.WFNew.SubmitJob(WFNew.java:603)
        at com.mycompany.hadoop.util.WFNew.access$500(WFNew.java:37)
        at com.mycompany.hadoop.util.WFNew$3.run(WFNew.java:428)
        at java.lang.Thread.run(Thread.java:745)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: colon:test.log
        at java.net.URI.checkPath(URI.java:1804)
        at java.net.URI.<init>(URI.java:752)
        at org.apache.hadoop.fs.Path.initialize(Path.java:202)
        ... 29 more

Basically, Hadoop does not allow colon in path or file name, but S3 allows. 

There are multiple discussions about this, but no fix on Hadoop community.  If you have to support colon in file path or name, you have to have your own solution. 

One of discussions is: 

https://stackoverflow.com/questions/34093098/load-a-amazon-s3-file-which-has-colons-within-the-filename-through-pyspark

final Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs." + CustomS3FileSystem.SCHEMA + ".impl", 
CustomS3FileSystem.class.getName());
public class CustomS3FileSystem extends NativeS3FileSystem {
public static final String SCHEMA = "s3";

  @Override
  public FileStatus[] globStatus(final Path pathPattern, final PathFilter filter)
      throws IOException {
    final FileStatus[] statusList = super.listStatus(pathPattern);
    final List<FileStatus> result = Lists.newLinkedList();
    for (FileStatus fileStatus : statusList) {
      if (filter.accept(fileStatus.getPath())) {
        result.add(fileStatus);
      }
    }
    return result.toArray(new FileStatus[] {});
  }
}

This does not work for me due to multiple reasons:

1. On my case, we are using EMR so EmrFileSystem is the FileSystem we need to use as base class.
2. We have to pass CustomS3FileSystem.class to cluster with other Java classes through our map reduce jar.  Just using CustomS3FileSystem.class on client code will not work.
3. We are unable to set customized FieSystem through Java code.  But we can set it through core-site.xml file by changing the default FileSystem class (com.amazon.ws.emr.hadoop.fs.EmrFileSystem) to our customized FileSystem (CustomS3FileSystem in this example).

After that, we can execute map reduce job for files which has colon in file name or path.

The changed properties in core-site.xml is:

<property>
  <name>fs.s3.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

<property>
  <name>fs.s3n.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>


All head files we used are:

import com.amazon.ws.emr.hadoop.fs.EmrFileSystem;
import com.google.common.collect.Lists;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
This is just a workaround as above article pointed out: it will not allow you to specify wildcards in the URL

Tuesday, November 7, 2017

Command to use jconsole to monitor java program remotely

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=18745 -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.6.81 -Xrunjdwp:transport=dt_socket,address=8002,server=y,suspend=n -jar "xxxxAgent-jetty.jar"

Then from remote machine, run jconsole and put above ip and port into jconsole GUI.