/usr/hdp
JavaOptions to pass to Spark driver: Here is a Hortonworks example:
JavaOptions=-Dhdp.version=2.4.2.0-258
-Dspark.driver.extraJavaOptions=-Dhdp.version=2.4.2.0-258
Snappy native library to pass to Java program. Here is an example for CDH5:
1. Copy /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native/* to your java program machine and put them to any directory such as /opt/native
2. Run your Java program with following Java options:
JavaOptions=-Djava.library.path=/opt/native
Your project can be on or off. Your project's priority can be changed. Your job can be changed. But the technology is always heading north!
Monday, June 12, 2017
Wednesday, May 10, 2017
Search anything from a bugzilla bug given a range of bug numbers
[root@centos my-data-set]# for ((i=13420;i<13500;i++)); do echo $i >> /tmp/search_result.txt; curl http://192.168.5.105/bugzilla/show_bug.cgi?id=$i | grep "passed to DE constructor may be null" >> /tmp/search_result.txt; done;
Tuesday, May 2, 2017
A simple way to clone 10,000 files to cluster
# Generate 10,000 files from one seed and put them into 100 subdirectories
[hdfs@cdh1 tmp]$ for((i=1;i<=100;i++));do mkdir -p 10000files/F$i; for((j=1;j<=100;j++));do echo $i-$j; cp 1000line.csv 10000files/F$i/$i-$j.csv;done;done;
# Move them to cluster. One sub-directory at one time.
[hdfs@cdh1 tmp]$ for((i=1;i<=100;i++));do echo $i; hadoop fs -mkdir /JohnZ/10000files/F$i; hadoop fs -copyFromLocal 10000files/F$i/* /JohnZ/10000files/F$i/.;done;
Tuesday, April 25, 2017
Prepare Python machine learning environment on Centos 6.6 to train data
# major points: 1. Has to use Python 2.7, not 2.6. But Centos 6.6 uses Python 2.6 for OS so upgrading to 2.7 is not a solution. Need to install Python 2.7 in addition to 2.6. 2. Use setuptool to install pip and use pip to install rest.
# Download dependency files
yum groupinstall "Development tools"
yum -y install gcc gcc-c++ numpy python-devel scipy
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel
# Compile and install Python 2.7.13
wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
tar xzf Python-2.7.13.tgz
cd Python-2.7.13
./configure
# make altinstall is used to prevent replacing the default python binary file /usr/bin/python.
make altinstall
# Download setuptools using wget:
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
# Extract the files from the archive:
tar -xvf setuptools-1.4.2.tar.gz
# Enter the extracted directory:
cd setuptools-1.4.2
# Install setuptools using the Python we've installed (2.7.6)
# python2.7 setup.py install
/opt/python-2.7.13/Python-2.7.13/python ./setup.py install
# install pip
curl https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py | python2.7 -
or (following works for me)
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python ./setuptools/setuptools-1.4.2/easy_install.py pip
# install numpy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install numpy
# Install SciPy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scipy
# Install Scikit
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scikit-learn
# Install nltk
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install nltk
# Download nltk data (will be stored under /root/nltk_data)
[root@centos SVM]# /opt/python-2.7.13/Python-2.7.13/python -m nltk.downloader all
# Download dependency files
yum groupinstall "Development tools"
yum -y install gcc gcc-c++ numpy python-devel scipy
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel
wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
tar xzf Python-2.7.13.tgz
cd Python-2.7.13
./configure
# make altinstall is used to prevent replacing the default python binary file /usr/bin/python.
make altinstall
# Download setuptools using wget:
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
# Extract the files from the archive:
tar -xvf setuptools-1.4.2.tar.gz
# Enter the extracted directory:
cd setuptools-1.4.2
# Install setuptools using the Python we've installed (2.7.6)
# python2.7 setup.py install
/opt/python-2.7.13/Python-2.7.13/python ./setup.py install
# install pip
curl https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py | python2.7 -
or (following works for me)
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python ./setuptools/setuptools-1.4.2/easy_install.py pip
# install numpy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install numpy
# Install SciPy
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scipy
# Install Scikit
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install scikit-learn
# Install nltk
[root@centos python-2.7.13]# /opt/python-2.7.13/Python-2.7.13/python -m pip install nltk
# Download nltk data (will be stored under /root/nltk_data)
[root@centos SVM]# /opt/python-2.7.13/Python-2.7.13/python -m nltk.downloader all
Tuesday, March 21, 2017
Mapreduce logs have weird behavior on HDP 2.3 Tez
When launching map reduce on Tez, we
will not see our logs from HDP UI. Click
‘History’ and got nothing although we can see Hadoop system logs.
But if you use following command, you will get map reduce logs
from system stdout:
sudo -u hdfs yarn logs -applicationId
application_1490133530166_0002
But the format is modified:
2017-03-21 15:04:24,708 [ERROR]
[TezChild] |common.FindAndExitMapRunner|: caught throwable when run mapper:
java.lang.UnsupportedOperationException: Input only available on map
The ‘source’ is changed to ‘TezChild’.
Our package name is truncated to only
have last part so our Java class name is not full name anymore. On this
example, “com.xxx.hadoop.common.FindAndExitMapRunner” is changed to “common.FindAndExitMapRunner”
To be
compare with normal log (i.e. without Tez), here is what we should have from map reduce log (you
see full package name and class name):
2017-03-20 14:29:54,778 INFO [main]
com.xxx.hadoop.common.ColumnMap: Columnar Mapper
Important! Important! Important! ---->
To review the log, you have to use exact user who launched such application: sudo -u hdfs"
Otherwise, you will see following error:
"Log aggregation has not completed or is not enabled."
Important! Important! Important! ---->
To review the log, you have to use exact user who launched such application: sudo -u hdfs"
Otherwise, you will see following error:
"Log aggregation has not completed or is not enabled."
Monday, March 6, 2017
Additional jar files when running Spark under Hadoop YARN mode (CDH 5.10.0 with Scala 2.10 and Spark 1.6.0)
lrwxrwxrwx 1 root root 91 Feb 23 15:02 spark-core_2.10-1.6.0-cdh5.10.0.jar -> /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/jars/spark-core_2.10-1.6.0-cdh5.10.0.jar
lrwxrwxrwx 1 root root 80 Feb 23 15:15 scala-library-2.10.6.jar -> /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/jars/scala-library-2.10.6.jar
lrwxrwxrwx 1 root root 37 Mar 6 14:06 commons-lang3-3.3.2.jar -> ../../../jars/commons-lang3-3.3.2.jar
-rw-r--r-- 1 root root 185676 Mar 6 14:11 typesafe-config-2.10.1.jar
lrwxrwxrwx 1 root root 55 Mar 6 14:26 akka-actor_2.10-2.2.3-shaded-protobuf.jar -> ../../../jars/akka-actor_2.10-2.2.3-shaded-protobuf.jar
lrwxrwxrwx 1 root root 56 Mar 6 14:28 akka-remote_2.10-2.2.3-shaded-protobuf.jar -> ../../../jars/akka-remote_2.10-2.2.3-shaded-protobuf.jar
lrwxrwxrwx 1 root root 55 Mar 6 14:29 akka-slf4j_2.10-2.2.3-shaded-protobuf.jar -> ../../../jars/akka-slf4j_2.10-2.2.3-shaded-protobuf.jar
lrwxrwxrwx 1 root root 70 Mar 6 14:40 spark-assembly-1.6.0-cdh5.10.0-hadoop2.6.0-cdh5.10.0.jar -> ../../../jars/spark-assembly-1.6.0-cdh5.10.0-hadoop2.6.0-cdh5.10.0.jar
[root@john2 lib]# pwd
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop-yarn/lib
Thursday, February 16, 2017
My co-worker Shawn's blogs for Hadoop machine setup and installation
http://wp.huangshiyang.com/env-set-up-centos
Subscribe to:
Posts (Atom)