Sunday, September 9, 2018

Python




Python:

Python GIL Global Interpreter Lock: CPthyon, PyPy
Jython and IronPython has no GIL.

Python IDE: PyCharm

SJCMACJ15JHTD8:~ jzeng$ python
Python 2.7.10 (default, Oct  6 2017, 22:29:07)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.python_implementation()
'CPython'

On Dec 12, I found ‘python’ is not working anymore and I have to specifically use ‘python3.7’:

SJCMACJ15JHTD8:prometheus jzeng$ python
SJCMACJ15JHTD8:prometheus jzeng$ python3.7
Python 3.7.1 (default, Nov  6 2018, 18:45:35)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Run ‘pip’ has to run this way:

SJCMACJ15JHTD8:prometheus jzeng$ python3.7 -m pip install prometheus_client

If I ran it as ‘pip install Prometheus_client’, I will get following error:

SJCMACJ15JHTD8:prometheus jzeng$ pip install prometheus_client
/usr/local/bin/pip: line 3: __requires__: command not found
/usr/local/bin/pip: line 4: import: command not found
/usr/local/bin/pip: line 5: import: command not found
from: can't read /var/mail/pkg_resources


Run Python built-in HTTP server on default port 8000:

(venv) SJCMACJ15JHTD8:myproject jzeng$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

Find package full path:

>>> import threading
>>> print(threading)
<module 'threading' from '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.pyc'>

Run memory profiler to show memory is only increased, and never decreased:

(venv) SJCMACJ15JHTD8:myproject jzeng$ easy_install -U memory_profiler

(venv) SJCMACJ15JHTD8:myproject jzeng$ python -m memory_profiler memory-profile-me.py
Filename: memory-profile-me.py

Line #    Mem usage    Increment   Line Contents
================================================
     4    9.672 MiB    9.672 MiB   @profile
     5                             def function():
     6   48.367 MiB   38.695 MiB       x = list(range(1000000)) #allocate a big list
     7  137.395 MiB   89.027 MiB       y = copy.deepcopy(x)
     8  137.395 MiB    0.000 MiB       del x
     9  137.395 MiB    0.000 MiB       return y


Python memory mode: Arena -> Pool -> Block.   There is a problem to reclaim memory back, as shown in above example. 

List comprehension:
List comprehensions provide a concise way to create lists.
Every time a loop is run to massage the contents of a sequence, try to replace it with a list comprehension.
Yield and Generator (3 ways to use ‘yield’)
#1 allow you to pause a function and return an intermediate result
Python provides a shortcut to write simple generators over a sequence. A syntax similar to list comprehensions can be used to replace yield. Parentheses are used instead of brackets:
   >>> iter = (x**2 for x in range(10) if x % 2 == 0)
   >>> for el in iter:
   ...     print el
   ...
0 4 16 36 64
These kinds of expressions are called generator expressions or genexp
#2 As of Python version 2.5, the yield statement is now allowed in the try clause of a try ... finally construct. If the generator is not resumed before it is finalized (by reaching a zero reference count or by being garbage collected), the generator-iterator's close()method will be called, allowing any pending finally clauses to execute.

#3 pytest supports execution of fixture specific finalization code when the fixture goes out of scope. By using a yield statement instead of return, all the code after the yield statement serves as the teardown code:
# content of conftest.py
 
import smtplib
import pytest
 
@pytest.fixture(scope="module")
def smtp_connection():
    smtp_connection = smtplib.SMTP("smtp.gmail.com", 587, timeout=5)
    yield smtp_connection  # provide the fixture value
    print("teardown smtp")
    smtp_connection.close()

The print and smtp.close() statements will execute when the last test in the module has finished execution, regardless of the exception status of the tests.


Closure:

a closure is an instance of a function, a value, whose non-local variables have been bound either to values or to storage locations (depending on the language).

def f(x):
    def g(y):
        return x + y
    return g  # Return a closure.

def h(x):
    return lambda y: x + y  # Return a closure.

# Assigning specific closures to variables.
a = f(1)
b = h(1)

# Using the closures stored in variables.
assert a(5) == 6
assert b(5) == 6

# Using closures without binding them to variables first.
assert f(1)(5) == 6  # f(1) is the closure.
assert h(1)(5) == 6  # h(1) is the closure.


Decorator:  Use @symbol (this is called “pie” syntax) to decorate a function.

@my_decorator
def just_some_func():
print(“testing func”)

One good example on class decorator is line 1215 in PanWFAppBase (panBaseAPP.py) for @output(fmt=’xml’)

Property & Descriptor:

@property
def budget(self):
return self._budget

@buget.setter
def budget(self, value):
if value < 0:
    raise ValueError(“Negative value not allowed: %s” % value
self._budget = value
Python automatically calls the getter whenever anybody tried to access the budget.  Likewise, Python automatically calls budget.setter whenever it encounters code like m.budget = value.
Descriptor lets you customize what should be done when you refer to an attribute on an object.
Descriptor can define the __get__, __set__, or __delete__ method.
A good article about Descriptor: http://nbviewer.jupyter.org/urls/gist.github.com/ChrisBeaumont/5758381/raw/descriptor_writeup.ipynb

A descriptor that implements __get__ and __set__ is called a data descriptor.
A descriptor that just implements __get__ is called a non-data descriptor.

With statement: providing a simple way to call some code before and after a block of code
>>> with file('/etc/hots') as hosts:
...     for line in hosts:
...         print line

Another example:

# content of test_yield2.py
 
import smtplib
import pytest
 
@pytest.fixture(scope="module")
def smtp_connection():
    with smtplib.SMTP("smtp.gmail.com", 587, timeout=5) as smtp_connection:
        yield smtp_connection  # provide the fixture value

The smtp_connection connection will be closed after the test finished execution because the smtp_connection object automatically closes when the with statement ends.


context management protocol

Context managers allow you to allocate and release resources precisely when you want to. The most widely used example of context managers is the with statement. Suppose you have two related operations which you’d like to execute as a pair, with a block of code in between. Context managers allow you to do specifically that.

At the very least a context manager has an __enter__ and __exit__ method defined. Let’s make our own file-opening Context Manager and learn the basics.
class File(object):
    def __init__(self, file_name, method):
        self.file_obj = open(file_name, method)
    def __enter__(self):
        return self.file_obj
    def __exit__(self, type, value, traceback):
        self.file_obj.close()

Just by defining __enter__ and __exit__ methods we can use our new class in a with statement. Let’s try:
with File('demo.txt', 'w') as opened_file:
    opened_file.write('Hola!')

Our __exit__ method accepts three arguments. They are required by every __exit__ method which is a part of a Context Manager class. Let’s talk about what happens under-the-hood.
1.     The with statement stores the __exit__ method of the File class.
2.     It calls the __enter__ method of the File class.
3.     The __enter__ method opens the file and returns it.
4.     The opened file handle is passed to opened_file.
5.     We write to the file using .write().
6.     The with statement calls the stored __exit__ method.
7.     The __exit__ method closes the file.
Another example is the Timer class in Prometheus client:


class Timer(object):

  def __init__(self, callback):
        self._callback = callback

    def _new_timer(self):
        return self.__class__(self._callback)

    def __enter__(self):
        self._start = default_timer()

    def __exit__(self, typ, value, traceback):
        # Time can go backwards.
        duration = max(default_timer() - self._start, 0)
        self._callback(duration)

    def __call__(self, f):
        def wrapped(func, *args, **kwargs):
            # Obtaining new instance of timer every time
            # ensures thread safety and reentrancy.
            with self._new_timer():
                return func(*args, **kwargs)

        return decorate(f, wrapped)


MRO (Method Resolution Order) is based on C3 which builds the linearization of a class.
In Python, the base classes are not implicitly called in __init__, and so it is up to the developer to call them.
>>> print file.__mro__
(<type 'file'>, <type 'object'>). ß new style class
>>> print inspect.__mro__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute '__mro__' ß old style class



Thursday, June 28, 2018

Inotify

Start from CDH 5.4 and Hadoop 2.6.

Only superuser can do it.

No filtering.

org.apache.hadoop.hdfs.inotify.Event.*
org.apache.hadoop.hdfs.DFSClient

Monday, April 30, 2018

HCatalog: Find Hive table name from file path

I believe I have talked about this logic multiple times.  But still be asked, so decide to document it so I will only need to send a link to this page next time when I am asked again.

From /etc/hive/conf/hive-site.xml, looking for the following property so you will know where is the root location for all hive tables:

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>

Then, your hive file should look like:

hdfs://cdh1.host:8020/user/hive/warehouse/table_name/2018-01-01/.../file_000000

Do a substring search for "/user/hive/warehouse" and the table name is just right after it (of course, need to filter out the /).

Above algorithm only works if a user uses default location and default database.

If a table is under non-default database, the full path will be something like:

hdfs://cdh1.host:8020/user/hive/warehouse/database_name.db/table_name/2018-01-01/.../file_000000

If a user specifies the 'Location' when defining Hive table (as in the following example), we have to use HiveMetaStoreClient to get the table name.  This is the external table case.

CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;

BTW, the way to get file path for a Hive table is to issue the following command from hive command line:

DESCRIBE FORMATTED table_name

Trie is the right data structure to hold locations so we can use do substring match for location given a full path of an HDFS file.  We can build a location to table name hash map, so we can get the table name once we get the location from the full path.

Caching all locations in trie can cause high memory usage if you have millions of Hive tables.  Also, need to update such trie when a new external table is created. 

Friday, March 30, 2018

Different locations of map reduce logs


HDP:

sudo -u hdfs hadoop fs -ls /app-logs/hdfs/logs

(7 days)

CDH5:

sudo -u hdfs hadoop fs -ls /tmp/logs/hdfs/logs/

(7 days)

EMR (AWS):

sudo -u hdfs hadoop fs -ls /var/logs/hdfs/logs/

(? days)

Default value for "yarn.log-aggregation.retain-seconds" is 7 days. 

If log aggregation is not enabled, the logs are in local file system.  For example, MapR put logs into following places:

/opt/mapr/hadoop/hadoop-2.7.0/logs/userlogs/

Thursday, March 29, 2018

SPARK_HOME, etc.

export SPARK_HOME==/usr/hdp/2.6.3.0-235/spark2




JavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.driver.extraJavaOptions=-Dhdp.version=2.6.3.0-235 -Dspark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.3.0-235


More info:
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java





Thursday, March 22, 2018

Additional steps for Ranger/Kerberos enabled Hadoop

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_security/content/ch06s01s01s01.html


Add values for the following properties in the "Custom kms-site" section. These properties allow the specified system users (hive, oozie, the user we are using and others) to proxy on behalf of other users when communicating with Ranger KMS. This helps individual services (such as Hive) use their own keytabs, but retain the ability to access Ranger KMS as the end user (use access policies associated with the end user).
  • hadoop.kms.proxyuser.{hadoop-user}.users
  • hadoop.kms.proxyuser.{hadoop-user}.groups
  • hadoop.kms.proxyuser.{hadoop-user}.hosts