Tech Notes

Python:

Python GIL Global Interpreter Lock: CPthyon, PyPy

Jython and IronPython has no GIL.

Python IDE: PyCharm

SJCMACJ15JHTD8:~ jzeng$ python

Python 2.7.10 (default, Oct 6 2017, 22:29:07)

[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import platform

>>> platform.python_implementation()

'CPython'

On Dec 12, I found ‘python’ is not working anymore and I have to specifically use ‘python3.7’:

SJCMACJ15JHTD8:prometheus jzeng$ python

SJCMACJ15JHTD8:prometheus jzeng$ python3.7

Python 3.7.1 (default, Nov 6 2018, 18:45:35)

[Clang 10.0.0 (clang-1000.11.45.5)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

Run ‘pip’ has to run this way:

SJCMACJ15JHTD8:prometheus jzeng$ python3.7 -m pip install prometheus_client

If I ran it as ‘pip install Prometheus_client’, I will get following error:

SJCMACJ15JHTD8:prometheus jzeng$ pip install prometheus_client

/usr/local/bin/pip: line 3: __requires__: command not found

/usr/local/bin/pip: line 4: import: command not found

/usr/local/bin/pip: line 5: import: command not found

from: can't read /var/mail/pkg_resources

Run Python built-in HTTP server on default port 8000:

(venv) SJCMACJ15JHTD8:myproject jzeng$ python -m SimpleHTTPServer

Serving HTTP on 0.0.0.0 port 8000 ...

Find package full path:

>>> import threading

>>> print(threading)

Run memory profiler to show memory is only increased, and never decreased:

(venv) SJCMACJ15JHTD8:myproject jzeng$ easy_install -U memory_profiler

(venv) SJCMACJ15JHTD8:myproject jzeng$ python -m memory_profiler memory-profile-me.py

Filename: memory-profile-me.py

Line # Mem usage Increment Line Contents

================================================

4 9.672 MiB 9.672 MiB @profile

5 def function():

6 48.367 MiB 38.695 MiB x = list(range(1000000)) #allocate a big list

7 137.395 MiB 89.027 MiB y = copy.deepcopy(x)

8 137.395 MiB 0.000 MiB del x

9 137.395 MiB 0.000 MiB return y

Python memory mode: Arena -> Pool -> Block. There is a problem to reclaim memory back, as shown in above example.

List comprehension:

List comprehensions provide a concise way to create lists.

Every time a loop is run to massage the contents of a sequence, try to replace it with a list comprehension.

Yield and Generator (3 ways to use ‘yield’)

#1 allow you to pause a function and return an intermediate result

Python provides a shortcut to write simple generators over a sequence. A syntax similar to list comprehensions can be used to replace yield. Parentheses are used instead of brackets:

>>> iter = (x**2 for x in range(10) if x % 2 == 0)

>>> for el in iter:

... print el

...

0 4 16 36 64

These kinds of expressions are called generator expressions or genexp

#2 As of Python version 2.5, the yield statement is now allowed in the try clause of a try ... finally construct. If the generator is not resumed before it is finalized (by reaching a zero reference count or by being garbage collected), the generator-iterator's close()method will be called, allowing any pending finally clauses to execute.

#3 pytest supports execution of fixture specific finalization code when the fixture goes out of scope. By using a yield statement instead of return, all the code after the yield statement serves as the teardown code:

# content of conftest.py

import smtplib

import pytest

@pytest.fixture(scope="module")

def smtp_connection():

    smtp_connection = smtplib.SMTP("smtp.gmail.com", 587, timeout=5)

    yield smtp_connection  # provide the fixture value

    print("teardown smtp")

    smtp_connection.close()

The print and smtp.close() statements will execute when the last test in the module has finished execution, regardless of the exception status of the tests.

Closure:

a closure is an instance of a function, a value, whose non-local variables have been bound either to values or to storage locations (depending on the language).

def f(x):

def g(y):

return x + y

return g # Return a closure.

def h(x):

return lambda y: x + y # Return a closure.

# Assigning specific closures to variables.

a = f(1)

b = h(1)

# Using the closures stored in variables.

assert a(5) == 6

assert b(5) == 6

# Using closures without binding them to variables first.

assert f(1)(5) == 6 # f(1) is the closure.

assert h(1)(5) == 6 # h(1) is the closure.

Decorator: Use @symbol (this is called “pie” syntax) to decorate a function.

@my_decorator

def just_some_func():

print(“testing func”)

One good example on class decorator is line 1215 in PanWFAppBase (panBaseAPP.py) for @output(fmt=’xml’)

Property & Descriptor:

@property

def budget(self):

return self._budget

@buget.setter

def budget(self, value):

if value < 0:

raise ValueError(“Negative value not allowed: %s” % value

self._budget = value

Python automatically calls the getter whenever anybody tried to access the budget. Likewise, Python automatically calls budget.setter whenever it encounters code like m.budget = value.
Descriptor lets you customize what should be done when you refer to an attribute on an object.
Descriptor can define the __get__, __set__, or __delete__ method.
A good article about Descriptor: http://nbviewer.jupyter.org/urls/gist.github.com/ChrisBeaumont/5758381/raw/descriptor_writeup.ipynb

A descriptor that implements __get__ and __set__ is called a data descriptor.

A descriptor that just implements __get__ is called a non-data descriptor.

With statement: providing a simple way to call some code before and after a block of code

>>> with file('/etc/hots') as hosts:

... for line in hosts:

... print line

Another example:

# content of test_yield2.py

import smtplib

import pytest

@pytest.fixture(scope="module")

def smtp_connection():

    with smtplib.SMTP("smtp.gmail.com", 587, timeout=5) as smtp_connection:

        yield smtp_connection  # provide the fixture value

The smtp_connection connection will be closed after the test finished execution because the smtp_connection object automatically closes when the with statement ends.

context management protocol

Context managers allow you to allocate and release resources precisely when you want to. The most widely used example of context managers is the with statement. Suppose you have two related operations which you’d like to execute as a pair, with a block of code in between. Context managers allow you to do specifically that.

At the very least a context manager has an __enter__ and __exit__ method defined. Let’s make our own file-opening Context Manager and learn the basics.

class File(object):

def __init__(self, file_name, method):

self.file_obj = open(file_name, method)

def __enter__(self):

return self.file_obj

def __exit__(self, type, value, traceback):

self.file_obj.close()

Just by defining __enter__ and __exit__ methods we can use our new class in a with statement. Let’s try:

with File('demo.txt', 'w') as opened_file:

opened_file.write('Hola!')

Our __exit__ method accepts three arguments. They are required by every __exit__ method which is a part of a Context Manager class. Let’s talk about what happens under-the-hood.

1. The with statement stores the __exit__ method of the File class.

2. It calls the __enter__ method of the File class.

3. The __enter__ method opens the file and returns it.

4. The opened file handle is passed to opened_file.

5. We write to the file using .write().

6. The with statement calls the stored __exit__ method.

7. The __exit__ method closes the file.

Another example is the Timer class in Prometheus client:

https://github.com/prometheus/client_python/blob/master/prometheus_client/context_managers.py

class Timer(object):


	def __init__(self, callback):
	self._callback = callback

	def _new_timer(self):
	return self.__class__(self._callback)

	def __enter__(self):
	self._start = default_timer()

	def __exit__(self, typ, value, traceback):
	# Time can go backwards.
	duration = max(default_timer() - self._start, 0)
	self._callback(duration)

	def __call__(self, f):
	def wrapped(func, args, *kwargs):
	# Obtaining new instance of timer every time
	# ensures thread safety and reentrancy.
	with self._new_timer():
	return func(args, *kwargs)

	return decorate(f, wrapped)

MRO (Method Resolution Order) is based on C3 which builds the linearization of a class.

In Python, the base classes are not implicitly called in __init__, and so it is up to the developer to call them.

>>> print file.__mro__

(<type 'file'>, <type 'object'>). ß new style class

>>> print inspect.__mro__

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: 'module' object has no attribute '__mro__' ß old style class

I believe I have talked about this logic multiple times. But still be asked, so decide to document it so I will only need to send a link to this page next time when I am asked again.

From /etc/hive/conf/hive-site.xml, looking for the following property so you will know where is the root location for all hive tables:

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>

Then, your hive file should look like:

hdfs://cdh1.host:8020/user/hive/warehouse/table_name/2018-01-01/.../file_000000

Do a substring search for "/user/hive/warehouse" and the table name is just right after it (of course, need to filter out the /).

Above algorithm only works if a user uses default location and default database.

If a table is under non-default database, the full path will be something like:

hdfs://cdh1.host:8020/user/hive/warehouse/database_name.db/table_name/2018-01-01/.../file_000000

If a user specifies the 'Location' when defining Hive table (as in the following example), we have to use HiveMetaStoreClient to get the table name. This is the external table case.

CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;

BTW, the way to get file path for a Hive table is to issue the following command from hive command line:

DESCRIBE FORMATTED table_name

Trie is the right data structure to hold locations so we can use do substring match for location given a full path of an HDFS file. We can build a location to table name hash map, so we can get the table name once we get the location from the full path.

Caching all locations in trie can cause high memory usage if you have millions of Hive tables. Also, need to update such trie when a new external table is created.

Tech Notes

Sunday, September 9, 2018

Python

Thursday, June 28, 2018

Inotify

Monday, April 30, 2018

HCatalog: Find Hive table name from file path

Friday, March 30, 2018

Different locations of map reduce logs

Thursday, March 29, 2018

SPARK_HOME, etc.

Thursday, March 22, 2018

Additional steps for Ranger/Kerberos enabled Hadoop