Monday, April 30, 2018

HCatalog: Find Hive table name from file path

I believe I have talked about this logic multiple times.  But still be asked, so decide to document it so I will only need to send a link to this page next time when I am asked again.

From /etc/hive/conf/hive-site.xml, looking for the following property so you will know where is the root location for all hive tables:

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>

Then, your hive file should look like:

hdfs://cdh1.host:8020/user/hive/warehouse/table_name/2018-01-01/.../file_000000

Do a substring search for "/user/hive/warehouse" and the table name is just right after it (of course, need to filter out the /).

Above algorithm only works if a user uses default location and default database.

If a table is under non-default database, the full path will be something like:

hdfs://cdh1.host:8020/user/hive/warehouse/database_name.db/table_name/2018-01-01/.../file_000000

If a user specifies the 'Location' when defining Hive table (as in the following example), we have to use HiveMetaStoreClient to get the table name.  This is the external table case.

CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;

BTW, the way to get file path for a Hive table is to issue the following command from hive command line:

DESCRIBE FORMATTED table_name

Trie is the right data structure to hold locations so we can use do substring match for location given a full path of an HDFS file.  We can build a location to table name hash map, so we can get the table name once we get the location from the full path.

Caching all locations in trie can cause high memory usage if you have millions of Hive tables.  Also, need to update such trie when a new external table is created.