Here we see Hive external tables on JSON data files in action and being reversed engineered from Hive into ODI. Carrying on from my earlier post on Hive reverse engineering here we will tackle external tables with primitive data types. Just like Oracle external tables for accessing data on the filesystem, Hive external tables can access big data files in HDFS. This example uses a JSON SerDe for Hive which was downloaded from here. The external tables in Hive are quite nice! You just specify the HDFS directory and all files in that directory are projected through the external table definition, also you can specify sophisticated SerDe classes that comply with the Hive SerDe interfaces - these classes can reverse engineer complex types also which make them very useful.
Before launching ODI or the agent for this I added the SerDe JAR file to this environment variable used by ODI, here is what I did in my system;
- export ODI_HIVE_SESSION_JARS=$ODI_HIVE_SESSION_JARS:/home/oracle/json/hive-json-serde-0.2.jar
I have 2 JSON files with data in my HDFS under /user/oracle/json I will specify this directory in the external table definition, along with the JSON SerDe. Within my Hive shell I can define the external table, remembering to add the JSON SerDe file (using the ADD JAR command) just before I create the external table.
- ADD JAR /home/oracle/json/hive-json-serde-0.2.jar;
- CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
- field1 string, field2 int, field3 string, field4 double
- )
- ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
- LOCATION '/user/oracle/json';
The data in the first JSON file looks like the following, this was just an example from the google code snippets;
- {"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
- {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
- {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
- {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
I have defined these fields in the external table. Back in ODI, in my model below I am using the customized reverse with the RKM Hive mentioned above, also for this reverse I just want one table imported, so I defined the mask just to be my_table. Clicking the 'Reverse Engineer' button on the top left of the panel starts the reverse engineering.
If there are errors, then check the ODI Operator, that's where you will see anything that goes wrong. The table in ODI has just some rudimentary information about the number of columns, all of the external table definition for this is defined in Hive, this is the same as when we reverse engineer an Oracle external table. Once the table is captured in ODI I can view the data;
The custom Hive RKM I used is here - this has some changes to the one shipped with ODI to include external table reverse (I included EXTERNAL_TABLE in the getTables API call), plus support for custom SerDe JARs loaded via the environment variable ODI_HIVE_SESSION_JARS (just like the ODI IKMs load the JARs, the RKM also needs to).
Anyway, that's a quick peek at external tables in Hive and reverse engineering them into ODI. Once the table is captured in ODI you can use it in mappings just like other tables!