The Hive external table let's us do lots of cool stuff including processing data from NoSQL. We have seen how custom SerDes are used, Hive storage handlers also provide some cool capabilities. Using the Hive storage handler defined here, an external table can be defined to project data from a NoSQL key-value store. The external table can then be used as a source in ODI, very simple.
The illustration on github has the following data stored in a Oracle NoSQL Database (the key is the lastname/firstname, the value is name-value pair):
- /Smith/Bob/-/birthdate: 05/02/1975
- /Smith/Bob/-/phonenumber: 1111-1111
- /Smith/Bob/-/userid: 1
- /Smith/Patricia/-/birthdate: 10/25/1967
- /Smith/Patricia/-/phonenumber: 2222-2222
- /Smith/Patricia/-/userid: 2
- /Wong/Bill/-/birthdate: 03/10/1982
- /Wong/Bill/-/phonenumber: 3333-3333
- /Wong/Bill/-/userid: 3
Using the Hive external table and the custom storage handler for a key value store, we define a mask to project the data through the external table.
- ADD JAR /home/oracle/kv/HiveKVStorageHandler.jar;
- CREATE EXTERNAL TABLE MY_KV_TABLE (lastname string, firstname string, birthdate string, phonenumber string, userid string)
- STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'
- WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userID")
- TBLPROPERTIES ("kv.host.port" = "localhost:5000", "kv.name" = "kvstore");
There are a few interesting properties here;
- we specify the keyvalue store using TBLPROPERTIES, identify the host/port and the keystore name (kvstore).
- the SerDe properties contains the mapping of the keys to column names, you will get a row for each value of birthdate, phonenumber, userID
Fairly straightforward. We can then reverse engineer this into ODI, using the same mechanism as I described in previous posts here setting the ODI_HIVE_SESSION_JARS and so forth. The data projected looks like this;
- hive> SELECT * FROM MY_KV_TABLE;
- OK
- Smith Patricia 10/25/1967 NULLNULL
- SmithPatriciaNULL2222-2222 NULL
- SmithPatriciaNULLNULL2
- SmithBob05/02/1975NULLNULL
- SmithBobNULL1111-1111NULL
- SmithBobNULLNULL1
- WongBill03/10/1982NULLNULL
- WongBillNULL3333-3333NULL
- WongBillNULLNULL3
In ODI by defining the Hive collect_set function as an aggregation function, we can then aggregate the data and pivot the data to get it as a row;
- SELECT lastname, firstname, collect_set(birthdate)[0], collect_set(phonenumber)[0], collect_set(userid)[0]
- FROM MY_KV_TABLE
- GROUP BY lastname, firstname;
So another interesting illustration of external tables in Hive and what they can provide.