Three Little Hive UDFs: Part 2

Introduction

In our ongoing exploration of Hive UDFs, we've covered the basic row-wise UDF. Today we'll move to the UDTF, which generates multiple rows for every row processed. This UDF built its house from sticks: it's slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.

We'll step through some of the more interesting pieces, but as before the full source is available on github here.

Extending GenericUDTF

Our UDTF is going to produce pairwise combinations of elements in a comma-separated string. So, for a string column "Apples, Bananas, Carrots" we'll produce three rows:

Apples, Bananas
Apples, Carrots
Bananas, Carrots

As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function does.

@Description(name="pairwise",value="_FUNC_(doc) - emits pairwise combinations of an input array")
publicclassPairwiseUDTFextendsGenericUDTF{

privatePrimitiveObjectInspectorstringOI=null;

We also create an object of PrimitiveObjectInspector, which we'll use to ensure that the input is a string. Once this is done, we need to override methods for initialization, row processing, and cleanup.

@Override
  publicStructObjectInspectorinitialize(ObjectInspector[]args)throwsUDFArgumentException
{
    if(args.length!=1){
      thrownewUDFArgumentException("pairwise() takes exactly one argument");
    }
 
    if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE
        &&((PrimitiveObjectInspector)args[0]).getPrimitiveCategory()!=
        PrimitiveObjectInspector.PrimitiveCategory.STRING){
      thrownewUDFArgumentException("pairwise() takes a string as a parameter");
    }

stringOI=(PrimitiveObjectInspector)args[0];

This UDTF is going to return an array of structs, so the initialize method needs to return a StructObjectInspector object. Note that the arguments to the constructor come in as an array of ObjectInspector objects. This allows us to handle arguments in a "normal" fashion but with the benefit of methods to broadly inspect type. We only allow a single argument -- the string column to be processed -- so we check the length of the array and validate that the sole element is both a primitive and a string.

The second half of the initialize method is more interesting:

List<String>fieldNames=newArrayList<String>(2);
    List<ObjectInspector>fieldOIs=newArrayList<ObjectInspector>(2);
    fieldNames.add("memberA");
    fieldNames.add("memberB");
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    returnObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);

}

Here we set up information about what the UDTF returns. We need this in place before we start processing rows, otherwise Hive can't correctly build execution plans before submitting jobs to MapReduce. The structures we're returning will be two strings per struct, which means we'll need ObjectInspector objects for both the values and the names of the fields. We create two lists, one of strings for the name, the other of ObjectInspector objects. We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value.

Now we're ready to actually do some processing, so we override the process method.

@Override
  publicvoidprocess(Object[]record)throwsHiveException{
    finalStringdocument=(String)stringOI.getPrimitiveJavaObject(record[0]);
 
    if(document==null){
      return;
    }
    String[]members=document.split(",");
java.util.Arrays.sort(members);
for(inti=0;i<members.length-1;i++)
for(intj=1;j<members.length;j++)
if(!members[i].equals(members[j]))
forward(newObject[]{members[i],members[j]});

}

This is simple pairwise expansion, so the logic isn't anything more than a nested for-loop. There are, though, some interesting things to note. First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting. This allows us to bail out early if the column value is null. Once we have the string, splitting, sorting, and looping is textbook stuff.

The last notable piece is that the process method does not return anything. Instead, we call forward to emit our newly created structs. From the context of those used to database internals, this follows the producer-consumer models of most RDBMs. From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context object.

@Override
  publicvoidclose()throwsHiveException{
    // do nothing

}

If there were any cleanup to do, we'd take care of it here. But this is simple emission, so our override doesn't need to do anything.

Using the UDTF

Once we've built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function. However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW.

#AddtheJar
addjar/mnt/shared/market_basket_example/pairwise.jar;
 
#Createafunction
 
CREATEtemporaryfunctionpairwiseAS'com.oracle.hive.udtf.PairwiseUDTF';
 
#viewthepairwiseexpansionoutput
SELECTm1,m2,COUNT(*)FROMmarket_basket

LATERALVIEWpairwise(basket)pwiseASm1,m2GROUPBYm1,m2;

Three Little Hive UDFs: Part 2

Introduction

Extending GenericUDTF

Using the UDTF

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112