Quantcast
Channel: Oracle Bloggers
Viewing all articles
Browse latest Browse all 19780

Three Little Hive UDFs: Part 2

$
0
0

Introduction

In our ongoing exploration of Hive UDFs, we've covered the basic row-wise UDF.  Today we'll move to the UDTF, which generates multiple rows for every row processed.  This UDF built its house from sticks: it's slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.

 We'll step through some of the more interesting pieces, but as before the full source is available on github here.

Extending GenericUDTF

 Our UDTF is going to produce pairwise combinations of elements in a comma-separated string.  So, for a string column "Apples, Bananas, Carrots" we'll produce three rows:

  • Apples, Bananas
  • Apples, Carrots
  • Bananas, Carrots

As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function does.

@Description(name="pairwise",value="_FUNC_(doc) - emits pairwise combinations of an input array")
publicclassPairwiseUDTFextendsGenericUDTF{

privatePrimitiveObjectInspectorstringOI=null;

 We also create an object of PrimitiveObjectInspector, which we'll use to ensure that the input is a string.  Once this is done, we need to override methods for initialization, row processing, and cleanup.

@Override

  publicStructObjectInspectorinitialize(ObjectInspector[]args)throwsUDFArgumentException

{

    if(args.length!=1){
      thrownewUDFArgumentException("pairwise() takes exactly one argument");
    }
 
    if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE

        &&((PrimitiveObjectInspector)args[0]).getPrimitiveCategory()!=

PrimitiveObjectInspector.PrimitiveCategory.STRING){

      thrownewUDFArgumentException("pairwise() takes a string as a parameter");
    }
 

stringOI=(PrimitiveObjectInspector)args[0];

This UDTF is going to return an array of structs, so the initialize method needs to return a StructObjectInspector object.  Note that the arguments to the constructor come in as an array of ObjectInspector objects.  This allows us to handle arguments in a "normal" fashion but with the benefit of methods to broadly inspect type.  We only allow a single argument -- the string column to be processed -- so we check the length of the array and validate that the sole element is both a primitive and a string.

The second half of the initialize method is more interesting: 

List<String>fieldNames=newArrayList<String>(2);
    List<ObjectInspector>fieldOIs=newArrayList<ObjectInspector>(2);
    fieldNames.add("memberA");
    fieldNames.add("memberB");
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    returnObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);

}

Here we set up information about what the UDTF returns.  We need this in place before we start processing rows, otherwise Hive can't correctly build execution plans before submitting jobs to MapReduce.  The structures we're returning will be two strings per struct, which means we'll need ObjectInspector objects for both the values and the names of the fields.  We create two lists, one of strings for the name, the other of ObjectInspector objects.  We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value.

Now we're ready to actually do some processing, so we override the process method.

@Override
  publicvoidprocess(Object[]record)throwsHiveException{
    finalStringdocument=(String)stringOI.getPrimitiveJavaObject(record[0]);
 
    if(document==null){
      return;
    }
    String[]members=document.split(",");
java.util.Arrays.sort(members);
for(inti=0;i<members.length-1;i++)
for(intj=1;j<members.length;j++)
if(!members[i].equals(members[j]))
forward(newObject[]{members[i],members[j]});

}

This is simple pairwise expansion, so the logic isn't anything more than a nested for-loop.  There are, though, some interesting things to note.  First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting.  This allows us to bail out early if the column value is null.  Once we have the string, splitting, sorting, and looping is textbook stuff.  

The last notable piece is that the process method does not return anything.  Instead, we call forward to emit our newly created structs.  From the context of those used to database internals, this follows the producer-consumer models of most RDBMs.  From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context object.

@Override
  publicvoidclose()throwsHiveException{
    // do nothing

}

If there were any cleanup to do, we'd take care of it here.  But this is simple emission, so our override doesn't need to do anything.

Using the UDTF

Once we've built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function.  However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW.

#AddtheJar
addjar/mnt/shared/market_basket_example/pairwise.jar;
 
#Createafunction
 
CREATEtemporaryfunctionpairwiseAS'com.oracle.hive.udtf.PairwiseUDTF';
 
#viewthepairwiseexpansionoutput
SELECTm1,m2,COUNT(*)FROMmarket_basket

LATERALVIEWpairwise(basket)pwiseASm1,m2GROUPBYm1,m2;


Viewing all articles
Browse latest Browse all 19780

Trending Articles