Introduction
In our ongoing series of posts explaining the in's and out's of Hive User Defined Functions, we're starting with the simplest case. Of the three little UDFs, today's entry built a straw house: simple, easy to put together, but limited in applicability. We'll walk through important parts of the code, but you can grab the whole source from github here.
Extending UDF
The first few lines of interest are very straightforward:
@Description(name="moving_avg",value="_FUNC_(x, n) - Returns the moving mean of a set of numbers over a window of n observations")@UDFType(deterministic=false,stateful=true)
publicclassUDFSimpleMovingAverageextendsUDF
We're extending the UDF class with some decoration. The decoration is important for usability and functionality. The description decorator allows us to give the Hive some information to show users about how to use our UDF and what it's method signature will be. The UDFType decoration tells Hive what sort of behavior to expect from our function.
A deterministic UDF will always return the same output given a particular input. A square-root computing UDF will always return the same square root for 4, so we can say it is deterministic; a call to get the system time would not be. The stateful annotation of the UDFType decoration is relatively new to Hive (e.g., CDH4 and above). The stateful directive allows Hive to keep some static variables available across rows. The simplest example of this is a "row-sequence," which maintains a static counter which increments with each row processed.
Since square-root and row-counting aren't terribly interesting, we'll use the stateful annotation to build a simple moving average function. We'll return to the notion of a moving average later when we build a UDAF, so as to compare the two approaches.
privateDoubleWritableresult=newDoubleWritable();privatestaticArrayDeque<Double>window;intwindowSize;publicUDFSimpleMovingAverage(){result.set(0);
}
The above code is basic initialization. We make a double in which to hold the result, but it needs to be of class DoubleWritable so that MapReduce can properly serialize the data. We use a deque to hold our sliding window, and we need to keep track of the window's size. Finally, we implement a constructor for the UDF class.
publicDoubleWritableevaluate(DoubleWritablev,IntWritablen){doublesum=0.0;doublemoving_average;doubleresidual;if(window==null){window=newArrayDeque<Double>();
}
Here's the meat of the class: the evaluate method. This method will be called on each row by the map tasks. For any given row, we can't say whether or not our sliding window exists, so we initialize it if it's null.
//slide the windowif(window.size()==n.get())window.pop();window.addLast(newDouble(v.get()));// compute the averagefor(Iterator<Double>i=window.iterator();i.hasNext();)
sum+=i.next().doubleValue();
Here we deal with the deque and compute the sum of the window's elements. Deques are essentially double-ended queues, so they make excellent sliding windows. If the window is full, we pop the oldest element and add the current value.
moving_average=sum/window.size();result.set(moving_average);
returnresult;
Computing the moving average without weighting is simply dividing the sum of our window by its size. We then set that value in our Writable variable and return it. The value is then emitted as part of the map task executing the UDF function.
Going Further
The stateful annotation made it simple for us to compute a moving average since we could keep the deque static. However, how would we compute a moving average if there was no notion of state between Hadoop tasks? At the end of the series we'll examine a UDAF that does this, but the algorithm ends up being much different. In the meantime, I challenge the reader to think about what sort of approach is needed to compute the window.