In the first two parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval / rqEval and ore.tableApply / rqTableEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.groupApply and the corresponding definitions required for SQL execution. The “group apply” function is one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on different partitions of data. This functionality is essential to enable the building of potentially 10s or 100s of thousands of predictive models, e.g., one per customer, and for taking advantage of high-performance computing hardware like Exadata.
Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically partitioning and passing data to parallel executing R engines. It ensures that all R function executions for all partitions complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result.
The variation on embedded R execution for ore.groupApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also an INDEX argument that specifies the name of a column by which the rows will be partitioned for processing by a user-defined R function.
Let’s look at an example. We’re going to use the C50 package to build a C5.0 decision tree model on the churn data set from C50. The goal is to build one churn model on the data for each state.
library(C50)
data(churn)
ore.create(churnTrain, "CHURN_TRAIN")
modList <- ore.groupApply(
CHURN_TRAIN,
INDEX=CHURN_TRAIN$state,
function(dat) {
library(C50)
dat$state <- NULL
dat$churn <- as.factor(dat$churn)
dat$area_code <- as.factor(dat$area_code)
dat$international_plan <- as.factor(dat$international_plan)
dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
C5.0(churn ~ ., data = dat, rules = TRUE)
});
mod.MA <- ore.pull(modList$MA)
summary(mod.MA)
A few points to highlight:
• As noted in Part 2 of this series, to use the CRAN package C50 on the client, we first load the library, and then the churn data set.
• Since the data is a data.frame, we’ll create a table in the database with this data. Notice that if you compare the results of str(churnTrain) with str(CHURN_TRAIN), you will see that the factor columns have been retained. This becomes relevant later.
• The function ore.groupApply will return a list of models stored as ore.object instances. The first argument is the ore.frame CHURN_TRAIN and the second argument indicates to partition the data on column state such that the user-defined function is invoked on each partition of the data.
• The next argument specifies the function, which could alternatively have been the function name if the FUN.NAME argument were used and the function saved explicitly in the R script repository. The function’s first argument (whatever its name) will receive one partition of data, e.g., all data associated with a single state.
• Regarding the user-defined function body, we explicitly load the package we’re using, C50 so the function body has access to it. Recall that this function will execute at the database server in a separate R engine from the client.
• Since we don’t need to know which state we’re working with and we don’t want this included in the model, we delete the column from the data.frame.
• Although the ore.frame defined functions, when they are loaded to the user-defined embedded R function, factors appear as character vectors. As a result, we need to convert them back to factors explicitly.
• The model is built and returned from the function.
• The result from ore.groupApply is a list containing the results from the execution of the user-defined function on each partition of the data. In this case, it will be one C5.0 model per state.
• To view the model, we first use ore.pull to retrieve it from the database and then invoke summary on it. The class of mod.MA is “C5.0”.
SQL API
We can invoke the function through the SQL API by storing the function in the R script repository. Previously we showed doing this using the SQL API, however, we can also do this using the R API , but we’re going to modify the function to store the resulting models in an ORE datastore by state name.:
ore.scriptCreate("myC5.0Function",
function(dat,datastorePrefix) {
library(C50)
datastoreName <- paste(datastorePrefix,dat[1,"state"],sep="_")
dat$state <- NULL
dat$churn <- as.factor(dat$churn)
dat$area_code <- as.factor(dat$area_code)
dat$international_plan <- as.factor(dat$international_plan)
dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
ore.save(mod, datastoreName)
TRUE
})
Just for comparison, we could invoke this from the R API as follows:
res <- ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
FUN.NAME="myC5.0Function",
datastorePrefix="myC5.0model", ore.connect=TRUE)
res
res <- ore.pull(res)
all(as.logical(res) == TRUE)
Since we’re using a datastore, we need to connect to the database setting ore.connect to TRUE. We also pass the datastorePrefix. The result res is an ore.list of logical. To test if all are TRUE, we first pull the result and use the R all function.
Back to the SQL API…Now that we can refer to the function in the SQL API, we invoke the function that places one model per datastore, each with the given prefix and state.
select *
from table(churnGroupEval(
cursor(select * from CHURN_TRAIN),
cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
'XML', 'state', 'myC5.0Function'));
There’s one thing missing, however. We don’t have the function churnGroupEval. There is no generic “rqGroupEval” in the API – we need to define our own table function that matches the data provided. Due to this and the parallel nature of the implementation, we need to create a PL/SQL FUNCTION and supporting PACKAGE:
CREATE OR REPLACE PACKAGE churnPkg AS
TYPE cur IS REF CURSOR RETURN CHURN_TRAIN%ROWTYPE;
END churnPkg;
/
CREATE OR REPLACE FUNCTION churnGroupEval(
inp_cur churnPkg.cur,
par_cur SYS_REFCURSOR,
out_qry VARCHAR2,
grp_col VARCHAR2,
exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("state"))
CLUSTER inp_cur BY ("state")
USING rqGroupEvalImpl;
/
The highlights in red indicate the specific parameters that need to be changed to create this function for any particular data set. There are other variants, but this will get you quite far.
To validate that our datastores were created, we invoke ore.datastore(). This returns the datastores present and we will see 51 such entries – one for each state and the District of Columbia.
Parallelism
Above, we mentioned that “group apply” supports data parallelism. By default, parallelism is turned off. To enable parallelism, the parameter to ore.groupApply needs to be set to TRUE.
ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
FUN.NAME="myC5.0Function",
datastorePrefix="myC5.0model",
ore.connect=TRUE,
parallel=TRUE)
In the case of the SQL API, the parallel hint can be provided with the input cursor. This indicates that degree of parallelism up to 4 should be enabled.
select *
from table(churnGroupEval(
cursor(select * /*+ parallel(t,4) */ from CHURN_TRAIN t),
cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
'XML', 'state', 'myC5.0Function'));
Map Reduce
The “group apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the partitioning by outputting the INDEX value as key and the data.frame as value. Then, each reducer receives the rows associated with one key. In our example above, INDEX was the column state and so each reducer would receive rows associated with a single state.
Memory and performance considerations
While the data is partitioned by the INDEX column, it is still possible that a given partition is quite large, such that either the partition of data will not fit in the R engine memory or the user-defined embedded R function will not be able to execute to completion. The usual remedial measures can be taken regarding setting memory limits – as noted in Part 2.
If the partitions are not balanced, you would have to configure the system’s memory for the largest partition. This will also have implications for performance, obviously, since smaller partitions of data will likely complete faster than larger ones.
The blog post Managing Memory Limits and Configuring Exadata for Embedded R Execution discusses how to instrument your code to understand the memory usage of your R function. This is done in the context of ore.indexApply (to be discussed later in this blog series), but the approach is analogous for “group apply.”