Quadratic data in Oracle R Enterprise and Oracle Data Mining

I was working with some data which was stored in an Oracle database on a SPARC T4 server. I thought that the data had a quadratic component and I wanted to analyze the data using SQL Developer and Oracle Data Mining, a component of the Oracle Advanced Analytics Option. When I reviewed the initial analysis, I wasn't getting results that I had expected, and the fit of the model wasn't very good. I decided to feed some simple, synthetic quad data into Oracle Data Miner to ensure that I was using the tool properly.

Oracle R Enterprise was used as the tool to create and view the synthetic data.

From an R session that has the Oracle R Enterprise package installed, it is easy to access an Oracle Database:

require(ORE)

## Loading required package: ORE
## Loading required package: OREbase
## 
## Attaching package: 'OREbase'
## 
## The following object(s) are masked from 'package:base':
## 
##     cbind, data.frame, eval, interaction, order, paste, pmax,
##     pmin, rbind, table
## 
## Loading required package: OREstats
## Loading required package: MASS
## Loading required package: OREgraphics
## Loading required package: OREeda
## Loading required package: OREdm
## Loading required package: lattice
## Loading required package: OREpredict
## Loading required package: ORExml

ore.connect("SCOTT", "orcl", "sparc-T4", "TIGER", 1521)

## Loading required package: ROracle
## Loading required package: DBI

The following R function, quad1(), is used to calculate "y=ax^2 + bx + c",

where:
- the data frame that is passed in has a column of x values.
- a is in coefficients[feature, 1]
- b is in coefficients[feature, 2]
- c is in coefficients[feature, 3]

The function will simply calculate points along a parabolic line and is more complicated than it needs to be. I will leave it in this complicated format so that I can extend it to work with more interesting functions, such as a parabolic surface, later.

quad1 <- function(df, coefficients) {
    feature <- 1

    coefficients[feature, 1] * df[, feature] * df[, feature] +
      coefficients[feature, 2] * df[, feature] +
      coefficients[feature, 3]
}

The following function, genData(), creates random "x" data points and uses func() to calculate the y values that correspond to the random x values.

genData <- function(nObservations, func, coefficients, nFeatures, scale) {
    dframe <- data.frame(x1 = rep(1, nObservations))for (feature inseq(nFeatures)) {
        name <- paste("x", feature, sep = "")
        dframe[name] <- runif(nObservations, -scale[feature], scale[feature])
    }
    dframe["y"] <- func(dframe, coefficients)return(dframe)
}

The following function, quadGraph(), is used for graphing. The points in dframe are displayed in a scatter plot. The coefficients for the known synthetic data is passed in and the corresponding line is sketched in blue. (Obviously, if you aren't working with synthetic data, it is unlikely that you will know the "true" coefficients.) The R model that is the best estimate of the data based on regression is passed in and sketched in blue.

quadGraph <- function(dframe, coefficients = NULL, model = NULL, ...) {with(dframe, plot(x1, y))title(main = "Quadratic Fit")legend("topright", inset = 0.05, c("True", "Model"), lwd = c(2.5, 2.5), 
        col = c("blue", "red"))
    xRange <- range(dframe[, "x1"])
    smoothX <- seq(xRange[1], xRange[2], length.out = 50)
    trueY <- quad1(data.frame(smoothX), coefficients)lines(smoothX, trueY, col = "blue")
    new = data.frame(x1 = smoothX)
    y_estimated <- predict(model, new)lines(smoothX, y_estimated, col = "red")
}

Here are the settings that will be used.

nFeatures <- 1  # one feature can sketch a line, 2 a surface, ...
nObservations <- 20  # How many rows of data to create for modeling
degree <- 2  # 2 is quadratic, 3 is cubic, etcset.seed(2)  # I'll get the same coefficients every time I run
coefficients <- matrix(rnorm(nFeatures * (degree + 1)), nFeatures, degree + 1)
scale <- (10^rpois(nFeatures, 2)) * rnorm(nFeatures, 3)

Here, synthetic data is created that matches the quadratic function and the random coefficients.

modelData <- genData(nObservations, quad1, coefficients, nFeatures, scale)

We can make this exercise at least slightly more realistic by adding some irreducible error for the regression algorithm to deal with. Add noise.

yRange <- range(modelData[, "y"])
yJitter <- (yRange[2] - yRange[1])/10
modelData["y"] <- modelData["y"] + rnorm(nObservations, 0, yJitter)

Great. At this point I have good quadratic synthetic data which can be analyzed. Feed the synthetic data to the Oracle Database.

oreDF <- ore.push(modelData)
tableName <- paste("QuadraticSample_", nObservations, "_", nFeatures, sep = "")ore.drop(table = tableName)ore.create(oreDF, table = tableName)

The Oracle R Enterprise function to fit the linear model works as expected.

m = ore.lm(y ~ x1 + I(x1 * x1), dat = oreDF)summary(m)

## 
## Call:
## ore.lm(formula = y ~ x1 + I(x1 * x1), data = oreDF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.149 -0.911 -0.156  0.888  1.894 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.3264     0.4308    3.08   0.0068 ** 
## x1           -0.0640     0.1354   -0.47   0.6428    
## I(x1 * x1)   -0.8392     0.0662  -12.68  4.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 
## 
## Residual standard error: 1.28 on 17 degrees of freedom
## Multiple R-squared: 0.912,	Adjusted R-squared: 0.901 
## F-statistic: 87.7 on 2 and 17 DF,  p-value: 1.1e-09

coefficients

##         [,1]   [,2]  [,3]
## [1,] -0.8969 0.1848 1.588

Notice that the "true" coefficients, that were used to create the synthetic data are close to the values from the regression. For example, the true "a" is stored in coefficients[1,1] = -0.8969 and is close to the model's I(x1 * x1) = -0.8392. Not bad given that the model was created from only 20 data points.

quadGraph(modelData, coefficients, m)

The 20 data points, which were calculated from the "true" equation, but with noisy irreducible error added, are shown in the graph. The model, estimated by ore.lm() from the 20 noisy data points, is close to true.

plot of chunk unnamed-chunk-10

At this point, my job is either complete, or ready to start, depending on your perspective. I'm happy that ore.lm() does a nice job of fitting, so maybe I'm done. But if you remember that my initial goal was to validate that SQL Developer and Oracle Data Miner work with quadratic data, my job has just begun. Now that I have known good quadratic synthetic data in the database, I'll use SQL Developer and the Oracle Data Mining to verify that everything is fine.

One more step in R. Create a second Oracle Database table that will be used to test the regression model.

testData <- genData(nObservations, quad1, coefficients, nFeatures, scale)
oreTestData <- ore.push(testData)
tableName <- paste("QuadraticTest_", nObservations, "_", nFeatures, sep = "")ore.drop(table = tableName)ore.create(oreTestData, table = tableName)

Here is the SQL Developer workflow that will be used. The synthetic data is in the Oracle Database table "QuadraticSample_20_1". The "Regress Build" node will run linear regression on the synthetic data. The test data which was generated using R in the previous paragraph, is stored in a Oracle Database table named "QuadraticTest_20_1". The Apply node will use the regression model that has been created and use the "x1" values from the test data, storing the y values in an Oracle Database table named "QUADTESTRESULTS".

SQL Developer Data Mining Work Flow

So how did it work? A PhD in statistics would quickly tell you, "not well", and might look at you like you're an idiot if you don't know that a Model F Value Statistic of 3.25 isn't good. My more pedestrian approach is to plot the results of applying the model to the test data.

Pull the test result data into R for viewing:

ore.sync()ore.attach()
testResults <- ore.pull(QUADTESTRESULTS)

## Warning: ORE object has no unique key - using random order

colnames(testResults)[1] <- "y"

with(testResults, plot(x1, y))title(main = "Results of Applying Model to Test Data")

Hmm, that doesn't look parabolic to me:

Linear fit, not quadratic

Now that I'm quite sure that SQL Developer and Oracle Data Mining isn't giving an expected fit, check through the advanced settings:

Advanced Setting

There it is!!

Feature Selection

Set the feature generation to use quadratic candidates, re-run the model, and bring the new results back into R:

Data viewed from R looks parabolic

And your statistician friends will be happy because the new model has a Model F Value Statistic of 124. Exciting, right?

Now, off to work on parabolic surfaces...

Quadratic data in Oracle R Enterprise and Oracle Data Mining

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112