Oracle OSCH:A “World
Hello” Example
![World Hello]()
In this
post we will walk through Alice in Wonderland's looking glass and do a “Hello World” example
for Oracle SQL Connector for HDFS (i.e. OSCH).The above title, “World
Hello” is a play on words meant to drive home the relationship between the two
loading models: OLH and OSCH.They both
can be used to load an Oracle table but while OLH is run on Hadoop and uses
Hadoop’s Map Reduce engine to write data into Oracle tables, OSCH uses the SQL engine
running on Oracle to read data living in HDFS files.OLH pushes data to Oracle on demand from
Hadoop.OSCH pulls data from HDFS on
demand from Oracle.
Below
we will first review the OSCH execution’s model.We
will then discuss configuration.OSCH
has a few more moving parts to worry about than OLH which invariably will
create hiccups, but if you follow my instructions, in the following order,
these should be minimized.
- Perform one-time configuration steps
- Create an Oracle external table that works against local test data
- Load the same local test data into an HDFS directory
- Recreate the external table to reference a Hadoop cluster
- Use OSCH External Table publishing tool to point the external table to the test data location in HDFS
The OSCH Execution Model
OSCH was explained in the first lesson of the tutorial, but
since we are revisiting it in depth, let’s review how it works.
OSCH is simply the plumping that lets Oracle external tables access HDFS
content.Oracle external tables are a
well established mechanism for reading content that is not populated or managed
by Oracle.For conventional Oracle external
tables, the content lives as files visible to the OS where the Oracle system is running. These would be either local files, or shared network files (e.g. NFS).When you create an Oracle External table you
point it to a set of files that constitute data that can be rendered as SQL
tables.Oracle External table
definitions call these “location” files.
Before OSCH was invented, external tables introduced an
option called a PREPROCESSOR directive.Originally it was an option that allowed a user to preprocess a single
location file before the content was streamed into Oracle.For instance, if your
contents were zip files, the PREPROCESSOR option could specify that “unzip –p”
is to be called with each location file, which would unzip the files before
passing the unzipped content to Oracle.The
output of an executable specified in the PREPROCESSOR directive is always
stdout (hence the “-p” option for the unzip call).A PREPROCESSOR executable is a
black box to Oracle.All Oracle knows is
that when it launches it and feeds it a location file path as an argument, the
executable will feed it a stream of bits thatrepresents data of an external table.
OSCH repurposed the PREPROCESSOR directive to provide access
to HDFS.Instead of calling something
like “unzip” it calls an OSCH tool that streams HDFS file content from Hadoop.The files it reads from HDFS are specified
as OSCH metadata living in the external table “location” files locally.(These metadata files are created using
OSCH’s publishing tool.)In other words, for OSCH, location files do
not contain HDFS content, but contains references to HDFS files living in a
Hadoop cluster.The OSCH supplied preprocessor
expects to find OSCH metadata in this file.
All this is encapsulated with the Oracle external table
definition.The preprocessor logic gets
invoked every time one issues a SELECT statement in SQL against the external
table.At run time, the OSCH
preprocessor is invoked, which opens a “location” file with metadata.It parses it the metadata and then generates
a list of files in HDFS it will open, one at a time, and read, piping the
content into Oracle.(The metadata
also includes optional CODEC directives, so if the HDFS content needs to be
decompressed before being fed to Oracle, the OSCH preprocessor can handle it).
BTW, if you just got nervous about the performance
implications of the “one at a time” phrase above, don’t be.This model is massively scalable.
One Time Configuration Steps
![Checklist]()
Understand the Requirements for Installing and Configuring OSCH
The things you will need for installing and configuring OSCH
include:
- Access to the system where Oracle is running and to the OS
account where Oracle is running (typically the Unix account "oracle”)
- Access to SQL*Plus and permission to connect as DBA
- Ability to create an Oracle user (e.g. "oschuser") with enough permission to create
an external table and directory objects
- The OSCH kit
- The Hadoop client kit for the Hadoop cluster you want to access
- The Hadoop client configuration for HDFS access
- Permission to read, write, and delete files in HDFS as OS user "oracle" (i.e. "oracle" is an Hadoop user)
The
formal documentation to install OSCH is here.Below I try to outline a process that has
works best for me.Install the Bits
Logon to the system where Oracle is running as “oracle”.Carve out an independent directory structure(e.g. /home/oracle/osch) outside of the directory structure of ORACLE_HOME.Install the OSCH kit (called“orahdfs-2.2.0”) and the Hadoop client kit (“hadoop-2.0.0”).I typically make these peers.Both kits need to be unzipped.Hadoop client kits typically require some
building to create a few native libraries typically related to CODECs.You will also unzip the Hadoop configurations
files (“hadoop-conf”).Finally you want
to create a default directory for location files that will be referenced by
external tables.This is the “exttab”
directory below.This directory needs
read and write privileges set for “oracle”.At
this point you should have a directory structure that looks something like
this:
/home/oracle/osch/orahdfs-2.2.0
/home/oracle/osch/hadoop-2.0.0
/home/oracle/osch/hadoop-conf
/home/oracle/osch/exttab
Configure HDFS
Follow the standard Hadoop client instructions that allow
you access the Hadoop cluster via HDFS from the Oracle system logged in as“oracle”.Typically this is to call
Hadoop pointing to the hadoop-conf files you copied over.
With Hadoop you will want to be able to create, read, and
write files under HDFS /user/oracle directory.For the moment carve out an area where we will put test data to read
from HDFS using OSCH.
hadoop --config /home/oracle/osch/hadoop-conf fs –mkdir /user/oracle/osch/exttab
Configure OSCH
In the OSCH kit you will need to configure the preprocessor
that is used to access the Hadoop cluster and read HDFS files.It is in the OSCH kit under the bin
directory, and is called hdfs_stream.This is a bash script which invokes an OSCH executable under the
covers.You need to edit the script and
provide a definition for OSCH_HOME.You
will also need to modify and export modified PATH and JAVA_LIBRARY_PATH
definitions to pick up Hadoop client binaries.
e.g.
OSCH_PATH = /home/oracle/orahdfs-2.2.0
export PATH=/home/oracle/hadoop-2.0.0/bin:/user/bin:/bin
export JAVA_LIBRARY_PATH=/home/oracle/hadoop-2.0.0/lib/native
Optionally hdfs_stream allows you to specify where external
table log files go.By default it goes
into the log directory living in the OSCH installation (e.g. /home/oracle/orahdfs-2.2.0/log).
When you’ve complete this step, interactively invoke
hdfs_stream with a single bogus argument “foo”, again on the Oracle system
logged in as “oracle”.
e.g.
./hdfs_stream
OSCH: Error reading or
parsing location file foo
This might seem lame, but it is a good sanity check that
ensures Oracle can execute the script while processing an external table.If you get a Java stack trace rather than
the above error message, the paths you defined in hdfs_stream are probably
broken and need to be fixed.
Configure Oracle for OSCH
In this step you need to first connect to Oracle as SYSDBA
and create an Oracle DIRECTORY object that points to the file location where
hdfs_stream exists.You create one of
these to be shared by any Oracle users running OSCH to connect to a particular
Hadoop cluster.
SQLPLUS> CREATE DIRECTORY
osch_bin_path as ‘/home/oracle/osch/oradhdfs-2.2.0/bin’;
Assuming you’ve created a vanilla Oracle user (e.g. "oschuser") which will own the external table, you want to grant execute privileges
on the osch_bin_path directory.
SQLPLUS>GRANT EXECUTE ON
DIRECTORY osch_bin_path TO oschuser;
Now reconnect to Oracle as “oschuser” and create an
additional directory to point to the directory where location files live.
SQLPLUS> CREATE DIRECTORY exttab_default_directory AS‘/home/oracle/osch/exttab’;
At this point you have configured OSCH to run against a
Hadoop cluster.Now you move on to
create external tables to map to content living in HDFS.
Create an Oracle External Table that works against Local Test Data
You want to create an external table definition that mirrors the table you want to load (e.g. reflecting the same column names and data types.)
Even the simplest local external table definitions take
some time to get right, and 99% of the external table verbiage needed to get it
working against HDFS is identical to getting it to work against local files, so it makes sense to get a vanilla local external table working before trying it against HDFS.
What
you want to do is take a small representative set of sample data that you want
to access in HDFS and localize it into as a single file local to the Oracle
system and to the “oracle” user.Call
it testdata.txt and put it in the /home/oracle/osch/exttab directory, which is
our directory for location files. I would recommend starting with a simple text CSV file.
To make things easier we will use the OSCH External Table tool to create an external table definition that you can use as a template to tweak to conform to your data. This tool can be
run from any system that can connect to the Oracle database, but in this
case we are going to stay put and run it locally where Oracle is
running as the OS "oracle" user.
The tool requires two environmental settings to run: specifically JAVA_HOME and CLASSPATH which needs to reference the tool's jar files:
export JAVA_HOME=/usr/lib/jdk
export CLASSPATH=/home/oracle/osch/orahdfs-2.2.0/jlib/*
For our running example it would look like
this:
/home/oracle/osch/hadoop-2.0.0/bin/hadoop
jar
/home/oracle/osch/orahdfs-2.2.0/jlib/orahdfs.jar oracle..hadoop.exttab.ExternalTable
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@localhost/dbm
-D oracle.hadoop.connection.user=oschuser
-D oracle.hadoop.exttab.tableName=helloworld_exttab
-D oracle.hadoop.exttab.dataPaths=/user/oracle/osch/exttab
-D oracle.hadoop.exttab.defaultDirectory=exttab_default_directory
-D oracle.hadoop.exttab.locationFileCount=1
-D oracle.hadoop.exttab.columnCount=7
-createTable --noexecute
Let’s decompose this command.
The following invokes the OSCH External Table tool by pointing to the
OSCH jar file (“orahdfs.jar”):
/home/oracle/osch/hadoop-2.0.0/bin/hadoop
jar
/home/oracle/osch/orahdfs-2.2.0/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable
These two lines connect to the Oracle database service ("dbm") as Oracle
user “oschuser”:
-D
oracle.hadoop.connection.url=jdbc:oracle:thin:@localhost/dbm
-D
oracle.hadoop.connection.user=oschuser
This identifies the name of the external table we want to create:
-D
oracle.hadoop.exttab.tableName=helloworld_exttab
This tells the tool the directory in HDFS where data lives:
-D
oracle.hadoop.exttab.dataPaths=/user/oracle/osch/exttab
This indicates where the location files will live (using the name of the Oracle directory created above that maps to "/home/oracle/osch/exttab"):
-D
oracle.hadoop.exttab.defaultDirectory=exttab_default_dir
This indicates how many location files we generate.For now since we are only loading one HDFS
file, we need only one location file to reference it, so we feed it a value of
1:
-D
oracle.hadoop.exttab.locationFileCount=1
This indicates how many columns are in the table:
-D
oracle.hadoop.exttab.columnCount=7
Finally we tell the tool to just pretend to create an external table. This will generate an external table definition and output it to the console:
-createTable --noexecute
The generated console output should look something like this:
Oracle SQL Connector for HDFS Release 2.2.0 - Production
Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.
The create table command was not executed.
The following table would be created.
CREATE TABLE "OSCHUSER"."HELLOWORLD_EXTTAB"
(
"C1" VARCHAR2(4000),
"C2" VARCHAR2(4000),
"C3" VARCHAR2(4000),
"C4" VARCHAR2(4000),
"C5" VARCHAR2(4000),
"C6" VARCHAR2(4000),
"C7" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY "EXTTAB_DEFAULT_DIRECTORY"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
STRING SIZES ARE IN CHARACTERS
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"C1" CHAR(4000),
"C2" CHAR(4000),
"C3" CHAR(4000),
"C4" CHAR(4000),
"C5" CHAR(4000),
"C6" CHAR(4000),
"C7" CHAR(4000)
)
)
LOCATION
(
'osch-20130904094340-966-1'
)
) PARALLEL REJECT LIMIT UNLIMITED;
Cut and paste the console output to an editor (or cut and paste the text above) and temporarily remove the PREPROCESSOR directive and rename the location file (i.e. "osch=20130904094340-966-1") to "testdata.txt" (the name of your data file). You then want to twiddle with the external table verbiage and change the dummy column names (e.g. "C1"), data types (e.g. "VARCHAR2(4000)", and field definitions (e.g. "CHAR(4000)") to reflect the table you want to load. (The details for creating Oracle external tables are explained here). Note that the rest of the verbiage (e.g. "RECORDS DELIMITED BY") is used to support standard CSV text files, so if the data in your test file is correctly formed as CSV input, then this stuff should be left as is.
When you think your external table definition is correct, create the table in Oracle and try accessing the data from SQL:
SQLPLUS>SELECT * FROM helloworld_exttab;
After you’ve got a SQL SELECT statement working, it's time to load the same data file it into HDFS and recreate the external table for
remote access.
Load an HDFS directory with Local Test Data File
Using your hadoop client on your Oracle system upload the
working test file you got working into HDFS into a the data directory you
created earlier.
hadoop fs –put /home/oracle/osch/exttab/testdata.txt/user/oracle/osch/exttab
Recreate the External Table Using the PREPROCESSOR Directive
Now drop the local external table, and recreate it using
the identical syntax that worked above, but putting back the PREPROCESSOR directive:
PREPROCESSOR"OSCH_BIN_PATH":hdfs_stream
This will redirect processing to HDFS files living in your
Hadoop cluster.Don’t try doing a SELECT statement yet.The last step is to recreate
location files so they point to content living in HDFS.
![Big Switch]()
Using the OSCH Publishing Tool to point to test data living in HDFS
By adding the PREPROCESSOR directive, you now have an
external table that is bound to data living in a Hadoop cluster.You now want to point the external table
to data living somewhere in HDFS.For
our case that is the data living in the HDFS directory we created and populated
above: “/user/oracle/osch/exttab”.
First delete the local data file, testdata.txt, living under /home/oracle/osch/exttab. That way we know if the external table works, it's not fooling us simply accessing local data.
Then rerun the External Table tool with the "publish" command:
/home/oracle/osch/hadoop-2.0.0/bin/hadoop
jar
/home/oracle/osch/orahdfs-2.2.0/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@localhost/dbm
-D oracle.hadoop.connection.user=oschuser
-D oracle.hadoop.exttab.tableName=helloworld_exttab
-D oracle.hadoop.exttab.dataPaths=/user/oracle/osch/exttab
-D oracle.hadoop.exttab.locationFileCount=1
-publish
This time the tool executes the "publish" command, which connects to the Oracle database,
prompts for "oschuser"'s password, reads the files living in the HDFS under
“/user/oracle/osch/exttab”
and
creates one location file that references our singleton data file "testdata.txt" that we moved into HDFS.
If
you look at your local directory, “ /home/oracle/osch/exttab”
, you will see that it has been populated with a
machine generated file (e.g. “osch-20130821102309-6509-1”) which contains XML
verbiage referring to testdata.txt in HDFS.
Test an Oracle External Table that works against HDFS Data
Now you connect to Oracle as “oschuser" and issue the same
SQL query you did when the data was local. You should get identical results as you did
earlier (the order of the rows might be different).
SQLPLUS>SELECT * FROM helloworld_exttab;
At this point you have SQL access to content living in HDFS. To use it to load an Oracle table (e.g. "helloworld") you need to
use either an INSERT statement:
SQLPLUS> INSERT INTO helloworld SELECT * FROM
helloworld_exttab;
or a CREATE TABLE statement.
SQLPLUS>CREATE TABLE
helloworld as SELECT * from helloworld_exttab;
What Did We Just Do?
![Abby Normal]()
Aside from doing one time initialization steps, what we did
was create an external table and tested it locally to see if it would work with
a particular data file format, then we recreated the external table definition,
adding the PREPROCESSOR directive to point to HDFS living in a Hadoop cluster. We then used the OSCH External Table tool to point an external table to a
directory in HDFS with data files having the same format.
The bindings here are simple to understand:
- The PREPROCESSOR directive references
hdfs_stream which binds external tables to a particular Hadoop cluster
- The External Table publishing tool binds an
external table to a set of data files living in that cluster
If you want an to access multiple Hadoop clusters, you
simply need to create a copy of “hdfs_stream” giving it a new name.
(e.g. "hdfs_stream_2”), configure it to work against the other cluster, and embed the
use the PREPROCESSOR directive to call “hdfs_stream_2”.
If you want two external tables to point to two different
data sources of the same format, then create a new external table with the same
attributes, and use OSCH External Table tool to point to another directory in
HDFS.
One question that frequently comes up has to do with using OSCH for SQL access. Specifically, since external tables map HDFS data, are they useful for doing general purpose Oracle SQL queries against HDFS data, not just for loading an Oracle table?
If the data set is very large and
you intend to run multiple SQL queries, then you want load it into an Oracle
table and run your queries against it.The reason has to do with the “black box” design of external
tables.The storage is not controlled
by Oracle, so there are no indices and no internal structures that Oracle would need
to make access by SQL efficient.SELECT
statements against any external table are a full table scan, something Oracle
SQL optimizer tries to avoid because it is resource expensive.
One last point, always use external table definitions to facilitate the conversion of text to Oracle native datatypes (e.g. NUMBER, INTEGER, TIMESTAMP, DATE). Do not rely on CAST and other functions (e.g. to_date) in SQL. The data type conversion code in external tables is much more efficient.
Next Steps
This post was to get a toy example working with a single
data file.
The next post will focus on
how to tune OSCH to for large data sets living in HDFS and exploit Oracle Parallel query infrastructure
for high performance loads. We will also discuss the pros and cons of using OSCH versus OLH.