Entity Extraction using Oracle Text was introduced with the 11.2.0.2 release of the Oracle Database. It's not a particularly well-known feature, so I thought I'd make some notes on how to use it.
Please note that Entity Extraction uses some third party technology "under the covers" which is deliberately disabled in 11.2.0.3. If you run these examples in 11.2.0.3 (or 11.2.0.4 when it's out) you will receive a "feature not generally available" message. The functionality can be re-enabled, but only by contacting Oracle Support so they can run some checks on whether your applications will be affected by the replacement of this third party technology in the next major release.
Entity extraction is the capability to identify, and extract, named entities within a text.
Entities are mainly nouns and noun phrases. They include names, places, times, coded strings (such as phone numbers and zip codes), percentages, monetary amounts and many others.
The ctx_entity package implements entity extraction by means of a built-in dictionary and set of rules for English text. The capabilities can be extended for English, or for other languages, by means of user-provided add-on dictionaries and rule sets.
Let's look at a very simple, basic example.
Let's assume we have a clob containing the text:
New York, United States of America <p>
The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.'
We will use ctx_entity.extract to find all the entities in the CLOB value. (For now, we won't worry about how the text got into the clob or how we provide the output clob - that's all in the examples if you want to look).
Entity extraction requires a new type of policy - an "extract policy" - which allows you to specify options. For now, we will create a default policy, thus:
ctx_entity.create_extract_policy( 'mypolicy' );
We can then call extract to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example by calling dbms_lob.createtemporary).
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )
outclob contains the XML identifying extracted entities. If we display the contents (preferably by selecting it as an xmltype so it gets formatted nicely) we will see:
<entities>
<entity id="0" offset="0" length="8" source="SuppliedDictionary">
<text>New York</text>
<type>city</type>
</entity>
<entity id="1" offset="150" length="18" source="SuppliedRule">
<text>Oracle Corporation</text>
<type>company</type>
</entity>
<entity id="2" offset="10" length="24" source="SuppliedDictionary">
<text>United States of America</text>
<type>country</type>
</entity>
<entity id="3" offset="83" length="2" source="SuppliedRule">
<text>5%</text>
<type>percent</type>
</entity>
<entity id="4" offset="113" length="8" source="SuppliedDictionary">
<text>software</text>
<type>product</type>
</entity>
<entity id="5" offset="0" length="8" source="SuppliedDictionary">
<text>New York</text>
<type>state</type>
</entity>
</entities>
That's fine if we're going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use XML DB functions to convert it:
select xtab.offset, xtab.text, xtab.type, xtab.source
from xmltable( '/entities/entity'
PASSING xmltype(outclob)
COLUMNS
offset number PATH '@offset',
lngth number PATH '@length',
text varchar2(50) PATH 'text/text()',
type varchar2(50) PATH 'type/text()',
source varchar2(50) PATH '@source'
) as xtab order by offset;
which produces as output:
OFFSET TEXT TYPE SOURCE
---------- ------------------------- -------------------- --------------------
0 New York city SuppliedDictionary
0 New York state SuppliedDictionary
10 United States of America country SuppliedDictionary
83 5% percent SuppliedRule
113 software product SuppliedDictionary
150 Oracle Corporation company SuppliedRule
Hopefully that's fairly self-explanatory. Each entity found is listed with it's offset (starting point within the text), length, text, entity type, and source (how it was found). The entity "New York" is identified as both a City and a State, so it appears twice.
If we don't want all the different types of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types. For example:
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' )
That would give us the XML
<entities>
<entity id="0" offset="0" length="8" source="SuppliedDictionary">
<text>New York</text>
<type>city</type>
</entity>
<entity id="2" offset="10" length="24" source="SuppliedDictionary">
<text>United States of America</text>
<type>country</type>
</entity>
</entities>
Next let's look at creating a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax.
The rule is added to an extraction policy, and will then apply whenever that policy is used
We will create a rule to identify increases, for example in a stock index.
There are many ways to express an increase - we're hoping to match any of the following expressions:
climbed by 5%
increased by over 30 percent
jumped 5.5%
So we'll create a regular expression which matches any of those, and create a new type of entity. User defined entities must start with the letter "x", so we'll call our entity "xPositiveGain":
ctx_entity.add_extract_rule( 'mypolicy', 1,
'<rule>' ||
'<expression>' ||
'((climbed|gained|jumped|increasing|increased|rallied)' ||
'( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))' ||
'</expression>' ||
'<type refid="1">xPositiveGain</type>' ||
'</rule>');
Note the "refid" in there. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that's the outermost (and first occurring) parentheses, so that's refid=1.
This time, it is necessary to compile the policy before we can use it:
ctx_entity.compile('mypolicy');
Then we can use it as before
ctx_entity.extract('mypolicy', mydoc, null, myresults)
The (abbreviated) output of that would be
<entities>
...
<entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
<text>climbed by over 5%</text>
<type>xPositiveGain</type>
</entity>
</entities>
Finally, we're going to add another user defined entity, but this time as using a dictionary. We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex. While we're at it we'll add "S&P 500" as well. To do that, we create an XML file containing the following:
<dictionary>
<entities>
<entity>
<value>dow jones industrial average</value>
<type>xIndex</type>
</entity>
<entity>
<value>S&P 500</value>
<type>xIndex</type>
</entity>
</entities>
</dictionary>
Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity "&" - otherwise the XML would not be valid.
This XML file is loaded into the system using the CTXLOAD utility. If the file was called "dict.load", we would use the command:
ctxload -user username/password -extract -name mypolicy -file dict.load
Again we need to compile the policy using ctx_entity.compile. Then when we run ctx_entity.extract we will see in the output:
<entities>
...
<entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
<text>climbed by over 5%</text>
<type>xPositiveGain</type>
</entity>
<entity id="7" offset="43" length="28" source="UserDictionary">
<text>Dow Jones Industrial Average</text>
<type>xIndex</type>
</entity>
</entities>
Here's some example code so you can try it yourself. It uses SQL*Plus bind variables, which don't work too well in SQL Developer - if you really don't like using SQL*Plus, you'll need to convert it - perhaps put the whole thing into a PL/SQL procedure and use PL/SQL variables instead of SQL*Plus variables.
Download link (right-click and "Save As" on most systems): entity_extraction.sql