SQL Server vs Oracle: Semantic Search

By:   |   Updated: 2022-06-21   |   Comments (1)   |   Related: > Full Text Search


Problem

As a follow up to SQL Server vs Oracle: Data Loading we want to perform sentiment analysis, but before doing that it's important to understand the data. In the article Comparing SQL Server Full Text Search and Oracle Text Search we were able to search for common words, misspellings, and stemming (root and derivatives of a word), and now it's time to let the database tell us automatically which common phrases it's able to identify and how often they repeat (not identical but very similar). How is this performed in SQL Server, in Oracle, and what are their differences?

Solution

SQL Server and Oracle Semantic Search allow you to perform the following natural language understanding and processing: derive user intents and act if they match a query, sentiment analysis against a pre-defined dictionary-based sentiment, text mining, rank documents, key tags extraction, and related content discovery.

It allows you to perform statistical semantic (since a word can have multiple meanings or it changes within the context or depending on the surrounding words): return a scored table with semantic key tags, return a scored table with documents like one specified, and return the key tags that make the documents semantically similar.

Remember we're analyzing tweets, by getting key tags you can see which of them are the most influential over others and can also help us identify and remove bot/botnet/netcenter tweets that can deviate perception.

Note: SQL Server Semantic Search only indexes single words, not multi-word phrases (N-grams), and inflectional forms and synonyms are indexed separately.

In the next sections you will see how this is enabled for SQL Server and Oracle, and the differences.

Oracle

To install Oracle Text Search, you must execute the following steps:

Step 1 - If your database is new, because it uses Java classes you need to increase the JAVA_POOL_SIZE from the default 4 MB to 256 MB, and because it uses cursors/stored procedures/control structures/parallel execution if you have not set SGA_TARGET you need to increase SHARED_POOL_SIZE from the default 172 MB to 512 MB, this is done with the PowerShell command below and after that you need to bounce the instance performing a "shutdown immediate" and a "startup"; in the end you will increase the memory usage to about 600 MB:

Clear-Content C:\app\Oracle19c\database\INITORCL.ORA
$content=@"
db_name=ORCL
java_pool_size=256M
shared_pool_size=512M
"@
Add-Content C:\app\Oracle19c\database\INITORCL.ORA $content

Step 2 - If your database is new, because several objects will be created, you need to increase the datafiles for SYSTEM/SYSAUX/SYS_UNDOTS to prevent slowness and constant file increase operations, this is done with the SQL commands below connected as SYS:

ALTER DATABASE DATAFILE 1 RESIZE 800M;
ALTER DATABASE DATAFILE 2 RESIZE 300M;
ALTER DATABASE DATAFILE 3 RESIZE 500M;

Step 3 - Verify there are no invalid installed components; if there are, you need to reinstall it by running the appropriate script. This is verified running the SQL command below connected as SYS and confirmed with the output as shown, you need to run it every time you install a new component:

SET LINESIZE 300
COL COMP_ID FOR A7
COL COMP_NAME FOR A40
COL VERSION FOR A10
SELECT COMP_ID, COMP_NAME, VERSION, STATUS FROM DBA_REGISTRY;

COMP_ID COMP_NAME                                VERSION    STATUS
------- ---------------------------------------- ---------- -----------
CATALOG Oracle Database Catalog Views            19.0.0.0.0 VALID
CATPROC Oracle Database Packages and Types       19.0.0.0.0 LOADED
RAC     Oracle Real Application Clusters         19.0.0.0.0 OPTION OFF
JAVAVM  JServer JAVA Virtual Machine             19.0.0.0.0 VALID
XDB     Oracle XML Database                      19.0.0.0.0 VALID

Step 4 - Verify there are no invalid objects; if there are you need to change directory as "cd C:\app\Oracle19c" and run the script connected as SYS with "@?\rdbms\admin\utlrp.sql". This is verified by running the SQL command below connected as SYS and confirmed with the output as shown, and you need to run it every time you install a new component:

COL OWNER FOR A20
COL OBJECT_TYPE FOR A15
COL OBJECT_NAME FOR A30
SET PAGESIZE 500
SELECT OWNER, OBJECT_TYPE, OBJECT_NAME FROM DBA_OBJECTS WHERE STATUS <> 'VALID' ORDER BY OWNER, OBJECT_TYPE;

no rows selected

Step 5 - Install Java Virtual Machine (JVM) component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise you can't run SPARQL queries:

spool C:\setup\initjvm.out
@?\javavm\install\initjvm.sql
spool off

Once run, confirm there were no errors with the PowerShell command below; if there were errors you need to resolve them and reinstall the component:

Get-Content "C:\setup\initjvm.out" | Select-String "(ORA-)|(ERR)" | Select LineNumber, Line | Format-Table

There will be a new component installed as follows:

COMP_ID COMP_NAME                                VERSION    STATUS
------- ---------------------------------------- ---------- -----------
JAVAVM  JServer JAVA Virtual Machine             19.0.0.0.0 VALID

You also need to verify the Java option is enabled with the SQL command and output below:

SELECT * FROM V$OPTION WHERE PARAMETER='Java';

PARAMETER  VALUE          CON_ID
---------- ---------- ----------
Java       TRUE                0

And verify it works with the SQL commands and output below:

SELECT DBMS_JAVA.LONGNAME('TEST') FROM DUAL;

DBMS_JAVA.LONGNAME('TEST')
--------------------------
TEST

Step 6 - Even if Oracle XML database (XDB) is installed, you need to install XML XDK component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise the ORA-29549 error is returned:

spool C:\setup\xdk.out
@?\xdk\admin\initxml.sql
spool off

Once run, confirm there were no errors in the output file. There will be a new component installed as follows:

COMP_ID COMP_NAME                                VERSION    STATUS
------- ---------------------------------------- ---------- -----------
XML     Oracle XDK                               19.0.0.0.0 VALID

Step 7 - Install Spatial and Graph (SDO) with the SQL commands below (it runs faster if you bounce the database before running it):

spool C:\setup\mdinst.out
@?\md\admin\mdinst.sql
spool off

Once run, confirm there were no errors in the output file. In my case the following 4 files contained invalid characters and returned ORA-29913 and ORA-30653 so I had to copy them to my host machine, open in Notepad++, make a change like add a white space and delete it, save the files, and copy them back to the original location in the Docker container, these files are used in external table imports:

C:\app\Oracle19c\md\admin\sdo_coord_ref_sys.txt
C:\app\Oracle19c\md\admin\sdo_coord_ops.txt
C:\app\Oracle19c\md\admin\sdo_coord_op_param_vals.txt
C:\app\Oracle19c\md\admin\cs_srs.txt

There will be a new component installed as follows:

COMP_ID COMP_NAME                                VERSION    STATUS
------- ---------------------------------------- ---------- ------------
SDO     Spatial                                  19.0.0.0.0 VALID

And you can verify the installed SDO components with the SQL commands and its output below:

COL NAMESPACE FOR A10
COL ATTRIBUTE FOR A11
COL VALUE FOR A10
COL DESCRIPTION FOR A40
SELECT NAMESPACE, ATTRIBUTE, VALUE, DESCRIPTION FROM MDSYS.RDF_PARAMETER;

NAMESPACE  ATTRIBUTE   VALUE      DESCRIPTION
---------- ----------- ---------- ----------------------------------------
COMPONENT  RDFCTX      INSTALLED  Semantic (Text) Search component
COMPONENT  RDFOLS      INSTALLED  RDF Optional component for OLS support
MDSYS      SEM_VERSION 19.1.0.0.0 VALID

Test Oracle Semantic Search

Once Semantic Search is installed, you can perform a test with the steps below, part of the instructions were taken from Oracle document "How To Use The GATE Extractor To Create SemContext Indexes (Doc ID 1550829.1)" which is outdated:

Step 1 - In your host machine, install General Architecture for Text Engineering (GATE) which is an open-source natural language processor and information extractor, it can be downloaded from here and the installer is named gate-developer-9.0.1-installer.exe with size 57.9 MB. Note two folders will be created, "C:\Program Files (x86)\GATE_Developer_9.0.1" and "C:\Users\pabechevb\.m2\repository\uk\ac\gate\plugins\annie\9.1". Once installed, open the new installed application "GATE 9.0.1", click on File > Manage CREOLE Plugins > select ANNIE (9.1) > click on the button "Extract Plugin Resources" and select the C:\temp\ folder, there will be a file named "ANNIE_with_defaults.gapp" and it is used when launching a GateListener described later.

Step 2 - Create a listener that will receive requests from the database and will reply with the GATE response. The sample can be downloaded from here with filename gatelistener.zip and size 3.18 KB. However, this is for version 5 and won't work for version 9, instead use the attached file which I modified to make it work in version 9 named GateListener.java, then to compile it I placed it at C:\temp.

First you need to compile the file with the CMD commands below, note it requires JDK for the compiler so in my case I'm using the executable from the Oracle installer, also note you need to reference the GATE jar file:

cmd
cd C:\temp
"C:\temp\WINDOWS.X64_193000_db_home\jdk\bin\javac.exe" -classpath "C:\Program Files (x86)\GATE_Developer_9.0.1\lib\gate-core-9.0.1.jar" GateListener.java

It compiles the class in GateListener.java and outputs a file named GateListener.class, then you can start the listener with the CMD commands below, note I have JRE installed locally so this is the executable I'm using, and Windows Firewall will ask you to open the port which I left as 12000:

cmd
cd C:\temp
mkdir ANNIE
copy ANNIE_with_defaults.gapp .\ANNIE
"C:\Program Files (x86)\Java\jre1.8.0_333\bin\java.exe" -classpath .;"C:\Program Files (x86)\GATE_Developer_9.0.1\bin\*";"C:\Program Files (x86)\GATE_Developer_9.0.1\lib\*" GateListener 12000

Once the listener is running, you can test it is reachable from the Docker container with the PowerShell command below replacing the IP with your host computer IP; note this will crash the program and you need to restart the listener:

Test-NetConnection -ComputerName 192.168.0.4 -InformationLevel "Detailed" -Port 12000

Step 3 - Connect to the database with the SYSTEM user (SYS doesn't work because it has too many privileges, MDSYS is configured to not allow connecting to it, and a normal user lacks several privileges needed) and configure the GATE host and port with the SQL commands below, replacing the IP with your host machine IP; note this is done once per database, and if it's not configured you'll get the ORA-13199 error later:

begin
  sem_rdfctx.set_extractor_param (
    param_key   => 'GATE_NLP_HOST',
    param_value => '192.168.0.4',
    param_desc  => 'Host for GATE NLP Listener');
  sem_rdfctx.set_extractor_param (
    param_key   => 'GATE_NLP_PORT',
    param_value => '12000',
    param_desc  => 'Port for Gate NLP Listener');
end;
/

Step 4 - Grant access to the SYSTEM user to connect to external hosts with the SQL commands below, otherwise you'll get the ORA-24247 error later:

begin
  dbms_network_acl_admin.create_acl (
    acl => 'SEMINDEX',
    description => 'Allow query SPARQL endpoints',
    principal => 'SYSTEM',
    is_grant => true,
    privilege => 'connect');
  dbms_network_acl_admin.assign_acl (
    acl => 'SEMINDEX',
    host => '*');
end;
/

You can verify it is created with the SQL commands below:

COL PRINCIPAL FOR A10
COL HOST FOR A12
COL LOWER_PORT FOR 99999
COL UPPER_PORT FOR 99999
COL ACL FOR A30
COL PRIVILEGE FOR A9
COL PRIVILEGE_STATUS FOR A16
SET LINESIZE 300
SET PAGESIZE 500
SELECT PRINCIPAL, HOST, LOWER_PORT, UPPER_PORT, ACL, 'CONNECT' AS PRIVILEGE, 
    DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'connect'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS
FROM DBA_NETWORK_ACLS
    JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID)  
UNION ALL
SELECT PRINCIPAL, HOST, NULL lower_port, NULL upper_port, acl, 'resolve' AS PRIVILEGE, 
    DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'resolve'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS
FROM DBA_NETWORK_ACLS
    JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID);

And you can test with the SQL commands below (remember to restart the listener after it crashes):

DECLARE
  l_conn Sys.UTL_TCP.connection;
BEGIN
  l_conn := utl_tcp.open_connection(remote_host =>'192.168.0.4',remote_port => 12000);
  utl_tcp.close_connection(l_conn);
END;
/

Step 5 - Create a semantic network with the SQL commands below, specifying the tablespace where to create it:

EXECUTE SEM_APIS.CREATE_SEM_NETWORK(TABLESPACE_NAME => 'SYSTEM');

Then create a Semantic Text policy to use GATE extractor with the SQL commands below, the extractor can be tuned and can be extended with additional ontologies but this is the simplest form:

begin
  sem_rdfctx.create_policy (policy_name => 'SEM_EXTR',
                            extractor   => mdsys.gatenlp_extractor());
end;
/

Step 6 - After the data has been imported as shown in Comparing SQL Server Full Text Search and Oracle Text Search, create the semantic index with the SQL commands below, you will notice activity in the listener:

CREATE INDEX RDFINDEX ON MYDB.MYTBL(TEXT) INDEXTYPE IS MDSYS.SemContext PARAMETERS('SEM_EXTR');

Verify there were 0 errors during the index creation with the SQL command below:

select count(*) from mdsys.rdfctx_index_exceptions;

Now you need the semantic model name created with the SQL query below and its output as shown:

 SELECT owner, model_id, model_name, table_name, column_name
  FROM MDSYS.SEM_MODEL$
 WHERE model_name like 'RDFCTX%';

OWNER        MODEL_ID MODEL_NAME      TABLE_NAME COLUMN_NAM
---------- ---------- --------------- ---------- ----------
SYSTEM              1 RDFCTX_MOD_1

And you can start running SPARQL queries, replacing the model's name as returned from the previous query, to get every relation created:

SELECT s, p, o
  FROM TABLE(SEM_MATCH(
  '(?s ?p ?o)',
  SEM_Models('RDFCTX_MOD_1'),
  null,
  null,
  null));

And you can query which records match any of the relations with the SPARQL query below:

SELECT TEXT FROM MYDB.MYTBL WHERE SEM_CONTAINS(TEXT, 'SELECT ?s ?p ?o WHERE {?s ?p ?o}', 1) = 1;

There are additional operators, like SEM_CONTAINS_SELECT which returns additional information about each document, and SEM_CONTAINS_COUNT which counts matching subgraphs for the pattern specified. You can also add annotations to the documents using SEM_RDFCTX.MAINTAIN_TRIPLES procedure. And Oracle can extract plain text version from formatted documents like Word, RTF, PDF using filters and CTX_DOC.POLICY_FILTER which requires Oracle Text installed.

The official documentation can be found in this link, this link and this link.

SQL Server

In the article Comparing SQL Server Full Text Search and Oracle Text Search you learned how to install Full Text Search, which is a prerequisite for Semantic Search and allows you to classify and compare documents no matter their type. To install semantic search, you must execute the following additional steps:

Step 1 - Find the SemanticLanguageDatabase in the SQL Server installation media with the command and output below:

Get-ChildItem -Path C:\setup -Filter SemanticLanguageDatabase.msi -Recurse -ErrorAction SilentlyContinue -Force

    Directory: C:\setup\1033_ENU_LP\x64\Setup
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        9/24/2019  10:53 PM      185946112 SemanticLanguageDatabase.msi

    Directory: C:\setup\x64\Setup
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        9/24/2019  11:33 PM      185946112 SemanticLanguageDatabase.msi

Step 2 - Run the installer, which is going to decompress the database files with the command below, there is no output:

& C:\setup\x64\Setup\SemanticLanguageDatabase.msi

Step 3 - Search for the decompressed database files with the command and output below:

Get-ChildItem -Path C:\ -Filter semanticsdb.mdf -Recurse -ErrorAction SilentlyContinue -Force

    Directory: C:\Program Files\Microsoft Semantic Language Database
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        9/24/2019   2:21 PM      226426880 semanticsDB.mdf

Step 4 - Attach the database to the server instance with the command and output below:

CREATE DATABASE semanticsdb
  ON ( FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb.mdf' )
  LOG ON ( FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb_log.ldf' )
  FOR ATTACH;
GO

Converting database 'semanticsdb' from version 855 to the current version 904.
Database 'semanticsdb' running the upgrade step from version 855 to version 856.
…
Database 'semanticsdb' running the upgrade step from version 903 to version 904.

Step 5 - Register the semantic statistics database with the command below, there is no output:

EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N'semanticsdb';

Step 6 - Verify the semantic database is registered and check its version with the command and output below:

SELECT * FROM sys.fulltext_semantic_language_statistics_database;
GO

database_id register_date           registered_by version
----------- ----------------------- ------------- ---------------
          6 2022-04-08 13:47:43.470             1 11.0.1153.1.3

Step 7 - Verify the installed document parsers with the command below:

SELECT * FROM sys.fulltext_document_types;

Step 8 - By default Semantic Search doesn't have Office 2007-2010 parsers installed, you need to download them, the file is named "FilterPack64bit.exe" with size 3.87 MB and its version is 2010, you need to install it with the commands below, there is no output and the instructions are described in this link:

cmd
c:\setup\FilterPack64bit.exe
exit

Step 9 - You also need to install the PDF parser, you need to download it, the file is named "PDFFilter64Setup.msi" with size 19.6 MB and its version is 11.0.1.36, you need to install it with the command below, there is no output:

& c:\setup\PDFFilter64Setup.msi

Step 10 - Once the new filters are installed, you need to load them into Semantic Search with the commands below, there is no output:

EXEC sp_fulltext_service 'update_languages';
GO
EXEC sp_fulltext_service 'load_os_resources', 1;
GO
EXEC sp_fulltext_service 'restart_all_fdhosts';
GO

Then you can query again the installed document parsers as before and see their location and version number.

Now you can alter the existing full text index, but in my case, I'm going to drop and recreate it with the commands below, note the only difference with a full text index is "STATISTICAL_SEMANTICS":

USE MyDb
GO
DROP FULLTEXT INDEX ON [dbo].[MyTbl];
GO
CREATE FULLTEXT INDEX ON dbo.MyTbl(Text STATISTICAL_SEMANTICS) KEY INDEX ID_TWEET ON TWEETS WITH CHANGE_TRACKING AUTO;

Test SQL Server Semantic Search

You can find key phrases in all tweets with the query below, they are called phrases but in reality, those are single words, it is ordered by score so the most important words appear at the top:

SELECT column_id, document_key, keyphrase, score
  FROM SEMANTICKEYPHRASETABLE(MyTbl, *)
 ORDER BY score DESC;

You can find the most important key phrases in all tweets with the query below, it is ordered by count, so the most common words appear at the top:

SELECT keyphrase, COUNT(1)
  FROM SEMANTICKEYPHRASETABLE(MyTbl, *)
 GROUP BY keyphrase
 ORDER BY 2 DESC;

You can find similar or related tweets with the query below, because it is done per document ID we need to store intermediate results and this query may not work with large datasets so you will need to run it in batches:

CREATE TABLE #Info ([SourceDocumentKey] INT, [SourceColumnId] INT, [MatchedColumnId] INT, [MatchedDocumentKey] INT, [Score] REAL)
DECLARE @cmd VARCHAR(MAX)
SET @cmd = ''
SELECT @cmd = @cmd + 'INSERT #Info EXEC(''
SELECT '+CAST([ID] AS VARCHAR(10))+', * FROM SEMANTICSIMILARITYTABLE(MyTbl, *, '+CAST([ID] AS VARCHAR(10))+')'');'
  FROM [MyTbl];
EXEC (@cmd)
SELECT s.Text, m.Text, Score
  FROM #Info [i]
 INNER JOIN MyTbl s on s.ID = i.SourceDocumentKey
 INNER JOIN MyTbl m on m.ID = i.MatchedDocumentKey
 WHERE Score > 0.5
 ORDER BY Score DESC;
DROP TABLE #Info

In my case there were no results because I'm selecting tweets that are at least 50% similar and there were none. This tells me in this dataset there are no tweets influencing others and they're not generated by bots/botnets/netcenters.

You can find the key phrases that make two documents similar or related, which means their words repeat often, with the query below; MyTbl is the loaded table, Text is the column for both comparisons, and 39 and 56 are the document IDs in the database:

SELECT keyphrase, score FROM SEMANTICSIMILARITYDETAILSTABLE(MyTbl, Text, 39, Text, 56);

The official documentation can be found here: Semantic search (contains several subtopics), Semantic search DDL, functions, stored procedures and views, Exploring Semantic Search Key Term Relevance.

Conclusion

You can see installing Semantic Search is a complex topic both in Oracle and SQL Server. In Oracle this doesn't depend on Oracle Text but having both installed allows you to perform additional actions, and SPARQL adds another level of difficulty. In SQL Server the queries remain as T-SQL and are easy to perform, but the database and Office extractors are a little out of date, so you need to test them first.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Pablo Echeverria Pablo Echeverria is a talented database administrator and C#.Net software developer since 2006. Pablo wrote the book "Hands-on data virtualization with Polybase". He is also talented at tuning long-running queries in Oracle and SQL Server, reducing the execution time to milliseconds and the resource usage up to 10%. He loves learning and connecting new technologies providing expert-level insight as well as being proficient with scripting languages like PowerShell and bash. You can find several Oracle-related tips in his LinkedIn profile.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2022-06-21

Comments For This Article




Thursday, June 23, 2022 - 2:48:29 PM - Pablo Echeverria Back To Top (90190)
Below is the code for GateListener.java:

/**
* The GateListener class is a sample listener implementation for the GATE
* engine that interacts with the Semantic indexing component in Oracle
* Database. The sample code uses the ANNIE extraction system and is tested
* with GATE Release 5.0 and JDK 1.5. The code may be modified to use custom
* gate plugins for domain-specific information extraction. This file
* must be compiled with a classpath that includes the jar files from the
* bin and lib directories of a GATE installation.
*
* See Oracle Database Semantic Technologies Developer's Guide for
* details about semantic indexing for documents.
*/

import gate.Annotation;
import gate.AnnotationSet;
import gate.Corpus;
import gate.Document;
import gate.DocumentContent;
import gate.Factory;
import gate.FeatureMap;
import gate.Gate;
import gate.ProcessingResource;
import gate.corpora.DocumentContentImpl;
import gate.corpora.DocumentXmlUtils;
import gate.creole.ANNIEConstants;
import gate.creole.ExecutionException;
import gate.creole.ResourceInstantiationException;
import gate.creole.SerialAnalyserController;
import gate.util.GateException;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

import java.io.File;
import gate.util.persistence.PersistenceManager;
import gate.creole.ConditionalSerialAnalyserController;

public class GateListener implements Runnable
{
private ConditionalSerialAnalyserController annieController = null;
private ServerSocket serverSocket = null;
private Thread sockThread = null;
private boolean reqEndThread = false;
private static final int READ_CHUNK_SIZE = 16*1024;

/**
* GateListener constructor.
* @param port - port at which the Gate listener is initialized
* @throws GateException
*/
public GateListener(int port)
throws GateException, IOException
{
Gate.init();

// create a serial analyser controller to run ANNIE with
/*annieController = (SerialAnalyserController) Factory.createResource(
"gate.creole.SerialAnalyserController", Factory.newFeatureMap(),
Factory.newFeatureMap(), "ANNIE_" + Gate.genSym());

// load each PR as defined in ANNIEConstants
for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) {
FeatureMap params = Factory.newFeatureMap(); // use default parameters
System.out.println("Annie constant:"+ANNIEConstants.PR_NAMES[i]);
System.out.println(params.toString());
ProcessingResource pr = (ProcessingResource)
Factory.createResource(ANNIEConstants.PR_NAMES[i], params);

// add the PR to the pipeline controller
annieController.add(pr);
} // for each ANNIE PR*/

annieController = (ConditionalSerialAnalyserController)
PersistenceManager.loadObjectFromFile(new File(new File(
Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR),
ANNIEConstants.DEFAULT_FILE));

// start the listener
try {
serverSocket = new ServerSocket(port);
}
catch (IOException e)
{
System.err.println("Could not read from the connection "+e.getMessage());
System.exit(1);
}

// start the thread that handles extraction requests
sockThread = new Thread(this);
sockThread.start();
System.out.println("Started the Listener on port "+port);
}

/**
* To build document content over a stream.
* @param inStream - input stream. The input stream has the length of the
* document followed by the content of the document.
* @return
* @throws IOException
*/
private DocumentContentImpl getDocumentStreamContent(InputStream inStream)
throws IOException
{
int readLength = 0;
int bufLength = 0;
String sdocLength;
int docLength;
char[] readBuffer = new char[READ_CHUNK_SIZE];

BufferedReader uReader = null;
StringBuffer buf = new StringBuffer();

uReader = new BufferedReader(new InputStreamReader(inStream));

// First field is the length of the doc to be read
sdocLength = uReader.readLine();

if (sdocLength == null)
{
return new DocumentContentImpl(buf.toString());
}
try {
docLength = Integer.parseInt(sdocLength);
} catch (NumberFormatException nfe)
{
throw new RuntimeException("invalid document format");
}
readLength = 0;
while (readLength < docLength)
{
bufLength = uReader.read(readBuffer, 0, READ_CHUNK_SIZE);
if (bufLength == -1) break;
buf.append(readBuffer, 0, bufLength);
readLength = readLength+bufLength;
}

return new DocumentContentImpl(buf.toString());
}

/**
* Write to the output stream.
* @param out - output stream for the write operation
* @param control - control information (-1 if error; else actual length of content)
* @param content - content to be written to the stream
*/
private void writeToStream(PrintWriter out, int control, String content)
{
out.println(control);
out.print(content);
}

/**
* Listener logic for accepting documents over a socket connection,
* processing them, and writing the extracted information back to the
* socket.
*/
public void run()
{
while (!reqEndThread && serverSocket.isBound())
{
try {
Socket client = serverSocket.accept();
DocumentContentImpl docContent;
PrintWriter out = new PrintWriter(client.getOutputStream(), true);
Corpus corpus = (Corpus) Factory.createResource("gate.corpora.CorpusImpl");
Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl");

// after accepting a socket connection from the database, start reading
// the document.
try {
docContent = getDocumentStreamContent(client.getInputStream());
} catch (Exception genExcep)
{
writeToStream(out, -1, "Error reading document "+genExcep.getMessage());
client.close();
continue; // start accepting new connections.
}

doc.setContent(docContent);
corpus.add(doc);

annieController.setCorpus(corpus);
annieController.execute();

// get the annotation set for the document
AnnotationSet annotSet = doc.getAnnotations();
HashSet<String> annotTypesRequired = new HashSet<String>();
annotTypesRequired.add("Person");
annotTypesRequired.add("Location");
annotTypesRequired.add("FirstPerson");
annotTypesRequired.add("Date");
annotTypesRequired.add("Organization");
annotTypesRequired.add("Lookup");
AnnotationSet reqAnnots = annotSet.get(annotTypesRequired);
String content = docContent.toString();
Iterator iter = reqAnnots.iterator();

// For each annotation, add a feature map that includes the
// value that led to the annotation. This way the XML representation
// of the annotation set is complete.
while (iter.hasNext())
{
Annotation currAnnot = (Annotation)iter.next();
FeatureMap featureMap = currAnnot.getFeatures();
int stpos = (int)currAnnot.getStartNode().getOffset().longValue();
int edpos = (int)currAnnot.getEndNode().getOffset().longValue();
String entity = content.substring(stpos, edpos);
featureMap.put("entityValue", entity);
}

// generate the XML document with all annotations
StringBuffer extrInfo = new StringBuffer();
DocumentXmlUtils.annotationSetToXml(reqAnnots, extrInfo);
// write the XML document to the output stream
writeToStream(out, extrInfo.length(), extrInfo.toString());

out.close();
doc.cleanup();
corpus.cleanup();
client.close();
} catch (IOException ioe)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+ioe.getMessage());
} catch (ResourceInstantiationException rie)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+rie.getMessage());
} catch (ExecutionException execExcep)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+execExcep.getMessage());
}
}
}

public static void main(String[] args)
throws GateException, IOException
{
int port = 0;
if (args.length != 1)
{
System.err.println("Usage : GateListener [port-number]");
System.exit(1);
}

try
{
port = Integer.parseInt(args[0]);
if (port < 0)
throw new NumberFormatException("negative number");
}
catch (NumberFormatException nfe)
{
System.err.println("Invalid port number");
System.exit(1);
}

new GateListener(port);
}
}














get free sql tips
agree to terms