Comparing SQL Server Full Text Search and Oracle Text Search

By:   |   Updated: 2022-05-20   |   Comments   |   Related: More > Full Text Search


Problem

As a follow up to "SQL Server vs Oracle: Data Loading", we want to perform sentiment analysis, but before doing that it's important to understand the data: common words, misspellings, stemming (root and derivatives of a word), etc. How is this performed in SQL Server, in Oracle, and what are the differences?

Solution

SQL Server Full Text Search and Oracle Text allows you to perform fast text lookup and fuzzy string matching in any character-type column without scanning the whole table on unstructured and semi-structured information, be it speech, documents, XML, bioinformatics, chemical or multimedia. You can query a BLOB data type where LIKE cannot be used, query suffixes where a LIKE performs a full table scan, and you can query over a linked server or database link. It allows you to specify the language you want (if supported) unlike the SOUNDEX and DIFFERENCE functions only supported in English, and it will handle appropriately languages that read from right to left and compound words.

Once created, it will allow you to perform the following types of searches:

  • Search one or more words or phrases, a prefix, or a suffix. When the prefix is a phrase, each token is considered.
  • Search inflectional forms of a word: "drive" will find "drives", "drove", "driving", "driven", etc. depending on the language specified both at the index creation time and at the query execution time.
  • Search synonyms of a word: "car" will find "truck", but this is maintained manually, you can add your own custom business synonyms.
  • Search one or more words or phrases close to other words or phrases, indicating the maximum number of non-search terms that separate the first and last search terms, and if the words must appear in order.
  • Rank results, with specifying different weights for words or phrases.
  • Search within the properties of a document.
  • Statistical semantic (since a word can have multiple meanings or it changes within the context), but only single words are indexed, not multi-word phrases (N-grams), and inflectional forms and synonyms are indexed separately.
  • To know what the output of a full text query will be without executing it.

Here are several considerations when creating a full text and/or semantic search index:

  • This is a feature you choose to be installed separately from the database/engine.
  • You need to account for enough memory when it is crawling the content and when executing a full text query.
  • If you need to search multiple conditions, combining them in a single full text query outperforms multiple statements due to it uses special reverse index properties. You can use AND, OR and NOT within the same full text search predicate and combine different full text operators. For best performance, instead of including multiple words with OR, use the thesaurus specifying them as synonyms.
  • Noise words are removed, but you can adjust them in the default stoplist or create your own stoplist and specify it when indexing and querying.
  • You can adjust synonyms and replacements in the thesaurus file for the language used.
  • For tables with high modify rates it is recommended to schedule manual updates instead of automatic.
  • It suffers from parameter sniffing, and parameter sniffing solutions also apply with full text queries.
  • If you need to handle multiple languages, don't mix them in the same column, or store the content as XML, HTML or a binary type which allow you to specify the language in different parts of the content.

Personally, I've used it in a document processing system which I developed, allowing the users to easily find the required template just by typing a few words contained in the template. In the next sections you will see how this is enabled for SQL Server and Oracle, and the differences.

Oracle Full Text Search

In the article "SQL Server vs Oracle: Data Loading" you learned how to generate data and insert into an Oracle database, but it didn't mention if your text contains commas the text is separated into fields, and because there was only one field in the table the results were trimmed at the first comma. To resolve that, the command to insert data needs to be as follows, adding TERMINATED_BY=EOF to not split fields on a comma and leaving the other parameters as described in the mentioned article:

sqlldr MYDB/MyPwd TABLE=MYTBL DATA='c:\setup\input.txt' DIRECT=TRUE MULTITHREADING=TRUE PARALLEL=TRUE TERMINATED_BY=EOF

The number of inserted rows will be the same as earlier, in my case it was 72. Once the data is loaded, you need to install and configure Oracle Text as follows (I used this guide as reference):

Step 1 - If your language won't be English, search for your language in $ORACLE_HOME/ctx/admin/defaults, for example to search for Spanish you can run the command below:

Get-ChildItem "C:\app\Oracle19c\ctx\admin\defaults" | Select-String "SPANISH" | Select Filename, LineNumber, Line, Path | Format-Table

Step 2 - Connect AS SYSDBA and run the install script $ORACLE_HOME/ctx/admin/catctx, replacing CTXSysPwd with the password you want to assign to a new schema named CTXSYS, replacing SYSAUX with the tablespace where you will create the new objects, replacing SYSTEM with your TEMP TABLESPACE (if you didn't specify it when creating the database it will be SYSTEM), and specifying NOLOCK to unlock the schema; once it completes verify there were no errors, in my case the command is as follows:

SQL> @C:\app\Oracle19c\ctx\admin\catctx.sql CTXSysPwd SYSAUX SYSTEM NOLOCK;

Step 3 - Change schema to CTXSYS (in earlier versions you had to login as this user) and run the default preferences script for your language (the one you found in step 1), for English the script is $ORACLE_HOME/ctx/admin/defaults/drdefus.sql as follows, then verify there were no errors:

SQL> ALTER SESSION SET CURRENT_SCHEMA=CTXSYS; 
SQL> @C:\app\Oracle19c\ctx\admin\defaults\drdefus.sql; 

Step 4 - As SYS verify Oracle Text was installed successfully with the commands and output shown below:

SQL> ALTER SESSION SET CURRENT_SCHEMA=SYS; 
SQL> SELECT COMP_NAME, STATUS, SUBSTR(VERSION,1,10) AS VERSION FROM DBA_REGISTRY WHERE COMP_ID = 'CONTEXT'; 
COMP_NAME    STATUS      VERSION 
------------ ----------- ---------- 
Oracle Text  VALID       19.0.0.0.0 
SQL> SELECT VER_DICT, VER_CODE FROM CTXSYS.CTX_VERSION; 
VER_DICT    VER_CODE 
----------- ---------- 
19.0.0.0.0  19.0.0.0.0 
SQL> SELECT CTXSYS.DRI_VERSION FROM DUAL; 
DRI_VERSION 
----------- 
19.0.0.0.0 
SQL> SELECT COUNT(1) FROM DBA_OBJECTS WHERE OWNER='CTXSYS'; 
  COUNT(1) 
---------- 
       398 
SQL> SELECT OBJECT_TYPE, COUNT(1) FROM DBA_OBJECTS WHERE OWNER='CTXSYS' GROUP BY OBJECT_TYPE; 
OBJECT_TYPE               COUNT(1) 
----------------------- ---------- 
INDEX                           68 
PACKAGE BODY                    67 
TYPE BODY                        6 
INDEXTYPE                        5 
PACKAGE                         78 
PROCEDURE                        2 
LIBRARY                          1 
FUNCTION                         2 
TYPE                            22 
OPERATOR                         6 
LOB                              4 
SEQUENCE                         3 
VIEW                            81 
TABLE                           53 
SQL> SELECT OBJECT_NAME, OBJECT_TYPE, STATUS FROM DBA_OBJECTS WHERE OWNER='CTXSYS' AND STATUS != 'VALID' ORDER BY OBJECT_NAME; 
no rows selected 

Step 5 - Grant Oracle Text permissions to the schema owner of the table loaded earlier with the commands below, replacing MYDB with your schema owner:

GRANT CTXAPP TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_CLS TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_DDL TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_DOC TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_OUTPUT TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_QUERY TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_REPORT TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_THES TO MYDB; 
GRANT EXECUTE ON CTXSYS.CTX_ULEXER TO MYDB; 

Step 6 - Connect with your schema owner and create the index with the commands below, replacing IDX_TWEETS with your index name, MYTBL with the table loaded earlier, and TEXT with the column name of the table:

SQL> connect MYDB/MyPwd 
SQL> CREATE INDEX IDX_TWEETS ON MYTBL(TEXT) INDEXTYPE IS CTXSYS.CONTEXT; 

Note there are other index types and you can specify additional properties when creating, but this is the simplest form.

Perform Test Searches in Oracle

Once Oracle Text is installed and configured, you can perform text searches as follow. To see how many records contain the exact word you extracted data for, see the command and output below, by default the word is case insensitive:

SQL> SELECT COUNT(1) FROM MYTBL WHERE CONTAINS(TEXT, 'sql') > 0; 
  COUNT(1) 
---------- 
        69 

To see how many records sound like the word you extracted data for (soundex), see the command and output below:

SQL> SELECT COUNT(1) FROM MYTBL WHERE CONTAINS(TEXT, '!sql') > 0; 
  COUNT(1) 
---------- 
        72 

To highlight (html bold) the words with a common root (stemming) run the commands and see its output below, in my case because I didn't create a primary key, I had to use ROWID but CTX_DOC.MARKUP accepts the index ID:

EXEC CTX_DOC.SET_KEY_TYPE('ROWID'); 
SET SERVEROUTPUT ON; 
DECLARE 
  mklob CLOB; 
  amt NUMBER := 4000; 
  line VARCHAR2(4000); 
BEGIN 
  FOR I IN (SELECT ROWID FROM MYTBL WHERE CONTAINS(TEXT, '$give') > 0) LOOP 
    CTX_DOC.MARKUP('IDX_TWEETS', I.ROWID, '$give', mklob, tagset => 'HTML_DEFAULT'); 
    DBMS_LOB.READ(mklob, amt, 1, line); 
    DBMS_OUTPUT.PUT_LINE(line); 
  END LOOP I; 
  DBMS_LOB.FREETEMPORARY(mklob); 
END; 
/ 
Sadly, a significant portion of dirty money has been <B>given</B> to political parties, such as 
X <B>gave</B> up, is on the horizon. He?s wholly owned by. 
WEAKNESS???When bounties on the heads of troops ignored it? <B>gave</B> classified? https://t 
Used to be the indefensible Capital that <B>gives</B> it's name? http:// 

Get Performance Stats in Oracle

You can get performance stats when using CONTAINS with the commands and output below:

EXPLAIN PLAN FOR SELECT TEXT FROM MYTBL WHERE CONTAINS(TEXT, '%BBC%') = 0; 
SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY()); 
------------------------------------------------------------------------------------------ 
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     | 
------------------------------------------------------------------------------------------ 
|   0 | SELECT STATEMENT            |            |     1 |   121 |     1   (0)| 00:00:01 | 
|   1 |  TABLE ACCESS BY INDEX ROWID| MYTBL      |     1 |   121 |     1   (0)| 00:00:01 | 
|*  2 |   DOMAIN INDEX              | IDX_TWEETS |       |       |     1   (0)| 00:00:01 | 
------------------------------------------------------------------------------------------ 
Predicate Information (identified by operation id): 
--------------------------------------------------- 
   2 - access("CTXSYS"."CONTAINS"("TEXT",'%BBC%')=0) 

And then compare it against the performance of using LIKE with the commands and output below, note LIKE always performs a full table scan under these conditions:

EXPLAIN PLAN FOR SELECT TEXT FROM MYDB.MYTBL WHERE TEXT LIKE '%BBC%'; 
SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY()); 
--------------------------------------------------------------------------- 
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     | 
--------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT  |       |     4 |   484 |     3   (0)| 00:00:01 | 
|*  1 |  TABLE ACCESS FULL| MYTBL |     4 |   484 |     3   (0)| 00:00:01 | 
--------------------------------------------------------------------------- 
Predicate Information (identified by operation id): 
--------------------------------------------------- 
   1 - filter("TEXT" LIKE '%BBC%' AND "TEXT" IS NOT NULL) 

You can also get the text index explained with the commands and output below, this can help you improve your search queries:

CREATE TABLE CTX_EXPLAIN( 
  EXPLAIN_ID VARCHAR2(30), 
  ID NUMBER, 
  PARENT_ID NUMBER, 
  OPERATION VARCHAR2(30), 
  OPTIONS VARCHAR2(30), 
  OBJECT_NAME VARCHAR2(64), 
  POSITION NUMBER, 
  CARDINALITY NUMBER); 
EXEC CTX_QUERY.EXPLAIN( - 
  index_name => 'IDX_TWEETS', - 
  text_query => '%BBC%', - 
  explain_table => 'CTX_EXPLAIN', - 
  sharelevel => 0); 
SELECT EXPLAIN_ID, ID, PARENT_ID, OPERATION, OPTIONS, OBJECT_NAME, POSITION FROM CTX_EXPLAIN ORDER BY ID; 
EXPLAIN_ID         ID  PARENT_ID OPERATION       OPTIONS    OBJECT_NAME       POSITION 
---------- ---------- ---------- --------------- ---------- --------------- ---------- 
                    1          0 WORD                       BBCNEWS                  1 

Additional notes:

  • CTXSYS.CONTEXT is not the only index type, there are other types for other purposes. For this index type, you need to synchronize the index every time you insert/update/delete records.
  • You can tune everything in the word extraction process to accommodate your needs.
  • You can add sub-indexes to improve the execution plan when searching/ordering using other columns, they can be any data type.
  • You can find Oracle Text Application Developer Guide in this link.
  • You can find Oracle Text Reference in this link.

SQL Server Full Text Search

In the article "SQL Server vs Oracle: Data Loading" you learned how to generate data and insert into a SQL Server database, but Full Text requires a unique key defined. If your data has been already loaded, you need to add a column which will serve as the unique key with the command below, it is recommended to be integer so SQL Server doesn't maintain a separate key mapping table:

ALTER TABLE MyDb.dbo.MyTbl ADD ID INT IDENTITY(1,1) NOT NULL;

Then you need to add a unique key with the command below, note if you need additional performance then create it as clustered:

CREATE UNIQUE INDEX ID_TWEET ON MyDb.dbo.MyTbl(ID);

Once that is done, you need to install and configure Full Text as follows:

Step 1 - Add the feature to SQL Server: this is accomplished by running the SQL Server installer as described in "Install SQL Server and Oracle using PowerShell and Windows Containers" adding the FullText feature; if it's already installed, this will only add the missing feature and everything else will be the same, if you remove a feature from the list then it will be deinstalled; if it succeeds you will see near the end "Setup result: 0":

& "c:\setup\setup.exe" /Q /Action=Install /IAcceptSQLServerLicenseTerms /IndicateProgress /Features=SQLEngine,Conn,FullText /InstanceName=MSSQLSERVER /TcpEnabled=1 /SecurityMode=SQL /SaPwd=@Sq1T3st /SqlSysAdminAccounts="ContainerAdministrator"

Step 2 -Verify there is now a service named MSSQLDFLauncher (but this name changes if it's a named instance):

PS C:\> get-service -name *sql* 
Status   Name               DisplayName 
------   ----               ----------- 
Running  MSSQLFDLauncher    SQL Full-text Filter Daemon Launche... 

Step 3 - Create a fulltext catalog in your database with the command below, changing TWEETS with the name of the catalog:

CREATE FULLTEXT CATALOG TWEETS;

Step 4 - Create a fulltext index in your table with the command below, changing MyTbl with your table name, Text with your column name, ID_TWEET with the index name you added earlier, TWEETS with the catalog name, and AUTO if you want the index population to be manual:

CREATE FULLTEXT INDEX ON dbo.MyTbl(Text) KEY INDEX ID_TWEET ON TWEETS WITH CHANGE_TRACKING AUTO;

You can also indicate the file type through its file extension in the same table, so you can have multiple document types together.

Perform Test Searches in SQL Server

Once Full Text is installed and configured, you can perform text searches as follow. To see how many records contain the exact word you extracted data for, see the command and output below, by default the word is case insensitive:

1> SELECT COUNT(1) FROM dbo.MyTbl WHERE CONTAINS(Text, 'sql'); 
2> GO 
----------- 
         64 

To see how many records match the meaning but not the exact wording, see the command and output below:

1> SELECT COUNT(1) FROM dbo.MyTbl WHERE FREETEXT(Text, 'sql'); 
2> GO 
----------- 
         68 

To see the words with a common root (stemming), see the command and output below:

1> SELECT Text FROM dbo.MyTbl WHERE FREETEXT(Text, 'FORMSOF(INFLECTIONAL, "give")'); 
2> GO 
Sadly, a significant portion of dirty money has been given to political parties, such as 
X gave up, is on the horizon. He?s wholly owned by. 
WEAKNESS???When bounties on the heads of troops ignored it? gave classified? https://t 
Used to be the indefensible Capital that gives it's name? http:// 

Get Performance Stats in SQL Server

You can get the performance of using CONTAINS in SQL Server Management Studio:

CONTAINS execution plan

And then compare it against the performance of using LIKE, note LIKE always performs a full table scan under these conditions:

LIKE execution plan

You can also test the full text with the command below, where 1033 is the locale ID for English, the third parameter is the default stoplist ID, and the last parameter is the accent sensitivity (1 for sensitive); from the result, the expansion_type indicates it is a single word:

parse results

Additional notes:

  • There is no way to perform a SOUNDEX within the Full Text index, but you can get the words in the index, run SOUNDEX on them, and then join the result back with the original table.
  • There is no way to highlight the words that match the query.
  • You can tune everything in the word extraction process to accommodate your needs.
  • Named pipes needs to be enabled with the default value and must be started only by the SQL Server service. The service start type is Manual.
  • You can only define one full text index per table.
  • You need to update statistics on the unique index before a full text index population to generate good partitions.
  • When upgrading your database, consider a full rebuild since newer versions introduce new word breakers and stemmers, and more advanced linguistic behavior.
  • By default, noise words are not removed from full text queries and throw a warning if you include them, but this can be changed with sp_configure 'transform noise words', 1.
  • You can adjust synonyms and replacements in the thesaurus file for the language used, but it is stored in cache, so you need to reload it using sp_fulltext_load_thesaurus_file; for example, use 1033 to reload changes made in Tsenu.xml used for English. Note the thesaurus rules are not recursive, you must avoid special characters, and you need to save the file as Unicode specifying Byte Order Marks.
  • Changes made with WRITETEXT and UPDATETEXT are not picked up by full text auto population, and you need to enable trace flag 7646 to alleviate blocking between DML and queries. For better performance, it is recommended to populate using incremental mode which requires a timestamp column and a secondary index on it.
  • It uses an algorithm to treat 50-word documents the same as 100-word documents, so if a word repeats the same number of times, a small document will have a higher rank than larger documents.
  • You can view the available languages in sys.fulltext_languages. You can load additional OS languages with EXEC sp_fulltext_service @action='load_os_resources', @value=1.
  • If you have plenty of memory and very complex full text queries, enable trace flag 7662 to disable slower processing mode when more than 20 MB of memory is required (fallback mechanism).
  • The full text indexes are not stored and updated individually nor per-table; instead, the full text catalog is treated as a whole. Check this link for full text catalog recommendations and a script to distribute full text indexes across multiple catalogs.
  • If the user queries are not returning data, you may need to implement a fix to repopulate full text indexes as described in this link.
  • You can find the Full Text documentation in this link.

Conclusion

You can see SQL Server Full Text and Oracle Text are very similar in how they work, how they must be configured, and what their results are; also, both must be either installed or created using a script. In Oracle there are more usage features and it need less administration. In SQL Server it is managed per database, so it doesn't need to be configured per user, but has more caveats.

Next Steps



Get Started Now - Click here to get your free 14 day trial of SolarWinds Database Insights




get scripts

next tip button



About the author
MSSQLTips author Pablo Echeverria Pablo Echeverria worked for more than 10 years as a software programmer and analyst. In 2016, I switched jobs to a DBA position, where I have implemented new processes, created better monitoring tools and grown my data scientist skills.

View all my tips


Article Last Updated: 2022-05-20

Comments For This Article





download














get free sql tips
agree to terms