SQL Server Semantic Search to Find Text in External Files
By: Daniel Calbimonte | Updated: 2012-10-04 | Comments (3) | Related: More > Full Text Search
Sometimes it is necessary to search for specific content inside documents stored in a SQL Server database. Is it possible to do this in SQL Server? Can I run T-SQL queries and find content inside Microsoft Word files? Yes, now with SQL Server 2012 you can do a semantic search. You can create a T-SQL query and look for information inside a Word document or other documents. In this tip, we will look at how this works.
SQL Server 2012 introduces a new search feature beyond the relational platform called semantic search. This feature extends the semantic full-text search capabilities to external files.
The Semantic Search engine can be installed when you do the initial installation of SQL Server or it can be added later.
Here is a screenshot of the option when you do a normal installation:
If you did not install this when you did the initial installation you can run the SQL Server installer for the SemanticLanguageDatabase as shown below. This can be found in the SQL Server setup folder.
Verify Features Are Installed
To verify if you have Full Text and Semantic Search installed you can run this query:
The following query s used to verify that the semantic database is installed and registered:
Create Sample Database
To use this new feature we are going to use the FileTable feature in SQL Server 2012. First I am going to create a folder where we will store our database files. In this example I have created a folder called "C:\FileTable".
After this folder is created, open SQL Server Management Studio and create a new database called "FileSearchTest". This is created just like any other database except we are going to also enable FILESTREAM for this database as well.
The script below creates the database:
CREATE DATABASE FileSearchTest ON PRIMARY ( NAME = N'FileSearchTest', FILENAME = N'C:\FileTable\FileSearchTest.mdf' ), FILEGROUP FilestreamFG CONTAINS FILESTREAM ( NAME = MyFileStreamData, FILENAME= 'C:\FileTable\Data' ) LOG ON ( NAME = N'FileSearchTest_Log', FILENAME = N'C:\FileTable\FileSearchTest_log.ldf' ) WITH FILESTREAM ( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'FileTable' )
Create Sample Table
Create a new table as a FileTable inside the database.
USE FileSearchTest go CREATE TABLE DocumentSemantics AS FileTable WITH ( FileTable_Directory = 'DocumentSemantics', FileTable_Collate_Filename = database_default ); GO
We just created a FileTable! This is a special table that is used to store non-relational data like Word files, text files, ppt files, etc... The below image shows this new feature in SSMS.
There is also a feature in SQL Server 2012 to explore the FileTable directory from within SSMS. In SSMS right click on the new FileTable you can see the "Explore FileTable Directory" option. This will show you the files that exist in this folder.
You can now copy files into the folder "C:\FileTable\Data", using Windows Explorer and then if you run a SELECT against the DocumentSemantics FileTable you can see a list of these files.
You can just run this command to see all of the details:
You can also delete files from the folder using T-SQL as shown below:
DELETE FROM dbo.DocumentSemantics
WHERE dbo.DocumentSemantics.file_type <> 'docx'
Searching Content Within Files
In this example we are going to search for text and find the files with that content:
WHERE FREETEXT (file_stream, 'After climbing a very high mountain, we discovered that there are many other mountains to climb')
The query will show the name of the documents that contain the following text: "After climbing a very high mountain, we discovered that there are many other mountains to climb".
The results displayed are the following:
If we open these documents in WordPad we can see this text as outlined below.
Finding Documents By Filename
Another option we have is to find filenames that include some text. In this example we are going to search for the word "system".
These are the files in folder "C:\FileTable\Data" that include the word "system":
We can run this T-SQL script to find the first 5 files that contain the word "system" in the filename:
USE FileSearchTest SELECT TOP (5) D.name, D.cached_file_size, KEYP_TBL.score FROM dbo.DocumentSemantics D INNER JOIN SEMANTICKEYPHRASETABLE ( dbo.DocumentSemantics, (name, file_stream) ) AS KEYP_TBL ON D.path_locator = KEYP_TBL.document_key WHERE KEYP_TBL.keyphrase ='system' ORDER BY KEYP_TBL.score DESC;
The results displayed by the query are as follows:
Semantic Search combined with FileTables is a powerful feature that will let us handle external files in SQL Server easily and we can query the content easily using T-SQL.
For more information, you can read these artilces:
- Semantickeyphrasetable (Transact-SQL)
- Find Key Phrases in Documents with Semantic Search
- Semantic Search
Last Updated: 2012-10-04
About the author
View all my tips