Learn more about SQL Server tools

   
   















































Remove HTML tags from strings using the SQL Server CLR

MSSQLTips author Diana Moldovan By:   |   Read Comments (2)   |   Related Tips: 1 | 2 | More > Functions - User Defined UDF
Problem

There are situations when you may need to remove HTML tags from your character string data. As an example, consider having to submit a product data feed to a search engine like Google. The detailed product description is mandatory in this case.  It is recommended that you remove all special characters and HTML formatting. This task can be handled in TSQL code, however in this case I have the opportunity to use .NET and the power of the regular expressions to manage the string.  In this tip, I'll build a CLR function which cleans up a string of HTML tags and special characters.  I'll use Visual Studio 2010 with C# as the programming language.  Check out this tip for my solution.

Solution

In one of my previous tips I've detailed the steps you need to build and deploy a CLR user defined function from Visual Studio 2010. Below are the steps that need to be followed. 

First, make sure that the CLR integration is enabled. This can be accomplished by using the server facets in SQL Server 2008, the Surface Area Configuration tool in SQL Server 2005 or run the following code:

sp_configure 'clr enabled', 1  
GO  
RECONFIGURE  
GO

Next, follow these steps:

  • Open Visual Studio 2010
  • Click on "New Project"
  • Choose the Database ---> SQL Server ---> Visual C# SQL CLR Database Project template.
  • Make sure that the project targets .NET 2 / .NET 3 / .NET 3.5.
  • Set up a connection to your database, test the connection and click OK
  • Right click on the project and add a user defined function as explained in the next section

Creating the user defined function in the SQL Server CLR

As you can see, before applying the regex pattern, which finds the HTML tags, I strip out the control characters from the final result. Afterwards, I find the HTML tags using the regex pattern and I replace them with an empty string.

Here is the code of the user defined function:

    using System;
    using System.Data.Sql;
    using System.Data.SqlTypes;
    using System.Collections.Generic;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTML(SqlString s){
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            StringBuilder tmpS = new StringBuilder(s1.Length);
            //striping out the "control characters"
            if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
            for (int i = 1; i <= s1.Length - 1; i++)
            {
                if (Char.IsControl(s1[i]))
                {
                    if (s1[i - 1] != ' ') tmpS.Append(' ');
                }
                else
                {
                    tmpS.Append(s1[i]);
                }
            }
            string result = tmpS.ToString();
            //finding the HTML tags and replacing them with an empty string
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(result, String.Empty);
        }
    }
    


Comparing the T-SQL and C# CLR code

This task can be carried out using TSQL - for example I could use this piece of simple and elegant code provided by Pinal Dave. It uses the STUFF TSQL string function to replace the HTML tags with empty strings. Should I use CLR or TSQL? Which is the right choice in this case?

To answer this question, I've done a little bit of testing. I used a ~61,000 records ProductDescription table, consisting of an integer ProductID column (clustered primary key) and a Description varchar(max) column which contains HTML markup. I used a slightly different .NET code, which deals only with the HTML replacement as shown below:

    using System;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTMLOnlyString(SqlString s)
        {
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(s1, String.Empty);
        }
    }
    

In my test cases, I worked on a virtual machine with Windows Server Standard 2003 SP2, 2GB of RAM on SQL Server 2008 Developer Edition with SP2. This is a development machine with no other code running on this machine during the test.

I ran two pieces of code in separate SSMS windows, one using Pinal's function and the other using the CLR function as shown below:

    --window 1 - CLR function
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    GO
    SELECT ProductDescriptionID, dbo.CleanHTMLOnlyString([Description])
    FROM ProductDescription
    --window 2 - Pinal's TSQL function
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    GO
    SELECT ProductDescriptionID, dbo.udf_StripHTML([Description])
    FROM ProductDescription
    

Based on the testing environment listed above, using the T-SQL code the CPU time and the elapsed time are greater. There is an average CPU time of ~9 seconds with the T-SQL code vs. ~4 seconds with the CLR code. In addition, there is an average elapsed time of ~10 seconds with the T-SQL code vs. ~5 seconds with the CLR Code. When I ran this same code in a similar hardware environment, but with SQL Server 2005 SP4 Developer Edition, differences are greater.  The results for CPU time were ~31 seconds for T-SQL code vs. ~4 seconds for CLR code as well as elapsed time of ~33 seconds for T-SQL code vs. ~5 seconds for CLR code.

But what is the price for this? A little bit of research led me to this article about the SQL CLR memory usage. Checking out the sys.dm_os_memory_clerks DMV I could see that the MEMORYCLERK_SQLCLR reserves and commits memory. I've run the query recommended by Steven Hemingray on my SQL Server 2008 development virtual machine and I obtained 8548 KB.  Depending on the operating system, how busy the environment is and how the CLR code is written, you may experience memory pressure. Searching more, I've found 2 extreme situations here and here.  So these are something to consider as well.

Next Steps
  • Explore here other situations when CLR might be a good choice. But...
  • ...do not forget to follow this good advice
  • If you decide to use CLR code, test thoroughly that code on your environment.


Last Update: 6/23/2011


About the author
MSSQLTips author Diana Moldovan
Diana Moldovan is a DBA and data centric applications developer with 6 years of experience covering SQL 2000, SQL 2005 and SQL 2008.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Post a comment or let the author know this tip helped you.

       All comments are reviewed, so stay on subject or we may delete your comment.

*Name   *Email Notify for updates



       Note: your email address is not published. Required fields are marked with an asterisk (*)


Get free SQL tips:

*Enter Code refresh code     



Tuesday, July 05, 2011 - 7:57:43 AM - Diana Moldovan Read The Tip

Jonathan,

 

The only thing I advocate here is caution when using CLR. I provide 2 examples, one with TSQL and another with CLR. Which is "the best" depends on your environment and workload.

Could you point me to additional resources which illustrate your opinion?


Monday, July 04, 2011 - 4:29:27 AM - Jonathan Choy Read The Tip

Please, if you're going to advise using .Net code for removing HTML tags to sanitize your strings, use a robust and well-designed example.

 

This is not such an example, as it advocates using regular expressions to parse HTML. (Recognition to remove is a subset of parsing, and if improperly done will fail to have the desired results).




 
Sponsor Information