Removing HTML tags from text data using PowerShell

By:   |   Comments (1)   |   Related: > PowerShell


Problem

In this tip I'll address the same problem as a previous tip using CLR, but to get the job done I'll use PowerShell. PowerShell will let me "clean" the string data and export the result to a .csv file in a single step. The tip I've mentioned before covers only the data "cleaning".

Solution

I'm working on a virtual machine with SQL 2008 Develoer edition SP2 installed locally with only a default instance. I'm using ~61.000 records [Test].[dbo].[ProductDescription] table, consisting of an integer ProductID column (clustered primary key) and a Description varchar(max) column which contains HTML markup.  My goal is to remove the HTML tags from the varchar(max) column and to export the "clean" data to a .csv file.

You can use sqlps.exe to complete your routine PowerShell tasks. However, I'm still a beginner and I need a more "visual" environment. Therefore I'm using for example PowerGui, which offers "goodies" such as intellisense and syntax highlighting. Others may prefer the ISE ("integrated scripting environment") offered by Microsoft or simply use the "regular" PowerShell console. In any of these cases you'll need to set up PowerShell for working with SQL Server.

I've added the 2 snap-ins (SqlServerCmdletSnapin100 and SqlServerProviderSnapin100) to my profile, so that they load every time I start PowerShell.

Add-PSSnapin SqlServerCmdletSnapin100
Add-PSSnapin SqlServerProviderSnapin100 

The SQL Server PowerShell provider enables you to "see" the SQL objects (servers, databases, tables, and so on...) in a hierarchy similar to the file system under a root drive named "SQLSERVER". The "SqlServerCmdletSnapin" includes a set of specific commandlets - I'll use "Invoke-Sqlcmd" later. Sqlps.exe offers these features "out of the box".

Since I can "see" the SQL Server objects in a "file system way", I can easily navigate to the objects of the default instance, in my case to the database "Test"

Set-Location SQLSERVER:\SQL\MyServer\DEFAULT\DATABASES\Test

I also need a .csv file where I'll export the data. I'll assign the file path to a string variable:

[string] $OutFile = "C:\MyLocation\OutFile.csv"

Make sure that your account can write to the location you specify.

To pull out the data from the [Test].[dbo].[ProductDescription] table, I'll use the Invoke-Sqlcmd commandlet.

Invoke-Sqlcmd  -Query "SELECT ProductDescriptionID, [Description] FROM ProductDescription"

In this case (I do not specify any user name and password) the connection to the SQL Server will be made using my Windows account.

The context in which I'm running the query has already been established by the "Set-Location" from above. In fact, when running Invoke-Sqlcmd, the PowerShell console will display a warning message: "WARNING: Using provider context. Server = MyServer, Database = Test". Don't worry, nothing is going wrong. This is rather an informational message, meant to show you the context you're working in. Thank you Chad Miller for the explanation. I'll use the "SuppressProviderContextWarning" parameter to get rid of the message.

Invoke-Sqlcmd will run the SELECT statement in the PowerShell environment. Its "output" consists of .NET framework objects of type System.Data.DataRow. The properties and methods of these objects can be seen by piping the Invoke-Sqlcmd result to Get-Member. Since now I'm interested only in the properties, I'll select only the property members:

Set-Location SQLSERVER:\SQL\MyServer\DEFAULT\DATABASES\Test
Invoke-Sqlcmd -Query "SELECT ProductDescriptionID, [Description] FROM ProductDescription `
WHERE ProductDescriptionID = 3"`
|Get-Member -MemberType Properties

I've used a "WHERE" condition in the query, so that the result set is limited and the script is not slow. Here is the result:

Get Member

Remember that part of the final goal is to export the data to a .csv file. To do this typically I use the Export-Csv commandlet. Export-Csv will create a .csv file with rows consisting of a comma separated list of values of the object properties - in our case the ProductDescriptionID and the Description. But the "raw" Description is not what I want to export. This time I need to remove the HTML tags from the product description. Therefore I cannot use Export-Csv.

Instead I'll remove the HTML tags from every "Description" property of each object returned by Invoke-Sqlcmd and I'll build a .csv file from the ProductDescriptionID and the "cleaned" description. Here is the code, all put together:

[string] $OutFile = "C:\MyLocation\OutFile.csv"
Set-Content  $OutFile "`"ProductDescriptionID`",`"Description`""
Set-Location SQLSERVER:\SQL\MyServer\DEFAULT\DATABASES\Test
Invoke-Sqlcmd -Query "SELECT ProductDescriptionID, [Description] FROM ProductDescription"`
|ForEach-Object{[string]::Join(",", `
    ([string]::Concat("`"", $_.ProductDescriptionID, "`""), `
 [string]::Concat("`"",$_.Description -replace "<[^>]*?>|<[^>]*>", "" ,"`"")))}`
|Add-Content $OutFile

I'm using:

  • Set-content for the file header. Set-Content will replace any old text that might exist in the target file with the text specified.
  • ForEach-Object, which enables me to loop through the Invoke-Sqlcmd output and to change the Description property for every object.
  • the Concat method of the string class. This method converts its arguments to string and concatenates those strings.
  • the Join method of the string class
  • the replace operator
  • Add-content to append the result to the target file.

Here is another variant of this script using a StringBuilder to append content and outputs the whole result to the target file. It is faster, but requires more memory.

[string] $OutFile = "C:\MyLocation\OutFile.csv"
[string] $query = "SELECT ProductDescriptionID, [Description] FROM ProductDescription"
$s = New-Object System.Text.StringBuilder
Set-Content  $OutFile "`"ProductDescriptionID`",`"Description`""
Set-Location SQLSERVER:\SQL\MyServer\DEFAULT\DATABASES\Test
Invoke-Sqlcmd -SuppressProviderContextWarning -Query $query|ForEach-Object{
    $s.Append([string]::Join(","`
    , ([string]::Concat("`"", $_.ProductDescriptionID, "`"")`
    , [string]::Concat("`"",$_.Description`
    -replace "<[^>]*?>|<[^>]*>", "" ,"`"")))); $s.Append("`r`n")
}|Out-Null
$s.ToString()|Add-Content $OutFile

I've also used "Out-Null" to avoid a lengthy console output.

Next Steps
  • I tried to "keep it simple" here, so there is no error trapping in this example. Consider adding error handling to your PowerShell scripts.
  • Check out more PowerShell stuff related to SQL Server here
  • If you're new to PowerShell, there are wonderful and absolutely free resources you should not miss: The Scripting Guy's learning center
  • You can schedule this script if you need to export the data periodically.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Diana Moldovan Diana Moldovan is a DBA and data centric applications developer with 6 years of experience covering SQL 2000, SQL 2005 and SQL 2008.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, July 24, 2017 - 6:30:06 AM - breccs Back To Top (63159)

Hi, How would you do the same and output the result to a .html file? 

 















get free sql tips
agree to terms