Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 
The Trade-off Between SQL Server Security and Performance - Free Webinar
 

Deleting Historical Data from a Large Highly Concurrent SQL Server Database Table


By:   |   Last Updated: 2015-03-30   |   Comments (19)   |   Related Tips: 1 | 2 | More > Maintenance

Problem

I have a large table with millions of historical records. I was asked to remove millions of old records from this table, but this table is accessed by our system 24x7, so there isn't a good time to do massive deletes without impacting the system. In this tip I will show you an unconventional way of doing massive deletes.

Solution

We as database specialists know that deleting or updating records from a table are operations that cause the most blocking. When performing a DELETE, the Database Engine needs to hold a lock on the rows being deleted in order to maintain data consistency. But when you delete a considerable amount of data the row locking escalates into a page or even a table lock which causes blocking issues.

When I was asked to remove historical data from a 120 GB unpartitioned table for an online payment service I tried different approaches, all with the same basic idea of limiting blocking issues. In this tip, we will look at different options for doing this and my final solution that I used.

Test Environment

For the purpose of this tip I used the PurchaseOrderDetail table of the AdventureWorks database and inserted a lot of data.

First we create a database.

USE [master]
GO

CREATE DATABASE [TestDB]
 CONTAINMENT = NONE
 ON  PRIMARY 
( NAME = N'TestDB_file1', 
   FILENAME = N'E:\MSSQL\TestDB_1.mdf',
   SIZE = 128MB , 
   MAXSIZE = UNLIMITED, 
   FILEGROWTH = 64MB) 
 LOG ON 
( NAME = N'TestDB_log_file1',
    FILENAME = N'E:\MSSQL\TestDB_1.ldf',
    SIZE = 64MB,
    MAXSIZE = 2048GB,
    FILEGROWTH = 32MB)
GO

The next script will create our sample table.

USE TestDB
GO

CREATE TABLE[PurchaseOrderDetail]
(
[PurchaseOrderID] [int] NOT NULL,
[PurchaseOrderDetailID] [int] NOT NULL IDENTITY(1, 1),
[DueDate] [datetime] NOT NULL,
[OrderQty] [smallint] NOT NULL,
[ProductID] [int] NOT NULL,
[UnitPrice] [money] NOT NULL,
[LineTotal] AS (isnull([OrderQty]*[UnitPrice],(0.00))),
[ReceivedQty] [decimal] (8, 2) NOT NULL,
[RejectedQty] [decimal] (8, 2) NOT NULL,
[StockedQty] AS (isnull([ReceivedQty]-[RejectedQty],(0.00))),
[ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_PurchaseOrderDetail_ModifiedDate] DEFAULT (getdate())
) ON [PRIMARY]
GO
ALTER TABLE [PurchaseOrderDetail] ADD CONSTRAINT [CK_PurchaseOrderDetail_OrderQty] CHECK (([OrderQty]>(0)))
GO
ALTER TABLE [PurchaseOrderDetail] ADD CONSTRAINT [CK_PurchaseOrderDetail_ReceivedQty] CHECK (([ReceivedQty]>=(0.00)))
GO
ALTER TABLE [PurchaseOrderDetail] ADD CONSTRAINT [CK_PurchaseOrderDetail_RejectedQty] CHECK (([RejectedQty]>=(0.00)))
GO
ALTER TABLE [PurchaseOrderDetail] ADD CONSTRAINT [CK_PurchaseOrderDetail_UnitPrice] CHECK (([UnitPrice]>=(0.00)))
GO
ALTER TABLE [PurchaseOrderDetail] ADD CONSTRAINT [PK_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID] PRIMARY KEY CLUSTERED  ([PurchaseOrderID], [PurchaseOrderDetailID]) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_PurchaseOrderDetail_ProductID] ON [PurchaseOrderDetail] ([ProductID]) ON [PRIMARY]
GO

Now we can fill our test table with data. By using CROSS JOIN we can quickly create a table with millions of records.

USE TestDB;
GO

INSERT INTO dbo.PurchaseOrderDetail
        ( PurchaseOrderID ,
          DueDate ,
          OrderQty ,
          ProductID ,
          UnitPrice ,
          ReceivedQty ,
          RejectedQty ,
          ModifiedDate
        )

SELECT PurchaseOrderID ,
       DATEADD( YEAR, 6,DueDate) ,
       OrderQty ,
       ProductID ,
       UnitPrice ,
       ReceivedQty ,
       RejectedQty ,       
       ModifiedDate 
    FROM AdventureWorks2012.Purchasing.PurchaseOrderDetail a
    CROSS JOIN sys.objects b
    CROSS JOIN  sys.objects c

Simple DELETE

Suppose we need to delete rows from PurchaseOrderDetail for any record that has a DueDate prior to January 1, 2014. At first you may be tempted to execute the following statement. Since there is so much data to delete, this will cause a lot of blocking and also take a long time to complete. The other issue is that this will be done in one giant operation and will cause the transaction log to get very large.

DELETE FROM dbo.PurchaseOrderDetail
WHERE DueDate < '20140101'

By running the above statement, you will create a lot of locking as you can see from the Performance Dashboard report image below.

Performance Dashboard Waiting Tasks Screen Capture.

Options to Delete the Data

Using TOP Clause

Another approach is to use a TOP clause with a DELETE statement to limit the number of rows deleted as shown below. The problem with this approach is that there is no an index on the DueDate which will cause SQL Server to scan the table to find the data. This will take the transaction longer to complete and could cause blocking issues. We can modify the TOP value from 1 to N for our needs.

USE TestDB
GO

DELETE TOP ( 5000 )
FROM    PurchaseOrderDetail
WHERE   DueDate < '20140101' 

A variant of the above method is to DELETE the records based on the primary key as shown below. The primary key is based on PurchaseOrderID and PurchaseOrderDetailID and this query is deleting based on the primary key.

USE TestDB
GO

DELETE  FROM PurchaseOrderDetail
WHERE   PurchaseOrderDetailID IN ( SELECT TOP 5000
                                            PurchaseOrderDetailID
                                   FROM     PurchaseOrderDetail
                                   WHERE    DueDate < '20140101'
                                   ORDER BY DueDate ASC )
AND PurchaseOrderID   IN ( SELECT TOP 5000
                                            PurchaseOrderID
                                   FROM     PurchaseOrderDetail
                                   WHERE    DueDate < '20140101'
                                   ORDER BY DueDate ASC );

In the previous two scripts we are deleting 5000 rows at a time, but in some cases that wonít avoid blocking issues.

Using ROWCOUNT property

Setting the ROWCOUNT property to a value other than 0 will cause the query to stop executing when the rows affected reach its value. I must warn you that in a future release of SQL Server this property will not affect the delete statement. You can read more about this in Books Online: SET ROWCOUNT (Transact-SQL).

USE TestDB
GO

SET ROWCOUNT 5000;
DELETE FROM PurchaseOrderDetail
WHERE DueDate < '20140101' 

Options to Delete all of the Data

At this point we have seen different options to delete small amounts of data at a time, but we need to figure out a way to accomplish this task without the need to be in front of the computer the entire time.

Using a Cursor

Donít even think about using a cursor. If you perform the delete inside a cursor your transaction log could grow disproportionally, even if your database is using the simple recovery model. The reason behind this behavior is that SQL Server needs to perform a checkpoint in order to truncate the log, but the checkpoint process cannot be done in the middle of a transaction. And guess what; the affected rows will be locked until the cursor completes. Even if the database recovery model is set to SIMPLE or you do frequent transaction log backups the transaction log will continue to grow until the entire operation is complete.

Using a While Loop

This may seem to be the way to go, but it has the same drawbacks of a cursor in regards to blocking and transaction log grow.

Using GO with a count

You can use the GO batch terminator with a value to have the statement run over and over again as shown below, but you will need to know how many times the batch needs to execute to delete the data. Also, since there is not an index on DueDate this can cause blocking issues while this runs. In this example it is deleting 5000 rows of data and will run 10000 times.

USE TestDB
GO

DELETE TOP (5000) 
FROM PurchaseOrderDetail
WHERE DueDate < '20140101' 
GO 10000

Deleting Row by Row with the Primary Key

The best way to delete the historical data without blocking issues is to delete row by row using the Primary Key. In this solution, I am generating DELETE statements for each row to delete based on the Primary Key and then executing these statements using SQLCMD.

Generating the DELETE Statements

To create the script all we need to do is build a select query with the DELETE statement enclosed as text and concatenate it with the primary key of the rows we want to delete casted as text. Take a look at the next script. If you have created dynamic SQL before, it will look familiar.

USE TestDB 
GO
SET NOCOUNT ON

SELECT   'DELETE FROM PurchaseOrderDetail WHERE PurchaseOrderID = '
        + CAST(PurchaseOrderID AS VARCHAR(50))
        + ' AND PurchaseOrderDetailID = '
        + CAST(PurchaseOrderDetailID AS VARCHAR(50)) + ';' + CHAR(13)
        + CHAR(10) + 'GO'
FROM    PurchaseOrderDetail
WHERE   DueDate < '20140101' 

The previous query contains a few things that will allow us to execute all the generated delete statements directly using the SQLCMD utility. The semicolon (;) is used as the statement terminator, it marks the end of the statement. The CHAR(13) + CHAR(10) will write the output to a new line. And finally the GO is used as the batch terminator. This will have SQLCMD send one DELETE statement at a time and therefore will minimize the locking. The next image explains the parts of the previous query.

Delete Statements Generator Explained.

In order to generate a text file with the DELETE statements we need to select the option in Management Studio to save the query results to file as shown below. You will be prompted for a file name and a location to save the results.

Press This Button to Save the Results to a File.

Executing the File using SQLCMD

This is the final step; we are going to execute our previously generated text file using the SQLCMD utility. You can use the following command to do so.

sqlcmd -S [Instance Name] -E -d [Database] -i [Script]

Sqlcmd Command Line.

After you hit enter you will see something similar to the next image in the Command Prompt window.

Sqlcmd Delete Script Execution.
Minimizing Transaction Log Growth

To minimize the growth of the transaction log you can set your database recovery model to SIMPLE, this way the transaction log will not grow out of control when the statements are running, because the log truncates on every checkpoint.  The other option is to run frequent transaction log backups while this process is running to minimize the growth of the transaction log.

Next Steps


Last Updated: 2015-03-30


next webcast button


next tip button



About the author
MSSQLTips author Daniel Farina Daniel Farina was born in Buenos Aires, Argentina. Self-educated, since childhood he showed a passion for learning.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Monday, August 21, 2017 - 6:08:56 PM - Kyle Back To Top

 Your very close to having something special here.  If we could just find a way to work with the alias for repeating tables.

 


Tuesday, April 28, 2015 - 11:54:17 AM - Jim Evans Back To Top

Thanks Daniel.  For a massive table and a goal of keeping a small percentage of the data, say 10%.  What do you think of creating a new table with the same structure.  Moving over the 10% for data to be retained. Then rename the original table to _Old and rename the new table to the original table name.  (May have to sync any new records if missed).  At this point you can truncate and delete the original massive table. 


Sunday, April 19, 2015 - 2:31:28 PM - Harry Back To Top

Hi, Daniel.

I found your post useful, and it raised a question that might be a common scenario.  What if a column was an incrementing identity integer and you wanted to renumber the data of the records you did not delete starting at 1 for this field ?  What would you do and what scripts would be needed ?

Thank you in advance.

Harry


Thursday, April 09, 2015 - 4:42:10 AM - DEEPAK Back To Top

Nice post 


Tuesday, April 07, 2015 - 4:53:40 PM - Dan Wheeler Back To Top

Thanks for this tip Daniel. I really like this approach and I am running through the sample exercise and I see that is working very well!


Tuesday, April 07, 2015 - 2:04:45 PM - Chuck Back To Top

I would also suggest enabling row versioning AKA snapshot isolation.

https://technet.microsoft.com/en-us/library/ms177404%28v=sql.105%29.aspx

IMO snapshot isolation is one of those features that should be a no-brainer. Just turn it on and watch 99% of your blocking issues disappear.


Wednesday, April 01, 2015 - 1:03:11 PM - Jared Karney Back To Top

You say that a cursor and a WHILE loop will grow the log because a checkpoint cannot r, but that is not true. A while loop will do each iteration in its own transaction. The stated reason is "the checkpoint process cannot be done in the middle of a transaction." Though that is true, it assumes that all iterations of the cursor or a WHILE loop execute in the context of a single transaction, which it does not. I purposefully batch my deletes from large tables to avoid transaction log growth. I have found a WHILE loop to be a great solution for large deletes.


Tuesday, March 31, 2015 - 11:58:46 PM - Daniel Farina Back To Top

Hi Everybody! Now I will try to answer your questions

@Nataraja Sidgal: At First, It will depend on your network speed and the distributor throughput. To the distributor it is the same if you delete 5000 rows one at a time or at once because delete statements are fully logged. I wish I had properly answered your question, if not just let me know.
Thank you very much for reading and commenting!

@fernando: [English] BCP does not support saving column headers. Looking at your code I can see two possible ways. The first way is to have a file with the name of the columns and then merge both files. If you go for this method, and assuming you have a file named ColumnNames.txt with the columns definition,  you only need to add the following line after EXEC master..xp_cmdshell @bcpCommand ..  EXEC master..xp_cmdshell 'copy ColumnNames.txt + C:\PRUEBA1.txt C:\ResultFile.txt'. The other option is to create a SSIS package (you can create one automatically with import/export wizard) and then execute something like follows: EXEC master..xp_cmdshell 'dtexec.ext /f SamplePackage.dtsx'
[Español] BCP no soporta guardar encabezados. Pero viendo tu código, lo que podés hacer es: Tener un archivo con los nombres de columna y hacer un merge con el resutado del BCP. Asumiendo que tu archivo con la definición de las columnas se llama ColumnNames.txt, Sólo tendrías que agregar la siguiente linea luego de EXEC master..xp_cmdshell @bcpCommand ..  EXEC master..xp_cmdshell 'copy ColumnNames.txt + C:\PRUEBA1.txt C:\ResultFile.txt'. Sino lo que podés hacer es crear un paquete SSIS (Podés usar el import/export wizzard) y después ejecutas algo así: EXEC master..xp_cmdshell 'dtexec.ext /f SamplePackage.dtsx'

Thank you All!


Tuesday, March 31, 2015 - 1:53:21 PM - shashi Back To Top

Is there a solution for Nataraja Sidgal's question


Tuesday, March 31, 2015 - 9:21:45 AM - John Back To Top

Nice insights, thanks!

 


Monday, March 30, 2015 - 9:28:04 PM - Bill Back To Top

One additional point about the TOP approach: Adding an index upon DueDate is an appropriate solution, if it avoids lock escalations and if it does not significantly impact other DML upon PurchaseOrderDetail.

And one additional point about singleton deletes: Consider http://blogs.msdn.com/b/sqlsakthi/archive/2011/04/17/what-is-writelog-waittype-and-how-to-troubleshoot-and-fix-this-wait-in-sql-server.aspx


Monday, March 30, 2015 - 8:33:16 PM - Nataraja Sidgal Back To Top

Here you are deleting one record at a time via SQL command which is good for a single table.  If you have multiple tables that are all in a relationship, we could do the same by deleting from the child table and then the parent table one record at a time.

However if all these tables are replicated (publisher) and all the tables are heavily used 24/7, what will be the impact of deleting the records one at a time what will be the impact to replication performance and overall sql performance since these tables are transactional tables for the business.  Note that a record deleted in the publisher may/may not be deleted in the subscriber, but the distribution database will still have to read all the transactions.


Monday, March 30, 2015 - 8:26:09 PM - fernando Back To Top

Tengo una pregunta como puedo exportar los encabezados de mi tabla, ya que cuando lo ejecuto solamente me exporta los registros y necesito los encabezados de dicha tabla espero puedan ayudarme. saludos y excelente pagina


DECLARE @FileName NVARCHAR(50),@bcpCommand NVARCHAR(255)

SET @FileName = 'C:\PRUEBA1.txt'

SET @bcpCommand = 'bcp "pao.dbo.vent" out '

SET @bcpCommand = @bcpCommand +' '+ @FileName + '-SGAVE697L497A017 -Usa -Pfercho16! -T  -c -t"|" -C ACP -F1 '

--SET @bcpCommand = @bcpCommand +' '+ @FileName + ' -C ACP -t"|" -c -t -r '

 

EXEC master..xp_cmdshell @bcpCommand

 

Monday, March 30, 2015 - 8:18:16 PM - Daniel Farina Back To Top

Hi guys! Thank you for reading!!! Now I will try to answer your questions.
@NikTek: I think that it won't be possible to execute it inside an SP, because the delete statements will run in the one single transaction. Maybe if some day SQL Server supports autonomous transactions it could be done.
But, you can use a SSIS package: one way would be to generate a text file with all the delete statements and then execute an OS task that calls sqlcmd. Another SSIS option would be an Execute SQL task and then do a for each loop that execute the delete statements.

@Abhi: I leave the script running for the weekend and on Monday when I arrived to the office it was done. One of my goals was to find a way that doesn't need to be monitored as @RichH said.

@AZJim: I believe that it has to do with the fact that  SQL Server doesnt support autonomous transactions.

Thank you!!!


Monday, March 30, 2015 - 5:38:08 PM - Darrell Back To Top

Great info Daniel, I would also recommend caution if your database is participating in an Availability Group


Monday, March 30, 2015 - 4:38:21 PM - AZJim Back To Top

Daniel ... thank you.  I have always wondered about the recommendation to not use a CURSOR for deletes/updates.  I have been an exclusively SQL Server DBA now for over three years.  Before I did mainframe DB2 DBA work and CURSOR processing was the norm.  The difference was when declaring the cursor, DB2 had an option of WITH HOLD.  So if you had to run a commit or checkpoint, you would automatically retain your position in the table.  Should there be a failure after the first commit/checkpoint and before completion, there was another facility to allow for a restart to your last commit/checkpoint.  So it really was the best of all worlds -- and I am surprised that Microsoft hasn't tried to incorporate this capability into SQL Server.  It would go a long way in providing higher concurrency for shops requiring continuous operations like you were in.  But your approach seems simple and workable.  Thanks again.


Monday, March 30, 2015 - 2:59:42 PM - NikTek Back To Top

Hi Daniel,

 

Thank you for providing excellent article.

Can I convert the code to a SP and schedule as sql job or it should only be used with SQLCMD?

Thank you

Nik


Monday, March 30, 2015 - 12:01:26 PM - RichH Back To Top

@Abhi: you're are correct, the script will delete rows one-by-one.  And that's the author's intent.  When deleting historical data, is the primary concern to get it done quickly,  or to get it done with minimal impact on the system?   I think, in this case, the point wasn't to see how fast we might delete the old data, but rather to do so with minimal interruption to normal operations. 

If a data maintenance script runs for two weeks and nobody but the DBAs know its running, does it matter that it took so long?


Monday, March 30, 2015 - 8:50:11 AM - Abhi Back To Top

Thanks Daniel for the tip. but quick question is how much time this above script will take to delete millions of data.

if not wrong records will be deleted one by one.


Learn more about SQL Server tools