Delete SQL Performance for SQL Server, MySQL and PostgreSQL

Problem

Not long ago, I was working on a SQL Server database that was growing quickly and was using a lot of disk space. After an initial check, I noticed that the amount of records in the largest table was not that many (only about 150,000) and the size of the database was a little over 100GB. The large table contained only 2 columns, an ID and a VARBINARY field that was quite wide (in terms of the amount of data stored in each row), hence the size. Eventually, I had to come up with a strategy to purge old information in chunks, because a single delete would have caused blocking and other potential issues.

I was thinking to myself, what if I run into a similar situation with other RDBMS (MySQL or PostgreSQL)? Would I see similar results if I had to delete lots of rows from tables with very wide rows? I got curious enough to give it a try and will share my results. I hope you find this as interesting as I did while doing the tests.

Solution

In this tip I will be presenting the results I compiled after performing records deletions within each respective RDBMS. The tests will be done against a set of 10,000 records in a table that only has an id column and a wide string (1,000,000 characters to be precise).

Initial considerations

As you probably already have guessed by the title of this article, I’m performing the tests against a SQL Server, MySQL and PostgreSQL database instance. To keep things as fair as possible, I’m performing each test in a Linux virtual machine that has the exact same specs; for the data type of the wide column, I tried to choose something as homologous as possible between them. One last detail for the RDBMS setup is that I tested them right out of the box, without any best practices, tuning, or specific parameter settings to possibly help improve performance.

VM Specs

  • OS: Ubuntu Server 20.04 LTS
  • CPU: Intel Core i7 2.7GHz (3820QM)
  • RAM: 8GB DDR3 1600MHz
  • Volume Size: 25GB

Information related to the RDBMSs and their respective runs

I used the latest available version of each product to keep things as fair as possible.

The scenario for the test goes like this:

1 – A simple test database is created in each DB instance and a table containing the wide column, with a structure similar to this:

CREATE TABLE test(
    **id INT NOT NULL PRIMARY KEY AUTOINCREMENT,
    ***wide VARCHAR(MAX) NOT NULL
)

The syntax for the auto increment id varies within each RDBMS, but it is just to give you an idea.

The data type of the wide column is not exactly the same across all RDBMSs, but I tried to choose one for each that allows me to store the exact same data in each of them, to keep things as similar as possible.

2 – A round of 10,000 records is inserted and the following rounds of deletes are performed:

  • Time to delete 1,000 rows, as well as the size of the respective transaction log file after its completion.
  • Time to delete 2,000 rows, as well as the size of the respective transaction log file after its completion.
  • Time to delete 3,000 rows, as well as the size of the respective transaction log file after its completion.
  • Time to delete 4,000 rows, as well as the size of the respective transaction log file after its completion.
  • Time to delete 5,000 rows, as well as the size of the respective transaction log file after its completion.
  • Time to delete 10,000 rows, as well as the size of the respective transaction log file after its completion.

3 – I’m inserting a string 1,000,000 characters long, in the wide column of each row (the exact same across all of the rows), which I’m attaching as a sample in case you want to try it in your system.

4 – For the tests in each RDBMs, I made sure that only that RDBMS was running in the VM while the other two were completely shut down.

Insertion Round

These metrics were measured on a completely empty instance for each case.

RDBMSVersionWide Column DataTypeTime to insert 10,000 recordsSize of data file after the insertion
SQL Server2019 CU6VARCHAR(MAX)18min 25sec9,864MB
MySQL8.0.21LONGTEXT13min 15sec9251.52MB
PostgreSQL12.2VARCHAR(1000000)4min 20sec147MB

Deletion Rounds

After each deletion round, the log file is cleaned/wiped before the next round takes place.  For each example, there’s a chart for comparison to get a better sense of the results. Note, I multiplied the time results by 10 just for visualization purposes.

1,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server44sec72MB
MySQL13.8sec954MB
PostgreSQL0.095sec16MB
comparison chart

2,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server1min 2sec136MB
MySQL29.24sec1.9GB
PostgreSQL0.179sec16MB
comparison chart

3,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server50sec200MB
MySQL2min 32sec2.8GB
PostgreSQL0.264sec16MB
comparison chart

4,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server1min 36sec264MB
MySQL3min 43sec3.7GB
PostgreSQL0.134sec16MB
comparison chart

5,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server2min 25sec355MB
MySQL5min 44sec4.6GB
PostgreSQL0.314sec16MB
comparison chart

10,000 Records Delete Test

RDBMSTime to delete recordsSize of log file after the deletion
SQL Server3min 53sec648MB
MySQL10min 51sec12GB
PostgreSQL0.639sec16MB
comparison chart

Interesting Facts

  • Without a doubt, PostgreSQL is the clear winner of this experiment. Out of the 3 RDBMSs, it was the one that made both the inserts and deletes in the least amount of time, while keeping the transaction log much smaller. It’s very likely that the transaction log management of the pg_wal files and checkpoint mechanism are very well implemented, or there’s something else behind the curtain that I’m just not aware of (since I’m not a PostgreSQL expert as of right now). If you have the explanation, feel free to share.
  • Compared to SQL Server, MySQL has an advantage within the first 2,000 records, in terms of the time to perform the deletes. However, at every stage of the process it takes a big hit on the transaction log size which goes to the roof at the end.
  • It would be interesting to perform the exact same experiment after applying best practices and properly tuning each RDBMS to see if better numbers are achieved. Although, I guess it would be a bit pointless for PostgreSQL as I don’t think I would get better numbers than the ones I already got.
  • In the world of RDBMSs, I have always heard that PostgreSQL is way better than MySQL. I have never used PostgreSQL in a highly transactional production environment, but after seeing the results from this experiment, I guess I can find some truth to that.
  • I know for a fact that managing delete operations like these in SQL Server, by chunks, plays a whole lot better in terms of the transaction log usage. Also, locks on the table are way more relaxed and it allows you to interact with the table even if the deletes are taking place, something that doesn’t happen on larger deletes.
  • If you have to perform something similar in a real production system and you are working with MySQL, I guess a well thought out strategy has to be applied. Of course, after conducting several test runs to see the results first.
  • It would have been interesting to throw more RDBMSs in the mix like Oracle, DB2, or jump into the NoSQL space and toss MongoDB, Cassandra, Redis, etc.

Next Steps

3 Comments

  1. Thanks for the testing. I only know about SQL Server and so I have to ask…
    What recovery model was SQL Server in and do the other two database engines have a similar setting and what were those settings?

  2. Postgresql is my favourite RDBMS but you need to count the VACCUM time PostgreSQL will take to actually remove the tuples from disk. Thanks for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *