Things To Know as a New DBA to Avoid Mistakes

Problem

As we gain experience in the technology we use, we inevitably learn things along the way that could have been useful long before we got that far. I certainly learned a lot of things the hard way, as I’m sure many of us do, and in my case that has definitely led to me implementing some sub-par solutions along the way.

Solution

In this tip, I wanted to point out a few of the concepts that I wish I had spent more time researching earlier on, and provide my reasoning and some links to further reading.

1. TESTING YOUR BACKUPS IS NOT ENOUGH

You may have heard this before, but a very appropriate quote I hear repeated often is, “You should have a restore plan, not a backup plan.” So true, and I’ve seen so many customers get bitten in scenarios where they are validating that their backup jobs are succeeding, and that is it.

There is not much point in having a .bak file on hand if you have no idea whether it can even be restored, or how long that restore will take on a secondary or disaster recovery server. There are various ways to automate this process, but a good start would be to have periodic disaster recovery drills, where you pretend that the production server is offline, and then do whatever you need to do to stand that system up elsewhere (stopping at the re-point your apps step).

At my last job, I went so far as to actually use a log shipping secondary as a production reporting server. Part of the motive was to reduce read contention on the primary OLTP instance, but a much more important side effect was that this would actively be testing the restore of every single backup, and making production services rely on those restores succeeding. (In the event of failures, we could either revert to the previous day, or just point the reporting application to the OLTP server temporarily.)

Some further reading on this topic:

2. INDEXING IS A BALANCE

Often I find that, over time, people will create an index for every problematic query they ever troubleshoot. I’ve been guilty of that, but have seen some extreme examples. My personal record has been a customer’s table with 215 non-clustered indexes – pretty impressive when there were only around 70 columns. While this is a quick and easy way to solve a short-term problem with, but happens because it is a quick way to solve a performance problem with an individual, isolated query.

The problem is that every index takes space on disk and in memory, and adds overhead to the write portion of the workload, since every index needs to be maintained along with the underlying table data. This means that for every insert, update, and delete, all of the indexes have to be modified as well. A system filled with many, often redundant indexes is going to have performance issues even with balanced workloads, and will be more prone to blocking and deadlocking in high-write scenarios.

There are dynamic management views and other indicators for discovering indexes that are missing and would help performance, but there are also ways to find redundant and unused indexes that are more trouble than they’re worth.

Some further reading on this topic:

3. DATES CAN BE A PAIN

I grew up in Canada, so I had a leg up on the disaster that is the MM/DD/YYYY format for expressing dates. But little did I know how many other technical issues and gotchas I would come across when dealing with date/time data in SQL Server in my early career. Some of the things that took my younger, more stubborn self to get straight:

BETWEEN IS AMBIGUOUS (AT BEST) – I used to write queries that paraphrased “give me all the data between the beginning and end of February.” February is a thorny example because of leap years, but even that aside, what do we mean by the “end” of the month? If the data is represented with datetime, it’s the 28^th or 29^th at 23:59:59.997. If it’s smalldatetime, it’s 23:59:00. Datetime2? As late as 23:59:59.9999999. And date is just the 28^th or 29^th at midnight. It’s very difficult to accurately determine the end of a period when the underlying data type can change; EOMONTH() doesn’t help, because it returns the last day of the month at 12:00 AM. I learned after some time that it’s much easier to determine the beginning of the next period than the end of the current one. So instead of:

WHERE column BETWEEN '20190201' AND '2019022[8|9] 23:59[:59.99[7|99999]]';

You say:

WHERE column >= '20190201' AND column < '20190301';

Again, if you know the beginning of the range, the beginning of the next range is very easy to find, and doesn’t have any of the issues involved with being sure you’ve correctly calculated the end of the range.

SHORTHAND IS ALSO AMBIGUOUS – Early in my career, I took a lot of pride in reducing character count. I loved that I could say:

EXEC sp_configure 'show adv';

Instead of:

EXEC sp_configure 'show advanced settings';

I mean, look at all those characters I saved, and how much time I got back as a result! I think back on that sarcastically whenever I see people type DATEPART(DD, <col>) or DATEDIFF(YYYY, <col1>, <col2>). In the first case, they typed two cryptic characters to avoid typing three (DAY), and in the second case, they typed four characters that aren’t a word to avoid typing four characters that are (YEAR). Making code self-documenting is a huge plus, especially when the cost is negligible or zero. But in this case establishing the better practice of typing the whole word instead of the shorthand can help avoid issues in the future that can be hard to test against. For example, let’s say you want to write code that returns the week number or the year, and you think you can use shorthand. You don’t need to look it up, because what could be simpler, right?

DECLARE @d date = '20190204';
SELECT DATEPART(W, @d), DATEPART(Y, @d);

Most people would probably expect those to return 6 and 2019, but instead they return 2 and 35. Why? Because W is shorthand for DAYOFWEEK, not WEEK, and Y is shorthand for DAYOFYEAR, not YEAR. If you ran your tests against W based on GETDATE(), and you happened to run it on February 8^th instead of the 4^th, your tests would have passed (at least under the default SET DATEFIRST 7).

There are some other issues I’ve come across with dates and times, including difficulties with time zones, generating date ranges on the fly, memorizing convert style numbers, removing time from dates, and the horror that is Daylight Savings Time. For further reading, see:

SQL Server DateTime Best Practices
Creating a date dimension or calendar table in SQL Server
Build a cheat sheet for SQL Server date and time formats
Handle conversion between time zones in SQL Server Part 1 | Part 2 | Part 3

4. GAPS ARE UNIMPORTANT

Early in my career, I spent more time than I care to admit coming up with obscure ways to use all of the available numbers in an IDENTITY column. A row gets deleted? A transaction gets rolled back? Let’s put a serializable lock on the table, find the lowest unused number using a disastrous self-join, and insert that row using SET IDENTITY_INSERT ON.

I try not to imagine now how much more I could have accomplished in place of (a) engineering those solutions in the first place, and (b) dealing with persistent concurrency issues afterward. It’s more a business problem than a technical one, as the requirement is usually driven by the business. But often it’s because it’s what they think they want, not what they actually need. I know that there are some scenarios where gaps do matter (e.g. invoices), but those edge cases probably shouldn’t be implemented with an IDENTITY column anyway. In most cases, the value produced by an IDENTITY column should be a meaningless, surrogate value that has zero meaning to end users, and therefore gaps should also become meaningless. There was a bug early in SQL Server 2012 that made these gaps an issue (speeding up exhaustion of values), but other than that scenario, you really shouldn’t be concerned about missing numbers.

In any case, some further reading about IDENTITY gaps:

5. SQL SERVER IS OPTIMIZED FOR SETS

As a developer quite green to the database world, I remember trying many times to emulate a for loop in T-SQL (and usually a nested for loop, to boot). The line of thinking we adopt from procedural code is something like, “I need to loop through these things, and perform an action against each thing, one at a time.”

Of course, you can accomplish row-by-row processing in SQL Server, using WHILE or DECLARE CURSOR (and let me be clear, there really is no material difference here; a while loop is just a cursor that doesn’t say so, though you have some easy behavioral and performance flexibility with explicit cursors). In most cases, row-by-row processing will throttle performance. There are some exceptions, like running totals before SQL Server 2012, but as a general rule, your line of thinking should be more like, “I need to perform an action against all of these things, at the same time.”

Further reading on set-based operations:

Next Steps

Those are just a few examples from my own experience. If you’re new to SQL Server, there are many tips, tutorials, and webcasts right here on this site that can help you avoid some future “I wish I knew back then” moments.

Aaron Bertrand

Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He also blogs at sqlblog.org.

MSSQLTips Awards: Author of the Year – 2016, 2023 | Leadership (200+ tips) – 2022

Five Things I Wish I Knew When I Started My SQL Server Career

1. TESTING YOUR BACKUPS IS NOT ENOUGH

2. INDEXING IS A BALANCE

3. DATES CAN BE A PAIN

4. GAPS ARE UNIMPORTANT

5. SQL SERVER IS OPTIMIZED FOR SETS

Leave a ReplyCancel Reply