By: Aaron Bertrand | Last Updated: 2010-09-29 | Comments (1) | Database Design
A short while ago, I discussed the importance of consistent naming conventions (Use consistent SQL Server Naming Conventions). Another similar issue I see creep into environments is the haphazard use of data types. Often it seems abundantly clear that different tables were designed by separate people - to the point that the same data could be represented in two tables with distinct data types. In some cases it even seems random... from one table to the next, you might not know what data type you'll get.
Some teams can keep very good documentation and communicate well to keep their data type usage consistent. Others use a data dictionary, extended properties, or 3rd party tools to make their database schemas self-documenting. Still others use User-Defined Types or Alias Types to prevent other team members from inadvertently going astray. (For what it's worth, I don't recommend this latter route solely for solving this problem - I talked about this in a blog post last year.)
Being aware of this problem is half the battle; how you solve it will depend on the size of your team, the complexity of your schema, and your willingness to right the ship. Documentation is certainly not the most glamorous part of our jobs, and it can be difficult to justify interfering with development time to write about what you've developed. But it should be easy, by way of an example, to show how a more consistent schema can help development in the long run.
I was recently working on a project where some of the data modeling was done in Access, and then some in Entity Framework. By the time I got to reviewing the schema, we had things like:
dbo.Fleets [Description] VARCHAR(32) dbo.FleetVehicles [Description] VARCHAR(50) dbo.FleetActions [Description] NVARCHAR(255) dbo.VehicleSummary [Description] SYSNAME
Were these all the same "Description"? No, they belonged to different tables and represented different entities in the model. Still, when developers are working on CRUD procedures, or writing front end validation scripts, or building forms, it can be very beneficial to have a consistent implementation so that they aren't constantly checking how or why the rules for one entity's Description value are different from another's. This may have been an intentional choice, but if so, this is just begging for some proper documentation.
Here is one quick way to find these cases where the data type for otherwise similar columns are different, using the catalog views sys.columns and sys.types. And yes, this query intentionally includes views, because you'll want to highlight places where implicit or explicit conversion might produce different output than the base column. However it does not deal with columns that have different names (e.g. Description in the table and FleetDescription in a view), or if a column uses an alias type.
(You may also be interested in identifying cases where the nullability, default values, collation, or participation in foreign keys or other constraints are inconsistent between similar columns. For this example, I'll restrict the query to data type and max_length / precision / scale.)
;WITH mismatch AS ( SELECT [Object] = QUOTENAME(SCHEMA_NAME(o.[schema_id])) + '.' + QUOTENAME(o.name), Object_Type = o.type_desc, Column_Name = QUOTENAME(c.name), [Precision] = c.[precision], Scale = c.scale, Max_Length = CASE WHEN t.name LIKE 'n%char' AND c.max_length <> -1 THEN CONVERT(VARCHAR(12), c.max_length/2) WHEN c.max_length = -1 THEN 'MAX' ELSE CONVERT(VARCHAR(12), c.max_length) END, Data_Type = t.name, rn = DENSE_RANK() OVER ( PARTITION BY c.name ORDER BY t.name, c.max_length, c.scale, c.[precision] ) FROM sys.columns AS c INNER JOIN sys.objects AS o ON c.[object_id] = o.[object_id] LEFT OUTER JOIN sys.types AS t ON c.system_type_id = t.system_type_id AND c.system_type_id = t.user_type_id WHERE o.is_ms_shipped = 0 ) SELECT [Object], Object_Type, Column_Name, Data_Type, Max_Length, [Precision], Scale FROM mismatch AS o WHERE EXISTS ( SELECT 1 FROM mismatch WHERE Column_Name = o.Column_Name AND rn > 1 ) ORDER BY Column_Name, Max_Length, [Object];
When you find discrepancies in the output, you may want to formulate a plan for correcting them. For some data type decisions, the plan may be simple; for others, further thought may need to go into the process. Here are a few pointers to help you decide which data type to use when there are multiple options.
CHAR / NCHAR / VARCHAR / NVARCHAR
- There has been a lot of banter about using fixed-width columns (CHAR / NCHAR) vs. variable-width columns (VARCHAR / NVARCHAR) to store strings. Conventional wisdom dictates that you use CHAR or NCHAR when your data is always of fixed width (e.g. VIN numbers, which are currently 17 characters), or if the data is variably but the max is less than, say, 10 characters. There is no rule here, of course; it is just an arbitrary cut-off point where the difference between storage / seek costs and row overhead / fragmentation costs tip the other way.
As for whether to use CHAR or VARCHAR vs. NCHAR or NVARCHAR, well that depends on whether your data needs to support Unicode data (or might ever need to do so in the future). Note that in SQL Server 2008 and below, you will pay a 100% penalty in storage costs (and likewise I/O costs) when you use NVARCHAR; even if you only ever store ASCII data, each character actually requires 2 bytes of storage. So people have often shied away from this datatype unless they absolutely knew they would need to support Unicode characters. In SQL Server 2008 R2, at least in Enterprise and Data Center Editions, there is much more weight to the argument of always using NVARCHAR if there is even the slightest chance you will need it - since the Data Compression feature has been enhanced to compress the data in these columns if it is only ASCII data. (See this SQL Server Storage Engine blog post for some background information on Unicode Compression.)
(Since TEXT / NTEXT are deprecated, I shouldn't have to remind you that if you are still using these types, you should be phasing them out of your systems, in favor of VARCHAR(MAX) / NVARCHAR(MAX).)
As for proper lengths for certain types of data, this again goes back to your domain and business requirements. As an example, we always use VARCHAR(320) for e-mail address because, as per the standard, it is localpart (64 ASCII) + @ + domain (255 ASCII). This may change in the future as the standards evolve, but for now this is what we use and we haven't run into any problems yet. Some other ways we define data types for different string data:
|User-supplied proper names (FirstName, LastName)||NVARCHAR||64|
|System-supplied lookup data (DocumentTypeName)||VARCHAR||256|
|File-specific attributes (Extension)||CHAR||5|
By no means am I suggesting that these are the data types you should use for these scenarios; just giving you a glimpse at how we define our data types for our needs.
DATETIME / SMALLDATETIME / DATE / TIME
- I often see auditing columns (e.g. ModifiedDate) or columns that store schedule dates, and these columns use DATETIME when only to-the-minute or even to-the-day precision is necessary. This is using 8 bytes when 4 (or less) could work. Before SQL Server 2008, you basically had three choices: use DATETIME, SMALLDATETIME, or store the data using a sub-optimal data type (like VARCHAR, or the INTs you see in msdb's SQL Agent tables). For auditing tables, I tend to use SMALLDATETIME. I will typically need to know roughly what time a change happened, but I can't see ever needing to know to the nearest second. For a lot of things (like CreatedDate) just the date is sufficient, so in SQL Server 2008 and above, I will use DATE.
Use TIME to store points in time, but not durations. For example I have already seen people using TIME to store 3:10:00 to mean 3 hours, 10 minutes, 24 seconds, rather than 3:10 AM... this breaks down quickly when you have a duration longer than 23:59:59.9999999. For durations either use two TIME or SMALL/DATETIME columns to store the begin and end intervals, or store the duration as a numeric value representing units of time (such as seconds or minutes. Note that TIME and DATETIME2 take precision parameters, so you can dictate how may decimal places you can store after the seconds. Again be sensible about what you choose here, and note that UIs such as the Table Designer in SSMS will default to giving you the most precision (and therefore will utilize the most space even when you don't need it).
For some more reading on this, please see Tibor Karaszi's excellent article, The ultimate guide to the datetime datatypes, and the very comprehensive Books Online topics, Using Date and Time Data and Date and Time Functions.
INT / BIGINT / SMALLINT / TINYINT
Most commonly, I see two inappropriate uses of the INT data type:
- Using INT for the ID of even the smallest, system-defined lookup value - for example, EmployeeTypeID is an INT even though there are currently only three employee types and you can't envision there ever being more than five.
- Using INT for the identifier on a table that will grow very large (e.g. a log or transaction table), and then later realizing that maybe it will need to be a BIGINT (or not realizing this until new log entries generate overflow errors).
In general, our decisions are based on the following rules (and we always take care to account for gaps, particularly when values are supplied externally or via an IDENTITY property):
- If you know your domain will require at most 255 values, use TINYINT.
- If you need more than 255 values but less than, say, 30,000, use SMALLINT.
- If you need more than 30,000 values but less than 2 billion, use INT.
- If there is any chance you will exceed 2 billion values, just start off with BIGINT now.
Like some of the others cases, this can quickly become subjective; you can make a case for always using INTs for all identifiers, or using different data types for different types of identifiers to optimize storage / performance. Either case may work better for you, as long as you understand the trade-offs.
NUMERIC / DECIMAL / REAL / FLOAT / MONEY / SMALLMONEY
- In our systems, we use only DECIMAL (which is functionally equivalent to NUMERIC). I prefer the word DECIMAL over NUMERIC only because it is a more accurate description of the data ("numeric" is too vague - an INT is also numeric).
We don't use REAL or FLOAT because we do not model any data based on approximate representation. Now there are certainly places where using these types are valid, but all too often I see them used when a DECIMAL would have been better - and plenty of head scratching and hair pulling when they perform calculations and end up with far more decimal places than they expected.
I have yet to find an advantage of using the MONEY / SMALLMONEY types. They are less flexible than NUMERIC / DECIMAL, and all they seem to buy you, other than precision problems when using the values in certain calculations, is the ability to format numbers with thousand separators. Even with regional / language settings, you still have to put the currency symbol in there yourself. In my opinion, this is all the job of the front end and not the database anyway. Just store the raw numbers in the database using NUMERIC or DECIMAL, and put the pretty commas in later.
TIMESTAMP / ROWVERSION
- Ah, TIMESTAMP. What an unfortunate naming blunder you turned into. Sadly, most people who come across this data type assume that it has something to do with date or time. I can't count how many times I've had to inform people that the TIMESTAMP column in their table won't tell them when the row was last modified. This confusion, and the fact that the name conflicts with the SQL standard, has led to its deprecation (you can read about this in the rowversion Books Online topic). In our system, we always use the ROWVERSION data type, even though Management Studio and sys.types will still tell us that we used TIMESTAMP, and SQL Server Native Client does not know what ROWVERSION means. You can vote to fix these bugs by the way:
BIT / CHAR
- Another debate I see a lot is whether to use a BIT column to represent true/false or yes/no, or to use the text equivalents in a CHAR(1) (namely: T/F, Y/N) or a CHAR(3 or 5) ("True"/"False" or "Yes"/"No"). My personal preference is to use a BIT column. It is very natural to use something like WHERE Active = 1 as opposed to WHERE Active = 'T' or WHERE Active = 'True'; in addition, the 1 and 0 are universal, whereas a French-speaking developer will probably instinctively use V (for "Vrai") instead of T.
One of the reasons people have argued against BIT in the past is because Enterprise Manager and Management Studio make it difficult to index a BIT column. While you can index a BIT column, it is highly unlikely to be useful; rather, you should look into a new SQL Server 2008 feature, filtered indexes. A filtered index will allow you to create an index with a WHERE clause, so that if you have a high percentage of rows where Active = 0, but you often query for rows where Active = 1, it will use a very small index to locate your rows, instead of performing a table scan.
As an aside, for gender, we use a CHAR(1) and use M/F and indicate U for unknown. There is some advice out there to use the numeric-based standards for this (0,1,2,9) but we find this more cumbersome than it is worth. Specifically because we are not a health care provider and do not need to differentiate between "unknown because we don't know" and "unknown because they won't tell us."
I'll repeat once again that many of these specific points about data types are subjective and based primarily on my own opinions and experience. Getting back to the main point: while you may argue back and forth about whether a FirstName should be 32 or 64 characters, or whether a ModifiedDate auditing column should include seconds, the key takeaway here is that these columns should be represented by the exact same data type in all of the tables they appear. When a FirstName can be 50 characters in one table and only 32 in another, you are bound to run into some problems down the road, so try and fix those issues ahead of the curve.
- Review your current data type standards.
- Identify cases where you are using different data types for similar columns.
- Consider implementing a data dictionary and/or full schema documentation.
- Browse the following tips and other resources:
- Create a SQL Server Data Dictionary in Seconds using Extended Properties
- Building a SQL Server Data Dictionary with the Extended Properties and Third Party Tools
- Unicode Compression in SQL Server 2008 R2
- SQL Server 2008 Date and Time Data Types
- SQL Server Filtered Indexes - What They Are, How to Use and Performance Advantages
Last Updated: 2010-09-29
About the author
View all my tips