By: K. Brian Kelley | Updated: 2011-06-28 | Comments (2) | Related: 1 | 2 | More > Database Design
I've been asked to be the data modeler for a new application and I've not done this before. I understand the idea of getting the entities and how they connect together from customer interviews, but how do I apply the normalization rules to my data model? Also, why are they important? Check out this tip to learn more.
When we apply normalization rules to our database models, we're looking to preserve data integrity. Normalization does this in two ways:
- It reduces the duplication of data in a data model. Imagine if we have the same bit of information stored in two places in our data model and we only change the data in one place. We now have a data integrity issue.
- It seeks to ensure that data in a table is properly related to other data in that table. For instance, information in the Orders table should only be based on orders. We shouldn't have an employee's salary information in the Orders table. Having the data in the right place means we're more likely to update it properly.
There are several normal forms, but for this tip we'll focus on first normal form. Subsequent tips will expand with the other typically used normal forms. First normal form has two requirements:
- Data is stored in an atomic manner.
- One or more columns in the table uniquely identify any particular row in the table for any given row.
The idea here is that data is stored in a way where there aren't groups of data in a column (for instance, multiple items in an order in a single column) or there aren't multiple columns of the same thing in a table. let's take a look at two examples:
Groups of Data in a Single Column
Take the following example of code:
-- First example where a column contains more than 1 value CREATE TABLE dbo.NotAtomic_Ver1 ( OrderDate SMALLDATETIME NOT NULL, CustomerName VARCHAR(100) NOT NULL, OrderItems VARCHAR(255) NOT NULL, OrderTotal DECIMAL(8, 2) NOT NULL ); GO INSERT INTO dbo.NotAtomic_Ver1 (OrderDate, CustomerName, OrderItems, OrderTotal) VALUES (GETDATE(), 'John Doe', '5 T-Shirts XL, 4 Khaki Pants XL', 130.00); INSERT INTO dbo.NotAtomic_Ver1 (OrderDate, CustomerName, OrderItems, OrderTotal) VALUES (GETDATE(), 'John Doe', '3 T-Shirts M, 2 Walking Shorts L', 50.00); GO SELECT * FROM dbo.NotAtomic_Ver1; GO
Here we've created a table for Orders and one of the columns contains everything that was part of that order. Visually, we'd see it like so:
One of the problems here is what happens if we have to change the order? What if the person only wanted a single t-shirt? Or what if we wanted to search for every purchase of medium-sized t-shirts? Both types of operations are more difficult than they should be because of the way the data is stored. We're not just contending with t-shirts, but also pants or shorts. This is an example where we see groups of data in a single column. If this is the way our data is modeled, it's not in first normal form.
Multiple Columns of the Same Thing
Perhaps we've learned our lesson and we don't want groups of data in a single column. Therefore, we break the order items up into multiple columns, say 2, because we've never had an order consisting of more than 2 different items. The code that would create such a construct is this:
-- A second attempt where there are multiple columns of the same thing CREATE TABLE dbo.NotAtomic_Ver2 ( OrderDate SMALLDATETIME NOT NULL, CustomerName VARCHAR(100) NOT NULL, OrderItem1 VARCHAR(255) NOT NULL, OrderItem2 VARCHAR(255) NOT NULL, OrderTotal DECIMAL(8, 2) NOT NULL ); GO INSERT INTO dbo.NotAtomic_Ver2 (OrderDate, CustomerName, OrderItem1, OrderItem2, OrderTotal) VALUES (GETDATE(), 'John Doe', '5 T-Shirts XL', '4 Khaki Pants XL', 130.00); INSERT INTO dbo.NotAtomic_Ver2 (OrderDate, CustomerName, OrderItem1, OrderItem2, OrderTotal) VALUES (GETDATE(), 'John Doe', '3 T-Shirts M', '2 Walking Shorts L', 50.00); GO SELECT * FROM dbo.NotAtomic_Ver2; GO
Visually, if we were to look at it, here's what we would have:
Hopefully you are already spotting some issues. For instance, what happens if I want 3 items for a particular order? Now I'm stuck, because the table only supports 2. Or what if I wanted to search for all khaki pants that were ordered? I have to search both sets of columns to get this information. If I expanded the table to have even more columns to support the ability to have more items on a particular order, then I would have to increase all my searches accordingly. It's a mess!
Uniquely Identifying a Particular Row
Another issue with either table is how do you identify a particular order? For instance, for the two rows of data we have, I might be able to identify a particular order by looking at the time the order came in and who the order was for, but as can be seen here, the orders look like they came in at the same time.
So I can't exactly identify which order is which. We could go and include every column, but while that happens on occasion, it's not how things tend to work in the real world. Typically we'd have an OrderID or Order Number that separately identifies each order. That can be built into the data model.
We know we need to identify each row uniquely. If we have a column for Order ID, that would take care of that issue. In cases where there isn't something actual that we can use (imagine if the store didn't actually track by an Order ID or Order Number), and we have to make something up to be used, we call this a surrogate key. If we have one or more columns that can be used to uniquely identify a row, we call that set of 1 or more columns a candidate key. When we actually select either a surrogate key or a candidate key to be used to uniquely identify the row, that becomes known as the primary key for the table.
Putting It All Together
The first thing we have to do is break up the groups of data or get rid of the multiple columns consisting of the same type of data. The solution here is to create a separate table to track the order items. That means we'll pull the items out of the main order table. Now, if we can't identify orders uniquely, we're hurting here, because tying that second table back to the order table is crucial. And that means we have to tie order items back to their specific orders. So we'll also need to introduce a surrogate key, which we'll call OrderID. Putting all that together, we'd get the following bit of code:
-- Tables in First Normal Form -- Normally we'd use constraints to further enforce data integrity -- But that's beyond the scope of this article -- Table for orders CREATE TABLE dbo.Order_1NF ( OrderID INT NOT NULL, OrderDate SMALLDATETIME NOT NULL, CustomerName VARCHAR(100) NOT NULL, OrderTotal DECIMAL(8, 2) NOT NULL ); GO -- Table for the items of those orders CREATE TABLE dbo.OrderItem_1NF ( OrderID INT NOT NULL, ItemQuantity INT NOT NULL, ItemName VARCHAR(255) NOT NULL ); GO INSERT INTO dbo.Order_1NF (OrderID, OrderDate, CustomerName, OrderTotal) VALUES (1, GETDATE(), 'John Doe', 130.00); INSERT INTO dbo.OrderItem_1NF (OrderID, ItemQuantity, ItemName) VALUES (1, 5, 'T-Shirt XL'); INSERT INTO dbo.OrderItem_1NF (OrderID, ItemQuantity, ItemName) VALUES (1, 4, 'Khaki Pants XL'); INSERT INTO dbo.Order_1NF (OrderID, OrderDate, CustomerName, OrderTotal) VALUES (2, GETDATE(), 'John Doe', 50.00); INSERT INTO dbo.OrderItem_1NF (OrderID, ItemQuantity, ItemName) VALUES (2, 3, 'T-Shirt M'); INSERT INTO dbo.OrderItem_1NF (OrderID, ItemQuantity, ItemName) VALUES (2, 2, 'Walking Shorts L'); GO SELECT * FROM dbo.Order_1NF; GO SELECT * FROM dbo.OrderItem_1NF; GO
It is certainly a longer bit of code, but when you run the SELECT statements, you'll see how everything is cleared up. We've solved the issue of identifying the rows uniquely for orders. A candidate key for order items is the combination of OrderID and ItemName. So both tables meet the requirement of being able to identify rows uniquely. Also, we've eliminated groups of data or columns having the same type of information in them. Therefore, this data model is in first normal form.
Is our job done? No, not yet. There are still some things that we'd want to do to make things better. For instance, note that the customer name is stored in the table for orders. What happens if the customer changes his or her legal name? Then we need to make sure we update every order the customer has ever placed. It would be better to get that information into a separate table and only be required to make a single update. These are the types of considerations that will bring us to a discussion of the other normal forms in later tips.
- As you begin to design your databases, consider the points made in this tip.
- Stay tuned for additional normalization tips.
- Check out these related tips:
- Database Design Tips
Last Updated: 2011-06-28
About the author
View all my tips