By: K. Brian Kelley | Last Updated: 2018-08-16 | Comments (1) | Database Design
I’m working to improve my SQL Server database development skills, specifically in the area of data modeling. I’ve seen some information on data normalization and the forms, however, I’d like to understand them better and why they are used.
Data normalization is where we seek to organize the data in such a way as to reduce duplication of data and prevent certain anomalies that can occur from inserts, updates, and deletes. Typically, we try to get data to at least what is 3rd normal form. However, before data can be considered to be organized into 3rd normal form, it must first meet 1st and 2nd normal form. In this tip we’ll take a look specifically at 1st normal form.
Let’s start with a snapshot of student data. Note that this isn’t the whole table. You can see this because we have a column labeled C1Name and another column with the title C2Name. As you might guess, there will be other columns for times and locations as well as other sets of columns to reflect additional classes. However, this snapshot gives us an idea of the issues we face with a raw set of data:
Let’s focus in on Major specifically. Note that we see two forms of Mechanical Engineering listed, one with the full name and one where Mechanical is abbreviated. This is a situation we’re trying to avoid, as the major name isn’t consistent between the two different students.
Basically, when we have repeating data entries, we have the potential for an issue. With the majors, we see the repeat within a particular column. But it could also be across columns, like with respect to classes:
Note that we see several classes spanned across different sets of columns. We are also concerned about this situation because we’ve introduced the possibility of data inconsistency because of the way the data is “structured.” Given that we now understand the issues these types of data structures cause, let’s talk about the types of anomalies we’re solving by performing data normalization.
Imagine that we’re adding a new student. However, we don’t have all of the student’s information. Let’s say that major is required. We can’t add the student. Or perhaps a class is being offered for the first time. If this “table” is all we have, there’s a problem. Until a student signs up for the class, how do we know it exists? This is an insert anomaly. Until we have all the required information, we can’t insert into it into our data structure.
The second situation is when we’re updating data. Let’s say that the History 201 meeting location has changed. It’s entirely possible the change is only captured on some of the rows, especially since we’re also repeating the classes as separate columns. As a result, one student’s record may show History 201 still meeting at Green 221. However, it should have been updated to show that History 201 is now meeting in the new location. Because there are multiple updates which have to be performed, it’s easy to miss one or more. This is an update anomaly.
Finally, we have delete anomalies. Imagine that French 301 wasn’t able to be offered this semester. We’ll remove it from the students that were signed up for that class, but just as the case with a class that has just been added, how do we know that the class even exists? It isn’t scheduled for this semester, but if we were to look at the data, we’d have no knowledge that it is even a potential offering.
First Steps to Preventing These Anomalies – First Normal Form
That leads us to data normalization. We want to prevent these anomalies. Here’s what First Normal Form requires:
- Each column should be atomic (no repeating values).
- There shouldn’t be repeating columns, either.
- Each set of related data should be reflected in its own table.
- Each set should allow us to unique identify each record. Either we identify or assign this unique identity, which we call the primary key.
Let’s take a step back and see where we have issues:
- The major column has multiple values for some students.
- Classes are represented by repeating columns.
- We should actually have Majors, Minors, and Classes in their own data tables. One could argue that we could break down Classes even further, but for now, we’ll stop at the concept of Classes, because further normalization will cause us to look at how we should break things down further.
- We don’t have anything in the current data which helps us uniquely identify each student. After all, we could have 2 John Does. So, we’ll need something else. A lot of school's handle this with a student ID number.
This should get us to something like:
At this point we’re not showing any relationships between the tables. Therefore, we don’t have any join tables for a 1-to-many or many-to-many relationship. Case in point: we know there’s a case where a student has more than 1 major. That’s where we’d need to create a join table that associates the primary key from the Student table with the primary key from the Major table. When you’re concerned with data modeling, those tables are important. However, we aren’t required to do so to put the data into First Normal Form.
As far as Primary Keys are concerned, we’re making the assumption that ClassName, ClassTimeSlot, and ClassLocation represent a unique combination for which we can identify the individual Class record. After all, we’d never expect two classes with the same name to meet in the same location at the exact same time. This is a bit unwieldy. This is a reason why you’ll see a column created to handle easy lookups and identification of records. Typically, this is some sort of integer datatype. Since we’re focusing on data normalization here, we’ll not get into when this should be done and stick with the “natural” columns (meaning they are already part of the data) to define the primary key.
What Have We Accomplished by Going to First Normal Form?
First Normal Form helps us with some of the anomalies, but not all of them. Let’s look at them each in turn.
Insert Anomalies: We’ve solved this in most of the cases, but the Class table is still going to give us issues. After all, the case of a class that isn’t being offered this term won’t show up in that table. There’s no way to insert the existence of such a class. We’ll address this further as we look at Second and Third Normal Form in a later tip.
Update Anomalies: This one is solved. Since we only have atomic values, we only have to make one update to correct any bit of information.
Delete Anomalies: Unfortunately, we suffer the same problem as with insert anomalies. If I were to delete a class offering, and there are no other scheduled meetings for that class, then we no longer see the class in the data. Again, this will be solved as we further normalize.
- Read a summary of logical database design.
- Walk through a database redesign to bring about a solid data model.
Last Updated: 2018-08-16
About the author
View all my tips