Monday, March 13, 2017

What Is Normalization, Part I

Why Normalization?

What is normalization? Normalization is a methodology for removing redundant data from a database WITHOUT losing information.

There are many ways to represent data. Some of the most common are: spreadsheets, flat files, and relational databases. Each of these ways have their own advantages and disadvantages.

For instance, in a spreadsheet, it's acceptable to represent the data like this:

Figure 1

This is perfectly readable to a human. It is obvious that the row beneath Gina Fawn also belongs to her. Unfortunately, this is impossible for a database to interpret. The database must have all the information it needs on a single row. It cannot refer to another row to get the information it needs.

One way to correct this, would be to fill in the missing information.

Figure 2

Unfortunately, this requires storing a lot of redundant data. What's the big deal? It's only a couple of fields, right? But that's only in the example shown. What if we were storing all of the demographic data (name, address, phone, city, state, etc.) for a lot of people? This would waste a lot of storage capacity.

But wasted storage is not the worst problem. What if the SSN of Gina Fawn's first record was changed to 215-87-7854? Perhaps this was through operator error or maybe a programmatic update. It doesn't matter, the data has been changed. Now, which SSN is really Gina's? The database has no way of knowing. Worst still, the SSN matches Tony Jones. So, does that SSN represent Gina or Tony? Again, no way to know.

This same problem holds true for all the fields which hold redundant data. This is called a Data Anomaly error. Once you start having data anomalies, you cannot trust the integrity of your database.

One way to solve this problem is with Repeated Columns. This is a common solution in spreadsheets. With repeated columns, the non-redundant information (like row two in Figure 1) are stored as columns. We get something like this:

Figure 3

Now we don't have problems with redundancy, but we have additional problems. First of all, we have to decide how many repeated columns to create. In Figure 3, I only show one salary increase for Gina and Tony, but is that reasonable? What if Gina has five wage increases and Tony had seven? Is seven sets of columns enough? Do I cap it at the largest record? Or do I add more columns to accommodate growth? If so, how many?

Secondly, such a table structure requires a lot of manual modification and becomes untenable when you have a lot of data. Perhaps instead of just the date and salary, we are also storing the job description, pay grade, status, and so forth? The structure would be come so large and unruly that it would be impossible to maintain.

Fortunately, Normalization comes to the rescue. In What Is Normalization, Part II, I'll explain how. Thank Codd.


No comments: