Friday, September 29, 2017

Normalizing City, State, and Zip


Recently, I ran into a question on the internet about normalization that I thought would be good to repeat.

THE QUESTION:

 I'm toying with the idea of starting a new project, so I'm in brainstorming mode for table design. I'll be recording customer information in this application. Typical stuff: First and Last Names, Company, Street, Apt, City State and Zip, Phone numbers(s) and extensions, E-mail.

How do you guys recommend setting up the tables for City State and Zip? I was thinking that I would have:

TBL_General_State
PKStateID
StateAbbr (Limited to 2 letters)
StateName


TBL_General_City
PKCityID
FKStateID (Lookup to TBL__State)
CityName


TBL_General_Zip
PKZipID
FKCityID (Lookup to TBL__City
ZipCode


My customer information then would record only the zip code (PKZipID). And I could then use queries for the state, city, and zip information for forms, reports, etc.

Or is this beyond overkill?

ANSWER:

 By strict normalization theory, having City, State, and Zip in the same table violates the 3rd Normal Form because there are functional dependencies between those fields. However, functional dependencies are not all the same. There are strong dependencies and weak dependencies.

A strong dependency is one in which the value of a dependent field MUST be changed if another field is changed. For instance, suppose I have Quantity, Price, and ExtendedPrice, where ExtendedPrice is a calculation of the other two. If I change either Quantity or Price, the ExtendedPrice MUST be changed.

A weak dependency is one in which the value of a dependent field MAY be changed if another field is changed. City, State, and Zip are examples of weak dependencies. If I change a person's city, I may not have to change their state. They may have moved within the same state. Likewise, if I change the state, I may not have to change the city. There is, after all, a Grand Rapids, Michigan and Grand Rapids, Minnesota. The relationship between city and zip is even more complicated.

Now, it is possible to represent these fields in a fully normalized fashion, but I contend that it is more trouble for very little gain. There are two main reasons for normalizing data: minimize redundant data and maximize data integrity. Both of these can be achieved by using lookup tables for City and State without trying to represent the relationship between the two. A zip code could be mis-typed, of course, but it could also be mis-selected from a list, so to my mind there's no real reason to have a lookup table.

If you did normalize these fields, you could have a selection process that would present all possible combinations of values if you selected the City. For instance, if you had a combo box for City, you could have cascading combo boxes to select only the appropriate States and Zip codes. But it would be just as easy to mis-select the right value from this list as it would be to mis-select from independent lookup tables. And, of course, you'd have to create and maintain these relationships.

Therefore, normalizing City, State, and Zip adds a complication to your data model for very little gain, and in my opinion, is a good example of when to denormalize.

.

Wednesday, September 20, 2017

The Normal Forms: In a Nutshell


In this series, I have tried to explain in non-mathematical terms what the first three Normal Forms mean and how they determine database design.

This is not the most useful method of learning normalization. In fact, many expert developers never learn the formal definition of the normal forms. If you haven't already, I suggest you read the following series:

Normalization
ER Diagramming

However, I think it is useful to know what the Normal Forms are. Sometimes when you get stuck in a design, you can go back to the definitions to get yourself out of trouble.

So, in summary:

First Normal Form (1NF) says that each record must be unique, that is, it has a primary key. There are some additional restrictions on how such uniqueness is maintained such as not allowing positional referencing and no repeated columns.

Second Normal Form (2NF) says that each field in the record must depend on the whole primary key, not just a part of it.

Third Normal Form (3NF) says that no field must depend on any other field except the primary key.

William Kent, author of A Simple Guide to Five Normal Forms in Relational Database Theory, once abbreviated the first three normal forms like this:

"The Key, the whole Key, and nothing but the Key, so help me Codd."
.

Wednesday, September 13, 2017

The Normal Forms: Third Normal Form

Last time, in The Normal Forms: Second Normal Form, I discussed how to remove redundant data by identifying fields which are not functionally dependant on the entire primary key. Figure 1 shows the results.

Figure 1: Order table Decomposed into Orders and Order Details


This corrected some data anomaly errors in my data, however, data anomalies are still possible under 2NF. To prevent these anomalies, I need an additional rule: Third Normal Form (3NF).

Definition

A table is said to be in Third Normal Form (3NF) if:

  1. It is in Second Normal Form and
  2. If all non-key fields are mutually independent, that is, all fields are functionally dependant ONLY on the primary key field(s).

Implications

There are two main sources of data anomalies that 3NF corrects are 1) Redundant Data and 2) Calculated fields.

Redundant Data

Although I removed some of the redundant data when I split the Order table into Orders and OrderDetails, there is still some redundancy left, namely ProductNum and Item. Both of these fields are dependant on the entire primary key, so they comply with 2NF. However The ProductNum and Item fields are mutually dependant, that is, they depend upon each other. The product number determines the item description and the item description determines the product number.

Just as we saw in 2NF, redundancy can lead to inconsistent data being entered into the database or correct information being changed after the fact. Figure 2 shows some data anomalies possible under 2NF as a result of redundant data.

Figure 2: 2NF Data Anomalies Due to Redundant Data


Product A7S has two different items associated with it: either a wrench or a nail. Which is it?

Also, two product numbers (B7G and B7H) are associated with an Item called "saw". Is this the same saw or not?

Calculated Values

Mutual dependency is also an issue with storing calculated values. Suppose I had a Quantity and Price field and I decided to calculate the ExtendedPrice by multiplying the them. This is a common database error made by novices.

The problem is one of dependency. The Extended Price calculation depends on the Quantity and Price fields for its value. 3NF says that no field should depend on any field except those making up the primary key.

If I store that calculation and later go back and change one of the dependant fields (either the Quantity or the Price), my calculation will be incorrect. Figure 3 shows some calculated values anomalies.

Figure 3: Anomalies with Calculated Values


First of all, if the user is manually calculating and typing in the value of the Extended Price, the value could be anything, even a calculation from a different row. So let's assume I have an automated process, a formula in a form which calculates the value.

The problem is that you must depend on programming to maintain your data integrity, not the database itself. If the integrity is maintained at the database level, it cannot be subverted.

In the case of the table above, the first anomalous record was caused by changing the Quantity from 1 to 2 after the fact. But because I didn't have a process to re-calculate the value if Quantity changed, the Extended Price is now wrong.

In the second case, an Update Query was used to raise the price of nails was raised from $0.09 to $0.10. Unfortunately, the query did not include a new calculation, so all Extended Price calculations for nails are now wrong.

Implementation

The problem of calculated values is a simple one to solve. Don't. As a general rule, I just don't store calculations. There are minor exceptions, but in most cases, I'll be safe by just leaving them out. When I need these values, I'll calculate them as output in either a query, form, or report.

As with 2NF, the solution to redundant data is to remove it to a separate table, leaving one field to join back to the original. In this case, the ProductNum, Item, and Price fields will go into the Products table. I'll leave ProductNum in the Order Detail table to maintain the relationship. Figure 4 is the result.

Figure 4: Decomposing Order Details to Remove Redundant Data


So now I've removed as much redundant data as possible. There's still a little left. There always will be in order to maintain the relationships between tables. But none of the redundancy will result in data anomalies, so I can say with confidence that my tables are now normalized to Third Normal Form. Figure 5 shows the final design.

Figure 5: Final Design


In my next and final post: The Normal Forms: In A Nutshell, I'll wrap it all up.

.