Friday, September 29, 2017
Normalizing City, State, and Zip
Recently, I ran into a question on the internet about normalization that I thought would be good to repeat.
THE QUESTION:
I'm toying with the idea of starting a new project, so I'm in brainstorming mode for table design. I'll be recording customer information in this application. Typical stuff: First and Last Names, Company, Street, Apt, City State and Zip, Phone numbers(s) and extensions, E-mail.
How do you guys recommend setting up the tables for City State and Zip? I was thinking that I would have:
TBL_General_State
PKStateID
StateAbbr (Limited to 2 letters)
StateName
TBL_General_City
PKCityID
FKStateID (Lookup to TBL__State)
CityName
TBL_General_Zip
PKZipID
FKCityID (Lookup to TBL__City
ZipCode
My customer information then would record only the zip code (PKZipID). And I could then use queries for the state, city, and zip information for forms, reports, etc.
Or is this beyond overkill?
ANSWER:
By strict normalization theory, having City, State, and Zip in the same table violates the 3rd Normal Form because there are functional dependencies between those fields. However, functional dependencies are not all the same. There are strong dependencies and weak dependencies.
A strong dependency is one in which the value of a dependent field MUST be changed if another field is changed. For instance, suppose I have Quantity, Price, and ExtendedPrice, where ExtendedPrice is a calculation of the other two. If I change either Quantity or Price, the ExtendedPrice MUST be changed.
A weak dependency is one in which the value of a dependent field MAY be changed if another field is changed. City, State, and Zip are examples of weak dependencies. If I change a person's city, I may not have to change their state. They may have moved within the same state. Likewise, if I change the state, I may not have to change the city. There is, after all, a Grand Rapids, Michigan and Grand Rapids, Minnesota. The relationship between city and zip is even more complicated.
Now, it is possible to represent these fields in a fully normalized fashion, but I contend that it is more trouble for very little gain. There are two main reasons for normalizing data: minimize redundant data and maximize data integrity. Both of these can be achieved by using lookup tables for City and State without trying to represent the relationship between the two. A zip code could be mis-typed, of course, but it could also be mis-selected from a list, so to my mind there's no real reason to have a lookup table.
If you did normalize these fields, you could have a selection process that would present all possible combinations of values if you selected the City. For instance, if you had a combo box for City, you could have cascading combo boxes to select only the appropriate States and Zip codes. But it would be just as easy to mis-select the right value from this list as it would be to mis-select from independent lookup tables. And, of course, you'd have to create and maintain these relationships.
Therefore, normalizing City, State, and Zip adds a complication to your data model for very little gain, and in my opinion, is a good example of when to denormalize.
.
Wednesday, September 20, 2017
The Normal Forms: In a Nutshell
In this series, I have tried to explain in non-mathematical terms what the first three Normal Forms mean and how they determine database design.
This is not the most useful method of learning normalization. In fact, many expert developers never learn the formal definition of the normal forms. If you haven't already, I suggest you read the following series:
Normalization
ER Diagramming
However, I think it is useful to know what the Normal Forms are. Sometimes when you get stuck in a design, you can go back to the definitions to get yourself out of trouble.
So, in summary:
First Normal Form (1NF) says that each record must be unique, that is, it has a primary key. There are some additional restrictions on how such uniqueness is maintained such as not allowing positional referencing and no repeated columns.
Second Normal Form (2NF) says that each field in the record must depend on the whole primary key, not just a part of it.
Third Normal Form (3NF) says that no field must depend on any other field except the primary key.
William Kent, author of A Simple Guide to Five Normal Forms in Relational Database Theory, once abbreviated the first three normal forms like this:
"The Key, the whole Key, and nothing but the Key, so help me Codd."
.
Wednesday, September 13, 2017
The Normal Forms: Third Normal Form
Figure 1: Order table Decomposed into Orders and Order Details

This corrected some data anomaly errors in my data, however, data anomalies are still possible under 2NF. To prevent these anomalies, I need an additional rule: Third Normal Form (3NF).
Definition
A table is said to be in Third Normal Form (3NF) if:
- It is in Second Normal Form and
- If all non-key fields are mutually independent, that is, all fields are functionally dependant ONLY on the primary key field(s).
Implications
There are two main sources of data anomalies that 3NF corrects are 1) Redundant Data and 2) Calculated fields.
Redundant Data
Although I removed some of the redundant data when I split the Order table into Orders and OrderDetails, there is still some redundancy left, namely ProductNum and Item. Both of these fields are dependant on the entire primary key, so they comply with 2NF. However The ProductNum and Item fields are mutually dependant, that is, they depend upon each other. The product number determines the item description and the item description determines the product number.
Just as we saw in 2NF, redundancy can lead to inconsistent data being entered into the database or correct information being changed after the fact. Figure 2 shows some data anomalies possible under 2NF as a result of redundant data.
Figure 2: 2NF Data Anomalies Due to Redundant Data

Product A7S has two different items associated with it: either a wrench or a nail. Which is it?
Also, two product numbers (B7G and B7H) are associated with an Item called "saw". Is this the same saw or not?
Calculated Values
Mutual dependency is also an issue with storing calculated values. Suppose I had a Quantity and Price field and I decided to calculate the ExtendedPrice by multiplying the them. This is a common database error made by novices.
The problem is one of dependency. The Extended Price calculation depends on the Quantity and Price fields for its value. 3NF says that no field should depend on any field except those making up the primary key.
If I store that calculation and later go back and change one of the dependant fields (either the Quantity or the Price), my calculation will be incorrect. Figure 3 shows some calculated values anomalies.
Figure 3: Anomalies with Calculated Values

First of all, if the user is manually calculating and typing in the value of the Extended Price, the value could be anything, even a calculation from a different row. So let's assume I have an automated process, a formula in a form which calculates the value.
The problem is that you must depend on programming to maintain your data integrity, not the database itself. If the integrity is maintained at the database level, it cannot be subverted.
In the case of the table above, the first anomalous record was caused by changing the Quantity from 1 to 2 after the fact. But because I didn't have a process to re-calculate the value if Quantity changed, the Extended Price is now wrong.
In the second case, an Update Query was used to raise the price of nails was raised from $0.09 to $0.10. Unfortunately, the query did not include a new calculation, so all Extended Price calculations for nails are now wrong.
Implementation
The problem of calculated values is a simple one to solve. Don't. As a general rule, I just don't store calculations. There are minor exceptions, but in most cases, I'll be safe by just leaving them out. When I need these values, I'll calculate them as output in either a query, form, or report.
As with 2NF, the solution to redundant data is to remove it to a separate table, leaving one field to join back to the original. In this case, the ProductNum, Item, and Price fields will go into the Products table. I'll leave ProductNum in the Order Detail table to maintain the relationship. Figure 4 is the result.
Figure 4: Decomposing Order Details to Remove Redundant Data

So now I've removed as much redundant data as possible. There's still a little left. There always will be in order to maintain the relationships between tables. But none of the redundancy will result in data anomalies, so I can say with confidence that my tables are now normalized to Third Normal Form. Figure 5 shows the final design.
Figure 5: Final Design

In my next and final post: The Normal Forms: In A Nutshell, I'll wrap it all up.
.
Monday, August 28, 2017
The Normal Forms: Second Normal Form
Figure 1: Order Table - 1NF

The problem here is the danger of introducing errors, called data anomalies, into the table. Data anomalies can be introduced by operator error or through programming. Once you have a single data anomaly in your table, all of your data is suspect, so the remaining normal forms work to remove such data anomalies. Figure 2 shows the same table with data anomalies present.
Figure 2: Order Table with Data Anomalies Present

As you can see, Order 112 has two different customer numbers (444 and 445), which is correct? It is impossible to tell. In addition, both product numbers B7G and B7H are identified as a 'saw'. Are these the same product with different product numbers or different products with the same description? Again, I can't know based on the data in the database.
The root cause of these data anomalies is redundant data, that is, data that is repeated in multiple rows. So we need to minimize this redundant data as much as possible.
Now wait a second! Didn't I just say in the last post that I HAD to repeat the values? Yes I did. But that was to comply with 1NF, which is not the end of the story.
Definition
So let's look at the definition of Second Normal Form (2NF). A table is said to be in 2NF if:
- It is in 1NF.
- Every field is functionally dependant on the entire primary key, that is, it depends on the entire primary key for its value.
Functional Dependency
Before I can continue, I have to talk a bit about functional dependencies, because all of the remaining normal forms rely on this concept. Functional dependency speaks to the relationship that fields in a table have to each other. It is perhaps best explained by example.
Suppose there is an Employee table, and I am an entity in that table. There is a row that represents all the information about Roger Carlson with Social Security Number (SSN) acting as the primary key. Since all the fields in my row are information about me, and I am represented by the SSN, we can say that each field depends on SSN for its value. Another way to say it is that SSN implies the value of each of the other fields in my record.
If a different row is selected, with a different SSN, the values of all the other fields will change to represent that entity.
Implications
Second Normal Form says that all of the fields in a table must depend on the ENTIRE primary key. When there is a single primary key (like SSN), it is pretty simple. Each field must be a fact about the record. But when there is a compound primary key, it's possible that some fields may depend on just part of the primary key.
Going back to our Order Table example, Figure 3 shows these partial dependencies.
Figure 3: 1NF Orders Table showing dependencies

In order to uniquely identify the record, the primary key of this table is a combination of OrderNum and ProductNum (or Item, but a number is a better choice).
2NF says that each field must depend on the ENTIRE primary key. This is true for some fields: Quantity depends on both the OrderNum and ProductNum, so does Item. However, some fields do not.
Order 112 will be for customer 444 regardless of the product selected. The order date does not change when the product number changes either. These fields depend ONLY on the OrderNum field.
Since some fields do not depend on the entire primary key, it is not in Second Normal Form. So what do I do about it?
Implementation
The solution is to remove those records, which do not depend on the entire primary key, to a separate table where they do. In the process, I remove the redundant or repeated data so there is just a single record for each. Figure 4 shows the process of decomposing the table into two tables.
Figure 4: 1NF Orders table Decomposed to 2NF

This corrects the data anomaly with the two different customers for the same order. However, I still have the problem of the product number and the item description. It's still possible for the same product to have different descriptions or different items sharing the same ProductNum, as Figure 5 illustrates.
Figure 5: Remaining data anomalies.

Product A7S is either a wrench or a nail, and a saw is either product number B7G or B7H.
To correct these problems, I need to add yet another normal form: Third Normal Form. I'll talk about that next.
.
Wednesday, August 16, 2017
The Normal Forms: First Normal Form (1NF)
The First Normal Form, or 1NF, is the very lowest, basic arrangement of fields in a table. If your table is not in 1NF, then it isn't really a table. Sadly, many novice databases are not even in 1NF.
Definition
A table is said to be in First Normal Form if:
1) there is no row or column order
2) each row (record) is unique
3) each row by column value (field) contains exactly one value
4) there are no repeating columns
What does this mean?
First of all, the operation of the table will be unaffected by the order the rows are in or the order the fields are within the row. It means that each record must be complete unto itself without referencing another row positionally, for example, the row above. Likewise the position of the fields is irrelevant.
Since each record is unique, it means there are no duplicate records. This uniqueness is defined by a field or combination of fields whose value will never be duplicated. This is called the primary key. In order to assure uniqueness, no part of a primary key may be NULL.
Because a field must have a single value, it cannot contain a list or compound value. One over looked consequence of this rule is that each field MUST have at least one value. If the value of the field is not known, it is said to be NULL. (There is some debate over whether NULL is actually a value. I maintain it is, but the discussion is largely semantic.)
Lastly, there are no repeating columns. Repeating columns are columns that store essentially the same information. They may be columns like Product1, Product2, Product3; or multiple Yes/No columns that represent the same information like each product having its own column (Saw, Hammer, Nails).
Implications
Let's take a look at how these rules are implemented and what they mean for table design.
Suppose I want a simple Order table with OrderNum, CustomerNum, OrderDate, Quantity, Item, and ProductNum. Although the definition of 1NF is fairly simple, it precludes a wide range of data arrangements. Let's take a look at some of these arrangements.
Figure 1 shows one way such data can be arranged.
Figure 1: Records with Missing Values

To make each record unique, the primary key would have to be OrderNum and Item. However, since no part of the primary key may be Null, this arrangement won't work. All the values of the primary key must be filled in.
But even more than this, the record is not "complete" unto itself. That is, it refers to other records for information. It's not that the values of OrderNum, CustomerNum, or OrderDate are unknown and therefore NULL. I do know the value, but I'm attempting to represent that data positionally. This, of course, violates the first rule (order is irrelevant) and rule 3 (each field must have a value).
This arrangement is common in spreadsheets and reports, but it is not sufficient for storing data.
Figure 2 shows another way the data can be arranged.
Figure 2: Information Stored In Lists

This violates rule 3. Each field must hold one and only one piece of information and not a list. It would be a nightmare to do anything with the date in the Item field other than simply display it because the database management system is designed to treat fields as indivisible.
While Figure 2 is an extreme example that mixes multiple fields in addition to multiple field values, believe it or not, I have also seen database designed like Figure 3:
Figure 3: Data stored in multiple lists

While this is better than Figure 2 (at least it does not mix fields), it is still not atomic and you'd have difficultly associating a quantity with a particular product.
Compound Values:
1NF also precludes compound values, things like full names in a single field or multi-part identification numbers.
Full Names
Why not store a full name? Roger J. Carlson is certainly pertinent information about me. However, it is not indivisible. It is made up of a first name, middle initial, and last name. Because I may want to access just pieces of it (using the first name in the salutation of a letter or sorting by last name), the information should be stored in separate fields.
Multi-part Numbers
Often, a database requirement is to have an identification number that is composed of different, meaningful parts. A serial number may have a four-digit product code, followed by the manufacture date (8 digits), and ended with the facility ID. It might look like this COMP02222008BMH. While this may be a useful arrangement for humans, it is useless in a database. Each value should be stored in a separate field. When the serial number is needed, it can be concatenated easily enough in a query, form, or report.
Figure 4 shows data that is stored in repeated columns.
Figure 4: Data Stored in Repeated Columns

This arrangement is common for people who use spreadsheets a lot. In fact, this is so common it is called "committing spreadsheet". The problem, in addition to having multiple columns, is that in order to associate a quantity with a product, you would have to do it positionally, breaking rule 1.
Lastly, another version of the Repeated Columns error is multiple Yes/No columns. Figure 5 illustrates that.
Figure 5: Data Stored in Yes/No Columns

At first blush, this does not seem to have the same problem, but all I've done is replace generic field names (Product1, Product2, etc) with specific ones (wench, saw, etc). It would be extremely easy to check a second field in any row and they you would have no idea which was correct.
Implementation
As we've seen, First Normal Form precludes a lot of possible data arrangements. So what's left? There's really only one possibility left. Figure 6 shows it.
Figure 6: 1NF Correct with Repeated Rows

Each row has a unique identifier and there are no duplicates. Each field contains a single value. The position of the row and field is irrelevant, and lastly there are no repeating columns.
It's perfect. Right? Well, no. While this table does conform to 1NF, it is still has some problems; problems that 1NF is not equipped to handle. For those, I need to look at the Second Normal Form (2NF), which is what I'll do next time.
.
Tuesday, August 1, 2017
The Normal Forms: Introduction
In What Is Normalization: Parts I, II, III, IV and V, I discussed the decomposition method of normalization, where you put all your fields into a single table and break the them down into smaller, normalized tables.
In Entity-Relationship Diagramming: Parts I, II, III, and IV, I discussed an alternate method which works from the bottom up. It takes the individual pieces of information (Attributes) and group them into logical groupings (Entities).
However, in neither case did I formally define or explain the Normal Forms. And that's for good reason. I find that only after people get a working understanding of normalization do they really understand the Normal Forms and what they imply. Therefore I usually leave them until last. If you haven't read the above mentioned serie, it would be worth your while to do so.
Normalization was first developed by E. F. Codd, the father of Relational Database theory. He created a series of "Normal Forms", which mathematically defined the rules for normalization. Each normal form is "stronger" than the previous, that is, they build upon the previous normal forms. Second Normal Form (2NF) encompasses all the rules of First Normal Form (1NF) plus adding its own rules. Third Normal Form encompasses all of 1NF and 2NF, and so on.
Figure 1: Each Normal Form Encompasses the Previous

In order, the normal forms are: First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal Form (BCNF), Fifth Normal Form (5NF), and Domain Key Normal Form (DKNF). BCNF comes between 3NF and 4NF because it was developed later, but because of its "strength" belonged between 3NF and 4NF.
Since each normal form encompasses all previous forms, in theory, the higher the normal form, the "better" the database.
In practice, however, normalizing to the first three normals form will avoid the vast majority of database design problems. So it is generally agreed that to be properly normalized, most databases must be in 3NF.
Beyond 3NF, the normal forms become increasingly specialized. Boyce-Codd Normal form and Fourth Normal Form were created to handle special situations. Fifth Normal Form and Domain-Key Normal Form are largely of theoretical intererst and little used in practical design.
So what I'm going to do for this series is limit myself to the first three normal forms, giving their definitions, implications for data, and how to implement them.
In my next post, I'll start with the First Normal Form.
Friday, June 23, 2017
Entity-Relationship Diagramming: Part IV
(I haven't discussed the Normal Forms. Perhaps I will in a later post.)
In Part III, I left off with the completed Entity-Relationship Diagram shown in Figure 1.
Figure 1: Employee Database E-R Diagram

When I do E-R diagramming, I like to ignore the attributes, concentrating instead on just the entities and their relationships. But at some point, obviously, the attributes have to come back into it. I like to save that to the end.
There are some good E-R diagramming tools available out there, like Visio and ERWin, but for small projects, I usually just implement the tables directly in Access and create the Relationships there. The Access Relationship Window works pretty well as a diagramming tool, and you've got a completed database at the end.
Implementing an E-R Diagram
So far, I've talked about Entities and Attributes. I do that deliberately to keep myself from thinking about implementation issues while I'm designing the data model. But at the implementation phase, entities become tables and attributes become fields.
The first thing I do is add an Autonumber, Primary Key field (Surrogate Key), to each table, naming it after the table with a suffix of "ID". Thus the tblEmployee table will have a field EmployeeID. Then I create a Unique Index on the Natural Key I identified during E-R diagramming process.
At this point, my tables look like Figure 2.
Figure 2: Tables with Surrogate Keys Added

Now it's time to look at my relationships. Relationships are created on fields holding related information, Primary Key to Foreign Key. In a One-to-Many (1:M) relationship, the primary key of the table on the "One" side is added to the table on the "Many" side table and becomes the foreign key.
Since there is a 1:M relationship between tblEmployee and tblEmployeeCertification, I'll put EmployeeID in EmployeeCertification and it will become the foreign key for tblEmployee.
I'll do the same with every relationship. The resultant tables look like Figure 3.
Figure 3: Tables with Foreign Keys Added

Lastly, I need to create the actual relationships. I do this by clicking and dragging the primary key of one table into its corresponding foreign key in the other. For instance, I'll click on EmployeeID in tblEmployee and drag it to EmployeeID in tblJobHistory.
When I do, I will get an "Edit Relationships" dialog box shown in Figure 4.
Figure 4: Edit Relationships Dialog Box

The Referential Integrity box will not be checked by default. However, you MUST check it in order to create a true relationship. Without it, Access will create a line between the tables, but it will NOT be a relationship.
When you click the Referential Integrity box, Access should correctly identify the relationship as a One-to-Many relationship. If it does not, if it says Indeterminate, you've done something wrong.
Clicking OK results in the relationship being created. Figure 5 shows that.
Figure 5: Relationship Between tblEmployee and tblJobHistory. 
Now, I'll just do that for the rest of the relationships, and I'm done.
Figure 6: Completed Data Model 
The process I've outlined here is really just the bare bones. If you're interested in seeing examples in more detail, you can find some in the Tutorials: Database Design section of my website.
.