Alessandro D. Gagliardi
Those who cannot remember the past are condemned to repeat it.
The Life of Reason, by George Santayana
The objectives of normalization were stated as follows by Codd:
- To free the collection of relations from undesirable insertion, update and deletion dependencies;
- To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
- To make the relational model more informative to users;
- To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34
A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.
Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:
| Customer ID | First Name | Surname | Telephone Number |
|---|---|---|---|
| 123 | Robert | Ingram | 555-861-2025 |
| 456 | Jane | Wright | 555-403-1659 |
| 789 | Maria | Fernandez | 555-808-9633 |
The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:
| Customer ID | First Name | Surname | Telephone Number |
|---|---|---|---|
| 123 | Robert | Ingram | 555-861-2025 |
| 456 | Jane | Wright | 555-403-1659 |
| 555-776-4100 | |||
| 789 | Maria | Fernandez | 555-808-9633 |
Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.
A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.
| Customer ID | First Name | Surname |
|---|---|---|
| 123 | Robert | Ingram |
| 456 | Jane | Wright |
| 789 | Maria | Fernandez |
| Customer ID | Telephone Number |
|---|---|
| 123 | 555-861-2025 |
| 456 | 555-403-1659 |
| 456 | 555-776-4100 |
| 789 | 555-808-9633 |
Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.
A table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.
| Employee | Skill | Current Work Location |
|---|---|---|
| Brown | Light Cleaning | 73 Industrial Way |
| Brown | Typing | 73 Industrial Way |
| Harrison | Light Cleaning | 73 Industrial Way |
| Jones | Shorthand | 114 Main Street |
| Jones | Typing | 114 Main Street |
| Jones | Whittling | 114 Main Street |
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"
| Employee | Skill | Current Work Location |
|---|---|---|
| Brown | Light Cleaning | 73 Industrial Way |
| Brown | Typing | 73 Industrial Way |
| Harrison | Light Cleaning | 73 Industrial Way |
| Jones | Shorthand | 414 Brannon Street |
| Jones | Typing | 414 Brannon Street |
| Jones | Whittling | 114 Main Street |
A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:
| Employee | Current Work Location |
|---|---|
| Brown | 73 Industrial Way |
| Harrison | 73 Industrial Way |
| Jones | 114 Main Street |
| Employee | Skill |
|---|---|
| Brown | Light Cleaning |
| Brown | Typing |
| Harrison | Light Cleaning |
| Jones | Shorthand |
| Jones | Typing |
| Jones | Whittling |
Neither of these tables can suffer from update anomalies.
Not all 2NF tables are free from update anomalies, however. This brings us to...
3NF was originally defined by E.F. Codd in 1971
A table is in 3NF if and only if it is in 2NF and every non-prime attribute of the table is non-transitively (i.e. directly) dependent on every superkey of that table.
Or...
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."
| Tournament | Year | Winner | Winner Date of Birth |
|---|---|---|---|
| Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
| Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
| Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
| Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.
The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:
| Tournament | Year | Winner |
|---|---|---|
| Indiana Invitational | 1998 | Al Fredrickson |
| Cleveland Open | 1999 | Bob Albertson |
| Des Moines Masters | 1999 | Al Fredrickson |
| Indiana Invitational | 1999 | Chip Masterson |
| Player | Date of Birth |
|---|---|
| Chip Masterson | 14 March 1977 |
| Al Fredrickson | 21 July 1975 |
| Bob Albertson | 28 September 1968 |
Update anomalies cannot occur in these tables, which are both in 3NF.
I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.
Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152
Consider these two representations of our players table:
| Player (PK) | Date of Birth |
|---|---|
| Chip Masterson | 14 March 1977 |
| Al Fredrickson | 21 July 1975 |
| Bob Albertson | 28 September 1968 |
| id (PK) | Player | Date of Birth |
|---|---|---|
| 1 | Chip Masterson | 14 March 1977 |
| 2 | Al Fredrickson | 21 July 1975 |
| 3 | Bob Albertson | 28 September 1968 |
Which is better? Why?
| id (PK) | Player | Date of Birth |
|---|---|---|
| 1 | Chip Masterson | 14 March 1977 |
| 2 | Al Fredrickson | 21 July 1975 |
| 3 | Bob Albertson | 28 September 1968 |
| id (PK) | Tournament | Year | Winner_id (FK) |
|---|---|---|---|
| 1 | Indiana Invitational | 1998 | 2 |
| 2 | Cleveland Open | 1999 | 3 |
| 3 | Des Moines Masters | 1999 | 2 |
| 4 | Indiana Invitational | 1999 | 1 |

DS_Lab03-RDBMS