Data Science

SQL and Relational Algebra

Alessandro D. Gagliardi

Last Week: Python

Questions?

Agenda

  1. Database Evolution
  2. Normalization
  3. Natural and Artificial Keys
  4. Star Schema
  5. Lab

Lab

Pair Programming

1. Start with a reasonably well-defined task.

2. Agree on one tiny goal at a time.

3. Rely on your partner, support your partner.

4. Talk a lot!

5. Sync up frequently.

6. Take a moment to celebrate as you complete tasks and overcome problems.

7. Switch roles often—at least every half hour class(?)

Questions?

Relational Databases

Those who cannot remember the past are condemned to repeat it.

The Life of Reason, by George Santayana

DB Evolution

  • 1960s
    • Hierarchical data structure (IBM IMS)
    • Network data structure (CODASYL)
  • 1970s
    • Relational data model
      • A Relational Model of Data for Large Shared Data Banks – E. F. Codd [1970]
    • System R (IBM), Ingres (Berkeley)

DB Evolution

  • 1980s
    • Commercialization of RDBMS (relational database management systems)
      • Oracle, Sybase, IBM DB2, Informix
    • SQL (Structured Query Language)
    • ACID (Atomic, Consistent, Isolated, Durable)
  • 1990s
    • PC RDBMS
      • Paradox, Microsoft SQL Server & Access
    • Larger DBs, driven by internet
    • Consolidation among commercial DB vendors

DB Evolution

  • 2000s
    • Commercialization of Open Source RDBMS
      • MySQL, Postgres
    • Evolving requirements expose RDBMS limitations
      • Storing complex and dynamic objects
      • Processing increasing data volumes
      • Analyzing massive amounts of data

Keys

  • A primary key is a unique identifier of a record in a table
  • A foreign key (when enforced) indicates a dependancy between two tables
    • often (though not necessarily) the primary key of the other table
  • Keys can be composite.

Normalization

The objectives of normalization were stated as follows by Codd:

  1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
  2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
  3. To make the relational model more informative to users;
  4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34

First normal form (1NF)

A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.

What?

The following scenario illustrates how a database design might violate first normal form.

Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

Customer

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

Customer

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
555-776-4100
789 Maria Fernandez 555-808-9633

Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.

A design that complies with 1NF:

A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.

Customer Name

Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez

Customer Telephone Number

Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.

Second normal form (2NF)

A table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.

What?

Consider a table describing employees' skills:

Employees' Skills

Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 114 Main Street
Jones Typing 114 Main Street
Jones Whittling 114 Main Street

Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

Employees' Skills

Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 414 Brannon Street
Jones Typing 414 Brannon Street
Jones Whittling 114 Main Street

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:

Employees

Employee Current Work Location
Brown 73 Industrial Way
Harrison 73 Industrial Way
Jones 114 Main Street

Employees' Skills

Employee Skill
Brown Light Cleaning
Brown Typing
Harrison Light Cleaning
Jones Shorthand
Jones Typing
Jones Whittling

Neither of these tables can suffer from update anomalies.

Not all 2NF tables are free from update anomalies, however. This brings us to...

Third normal form (3NF)

3NF was originally defined by E.F. Codd in 1971

A table is in 3NF if and only if it is in 2NF and every non-prime attribute of the table is non-transitively (i.e. directly) dependent on every superkey of that table.

Or...
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."

  • Requiring existence of "the key" ensures that the table is in 1NF
  • Requiring that non-key attributes be dependent on "the whole key" ensures 2NF
  • Requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF
An example of a 2NF table that fails to meet the requirements of 3NF is:

Tournament Winners

Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament Winners

Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson

Player Dates of Birth

Player Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968

Update anomalies cannot occur in these tables, which are both in 3NF.

I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.

Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152

Natural vs. Artificial Keys

Consider these two representations of our players table:

Player (PK) Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968
id (PK) Player Date of Birth
1 Chip Masterson 14 March 1977
2 Al Fredrickson 21 July 1975
3 Bob Albertson 28 September 1968

Which is better? Why?

Player Dates of Birth

id (PK) Player Date of Birth
1 Chip Masterson 14 March 1977
2 Al Fredrickson 21 July 1975
3 Bob Albertson 28 September 1968

Tournament Winners

id (PK) Tournament Year Winner_id (FK)
1 Indiana Invitational 1998 2
2 Cleveland Open 1999 3
3 Des Moines Masters 1999 2
4 Indiana Invitational 1999 1

Star Schema

Example

Open DS_Lab03-RDBMS