Normalization is a design approach that minimizes data redundancy and optimizes data structures by systematically and properly placing data elements into the appropriate groupings. Normalization is the process of efficiently organizing data in a database and is the process of decomposing large tables into smaller tables.
There are two goals of the normalization process:
Eliminating redundant data (for example, storing the same data in more than one table).
Ensuring data dependencies make sense (only storing related data in a table).
Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored and avoid problems with inserting, updating, or deleting data.
A series of guidelines were developed by the database community for ensuring that the databases are normalized. Those guidelines are represented by different normal forms.
Types of Normal forms are as follows:
First Normal Form or 1NF
Second normal Form or 2NF
Third Normal Form or 3NF
Fourth Normal Form or 4NF
Fifth Normal Form or 5NF
In practical applications, in general, you use only 1NF, 2NF, and 3NF.
The term atomic derives from atom, the smallest indivisible particle that can exist on its own.
First normal form eliminates repeating groups and non-atomic data from an entity.
To normalize a data model into 1NF, eliminate repeating groups into individual entities. In other words, do not use multiple attributes in a single entity to store similar data. Consider the sample data shown in table for a STUDENT information system for a college or university.
This data contains several violations of 1NF. First, you are tracking courses that really represent a repeating group for STUDENTs. So, the course information should be moved into separate entities. Furthermore, you need to specify identifiers for both entities. The identifier is the primary key for the entity
A second violation of 1NF is the non-atomic data shown in the StudentName attribute. A student name can be broken down into pieces: first name, middle initial and last name. It is not indivisible, and therefore violates first normal form.
StudentID | StudentName | MajorID | StudentMajor | CourseNum | CourseName | CourseCompDate |
---|---|---|---|---|---|---|
2907 | Smith, Jacob R | MAT | Mathematics | MAT0011 MAT0027 EGL0010 | Discrete Math Calculus I English Classics I | 8/1/2002 4/30/2002 12/30/2001 |
4019 | Patterson, Jane K | PHI | Philosophy | PHI0010 CS00100 | Intro to Philosophy Programming Languages | 2002-04-30 2002-04-30 |
5145 | Neeld, Norris B | EGL | English Literature | SOC0102 | Ascent of Man | 8/1/2002 |
6132 | Morrison, Xavier Q | MUS | Music | MUS0002 SOC0102 | Origin of Jazz Ascent of Man | 2002-04-30 2002-08-01 |
7810 | Brown, Richard E | CS | Computer Science | |||
8966 | Juarez, Samantha | EGL | English Literature | EGL0010 EGL0101 | English Classics I Shakespeare II | 2001-12-30 2002-08-01 |
StudentID | LastName | FirstName | MiddleInit | MajorID | StudentMajor |
---|---|---|---|---|---|
2907 | Smith | Jacob | R | MAT | Mathematics |
4019 | Patterson | Jane | K | PHI | Philosophy |
5145 | Neeld | Norris | B | EGL | English Literature |
6132 | Morrison | Xavier | Q | MUS | Music |
7810 | Brown | Richard | E | CS | Computer Science |
8966 | Juarez | Samantha | EGL | English Literature |
StudentID | CourseNum | CourseName | CourseCompDate |
---|---|---|---|
2907 | MAT0011 | Discrete Math | 8/1/2002 |
2907 | MAT0027 | Calculus I | 4/30/2002 |
2907 | EGL0010 | English Classics I | 12/30/2001 |
4019 | PHI0010 | Intro to Philosophy | 4/30/2002 |
4019 | CS00100 | Programming Languages | 4/30/2002 |
5145 | SOC0102 | Ascent of Man | 8/1/2002 |
6132 | MUS0002 | Origin of Jazz | 4/30/2002 |
6132 | SOC0102 | Ascent of Man | 8/1/2002 |
8966 | EGL0010 | English Classics I | 12/30/2001 |
8966 | EGL0101 | Shakespeare II | 8/1/2002 |
Second normal form (2NF) ensures that all the attributes of each entity are dependent on the primary key.
Notice that certain courses repeat in the COURSE entity, namely "English Classics I" and "Ascent of Man." This situation indicates a violation of 2NF. To correct the problem, we need to identify the attributes that do not depend on the entire key and remove them. The removed attributes, along with the portion of the primary key on which they depend, are placed in a new entity, ENROLLMENT. The entire primary key of the original entity remains with the original entity.
Another benefit of the normalization process is that you will frequently encounter new attributes that need to be specified for the new entities that are created. For example, perhaps the new COURSE entity causes us to remember that each course is assigned a number of credits that count toward graduation. No changes were required for the STUDENT entity:
2907 | MAT0011 | 2002-08-01 |
---|---|---|
2907 | MAT0027 | 2002-04-30 |
2907 | EGL0010 | 2001-12-30 |
4019 | PHI0010 | 2002-04-30 |
4019 | CS00100 | 2002-04-30 |
5145 | SOC0102 | 2002-08-01 |
6132 | MUS0002 | 2002-04-30 |
6132 | SOC0102 | 2002-08-01 |
8966 | EGL0010 | 2001-12-30 |
8966 | EGL0101 | 2002-08-01 |
CourseNum | CourseName | Credits |
---|---|---|
MAT0011 | Discrete Math | 3 |
MAT0027 | Calculus I | 4 |
EGL0010 | English Classics I | 3 |
PHI0010 | Intro to Philosophy | 3 |
CS00100 | Programming Languages | 3 |
SOC0102 | Ascent of Man | 3 |
MUS0002 | Origin of Jazz | 3 |
A rule of thumb for identifying 3NF violations is to look for groups of attributes whose values can apply to more than a single entity occurrence. When you discover such attributes, move them to a separate entity.
It is time to review our STUDENT information again, this time looking for 3NF violations. Examine the STUDENT data in closely. Notice that students can have the same major and, as such, certain major information can be repeated, specifically two students in our small sample are English Literature majors. To correct the problem, we need to remove major attributes that transitively depend on the key and create a new entity for them.
StudentID | LastName | FirstName | MiddleInit | MajorID |
---|---|---|---|---|
2907 | Smith | Jacob | R | MAT |
4019 | Patterson | Jane | K | PHI |
5145 | Neeld | Norris | B | EGL |
6132 | Morrison | Xavier | Q | MUS |
7810 | Brown | Richard | E | CS |
8966 | Juarez | Samantha | EGL |
MajorID | StudentMajor |
---|---|
MAT | Mathematics |
PHI | Philosophy |
EGL | English Literature |
MUS | Music |
CS | Computer Science |
To be complete, a diagram should be developed for the 3NF data model we just created for the STUDENT data. Figure shows such a data model. Notice that we have not filled in the optionality of the relationships. We could do this based on the sample data we used, but we really need to ask more questions before we can answer questions such as Does a every student have to have a major? The current data shows this to be the case, but in reality; you know that most freshmen, and even upperclassmen, may attend college without having a formally declared major.
Normalization does not stop with 3NF. Additional normal forms have been identified and documented. However, normalization past 3NF does not occur often in normal practice. The following are additional normal forms. Just for your information we’ve kept this.
Boyce Codd normal form (BCNF) is a further refinement of 3NF. Indeed, in his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if and only if every determinant is a candidate key. Most entities in 3NF are already in BCNF.
Fourth normal form (4NF) states that no entity can have more than a single one-to-many relationship if the one-to-many attributes are independent of each other. An entity is in 4NF if and only if it is in 3NF and has no multiple sets of multivalued dependencies.
Fifth normal form (5NF) specifies that every join dependency for the entity must be a consequence of its candidate keys.
If you have any doubts or queries related to this chapter, get them clarified from our Mainframe experts on ibmmainframer Community!