首页 » Big Data & AI » Data Analytics » 正文

5 Rules of Data Normalization

There are several levels of normalization. A brief description is provided in below:

  1. Eliminate Repeating Groups – Make a separate table for each set of related attributes, and give each table a primary key.
  2. Eliminate Redundant Data – If an attribute depends on only part of a multi-valued key, remove it to a separate table.
  3. Eliminate Columns Not Dependent On Key – If attributes do not contribute to a description of the key, remove them to a separate table.
  4. Isolate Independent Multiple Relationships – No table may contain two or more 1:n or n:m relationships that are not directly related.
  5. Isolate Semantically Related Multiple Relationships – There may be practical constrains on information that justify separating logically related many-to-many relationships.

1. Eliminate Repeating Groups

In the original member list, each member name is followed by any databases that the member has experience with. Some might know many, and others might not know any. To answer the question, “Who knows DB2?” we need to perform an awkward scan of the list looking for references to DB2. This is inefficient and an extremely untidy way to store information.

Moving the known databases into a seperate table helps a lot. Separating the repeating groups of databases from the member information results in first normal form. The MemberID in the database table matches the primary key in the member table, providing a foreign key for relating the two tables with a join operation. Now we can answer the question by looking in the database table for “DB2” and getting the list of members.

2. Eliminate Redundant Data

In the Database Table, the primary key is made up of the MemberID and the DatabaseID. This makes sense for the “Where Learned” and “Skill Level” attributes, since they will be different for every member/database combination. But the database name depends only on the DatabaseID. The same database name will appear redundantly every time its associated ID appears in the Database Table.

Suppose you want to reclassify a database – give it a different DatabaseID. The change has to be made for every member that lists that database! If you miss some, you’ll have several members with the same database under different IDs. This is an update anomaly.

Or suppose the last member listing a particular database leaves the group. His records will be removed from the system, and the database will not be stored anywhere! This is a delete anomaly. To avoid these problems, we need second normal form.

To achieve this, separate the attributes depending on both parts of the key from those depending only on the DatabaseID. This results in two tables: “Database” which gives the name for each DatabaseID, and “MemberDatabase” which lists the databases for each member.

Now we can reclassify a database in a single operation: look up the DatabaseID in the “Database” table and change its name. The result will instantly be available throughout the application.

3. Eliminate Columns Not Dependent On Key

The Member table satisfies first normal form – it contains no repeating groups. It satisfies second normal form – since it doesn’t have a multivalued key. But the key is MemberID, and the company name and location describe only a company, not a member. To achieve third normal form, they must be moved into a separate table. Since they describe a company, CompanyCode becomes the key of the new “Company” table.

The motivation for this is the same for second normal form: we want to avoid update and delete anomalies. For example, suppose no members from the IBM were currently stored in the database. With the previous design, there would be no record of its existence, even though 20 past members were from IBM!

4. Isolate Independent Multiple Relationships

This applies only to designs that include one-to-many and many-to-many relationships. An example of one-to-many is that one company can employ many members. An example of a many-to-many relationship is that a member can know many databases and several members might know the same database.

Suppose we want to add a new attribute to the MemberDatabase table called “Attire”. This way we can look for members that not only know DB2, but also typically wear a suit and tie. Fourth normal form dictates against this (using the MemberDatabase table, not wearing suits and ties). The two attributes do not share a meaningful relationship. A member may know a certain database, and he/she might just wear a wet suit. This doesn’t mean he/she can do both at the same time (unless you have a water-proof computer terminal). How will you represent this if you store both attributes in the same table?

5. Isolate Semantically Related Multiple Relationships

Usually, related attributes belong together. For example, if we really wanted to record which databases each member works on wearing which kinds of clothes, we would want to keep the attire attribute in the MemberDatabase table. But there are times when special characteristics of the data make it more efficient to separate even logically related attributes.

Imagine that our system will record which jobs are available in each company, and which schools typically supply candidates to those companies. This suggests a CompanySchoolJob table which satisfies fourth normal form. As long as any company can use any candidates from any school, this works fine.

Now suppose a law is passed to prevent exclusive arangements: a company accepting candidates must accept them from all schools it deals with. In other words, if IBM is hiring DBAs and wants to maintain a relationship with MIT, it must accept DBAs from MIT.

The need for fifth normal form becomes clear when we consider inserts and deletes. Suppose a company decides to create 3 new jobs types:  HTML DBA, Java Programmer and Underwater DB2 DBA. Suppose further that it already deals with three schools that can supply candidates for those positions. This will require nine new rows in the database, one for each school/job combination.

Breaking up the table reduces the number of inserts to six. Here are the tables neccessary for fifth normal form, shown with the six newly inserted rows in bold type. If an application involves significant update activity, fifth normal form can mean important savings. Note that these combination tables develop naturally out of entity-relationship analysis.