Normalization : 1NF, 2NF, 3NF, BCNF, 4NF, and 5NF Normalization: Normalization refers to rigorous standards for good design, designed formally, and methods for testing a DB’s design as given below.To design a good relational database schema, by which an existing schema is modified to bring its component tables into compliance (fulfillment) with a series of progressive normal forms as below. below.
Normalization is a process of looking at the table(relational schema) in the RDBMS to test whether they pass a series of test. The tests are used to avoid update anomalies, redundancy, and ambiguity from the data base schema. If the table fails (bad relation) in a test, the solution is decomposing the attributes of bad relation (table) into smaller tables. The goal of database normalization is to ensure that every non-key column in every table is directly dependent on the key, the whole key and nothing but the key and with this goal come benefits in the form of reduced reduce d redunda redundancies ncies,, fewer anomalies, anomalies, and impr improved oved effic efficiencie ienciess (Whi (While le normal normalizati ization on tends to incre increase ase the duplication of data, it does not introduce redundancy, which is unnecessary duplication.).
Normalization up to BC normal form (BCNF) is based on functional dependencies and key constraints.
About the Key: Column(s) C is primary key for table T if: Property 1: All columns in T are functionally dependent on C Property 2: No sub-collection of columns in C (assuming C is a collection of columns and not
just a single column) has Property 1 Superkey:
Any combination of attributes that uniquely identifies each tuple in a relation is . Candidate Keys:
Any minimal combination of attributes that uniquely identifies each tuple in a relation – you can’t remove any attribute and still have unique identification. Alternate Keys / Secondary Keys:
Candidate
keys
not
chosen
as
primary
keys
are
called
alternate
keys.
Primary key: The candidate key chosen to be the unique identifier for a relation is called Primary key. Prime Attribute:
An attribute that is a member of a candidate key is called Prime Attribute. Non-Prime Attribute:
An attribute that is not a member of a candidate key is called Prime Attribute.
1. First Normal Form : The first normal form (or 1NF) requires that the values in each column of a table are atomic . By atomic we mean that there are no multiple values in a column.
Example: 1NF isn’t very interesting – it is a stepping stone to others. For instance, if we have (s#, name, city), where each supplier may have several branch locations in different cities, then the relation has a domain that allows sets of cities for values, thus they aren’t atomic.
Not in 1NF
Normalized to 1NF. The solution is to decompose the table into 2 tables: (s#, name) and (s#, location).
2. Second Normal Form: A relation is in 2NF if and only if it is in 1NF and every nonprime attribute is fully functionally dependent on the primary (Composite) key. Example 1:
Where, the Underline represents the Primary Key attributes and the arrows represent the Functional Dependencies of the relation R1.
The primary key is (s#, p#), and FDs are city tax
s# tax,city s#,p# qty, Where tax is determined by the city, the tax and city are determined by s# and qty is determined by both the
s#,p# . Note that this structure will have update anomalies as below.
Anomalies because of Partial Dependancy:
Insert. You cannot enter the existence of a new supplier and city unless that supplier is shipping a part. This
is because of the integrity rule, that all fields in the key must have values. p# is part of the key. Delete. If a supplier has only one shipment, and it gets deleted, you also delete all knowledge of that
supplier, such as the city. Update. Because of the redundancy, if the city of a supplier moves, then you must either find all
occurrences of the supplier and change the city or change one occurrence, and have an inconsistent DB.
The solution is to divide this into two tables, where the key of the new table will be the one that was independently determining the values of some attributes. So now we have
The functional dependency diagram shows that each of them now contains attributes that are fully dependent on the primary key of each relation. Insert – can now insert the existence of a supplier into R2a,
without a shipment. Delete – can now delete a shipment from R2b without losing information about the supplier. Update – can now update the city in only one place.
3. Third Normal Form :
A relation is in 3NF if and only if it is in 2NF and every nonprime attribute is non-transitively dependent on the primary key.
Third Normal Form (3NF) requires that all columns depend directly on the primary key. Tables violate the Third Normal Form when one column depends on another column, which in turn depends on
the primary key (a transitive dependency). One way to identify transitive dependencies is to look at your table and see if any columns would require updating if another column in the table was updated. If such a column exists, it probably violates 3NF. A transitive dependence is when
r.a
r.b and r.b r.c hold. Therefore, the transitive dependency
r.a
r.c also holds. This can be seen in the functional dependency for R2a.
Tax rate is dependent on city. City is dependent on s#. Therefore, tax rate is dependent on s# through city.
This shows by the fact that there are arrows that originate from places other than the key. This also gives anomalies. Anomalies because of transitive dependency: Insert – cannot enter that a city has a tax rate unless we have a supplier there. Again, this is because we
cannot have a null primary key. Delete – if there is only one supplier in a city, when we delete the supplier, we delete the tax information for
that city. Update – if we change the tax rate for a city, we must either 1. find all suppliers in that city and change the
status for it or 2. change only one and have an inconsistent DB.
The solution is to break the relation into two relations. The point here is to get rid of the extra arrows, and make simple functional dependencies. So the two new relations are
Now the functional dependency diagrams are simple, there are no transitive dependencies, all attributes are fully dependent on the key, and they are in 3NF.
4. Boyce-Codd Normal Form (BCNF) :
A relation is in BCNF if and only if every determinant is a candidate key. A determinant is any attribute on which another attribute is functionally dependent.
This Normal Form was developed to deal with relations where the relation has multiple candidate keys, and the candidate keys are composite, and the candidate keys are overlapped. When a relation has
more than one candidate key, anomalies may occur even though the relation is in 3NF. BCNF is based on the concept of a determinant . A determinant is any attribute (simple or composite)
on which some other attribute is fully functionally dependent . A relation is in BCNF is, and only if, every determinant is a candidate key.
Example:
Where, sname is also unique. The candidate keys are (s#, p#) and (sname, p#). s# determines sname, and sname determines s#, so they are both determinants. But they are not candidate keys – they are part of different candidate keys. update anomaly because of multiple Determinant:
If you update the sname in one tuple, you must either update it in all tuples with the same s#, or be inconsistent. Ditto with p#. The solution is to make two projections of R5:
Now, each determinant in each relation is also a candidate key. You can update sname or s# in one place (taking into account the issues of foreign keys). Complexity of more tables:
A complete normalization of tables is desirable, but you may find that in practice that full normalization can introduce complexity to your design and application. More tables often lead more JOIN operations , and in most database management systems (DBMSs) such JOIN operations can be costly,
leading to decreased performance. The key lies in finding a balance where the first three normal forms are generally met without creating an exceedingly complicated schema.
5. Multivalued Dependencies – 4NF:
A problem with multi-valued dependencies (MVD) occurs when you are trying to express two independent 1:N relations, or multi-valued attributes, in the same relation. For example, in your initial design process, you may have seen something like:
Example:
Where, the manager is associated (multi-determines) a set of phone numbers, and also a set of employees, but the phone numbers and the employees have nothing to do with each other. Of course, you can’t have a relation that looks like the ones above – it is excluded by 1NF. You are trying to express the idea that the manager is associated with a set of employees, and a set of phone numbers, but that the employees and the phone numbers are independent of one another. So, you might design a relation that looks like:
But that implies a relationship (connection) between 999-1212 and George. To avoid that appearance, you would have to store all combinations of phone# and employee. Two 1:N relations (or multivalued attributes), A:B, A:C, where B and C are independent of each other.
A->>B
A determines a set of (Multiple) values B
A->>C
A determines a set of (Multiple) values C
The only time a multi-valued dependency (mvd) is a problem is when you have more than one mvd, and the B and C values are independent. A trivial mvd is one where:
1. The B attribute(s) are a subset of the A attributes. That is, if you made them distinct from each other, there would no longer be an mvd. E.g., ABC->>B. 2. The union of A and B make up the entire relation – there are no other attributes in the table. Otherwise, you have a nontrivial mvd, and these are the potential problems. There are lots of redundancies allow room for anomalies. Note that the relations with non-trivial mvd’s tend to be all-key relations, where the key is the entire relation. The cure: 4NF
A relation is in 4NF if for every nontrivial mvd A->>B, C is a superkey (any combination of attributes) that is a unique ID in the relation (non-minimal) for the relation. The manager table used as an example above is not in 4NF. mgr, phone# is a nontrivial mvd because phone# is not a subset of mgr, and there is also employee. Similarly, mgr, employee is a nontrivial mvd. As usual, the way to get a relation into 4NF is to decompose it , to get the mvds into separate relations :
Which are now trivial mvds, making up the entire table? This decomposition will have the lossless join property. The Overall Idea
Remember that the goal here is to get a good design. Starting from an ER diagram is one way, although you still mgiht want to check normalization of tables. But starting with a bunch of tables and then normalizing them (or starting with one enormous table) is another approach. We have been talking about normalization as something that you do regarding just one table in the database. It is also important to look at your DB design in terms of how the tables relate to each other, and how you can combine them. Merely having a bunch of tables in 3NF or BCNF is not enough. Some definitions:
In a database design, we have a decomposition D of the universal relation R . This is the way that all of the attributes have been decomposed into tables. There is a set of functional dependencies F that hold over the attributes of R; this depends on the semantics of the DB and how things work in the world it models.
1. Dependency Preserving Decomposition:
In decomposition, it is possible to lose a functional dependency – this is undesirable, so a good decomposition will preserve dependencies. There are two ways of storing functional dependencies: they can be in the same table, or they can be inferred from different tables. 2. Lossless (Additive) Joins:
Another important feature of a good decomposition is that it gives lossless joins. This is the problem of spurious tuples. The term “lossless” refers to the problem of losing information – the way that you lose information here is by getting noise (spurious tuples) into your table. Properties of lossless join decomposition:
1. For 2-relation DB schemas: the attributes in both relations must functionally determine either those attributes that appear in only the first relation, or those that appear in only the second relation. 2. Once you have established decomposition with the lossless join property, you can further decompose one of its tables without losing the property. So, to decompose and maintain lossless joins: For each table in the DB that isn’t in BCNF, find the functional dependency that is in violation (that is, contains a determinant that is not part of a candidate key), and break the relation into two. One relation contains the A and B (AB) attributes from the functional dependency A ->>BC. The other contains the rest of the attributes (i.e A C).
You can’t always perform the “ideal” decomposition that is in BCNF and preserves dependencies. You may only be able to get to 3NF. You must then decide whether to leave it there, and build in protection for update anomalies, or to decompose even further, with the resulting loss of performance. In terms of design, remember that it isn’t a good idea to design a table that will get too many nulls. It is better to break it up into another table. However, this could also result in the problem of dangling tuples . The representation of a “thing” is broken up into 2 tables. To get the full information on the “thing”, you join the tables together. However, if some tuples have either null value on a join attribute, or don’t appear at all in one table, they won’t appear in the result, unless you know in advance that you should do an outer join.
6. Fifth Normal Form (5NF) and Join Dependencies : 5NF, also known as Project-join normal form (PJ/NF) is based on the idea of a lossless JOIN or
the lack of a join-projection anomaly. This problem occurs when you have an n-way relationship, where n>2. A quick check for 5NF is to see if the table is in 3NF and all the candidate keys are single columns. Join dependency
A join dependency (JD), denoted by JD( R1, R2, ..., Rn), specified on relation schema R, specifies a constraint on the states r of R. The constraint states that every legal state r of R should have a non-additive join decomposition into R1, R2, ..., Rn; that is, for every such r we have
Note: an MVD is a special case of a JD where n = 2. •
Lossless-join property refers to when we decompose a relation into two relations - we can rejoin
the resulting relations to produce the original relation. •
Consider a table of supply with no MVD is in 4NF but not in 5NF if it has JD(R1,R2,R3).
•
The SUPPLY relation with the join dependency is decomposed into three relations R1,R2, and $3
that are each in 5NF. •
Applying a natural join to any two of these relations produces spurious tuples, but applying a natural
join to all three together does not.
Basic Definitions in Normalization What are update Anomalies?
The Problems resulting from data redundancy in an un-normalized database table are collectively known as update anomalies . So any database insertion, deletion or modification that leaves the database in an inconsistent state is said to have caused an update anomaly. They are classified as Insertion anomalies
Deletion anomalies
Modification anomalies
Tbl_Staff_Branch
Tbl_Staff
Tbl_Branch
•
Insertion anomalies: To insert the details of a new member of staff located at branch B1 into
the Tbl_Staff_Branch Table shown above, we must enter the correct details of branch number B1 so that the branch details are consistent with the values for branch B1 in other rows. To insert the details of a new branch that currently has no members of staff into the Tbl_Staff_Branch table, it is necessary to enter nu lls for the staff details which is not allowed
as staffID is the primary key. But if you normalize Tbl_Staff_Branch, which is in Second Normal Form (2NF) to Third Normal Dorm (3NF), you end up with Tbl_Staff and Tbl_Branch and you shouldn't have the problems mentioned above.
•
Deletion anomalies: If we delete a row from the Tbl_Staff_Branch table that represents the last
member of staff located at that branch, (for e.g. row with Branch numbers B",B3 or B4) the detals about that branch are also lost from the Database.
•
Modification anomalies: Should we need to change the address of a perticular branch in
the Tbl_Staff_Branch table, we must update the rows of all staff located at that branch. If this modification is not carried out on all the relevent rows, the database will become inconsistent. What’s a spurious tuple?
A spurious tuple is, basically, a record in a database that gets created when two tables are joined badly. In database-ese, spurious tuples are created when two tables are joined on attributes that are neither primary keys nor foreign keys. What is Functional Dependency? what are the different types of Functional Dependencies?
Functional Dependency is a constraint between two sets of attributes from the database. Functional Dependency describes the relationship between attributes (columns) in a table. They are
fundamental to the process of Normalization. For example, if A and B are attributes of a table, B is functionally dependent on A, if each value of A is associated with exactly one value of B (so, you can say, 'A functionally determines B').
Functional dependency between A and B •
Attribute or group of attributes on the left hand side of the arrow of a functional dependency is referred to as 'determinant'.
•
Attribute or group of attributes on the right hand side of the arrow of a functional dependency is referred to as 'dependent'.
Simple example would be StaffID functionally determines Position in the above tables.
Functional Dependency can be classified as follows:
•
Full Functional dependency Indicates that if A and B are attributes(columns)of a table, B is fully
functionally dependent on A if B is functionally dependent on A ,but not on any proper subset of A. E.g. StaffID---->BranchID •
Partial Functional Dependency Indicates that if A and B are attributes of a table , B is partially
dependent on A if there is some attribute that can be removed from A and yet the dependency still holds. Say for Ex, consider the following functional dependency that exists in the Tbl_Staff table: StaffID,Name -------> BranchID BranchID is functionally dependent on a subset of A (StaffID,Name), namely StaffID. •
Transitive Functional Dependency: A condition where A , B and C are attributes of a table such
that if A is functionally dependent on B and B is functionally dependent on C then C is Transitively dependent on A via B.
Say for Ex, consider the following functional dependencies that exists in the Tbl_Staff_Branch table: StaffID---->Name,Sex,Position,Sal,BranchID,Br_Address BranchID----->Br_Address So, StaffID attribute functionally determines Br_Address via BranchID attribute.
What is closure of a set of FDs?
If F is a set of FDs on a relation schema R , then F+, the closure of F, is the smallest set of FDs such that
F+
F and no FD can be derived from F by using the inference axioms that are not contained in F+ . if R is
not specified, it is assumed to contain all the attributes that appear in F.