UCLouvain / ICTEAM, 2014-2015 Bernard Lambeau
[email protected]
LINGI2172 – Databases Logical Database Design Part I –Normal Forms & Normalization Originally taken, then slightly adapted, from “CS252 Fundamentals of Relational Databases” by Hugh Darwen at University of Warwick www.dcs.warwick.ac.uk/~hugh
Textbook / Reading “An Introduction to...”, H. Darwen • Chapter 7 & 8 • Tutorial D
“Database Systems”, Elmasri & Navathe • Chapter 14 & 15 • Slightly different point of view
What is logical design about? • Candidate database design(s)
– How good is it? (if only one) – What is the “best” one? (if more than one)
• Mostly about requirements
– Don’t try to find an answer without knowing your requirements in the first place
• Yet a few very common, “reusable” requirements – Avoid redundancy
• Leads to update anomalies, difficulties and/or data inconsistency
– Orthogonality
• Prefer a database design that helps tackling new requirements easily
– Keep it simple, yet not too simple
• Prefer a simple database design over a complex one
Best Design Quote Ever There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
Tony Hoare, The Emperor’s Old Clothes, Turing Award Lecture, 1980
A “Reducible” Relation WIFE_OF_HENRY_VIII Wife# 1 2
FirstName Catherine Anne
LastName of Aragon Boleyn
Fate divorced beheaded
3 4 5 6
Jane Anne Catherine Catherine
Seymour of Cleves Howard Parr
died divorced beheaded survived
(Note the underscoring of the primary key attribute.)
“Decomposing” H8’s Wives W_LN_F
W_FN Wife# 1 2
FirstName Catherine Anne
Wife# 1 2
LastName of Aragon Boleyn
Fate divorced beheaded
3 4 5 6
Jane Anne Catherine Catherine
3 4 5 6
Seymour of Cleves Howard Parr
died divorced beheaded survived
Join Dependency The join dependency (JD) that holds in WIFE_OF_HENRY_VIII and allows us to decompose in W_FN and W_LN_F is written like this: * { { Wife#, FirstName }, { Wife#, LastName, Fate } } • The star indicates a JD. • The operands of a JD are written inside braces. • Each operand is a set of attributes, hence the inner braces. If a given JD holds in relvar r, then at all times r = the join of the projections indicated by the operands of the JD.
A Join Dependency That Does Not Hold Although W_FN is irreducible, we can of course take several projections of it, the following two in particular: W_FN { Wife# }
W_FN { FirstName }
Wife# 1 2
FirstName Catherine Anne
3 4 5 6
Jane But the JOIN of these two does not yield W_FN, so the JD *{ { Wife# }, { FirstName} } does not hold in W_FN.
Decomposition of W_LN_F W_LN_F can be further decomposed: W_LN
W_F
Wife# 1 2
LastName of Aragon Boleyn
Wife# 1 2
Fate divorced beheaded
3 4 5 6
Seymour of Cleves Howard Parr
3 4 5 6
died divorced beheaded survived
3-Way Join Dependency So the following JD holds in W_LN_F: * { { Wife#, LastName }, { Wife#, Fate } }
and we can conclude that the following 3-way JD holds in WIFE_OF_HENRY_VIII: * { { Wife#, FirstName }, { Wife#, LastName }, { Wife#, Fate } }
i.e., WIFE_OF_HENRY_VIII =
WIFE_OF_HENRY_VIII { Wife#, FirstName } JOIN
WIFE_OF_HENRY_VIII { Wife#, LastName } JOIN
WIFE_OF_HENRY_VIII { Wife#, Fate }
Design Comparison Does the decomposed design have any advantages? The single relvar design carries an implicit constraint to the effect that every wife has a wife number, a first name, a last name and a fate. This constraint is not implicit in the decomposed design. In fact, to enforce it, each of W_FN, W_LN and W_F needs a foreign key referencing each of the other two. But then the first attempt to insert a tuple into any of them must fail (unless multiple assignment is available).
Conclusion In the example at hand, the single relvar design is preferred, so long as the constraint implied by it truly reflects the real world. (For example, if it turns out that in fact not every wife has a last name, then we should separate out W_LN.) But the example at hand is rather special: • Each operand of the 3-way JD that holds in
WIFE_OF_HENRY_VIII includes a key of that relvar • and it is the same key in each case, viz. {Wife#} We need to look at some not-so-special examples.
A Not-So-Special JD ENROLMENT StudentId S1 S1
Name Anne Anne
CourseId C1 C2
S2 S3 S4
Boris Cindy Devinder
C1 C3 C1
Recall that we decided to “split” (i.e., decompose) this one, as follows …
Splitting ENROLMENT IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder
IS_ENROLLED_ON StudentId S1 S1
CourseId C1 C2
S2 S3 S4
C1 C3 C1
Notice the JD: *{ { StudentId, Name }, { StudentId, CourseId } } that holds in ENROLMENT.
Functional Dependency Because {StudentId} is a key of the projection ENROLMENT{StudentId, Name}, we say that the following functional dependency (FD) holds in ENROLMENT: { StudentId } → { Name } • The arrow, pronounced “determines”, indicates an FD. • Each operand is a set of attributes (hence the braces). Name is a function of StudentId.
For each StudentId there is exactly one Name.
Anatomy of an FD →
A
B
The determinant “determines”
The dependant set
Reminder: The determinant is a set of attributes, and so is the dependant set.
P.S. “dependant” is not a misspelling! It’s the noun, not the adjective.
FDs That Do Not Hold in ENROLMENT { { { { { {
Name } → { StudentId } Name } → { CourseId } CourseId } → { StudentId } CourseId, Name } → { StudentId } StudentId } → { CourseId } StudentId, Name } → { CourseId }
also: { x } → { StudentId }, because x is not an attribute of ENROLMENT.
Theorems About FDs Assume A → B holds in r, e.g. {StudentId} → {Name, Gender}. Then:
Left-Augmentation: If A’ is a superset of A, then A’ → B holds in r. Right-reduction: If B’ is a subset of B, then A → B’ holds in r. Transitivity: If A → B and B → C hold in r, then A → C holds in r. In general: If A → B and C → D hold in r, then: AU(C–B)→BUD
holds in r.
Left-Irreducibility If A → B holds in r and no proper subset A’ of A is such that A’ → B holds in r, then A → B is a left-irreducible FD (in r). B is sometimes said to be fully dependent on A. Conversely, if there is such a subset A’, then A → B is a left-reducible FD (in r). B is therefore not fully dependent on A.
FDs and Keys If A → B is an FD in r and A U B constitutes the entire heading of r, then A is a superkey of r. If A → B is a left-irreducible FD in r and A U B constitutes the entire heading of r, then A is a key of r. (The longer term candidate key is often used instead of key, for historical reasons.)
Normal Forms Arising from the study of JDs in general and FDs in particular, various “normal forms” have been defined: • First Normal Form (1NF) • Second Normal Form (2NF) • Third Normal Form (3NF) • Boyce/Codd Normal Form (BCNF) • Fourth Normal Form (4NF) • Fifth Normal Form (5NF) • Sixth Normal Form (6NF)
Each of these is in a sense stricter than its immediate predecessor. • The ones shown in bold are particularly important. • The others were early attempts that proved inadequate.
Normalisation Normalisation is the act of decomposing a relvar that fails to satisfy a given normal form (e.g., BCNF) such that the result is an equivalent set of two or more “smaller” relvars that do satisfy that normal form. We decompose by taking the projections specified in a given join dependency (JD). In the case of ENROLMENT, the given JD is *{ { StudentId, Name }, {StudentId, CourseId } } determined by the FD { StudentId } → { Name }
Purposes of Normalisation A database all of whose relvars satisfy 6NF has the following possibly desirable properties:
• No redundancy • no recording of the same information more than once
• No “update anomalies” • to be explained later
• Full orthogonality • independent recording of the simplest facts But 5NF is usually sufficient (and 6NF is sometimes problematical with existing technology), as we shall see …
Normal Forms Normalization process
6NF
Join Dependencies *{ { … }, { … } }
5NF 4NF
BCNF 3NF
Some of them arising from…
Functional Deps: {…}→{…}
2NF 1NF Only one input relvar, no real “theory” for schema normalization
First Normal Form (1NF) A relation is in 1NF if attribute domains include only atomic (simple, indivisible) values and each tuple attribute value is a single value from the domain. The term (due to E.F. Codd) is not clearly defined, partly because it depends on an ill-defined concept of “atomicity”. We shall assume that every relation, and therefore every relvar, is in 1NF, under the assumption of a precise definition of data types (e.g. as in Tutorial D). Some authorities take it that a relation is in 1NF iff none of its attributes is relation-valued or tuple-valued. It is sometimes recommended to avoid use of such attributes (especially RVAs) in (base) database relvars.
2NF and 3NF These normal forms, originally defined by E.F. Codd, were really “mistakes”. You will find definitions in the textbooks but there is no need to learn them. The faults with Codd’s original definition of 3NF were reported to him by Raymond Boyce. Together they worked on an improved, simpler normal form, which became known as Boyce-Codd Normal Form (BCNF).
Boyce/Codd Normal Form (BCNF) BCNF is defined thus: Relvar R is in BCNF if and only if for every nontrivial FD
A → B satisfied by R, A is a superkey of R. More loosely, “every nontrivial determinant is a [candidate] key”.
BCNF addresses redundancy arising from JDs that are consequences of FDs. (Not all JDs are consequences of FDs. We will look at the others later.)
Splitting ENROLMENT ENROLMENT StudentId S1 S1
Name Anne Anne
CourseId C1 C2
S2 S3 S4
Boris Cindy Devinder
C1 C3 C1
KEY{ StudentId, CourseId } while { StudentId } → { Name } FD always holds on the relvar
Splitting ENROLMENT (bis) IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder S5 Boris
IS_ENROLLED_ON StudentId S1 S1
CourseId C1 C2
S2 S3 S4
C1 C3 C1
The attributes involved in the “offending” FD have been separated into IS_CALLED, and now we can add student S5!
Advantages of BCNF With reference to our ENROLMENT example, decomposed into the BCNF relvars IS_CALLED and IS_ENROLLED_ON: • Anne’s name is recorded twice in ENROLMENT, but only
once in IS_CALLED. In ENROLMENT it might appear under
different spellings (Anne, Ann), unless the FD
{ StudentId } → { Name} is declared as a constraint. Redundancy is the problem here. • With ENROLMENT, a student’s name cannot be recorded
unless that student is enrolled on some course, and an
anonymous student cannot be enrolled on any course. Lack of orthogonality is the problem here.
Another Kind of Offending FD TUTORS_ON StudentId S1 S1
TutorId T1 T2
TutorName Hugh Mary
CourseId C1 C2
S2 S3 S4
T3 T4 T1
Lisa Fred Hugh
C1 C3 C1
Assume the FD { TutorId } → { TutorName } holds.
Splitting TUTORS_ON TUTOR_NAME TutorId T1 T2 T3 T4 T5
TutorName Hugh Mary Lisa Fred Zack
TUTORS_ON_BCNF StudentId S1 S1
TutorId T1 T2
CourseId C1 C2
S2 S3 S4
T3 T4 T1
C1 C3 C1
Now we can put Zack, who isn’t assigned to anybody yet, into the database. Note the FK required for TUTORS_ON_BCNF.
Dependency Preservation SCOR StudentId S1 S1
CourseId C1 C2
Organiser Owen Olga
Room 13 24
S2
C1
Owen
13
Assume FDs:
{ CourseId } → { Organiser } { Organiser } → { Room }
{ Room } → { Organiser }
Which one do we address first?
Try 1: {CourseId}→{Organiser} SCR
CO
StudentId S1 S1
CourseId C1 C2
Room 13 24
S2
C1
13
“Loses” { Room } → { Organiser }
and {Organiser } → { Room }
CourseId Organiser C1 Owen C2
Olga
Try 2: {Room}→{Organiser} SCR
OR
StudentId S1 S1
CourseId C1 C2
Room 13 24
S2
C1
13
“Loses”
{ CourseId } → { Organiser }
Organiser Room Owen 13 Olga
24
Try 3: {Organiser}→{Room} SCO
OR
StudentId S1 S1
CourseId C1 C2
Organiser Owen Olga
S2
C1
Owen
Organiser Room Owen 13 Olga
Preserves all three FDs! (But we must still decompose SCO, of course)
24
An FD That Cannot Be Preserved TUTOR_FOR StudentId S1 S1
TutorId T1 T2
CourseId C1 C2
S2 S3 S4
T3 T4 T1
C1 C3 C1
Now assume the FD { TutorId } → { CourseId } holds. This is a third kind of offending FD.
Splitting TUTOR_FOR TUTORS
TEACHES
StudentId S1 S1
TutorId T1 T2
TutorId T1 T2
CourseId C1 C2
S2 S3 S4
T3 T4 T1
T3 T4
C1 C3
Note the keys.
Have we “lost” the FD { StudentId, CourseId } → { TutorId } ? And the FK referencing IS_ENROLLED_ON?
Example of FD violation TUTORS
TEACHES
StudentId S1 S1
TutorId T1 T2
TutorId T1 T2
CourseId C1 C2
S2 S3 S4 S1
T3 T4 T1 T5
T3 T4 T5 T5
C1 C3 C1 C4
Now, S1 is taught by both T1 and T5 in course C1 (if you do the JOIN), violating our original key.
Reinstating The Lost FD Need to add the following constraint: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES
IS_EMPTY ( ( TUTORS JOIN TEACHES )
GROUP { ALL BUT StudentId, CourseId } AS G
WHERE COUNT ( G ) > 1 ) ;
or equivalently: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES
WITH TUTORS JOIN TEACHES AS TJT :
COUNT ( TJT ) = COUNT ( TJT { StudentId, CourseId } ) ;
And The Lost Foreign Key The “lost” foreign key is easier: CONSTRAINT FK_FOR_TUTORS_JOIN_TEACHES
IS_EMPTY ( ( TUTORS JOIN TEACHES )
NOT MATCHING
IS_ENROLLED_ON ) ;
FD-driven Logical Design Given a relvar R with heading H and some FDs, decompose R into relvars R1 ... Rn such that ▪ H = H1 ∪ ... ∪ Hn
(attribute preservation)
▪ Ri is in a particular NF, for every i ▪ Every FD appears in a Ri ▪ R = JOIN{ R1 , ... Rn }
(normalized)
(dependency preservation) (lossless decomposition)
Generally on the universal relvar R0, an hypothetical relvar containing all attributes of the Universe of Discourse
Bad news arising • In the general case, you can’t have all three of the following – Guaranteed nonlossy design, – Guaranteed dependency preservation, and – All relvars in BCNF • Nonlossy design being considered a must, one has to make a choice between – Sacrificing BCNF, or – Sacrificing some functional dependencies • i.e. no longer guaranteed without explicit constraints
Interesting cases of Sets of FDs • Irreductible cover S if and only if – Every FD is right-irreductible – Every FD is left-irreductible – No proper subset of S is a cover for S+ Where S+ is the closure of S, i.e. all FDs that are implied by S (using reflexivity & unification theorems)
• Minimal cover – Take an irreductible cover S – If A -> B and A -> C then use A -> B ∪ C instead
Heath’s Theorem • Let A, B and C be subsets of the heading of a relvar R such that every attribute appears in at least one subset And suppose FD A → B holds in R • Then, the JD*{ A∪B, A∪C } holds in R • Hence, the usual lossless decomposition into – R1 = R{A∪B} and R2 = R{A∪C} – Especially useful when • A, B and C are disjoint and, • A is not already a key for R and, • B is non-empty
Normalization to 3NF Input:
a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in 3NF Post: Attribute preservation Lossless decomposition Dependency preservation, but no BCNF D = { } foreach FD A → B in S D ← RELVAR INIT R{A∪B} KEY A if no relvar in D includes a key Ki for R D ← RELVAR INIT R{Ki} KEY Ki (i arbitrarily return D
chosen)
Normalization to BCNF Input:
a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in BCNF Post: Attribute preservation Lossless decomposition No dependency preservation D = NormalizeTo3NF(R, S) while one Ri in D not in BCNF let A → B be an offending FD in Ri D = D \ { Ri } D ← RELVAR INIT R{A∪B} KEY A D ← RELVAR INIT R{Hi-B} KEY ??? return D
Example TUTOR_FOR StudentId S1 S1
TutorId T1 T2
CourseId C1 C2
S2 S3 S4
T3 T4 T1
C1 C3 C1
With the following FDs { TutorId } → { CourseId } { StudentId, CourseId } → { TutorId }
Algorithm Execution Example 3NF with dependency preservation D = {} D ←TEACHES = {TutorId, CourseId} D ←TUTOR_FOR = {StudentId, CourseId, TutorId}
BCNF without dependency preservation TUTOR_FOR not in BCNF: { TutorId } → { CourseId } D = D - { TUTOR_FOR } D ← TEACHES = { TutorId, CourseId } D ← TAUGHT_BY = { StudentId, TutorId }
Which one is best? 3NF TEACHES = {TutorId, CourseId} TUTOR_FOR = {StudentId, CourseId, TutorId} CONSTRAINT { TutorId } → { CourseId } ON TUTOR_FOR
BCNF TEACHES = { TutorId, CourseId } TAUGHT_BY = { StudentId, TutorId } CONSTRAINT { StudentId, CourseId } → { TutorId }
ON JOIN { TEACHES, HAS_TUTOR }
Higher Normal Forms Normalization process
6NF
Join Dependencies *{ { … }, { … } }
5NF 4NF
BCNF 3NF
Some of them arising from…
Functional Deps: {…}→{…}
2NF 1NF Only one input relvar, no real “theory” for schema normalization
To sum up: Boyce-Codd Normal Form Given a relvar R and some FDs “every nontrivial determinant is a [candidate] key”. BCNF addresses redundancy arising from JDs that are consequences of FDs.
BCNF is not enough TBC1 Teacher T1 T1
Book Database Systems Database in Depth
CourseId C1 C1
T1 T1 T2
Database Systems Database in Depth Database in Depth
C2 C2 C2
Assume the JD *{ { Teacher, Book }, { Book, CourseId } } holds.
Normalising TBC1 TB
BC
Teacher Book T1 Database Systems T1 Database in Depth T2
Database in Depth
Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth
C2 C2
We have lost the constraint implied by the JD, but does a teacher really have to teach a course just because he or she uses a book that is used on that course?
Fifth Normal Form (5NF) 5NF caters for all harmful JDs. Relvar R is in 5NF iff every nontrivial JD that holds in R is implied by the keys of R. (Fagin’s definition, 1979) Apart from a few weird exceptions, a JD is “implied by the keys” if every projection is a superkey. (Date’s definition – but see the Notes for this slide) To explain “nontrivial”: A JD is trivial if and only if one of its operands is the entire heading of R (because every such JD is clearly satisfied by R).
A JD of Degree > 2 TBC2 Teacher T1 T1
Book Database Systems Database in Depth
CourseId C1 C1
T1 T1 T2
Database Systems Database in Depth Database in Depth
C2 C2 C2
Now assume the JD *{ { Teacher, Book }, { Book, CourseId }, { Teacher, CourseId } } holds.
Normalising TBC2 TB
BC
Teacher Book T1 Database Systems T1 Database in Depth T2 TC
Database in Depth Teacher T1 T1
CourseId C1 C2
T2
C2
Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth
C2 C2
(and we’ve “lost” the constraint again)
Sixth Normal Form (6NF) 6NF subsumes 5NF and is the strictest NF: Relvar R is in 6NF if and only if every JD that holds in R is trivial. 6NF provides maximal orthogonality, as already noted, but is not normally advised. It addresses additional anomalies that can arise with temporal data (beyond the scope of this course—and, what’s more, the definition of join dependency has to be revised).
Wives of Henry VIII in 6NF W_FN
W_LN
Wife# 1 2
FirstName Catherine Anne
3 4 5 6
Jane Anne Catherine Catherine
Not a good idea!
W_F
Wife# LastName 1 of Aragon 2 Boleyn 3 4 5 6
Seymour of Cleves Howard Parr
Wife# Fate 1 divorced 2 beheaded 3 4 5 6
died divorced beheaded survived