LINGI2172 – Databases Logical Database Design Part I –Normal Forms & Normalization UCLouvain / ICTEAM, 2014-2015 Bernard Lambeau bernard.lambeau@uclou...

15 downloads 15 Views 333KB Size

UCLouvain / ICTEAM, 2014-2015 Bernard Lambeau [email protected]

LINGI2172 – Databases Logical Database Design Part I –Normal Forms & Normalization Originally taken, then slightly adapted, from “CS252 Fundamentals of Relational Databases” by Hugh Darwen at University of Warwick www.dcs.warwick.ac.uk/~hugh

Textbook / Reading “An Introduction to...”, H. Darwen • Chapter 7 & 8 • Tutorial D

“Database Systems”, Elmasri & Navathe • Chapter 14 & 15 • Slightly different point of view

What is logical design about? • Candidate database design(s)

– How good is it? (if only one) – What is the “best” one? (if more than one)

• Mostly about requirements

– Don’t try to find an answer without knowing your requirements in the first place

• Yet a few very common, “reusable” requirements – Avoid redundancy

• Leads to update anomalies, difficulties and/or data inconsistency

– Orthogonality

• Prefer a database design that helps tackling new requirements easily

– Keep it simple, yet not too simple

• Prefer a simple database design over a complex one

Best Design Quote Ever There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

Tony Hoare, The Emperor’s Old Clothes, Turing Award Lecture, 1980

A “Reducible” Relation WIFE_OF_HENRY_VIII Wife# 1 2

FirstName Catherine Anne

LastName of Aragon Boleyn

Fate divorced beheaded

3 4 5 6

Jane Anne Catherine Catherine

Seymour of Cleves Howard Parr

died divorced beheaded survived

(Note the underscoring of the primary key attribute.)

“Decomposing” H8’s Wives W_LN_F

W_FN Wife# 1 2

FirstName Catherine Anne

Wife# 1 2

LastName of Aragon Boleyn

Fate divorced beheaded

3 4 5 6

Jane Anne Catherine Catherine

3 4 5 6

Seymour of Cleves Howard Parr

died divorced beheaded survived

Join Dependency The join dependency (JD) that holds in WIFE_OF_HENRY_VIII and allows us to decompose in W_FN and W_LN_F is written like this: * { { Wife#, FirstName }, { Wife#, LastName, Fate } } • The star indicates a JD. • The operands of a JD are written inside braces. • Each operand is a set of attributes, hence the inner braces. If a given JD holds in relvar r, then at all times r = the join of the projections indicated by the operands of the JD.

A Join Dependency That Does Not Hold Although W_FN is irreducible, we can of course take several projections of it, the following two in particular: W_FN { Wife# }

W_FN { FirstName }

Wife# 1 2

FirstName Catherine Anne

3 4 5 6

Jane But the JOIN of these two does not yield W_FN, so the JD *{ { Wife# }, { FirstName} } does not hold in W_FN.

Decomposition of W_LN_F W_LN_F can be further decomposed: W_LN

W_F

Wife# 1 2

LastName of Aragon Boleyn

Wife# 1 2

Fate divorced beheaded

3 4 5 6

Seymour of Cleves Howard Parr

3 4 5 6

died divorced beheaded survived

3-Way Join Dependency So the following JD holds in W_LN_F: * { { Wife#, LastName }, { Wife#, Fate } }

and we can conclude that the following 3-way JD holds in WIFE_OF_HENRY_VIII: * { { Wife#, FirstName }, { Wife#, LastName }, { Wife#, Fate } }

i.e., WIFE_OF_HENRY_VIII = WIFE_OF_HENRY_VIII { Wife#, FirstName } JOIN WIFE_OF_HENRY_VIII { Wife#, LastName } JOIN WIFE_OF_HENRY_VIII { Wife#, Fate }

Design Comparison Does the decomposed design have any advantages? The single relvar design carries an implicit constraint to the effect that every wife has a wife number, a first name, a last name and a fate. This constraint is not implicit in the decomposed design. In fact, to enforce it, each of W_FN, W_LN and W_F needs a foreign key referencing each of the other two. But then the first attempt to insert a tuple into any of them must fail (unless multiple assignment is available).

Conclusion In the example at hand, the single relvar design is preferred, so long as the constraint implied by it truly reflects the real world. (For example, if it turns out that in fact not every wife has a last name, then we should separate out W_LN.) But the example at hand is rather special: • Each operand of the 3-way JD that holds in WIFE_OF_HENRY_VIII includes a key of that relvar • and it is the same key in each case, viz. {Wife#} We need to look at some not-so-special examples.

A Not-So-Special JD ENROLMENT StudentId S1 S1

Name Anne Anne

CourseId C1 C2

S2 S3 S4

Boris Cindy Devinder

C1 C3 C1

Recall that we decided to “split” (i.e., decompose) this one, as follows …

Splitting ENROLMENT IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder

IS_ENROLLED_ON StudentId S1 S1

CourseId C1 C2

S2 S3 S4

C1 C3 C1

Notice the JD: *{ { StudentId, Name }, { StudentId, CourseId } } that holds in ENROLMENT.

Functional Dependency Because {StudentId} is a key of the projection ENROLMENT{StudentId, Name}, we say that the following functional dependency (FD) holds in ENROLMENT: { StudentId } → { Name } • The arrow, pronounced “determines”, indicates an FD. • Each operand is a set of attributes (hence the braces). Name is a function of StudentId. For each StudentId there is exactly one Name.

Anatomy of an FD →

A

B

The determinant “determines”

The dependant set

Reminder: The determinant is a set of attributes, and so is the dependant set.

P.S. “dependant” is not a misspelling! It’s the noun, not the adjective.

FDs That Do Not Hold in ENROLMENT { { { { { {

Name } → { StudentId } Name } → { CourseId } CourseId } → { StudentId } CourseId, Name } → { StudentId } StudentId } → { CourseId } StudentId, Name } → { CourseId }

also: { x } → { StudentId }, because x is not an attribute of ENROLMENT.

Theorems About FDs Assume A → B holds in r, e.g. {StudentId} → {Name, Gender}. Then:

Left-Augmentation: If A’ is a superset of A, then A’ → B holds in r. Right-reduction: If B’ is a subset of B, then A → B’ holds in r. Transitivity: If A → B and B → C hold in r, then A → C holds in r. In general: If A → B and C → D hold in r, then: AU(C–B)→BUD holds in r.

Left-Irreducibility If A → B holds in r and no proper subset A’ of A is such that A’ → B holds in r, then A → B is a left-irreducible FD (in r). B is sometimes said to be fully dependent on A. Conversely, if there is such a subset A’, then A → B is a left-reducible FD (in r). B is therefore not fully dependent on A.

FDs and Keys If A → B is an FD in r and A U B constitutes the entire heading of r, then A is a superkey of r. If A → B is a left-irreducible FD in r and A U B constitutes the entire heading of r, then A is a key of r. (The longer term candidate key is often used instead of key, for historical reasons.)

Normal Forms Arising from the study of JDs in general and FDs in particular, various “normal forms” have been defined: • First Normal Form (1NF) • Second Normal Form (2NF) • Third Normal Form (3NF) • Boyce/Codd Normal Form (BCNF) • Fourth Normal Form (4NF) • Fifth Normal Form (5NF) • Sixth Normal Form (6NF)

Each of these is in a sense stricter than its immediate predecessor. • The ones shown in bold are particularly important. • The others were early attempts that proved inadequate.

Normalisation Normalisation is the act of decomposing a relvar that fails to satisfy a given normal form (e.g., BCNF) such that the result is an equivalent set of two or more “smaller” relvars that do satisfy that normal form. We decompose by taking the projections specified in a given join dependency (JD). In the case of ENROLMENT, the given JD is *{ { StudentId, Name }, {StudentId, CourseId } } determined by the FD { StudentId } → { Name }

Purposes of Normalisation A database all of whose relvars satisfy 6NF has the following possibly desirable properties:

• No redundancy • no recording of the same information more than once

• No “update anomalies” • to be explained later

• Full orthogonality • independent recording of the simplest facts But 5NF is usually sufficient (and 6NF is sometimes problematical with existing technology), as we shall see …

Normal Forms Normalization process

6NF

Join Dependencies *{ { … }, { … } }

5NF 4NF

BCNF 3NF

Some of them arising from…

Functional Deps: {…}→{…}

2NF 1NF Only one input relvar, no real “theory” for schema normalization

First Normal Form (1NF) A relation is in 1NF if attribute domains include only atomic (simple, indivisible) values and each tuple attribute value is a single value from the domain. The term (due to E.F. Codd) is not clearly defined, partly because it depends on an ill-defined concept of “atomicity”. We shall assume that every relation, and therefore every relvar, is in 1NF, under the assumption of a precise definition of data types (e.g. as in Tutorial D). Some authorities take it that a relation is in 1NF iff none of its attributes is relation-valued or tuple-valued. It is sometimes recommended to avoid use of such attributes (especially RVAs) in (base) database relvars.

2NF and 3NF These normal forms, originally defined by E.F. Codd, were really “mistakes”. You will find definitions in the textbooks but there is no need to learn them. The faults with Codd’s original definition of 3NF were reported to him by Raymond Boyce. Together they worked on an improved, simpler normal form, which became known as Boyce-Codd Normal Form (BCNF).

Boyce/Codd Normal Form (BCNF) BCNF is defined thus: Relvar R is in BCNF if and only if for every nontrivial FD A → B satisfied by R, A is a superkey of R. More loosely, “every nontrivial determinant is a [candidate] key”.

BCNF addresses redundancy arising from JDs that are consequences of FDs. (Not all JDs are consequences of FDs. We will look at the others later.)

Splitting ENROLMENT ENROLMENT StudentId S1 S1

Name Anne Anne

CourseId C1 C2

S2 S3 S4

Boris Cindy Devinder

C1 C3 C1

KEY{ StudentId, CourseId } while { StudentId } → { Name } FD always holds on the relvar

Splitting ENROLMENT (bis) IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder S5 Boris

IS_ENROLLED_ON StudentId S1 S1

CourseId C1 C2

S2 S3 S4

C1 C3 C1

The attributes involved in the “offending” FD have been separated into IS_CALLED, and now we can add student S5!

Advantages of BCNF With reference to our ENROLMENT example, decomposed into the BCNF relvars IS_CALLED and IS_ENROLLED_ON: • Anne’s name is recorded twice in ENROLMENT, but only once in IS_CALLED. In ENROLMENT it might appear under different spellings (Anne, Ann), unless the FD { StudentId } → { Name} is declared as a constraint. Redundancy is the problem here. • With ENROLMENT, a student’s name cannot be recorded unless that student is enrolled on some course, and an anonymous student cannot be enrolled on any course. Lack of orthogonality is the problem here.

Another Kind of Offending FD TUTORS_ON StudentId S1 S1

TutorId T1 T2

TutorName Hugh Mary

CourseId C1 C2

S2 S3 S4

T3 T4 T1

Lisa Fred Hugh

C1 C3 C1

Assume the FD { TutorId } → { TutorName } holds.

Splitting TUTORS_ON TUTOR_NAME TutorId T1 T2 T3 T4 T5

TutorName Hugh Mary Lisa Fred Zack

TUTORS_ON_BCNF StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

Now we can put Zack, who isn’t assigned to anybody yet, into the database. Note the FK required for TUTORS_ON_BCNF.

Dependency Preservation SCOR StudentId S1 S1

CourseId C1 C2

Organiser Owen Olga

Room 13 24

S2

C1

Owen

13

Assume FDs:

{ CourseId } → { Organiser } { Organiser } → { Room } { Room } → { Organiser }

Which one do we address first?

Try 1: {CourseId}→{Organiser} SCR

CO

StudentId S1 S1

CourseId C1 C2

Room 13 24

S2

C1

13

“Loses” { Room } → { Organiser } and {Organiser } → { Room }

CourseId Organiser C1 Owen C2

Olga

Try 2: {Room}→{Organiser} SCR

OR

StudentId S1 S1

CourseId C1 C2

Room 13 24

S2

C1

13

“Loses”

{ CourseId } → { Organiser }

Organiser Room Owen 13 Olga

24

Try 3: {Organiser}→{Room} SCO

OR

StudentId S1 S1

CourseId C1 C2

Organiser Owen Olga

S2

C1

Owen

Organiser Room Owen 13 Olga

Preserves all three FDs! (But we must still decompose SCO, of course)

24

An FD That Cannot Be Preserved TUTOR_FOR StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

Now assume the FD { TutorId } → { CourseId } holds. This is a third kind of offending FD.

Splitting TUTOR_FOR TUTORS

TEACHES

StudentId S1 S1

TutorId T1 T2

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

T3 T4

C1 C3

Note the keys. Have we “lost” the FD { StudentId, CourseId } → { TutorId } ? And the FK referencing IS_ENROLLED_ON?

Example of FD violation TUTORS

TEACHES

StudentId S1 S1

TutorId T1 T2

TutorId T1 T2

CourseId C1 C2

S2 S3 S4 S1

T3 T4 T1 T5

T3 T4 T5 T5

C1 C3 C1 C4

Now, S1 is taught by both T1 and T5 in course C1 (if you do the JOIN), violating our original key.

Reinstating The Lost FD Need to add the following constraint: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES IS_EMPTY ( ( TUTORS JOIN TEACHES ) GROUP { ALL BUT StudentId, CourseId } AS G WHERE COUNT ( G ) > 1 ) ;

or equivalently: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES WITH TUTORS JOIN TEACHES AS TJT : COUNT ( TJT ) = COUNT ( TJT { StudentId, CourseId } ) ;

And The Lost Foreign Key The “lost” foreign key is easier: CONSTRAINT FK_FOR_TUTORS_JOIN_TEACHES IS_EMPTY ( ( TUTORS JOIN TEACHES ) NOT MATCHING IS_ENROLLED_ON ) ;

FD-driven Logical Design Given a relvar R with heading H and some FDs, decompose R into relvars R1 ... Rn such that ▪ H = H1 ∪ ... ∪ Hn

(attribute preservation)

▪ Ri is in a particular NF, for every i ▪ Every FD appears in a Ri ▪ R = JOIN{ R1 , ... Rn }

(normalized)

(dependency preservation) (lossless decomposition)

Generally on the universal relvar R0, an hypothetical relvar containing all attributes of the Universe of Discourse

Bad news arising • In the general case, you can’t have all three of the following – Guaranteed nonlossy design, – Guaranteed dependency preservation, and – All relvars in BCNF • Nonlossy design being considered a must, one has to make a choice between – Sacrificing BCNF, or – Sacrificing some functional dependencies • i.e. no longer guaranteed without explicit constraints

Interesting cases of Sets of FDs • Irreductible cover S if and only if – Every FD is right-irreductible – Every FD is left-irreductible – No proper subset of S is a cover for S+ Where S+ is the closure of S, i.e. all FDs that are implied by S (using reflexivity & unification theorems)

• Minimal cover – Take an irreductible cover S – If A -> B and A -> C then use A -> B ∪ C instead

Heath’s Theorem • Let A, B and C be subsets of the heading of a relvar R such that every attribute appears in at least one subset And suppose FD A → B holds in R • Then, the JD*{ A∪B, A∪C } holds in R • Hence, the usual lossless decomposition into – R1 = R{A∪B} and R2 = R{A∪C} – Especially useful when • A, B and C are disjoint and, • A is not already a key for R and, • B is non-empty

Normalization to 3NF Input:

a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in 3NF Post: Attribute preservation Lossless decomposition Dependency preservation, but no BCNF D = { } foreach FD A → B in S D ← RELVAR INIT R{A∪B} KEY A if no relvar in D includes a key Ki for R D ← RELVAR INIT R{Ki} KEY Ki (i arbitrarily return D

chosen)

Normalization to BCNF Input:

a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in BCNF Post: Attribute preservation Lossless decomposition No dependency preservation D = NormalizeTo3NF(R, S) while one Ri in D not in BCNF let A → B be an offending FD in Ri D = D \ { Ri } D ← RELVAR INIT R{A∪B} KEY A D ← RELVAR INIT R{Hi-B} KEY ??? return D

Example TUTOR_FOR StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

With the following FDs { TutorId } → { CourseId } { StudentId, CourseId } → { TutorId }

Algorithm Execution Example 3NF with dependency preservation D = {} D ←TEACHES = {TutorId, CourseId} D ←TUTOR_FOR = {StudentId, CourseId, TutorId}

BCNF without dependency preservation TUTOR_FOR not in BCNF: { TutorId } → { CourseId } D = D - { TUTOR_FOR } D ← TEACHES = { TutorId, CourseId } D ← TAUGHT_BY = { StudentId, TutorId }

Which one is best? 3NF TEACHES = {TutorId, CourseId} TUTOR_FOR = {StudentId, CourseId, TutorId} CONSTRAINT { TutorId } → { CourseId } ON TUTOR_FOR

BCNF TEACHES = { TutorId, CourseId } TAUGHT_BY = { StudentId, TutorId } CONSTRAINT { StudentId, CourseId } → { TutorId } ON JOIN { TEACHES, HAS_TUTOR }

Higher Normal Forms Normalization process

6NF

Join Dependencies *{ { … }, { … } }

5NF 4NF

BCNF 3NF

Some of them arising from…

Functional Deps: {…}→{…}

2NF 1NF Only one input relvar, no real “theory” for schema normalization

To sum up: Boyce-Codd Normal Form Given a relvar R and some FDs “every nontrivial determinant is a [candidate] key”. BCNF addresses redundancy arising from JDs that are consequences of FDs.

BCNF is not enough TBC1 Teacher T1 T1

Book Database Systems Database in Depth

CourseId C1 C1

T1 T1 T2

Database Systems Database in Depth Database in Depth

C2 C2 C2

Assume the JD *{ { Teacher, Book }, { Book, CourseId } } holds.

Normalising TBC1 TB

BC

Teacher Book T1 Database Systems T1 Database in Depth T2

Database in Depth

Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth

C2 C2

We have lost the constraint implied by the JD, but does a teacher really have to teach a course just because he or she uses a book that is used on that course?

Fifth Normal Form (5NF) 5NF caters for all harmful JDs. Relvar R is in 5NF iff every nontrivial JD that holds in R is implied by the keys of R. (Fagin’s definition, 1979) Apart from a few weird exceptions, a JD is “implied by the keys” if every projection is a superkey. (Date’s definition – but see the Notes for this slide) To explain “nontrivial”: A JD is trivial if and only if one of its operands is the entire heading of R (because every such JD is clearly satisfied by R).

A JD of Degree > 2 TBC2 Teacher T1 T1

Book Database Systems Database in Depth

CourseId C1 C1

T1 T1 T2

Database Systems Database in Depth Database in Depth

C2 C2 C2

Now assume the JD *{ { Teacher, Book }, { Book, CourseId }, { Teacher, CourseId } } holds.

Normalising TBC2 TB

BC

Teacher Book T1 Database Systems T1 Database in Depth T2 TC

Database in Depth Teacher T1 T1

CourseId C1 C2

T2

C2

Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth

C2 C2

(and we’ve “lost” the constraint again)

Sixth Normal Form (6NF) 6NF subsumes 5NF and is the strictest NF: Relvar R is in 6NF if and only if every JD that holds in R is trivial. 6NF provides maximal orthogonality, as already noted, but is not normally advised. It addresses additional anomalies that can arise with temporal data (beyond the scope of this course—and, what’s more, the definition of join dependency has to be revised).

Wives of Henry VIII in 6NF W_FN

W_LN

Wife# 1 2

FirstName Catherine Anne

3 4 5 6

Jane Anne Catherine Catherine

Not a good idea!

W_F

Wife# LastName 1 of Aragon 2 Boleyn 3 4 5 6

Seymour of Cleves Howard Parr

Wife# Fate 1 divorced 2 beheaded 3 4 5 6

died divorced beheaded survived

LINGI2172 – Databases Logical Database Design Part I –Normal Forms & Normalization Originally taken, then slightly adapted, from “CS252 Fundamentals of Relational Databases” by Hugh Darwen at University of Warwick www.dcs.warwick.ac.uk/~hugh

Textbook / Reading “An Introduction to...”, H. Darwen • Chapter 7 & 8 • Tutorial D

“Database Systems”, Elmasri & Navathe • Chapter 14 & 15 • Slightly different point of view

What is logical design about? • Candidate database design(s)

– How good is it? (if only one) – What is the “best” one? (if more than one)

• Mostly about requirements

– Don’t try to find an answer without knowing your requirements in the first place

• Yet a few very common, “reusable” requirements – Avoid redundancy

• Leads to update anomalies, difficulties and/or data inconsistency

– Orthogonality

• Prefer a database design that helps tackling new requirements easily

– Keep it simple, yet not too simple

• Prefer a simple database design over a complex one

Best Design Quote Ever There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

Tony Hoare, The Emperor’s Old Clothes, Turing Award Lecture, 1980

A “Reducible” Relation WIFE_OF_HENRY_VIII Wife# 1 2

FirstName Catherine Anne

LastName of Aragon Boleyn

Fate divorced beheaded

3 4 5 6

Jane Anne Catherine Catherine

Seymour of Cleves Howard Parr

died divorced beheaded survived

(Note the underscoring of the primary key attribute.)

“Decomposing” H8’s Wives W_LN_F

W_FN Wife# 1 2

FirstName Catherine Anne

Wife# 1 2

LastName of Aragon Boleyn

Fate divorced beheaded

3 4 5 6

Jane Anne Catherine Catherine

3 4 5 6

Seymour of Cleves Howard Parr

died divorced beheaded survived

Join Dependency The join dependency (JD) that holds in WIFE_OF_HENRY_VIII and allows us to decompose in W_FN and W_LN_F is written like this: * { { Wife#, FirstName }, { Wife#, LastName, Fate } } • The star indicates a JD. • The operands of a JD are written inside braces. • Each operand is a set of attributes, hence the inner braces. If a given JD holds in relvar r, then at all times r = the join of the projections indicated by the operands of the JD.

A Join Dependency That Does Not Hold Although W_FN is irreducible, we can of course take several projections of it, the following two in particular: W_FN { Wife# }

W_FN { FirstName }

Wife# 1 2

FirstName Catherine Anne

3 4 5 6

Jane But the JOIN of these two does not yield W_FN, so the JD *{ { Wife# }, { FirstName} } does not hold in W_FN.

Decomposition of W_LN_F W_LN_F can be further decomposed: W_LN

W_F

Wife# 1 2

LastName of Aragon Boleyn

Wife# 1 2

Fate divorced beheaded

3 4 5 6

Seymour of Cleves Howard Parr

3 4 5 6

died divorced beheaded survived

3-Way Join Dependency So the following JD holds in W_LN_F: * { { Wife#, LastName }, { Wife#, Fate } }

and we can conclude that the following 3-way JD holds in WIFE_OF_HENRY_VIII: * { { Wife#, FirstName }, { Wife#, LastName }, { Wife#, Fate } }

i.e., WIFE_OF_HENRY_VIII = WIFE_OF_HENRY_VIII { Wife#, FirstName } JOIN WIFE_OF_HENRY_VIII { Wife#, LastName } JOIN WIFE_OF_HENRY_VIII { Wife#, Fate }

Design Comparison Does the decomposed design have any advantages? The single relvar design carries an implicit constraint to the effect that every wife has a wife number, a first name, a last name and a fate. This constraint is not implicit in the decomposed design. In fact, to enforce it, each of W_FN, W_LN and W_F needs a foreign key referencing each of the other two. But then the first attempt to insert a tuple into any of them must fail (unless multiple assignment is available).

Conclusion In the example at hand, the single relvar design is preferred, so long as the constraint implied by it truly reflects the real world. (For example, if it turns out that in fact not every wife has a last name, then we should separate out W_LN.) But the example at hand is rather special: • Each operand of the 3-way JD that holds in WIFE_OF_HENRY_VIII includes a key of that relvar • and it is the same key in each case, viz. {Wife#} We need to look at some not-so-special examples.

A Not-So-Special JD ENROLMENT StudentId S1 S1

Name Anne Anne

CourseId C1 C2

S2 S3 S4

Boris Cindy Devinder

C1 C3 C1

Recall that we decided to “split” (i.e., decompose) this one, as follows …

Splitting ENROLMENT IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder

IS_ENROLLED_ON StudentId S1 S1

CourseId C1 C2

S2 S3 S4

C1 C3 C1

Notice the JD: *{ { StudentId, Name }, { StudentId, CourseId } } that holds in ENROLMENT.

Functional Dependency Because {StudentId} is a key of the projection ENROLMENT{StudentId, Name}, we say that the following functional dependency (FD) holds in ENROLMENT: { StudentId } → { Name } • The arrow, pronounced “determines”, indicates an FD. • Each operand is a set of attributes (hence the braces). Name is a function of StudentId. For each StudentId there is exactly one Name.

Anatomy of an FD →

A

B

The determinant “determines”

The dependant set

Reminder: The determinant is a set of attributes, and so is the dependant set.

P.S. “dependant” is not a misspelling! It’s the noun, not the adjective.

FDs That Do Not Hold in ENROLMENT { { { { { {

Name } → { StudentId } Name } → { CourseId } CourseId } → { StudentId } CourseId, Name } → { StudentId } StudentId } → { CourseId } StudentId, Name } → { CourseId }

also: { x } → { StudentId }, because x is not an attribute of ENROLMENT.

Theorems About FDs Assume A → B holds in r, e.g. {StudentId} → {Name, Gender}. Then:

Left-Augmentation: If A’ is a superset of A, then A’ → B holds in r. Right-reduction: If B’ is a subset of B, then A → B’ holds in r. Transitivity: If A → B and B → C hold in r, then A → C holds in r. In general: If A → B and C → D hold in r, then: AU(C–B)→BUD holds in r.

Left-Irreducibility If A → B holds in r and no proper subset A’ of A is such that A’ → B holds in r, then A → B is a left-irreducible FD (in r). B is sometimes said to be fully dependent on A. Conversely, if there is such a subset A’, then A → B is a left-reducible FD (in r). B is therefore not fully dependent on A.

FDs and Keys If A → B is an FD in r and A U B constitutes the entire heading of r, then A is a superkey of r. If A → B is a left-irreducible FD in r and A U B constitutes the entire heading of r, then A is a key of r. (The longer term candidate key is often used instead of key, for historical reasons.)

Normal Forms Arising from the study of JDs in general and FDs in particular, various “normal forms” have been defined: • First Normal Form (1NF) • Second Normal Form (2NF) • Third Normal Form (3NF) • Boyce/Codd Normal Form (BCNF) • Fourth Normal Form (4NF) • Fifth Normal Form (5NF) • Sixth Normal Form (6NF)

Each of these is in a sense stricter than its immediate predecessor. • The ones shown in bold are particularly important. • The others were early attempts that proved inadequate.

Normalisation Normalisation is the act of decomposing a relvar that fails to satisfy a given normal form (e.g., BCNF) such that the result is an equivalent set of two or more “smaller” relvars that do satisfy that normal form. We decompose by taking the projections specified in a given join dependency (JD). In the case of ENROLMENT, the given JD is *{ { StudentId, Name }, {StudentId, CourseId } } determined by the FD { StudentId } → { Name }

Purposes of Normalisation A database all of whose relvars satisfy 6NF has the following possibly desirable properties:

• No redundancy • no recording of the same information more than once

• No “update anomalies” • to be explained later

• Full orthogonality • independent recording of the simplest facts But 5NF is usually sufficient (and 6NF is sometimes problematical with existing technology), as we shall see …

Normal Forms Normalization process

6NF

Join Dependencies *{ { … }, { … } }

5NF 4NF

BCNF 3NF

Some of them arising from…

Functional Deps: {…}→{…}

2NF 1NF Only one input relvar, no real “theory” for schema normalization

First Normal Form (1NF) A relation is in 1NF if attribute domains include only atomic (simple, indivisible) values and each tuple attribute value is a single value from the domain. The term (due to E.F. Codd) is not clearly defined, partly because it depends on an ill-defined concept of “atomicity”. We shall assume that every relation, and therefore every relvar, is in 1NF, under the assumption of a precise definition of data types (e.g. as in Tutorial D). Some authorities take it that a relation is in 1NF iff none of its attributes is relation-valued or tuple-valued. It is sometimes recommended to avoid use of such attributes (especially RVAs) in (base) database relvars.

2NF and 3NF These normal forms, originally defined by E.F. Codd, were really “mistakes”. You will find definitions in the textbooks but there is no need to learn them. The faults with Codd’s original definition of 3NF were reported to him by Raymond Boyce. Together they worked on an improved, simpler normal form, which became known as Boyce-Codd Normal Form (BCNF).

Boyce/Codd Normal Form (BCNF) BCNF is defined thus: Relvar R is in BCNF if and only if for every nontrivial FD A → B satisfied by R, A is a superkey of R. More loosely, “every nontrivial determinant is a [candidate] key”.

BCNF addresses redundancy arising from JDs that are consequences of FDs. (Not all JDs are consequences of FDs. We will look at the others later.)

Splitting ENROLMENT ENROLMENT StudentId S1 S1

Name Anne Anne

CourseId C1 C2

S2 S3 S4

Boris Cindy Devinder

C1 C3 C1

KEY{ StudentId, CourseId } while { StudentId } → { Name } FD always holds on the relvar

Splitting ENROLMENT (bis) IS_CALLED StudentId Name S1 Anne S2 Boris S3 Cindy S4 Devinder S5 Boris

IS_ENROLLED_ON StudentId S1 S1

CourseId C1 C2

S2 S3 S4

C1 C3 C1

The attributes involved in the “offending” FD have been separated into IS_CALLED, and now we can add student S5!

Advantages of BCNF With reference to our ENROLMENT example, decomposed into the BCNF relvars IS_CALLED and IS_ENROLLED_ON: • Anne’s name is recorded twice in ENROLMENT, but only once in IS_CALLED. In ENROLMENT it might appear under different spellings (Anne, Ann), unless the FD { StudentId } → { Name} is declared as a constraint. Redundancy is the problem here. • With ENROLMENT, a student’s name cannot be recorded unless that student is enrolled on some course, and an anonymous student cannot be enrolled on any course. Lack of orthogonality is the problem here.

Another Kind of Offending FD TUTORS_ON StudentId S1 S1

TutorId T1 T2

TutorName Hugh Mary

CourseId C1 C2

S2 S3 S4

T3 T4 T1

Lisa Fred Hugh

C1 C3 C1

Assume the FD { TutorId } → { TutorName } holds.

Splitting TUTORS_ON TUTOR_NAME TutorId T1 T2 T3 T4 T5

TutorName Hugh Mary Lisa Fred Zack

TUTORS_ON_BCNF StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

Now we can put Zack, who isn’t assigned to anybody yet, into the database. Note the FK required for TUTORS_ON_BCNF.

Dependency Preservation SCOR StudentId S1 S1

CourseId C1 C2

Organiser Owen Olga

Room 13 24

S2

C1

Owen

13

Assume FDs:

{ CourseId } → { Organiser } { Organiser } → { Room } { Room } → { Organiser }

Which one do we address first?

Try 1: {CourseId}→{Organiser} SCR

CO

StudentId S1 S1

CourseId C1 C2

Room 13 24

S2

C1

13

“Loses” { Room } → { Organiser } and {Organiser } → { Room }

CourseId Organiser C1 Owen C2

Olga

Try 2: {Room}→{Organiser} SCR

OR

StudentId S1 S1

CourseId C1 C2

Room 13 24

S2

C1

13

“Loses”

{ CourseId } → { Organiser }

Organiser Room Owen 13 Olga

24

Try 3: {Organiser}→{Room} SCO

OR

StudentId S1 S1

CourseId C1 C2

Organiser Owen Olga

S2

C1

Owen

Organiser Room Owen 13 Olga

Preserves all three FDs! (But we must still decompose SCO, of course)

24

An FD That Cannot Be Preserved TUTOR_FOR StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

Now assume the FD { TutorId } → { CourseId } holds. This is a third kind of offending FD.

Splitting TUTOR_FOR TUTORS

TEACHES

StudentId S1 S1

TutorId T1 T2

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

T3 T4

C1 C3

Note the keys. Have we “lost” the FD { StudentId, CourseId } → { TutorId } ? And the FK referencing IS_ENROLLED_ON?

Example of FD violation TUTORS

TEACHES

StudentId S1 S1

TutorId T1 T2

TutorId T1 T2

CourseId C1 C2

S2 S3 S4 S1

T3 T4 T1 T5

T3 T4 T5 T5

C1 C3 C1 C4

Now, S1 is taught by both T1 and T5 in course C1 (if you do the JOIN), violating our original key.

Reinstating The Lost FD Need to add the following constraint: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES IS_EMPTY ( ( TUTORS JOIN TEACHES ) GROUP { ALL BUT StudentId, CourseId } AS G WHERE COUNT ( G ) > 1 ) ;

or equivalently: CONSTRAINT KEY_OF_TUTORS_JOIN_TEACHES WITH TUTORS JOIN TEACHES AS TJT : COUNT ( TJT ) = COUNT ( TJT { StudentId, CourseId } ) ;

And The Lost Foreign Key The “lost” foreign key is easier: CONSTRAINT FK_FOR_TUTORS_JOIN_TEACHES IS_EMPTY ( ( TUTORS JOIN TEACHES ) NOT MATCHING IS_ENROLLED_ON ) ;

FD-driven Logical Design Given a relvar R with heading H and some FDs, decompose R into relvars R1 ... Rn such that ▪ H = H1 ∪ ... ∪ Hn

(attribute preservation)

▪ Ri is in a particular NF, for every i ▪ Every FD appears in a Ri ▪ R = JOIN{ R1 , ... Rn }

(normalized)

(dependency preservation) (lossless decomposition)

Generally on the universal relvar R0, an hypothetical relvar containing all attributes of the Universe of Discourse

Bad news arising • In the general case, you can’t have all three of the following – Guaranteed nonlossy design, – Guaranteed dependency preservation, and – All relvars in BCNF • Nonlossy design being considered a must, one has to make a choice between – Sacrificing BCNF, or – Sacrificing some functional dependencies • i.e. no longer guaranteed without explicit constraints

Interesting cases of Sets of FDs • Irreductible cover S if and only if – Every FD is right-irreductible – Every FD is left-irreductible – No proper subset of S is a cover for S+ Where S+ is the closure of S, i.e. all FDs that are implied by S (using reflexivity & unification theorems)

• Minimal cover – Take an irreductible cover S – If A -> B and A -> C then use A -> B ∪ C instead

Heath’s Theorem • Let A, B and C be subsets of the heading of a relvar R such that every attribute appears in at least one subset And suppose FD A → B holds in R • Then, the JD*{ A∪B, A∪C } holds in R • Hence, the usual lossless decomposition into – R1 = R{A∪B} and R2 = R{A∪C} – Especially useful when • A, B and C are disjoint and, • A is not already a key for R and, • B is non-empty

Normalization to 3NF Input:

a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in 3NF Post: Attribute preservation Lossless decomposition Dependency preservation, but no BCNF D = { } foreach FD A → B in S D ← RELVAR INIT R{A∪B} KEY A if no relvar in D includes a key Ki for R D ← RELVAR INIT R{Ki} KEY Ki (i arbitrarily return D

chosen)

Normalization to BCNF Input:

a relvar R with keys K1 … Kn a minimal cover S of FDs that hold in R Output: a set D of database relvars Ri in BCNF Post: Attribute preservation Lossless decomposition No dependency preservation D = NormalizeTo3NF(R, S) while one Ri in D not in BCNF let A → B be an offending FD in Ri D = D \ { Ri } D ← RELVAR INIT R{A∪B} KEY A D ← RELVAR INIT R{Hi-B} KEY ??? return D

Example TUTOR_FOR StudentId S1 S1

TutorId T1 T2

CourseId C1 C2

S2 S3 S4

T3 T4 T1

C1 C3 C1

With the following FDs { TutorId } → { CourseId } { StudentId, CourseId } → { TutorId }

Algorithm Execution Example 3NF with dependency preservation D = {} D ←TEACHES = {TutorId, CourseId} D ←TUTOR_FOR = {StudentId, CourseId, TutorId}

BCNF without dependency preservation TUTOR_FOR not in BCNF: { TutorId } → { CourseId } D = D - { TUTOR_FOR } D ← TEACHES = { TutorId, CourseId } D ← TAUGHT_BY = { StudentId, TutorId }

Which one is best? 3NF TEACHES = {TutorId, CourseId} TUTOR_FOR = {StudentId, CourseId, TutorId} CONSTRAINT { TutorId } → { CourseId } ON TUTOR_FOR

BCNF TEACHES = { TutorId, CourseId } TAUGHT_BY = { StudentId, TutorId } CONSTRAINT { StudentId, CourseId } → { TutorId } ON JOIN { TEACHES, HAS_TUTOR }

Higher Normal Forms Normalization process

6NF

Join Dependencies *{ { … }, { … } }

5NF 4NF

BCNF 3NF

Some of them arising from…

Functional Deps: {…}→{…}

2NF 1NF Only one input relvar, no real “theory” for schema normalization

To sum up: Boyce-Codd Normal Form Given a relvar R and some FDs “every nontrivial determinant is a [candidate] key”. BCNF addresses redundancy arising from JDs that are consequences of FDs.

BCNF is not enough TBC1 Teacher T1 T1

Book Database Systems Database in Depth

CourseId C1 C1

T1 T1 T2

Database Systems Database in Depth Database in Depth

C2 C2 C2

Assume the JD *{ { Teacher, Book }, { Book, CourseId } } holds.

Normalising TBC1 TB

BC

Teacher Book T1 Database Systems T1 Database in Depth T2

Database in Depth

Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth

C2 C2

We have lost the constraint implied by the JD, but does a teacher really have to teach a course just because he or she uses a book that is used on that course?

Fifth Normal Form (5NF) 5NF caters for all harmful JDs. Relvar R is in 5NF iff every nontrivial JD that holds in R is implied by the keys of R. (Fagin’s definition, 1979) Apart from a few weird exceptions, a JD is “implied by the keys” if every projection is a superkey. (Date’s definition – but see the Notes for this slide) To explain “nontrivial”: A JD is trivial if and only if one of its operands is the entire heading of R (because every such JD is clearly satisfied by R).

A JD of Degree > 2 TBC2 Teacher T1 T1

Book Database Systems Database in Depth

CourseId C1 C1

T1 T1 T2

Database Systems Database in Depth Database in Depth

C2 C2 C2

Now assume the JD *{ { Teacher, Book }, { Book, CourseId }, { Teacher, CourseId } } holds.

Normalising TBC2 TB

BC

Teacher Book T1 Database Systems T1 Database in Depth T2 TC

Database in Depth Teacher T1 T1

CourseId C1 C2

T2

C2

Book CourseId Database Systems C1 Database in Depth C1 Database Systems Database in Depth

C2 C2

(and we’ve “lost” the constraint again)

Sixth Normal Form (6NF) 6NF subsumes 5NF and is the strictest NF: Relvar R is in 6NF if and only if every JD that holds in R is trivial. 6NF provides maximal orthogonality, as already noted, but is not normally advised. It addresses additional anomalies that can arise with temporal data (beyond the scope of this course—and, what’s more, the definition of join dependency has to be revised).

Wives of Henry VIII in 6NF W_FN

W_LN

Wife# 1 2

FirstName Catherine Anne

3 4 5 6

Jane Anne Catherine Catherine

Not a good idea!

W_F

Wife# LastName 1 of Aragon 2 Boleyn 3 4 5 6

Seymour of Cleves Howard Parr

Wife# Fate 1 divorced 2 beheaded 3 4 5 6

died divorced beheaded survived