Cluster analysis Marco Saerens (UCL), with Christine Decaestecker (ULB) 2 Table of contents 3 Table of contents General introduction Data pre-processi...

6 downloads 25 Views 18MB Size

Cluster analysis Marco Saerens (UCL), with Christine Decaestecker (ULB)

Table of contents

2

Table of contents General introduction Data pre-processing Some clustering algorithms Cluster validity Cluster interpretation Clustering in practice Softwares, further readings 3

Slides references Many slides and figures have been adapted from the slides associated to the following books: – Alpaydin (2004), Introduction to machine learning. The MIT Press. – Duda, Hart & Stork (2000), Pattern classification, John Wiley & Sons. – Han & Kamber (2006), Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann. – Tan, Steinbach & Kumar (2006), Introduction to data mining. Addison Wesley.

As well as from Wikipedia, the free online encyclopedia I really thank these authors for their contribution to machine learning and data mining teaching

4

General introduction

5

General introduction What is clustering ? Clustering versus vector quantization Research fields where clustering is studied An ill-posed problem ? Hierarchical versus fixed number of clusters Notions of proximity Proximity-based versus feature-based clustering Joint clustering of individuals and features Importance of validation Importance of interpretation Some notations

6

What is clustering ? Also called – Typology analysis (biology) – Segmentation (marketing) – Unsupervised learning (AI) – Automatic classification (data analysis)

Here are some illustrations from – Tan, Steinbach & Kumar, Introduction to data mining. Addison Wesley. 7

What is clustering, in brief ? Given a set of data points, each having a set of features, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another – Data points in separate clusters are less similar to one another

Similarity measures: – Euclidean distance if attributes are continuous – Other Problem-specific Measures

8

Application 1: market segmentation Goal: subdivide a market into distinct subsets of customers – where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix

Approach: – Collect different attributes/features of customers based on their geographical and lifestyle related information – Find clusters of similar customers – Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from 9 different clusters

Application 2: document clustering Goal: To find groups of documents that are similar to each other based on the important terms appearing in them Approach: – To identify frequently occurring terms in each document – Form a similarity measure based on the frequencies of different terms – Use it to cluster 10

What is clustering ? Objective: On the basis of – A similarity matrix D between individuals, or – A data matrix X describing the individuals through • Features or variables

Group – The individuals into a set of natural disjoint clusters – With high homogeneity within the clusters – And high heterogeneity between the clusters 11

What is clustering ? When using features, one speaks about – Feature-based or data-based clustering

When using similarities, one speaks about – Similarity-based clustering

12

What is clustering ? This means that we have to define – The features on which we will base our clustering – The concept of similarity – The concept of homogeneity

As well as – How to use these concepts in order to group the individuals = clustering algorithms 13

What is clustering ? Here is an ideal case… ****** ****** *

* *** ** ****** * *

Intracluster Intraclusterdistances distances are minimized are minimized Intercluster Interclusterdistances distances are aremaximized maximized

With well-separated clusters Clusters memberships may be crisp or fuzzy

14

Clustering versus vector quantization Clustering is related to the field of vector quantization (VQ) Objective: To partition a set of individuals In order to code these individuals In some optimal way – VQ techniques are related to clustering – But we do not necessarily want to identify wellseparated clusters 15

Clustering versus vector quantization Example of vector quantization

16

Research field where clustering is studied Clustering is studied in a vide range of fields: – Pattern recognition – Knowledge discovery in databases and data mining – Machine learning – Applied statistics 17

Research field where clustering is studied It is still a very active research field Some recent application areas – Bioinformatics – Computational linguistics and natural language processing

18

An ill-posed problem ? In fact,clustering is often an ill-posed problem Data containing well-separated groups are quite rare There is a problem for defining the « natural » number of clusters

19

An ill-posed problem ? Consider the following data:

20

An ill-posed problem ? Consider the following data:

How many clusters?

Six Clusters

Two Clusters

Four Clusters 21

Hierarchical versus fixed number of clusters Two types of clustering algorithms Hierarchical algorithms – Provide a whole family of solutions – With varying number of clusters – Visual drawing: dendrogram

« Fixed number of clusters » algorithms – The number of clusters is provided in advance

Both techniques can be combined

22

Notions of proximity The notion of proximity is central to clustering – Dissimilarities – Similarities

It will condition the shape of the clusters you will find – Euclidean distance tends to find spherical clusters 23

Notions of proximity Proximities can be computed from – Features measured on individuals – Links betweenindividuals, without any notion of feature – Direct comparisons between individuals – Etc…

24

Notions of proximity When using features measured on individuals Notions of proximity will be different for – Categorical variables – Numerical variables – For mixed features, one has to merge the proximities – In order to obtain a global proximity score between the individuals 25

Notions of proximity When using directly proximities between individuals – One has to use clustering algorithms based on proximity matrices – Not always readily available in computer softwares (except hierarchical clustering)

26

Proximity-based versus feature-based clustering Most feature-based clustering algorithms work on numerical variables only Three alternatives: – Transform categorical features into numerical features and use a standard numerical-feature based clustering – Compute similarities and then use a clustering based on the similarity matrix – Use a clustering algorithm that handles both types of variables 27

Joint clustering of individuals and features One can also try to jointly cluster both – Features and – Individuals

Into mutually disjoint clusters of individuals and features – Depends on the data

28

Importance of validation Validation is quite – Important and – Time-consuming

Some questions that should be adressed: – Are there really clusters? – However, even if there are no really wellseparated clusters, is the clustering still useful? – Are these clusters robust? – How do we compare two clusterings? 29

Importance of interpretation Ones we have obtained clusters, We have to provide an interpretation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters

Time-consuming but very fulfilling work 30

Some notations The data matrix will be X – Each row represents an individual – There are N (sometimes denotes as n) individuals in total – Each column represents a feature measured on the individual

An element i, j of X will be denoted as xij The column vector of p features Xj measured on individual i (realizations) will be xi The random vector containing the realizations of the random variables Xj (features) will be x Cluster i will be denoted by its label, Gi or Ci (ωi in31 supervised classification)

Some notations The distance matrix between individual i and individual j will be denoted as D The similarity matrix between individual i and individual j will be denoted as S

32

Conclusion Clustering – With fuzzy vs crisp membership – Feature-based vs similarity-based – Hierarchical vs fixed number of clusters

Importance of – Choice of proximities – Pre-processing of the data – Validation of the clustering – Interpretation of the clustering 33

Data pre-processing

34

Data pre-processing Exploratory analysis Definition of proximities (similarities and dissimilarities) Processing of numerical variables Processing of categorical variables Feature combination for mixed data (numerical + categorical features) Missing values 35

Exploratory analysis First step: gain insight about your data – Univariate distributions of all features – Correlations between features

0.00 0.00

Density

0.00

0.00 0

50000

100000 150000

36

Exploratory analysis Eliminate highly redundant (correlated) features = feature extraction

Re-code very rare modalities (or levels) of categorical variables Visualize the data

37

Exploratory analysis Perform a PCA (on correlation matrix) in order to visualize the data – For numerical features

Perform a multiple correspondence analysis – For categorical features

Perform a multidimensional scaling – If the data are in the form of a proximity matrix 38

Definition of proximities Some clustering algorithms perform directly on the data matrix (feature-based) Other clustering algorithms work on proximities: – A distance matrix or – A similarity matrix

39

Definition of proximities Many different dissimilarity measures have been defined Usually, we use distances which have the following properties

40

Definition of proximities We can also compute similarities Usually, we use similarity measures, which have the following properties

41

Definition of proximities There is a dual relation between – Dissimilarity measures and – Similarity measures

If dij is a distance between i and j, then sij = (dmax – dij) is a similarity measure

If s is a similarity measure, then dij2 = sii– 2 sij + sjj is a dissimilarity measure

Etc… 42

Processing of numerical variables: feature transformation Standardize the numerical features – To be independent of the unit measure – To be able to compare the variables

However, standardization can also have a negative effect 43

Processing of numerical variables: feature transformation Eventually, perform a Box-Cox transformation

44

Processing of numerical variables: feature transformation Transforms the numerical feature into – A new variable – That is « most Gaussian » in some sense

0

50000

100000 150000

190000 220000

250000

280000

45

Processing of numerical variables: feature transformation Transformation is chosen among the family

And λ is chosen in order to maximize – The likelihood of x(λ) computed on the data set – With the assumption that x(λ) is Gaussian 46

Processing of numerical variables: feature transformation Here is an example

50000

SSE

0

100000 150000

190000 220000

5e+11 4e+11 3e+11 2e+11 1e+11 0 -2.0-1.5 -1.0-0.5 .0

.5

250000

280000

1.0 1.5 2.0

Lambda

47

Processing of numerical variables: feature transformation The decision to transform the variable depends on – Its interpretation – Its distribution – The clustering algorithm you are using

48

Processing of numerical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – Large number of numerical features – Features having a few integer values

Perform a PCA on the numerical features and keep a given number of factors 49

Processing of numerical variables: feature reduction The PCA projets the data into a subspace – That keeps as much variance as possible – Among all the possible linear subspace projections

Visualize the ‘scree plot’, which is a plot of the eigenvalues λi against i Or keep a pre-defined % of the total variance 50

Processing of numerical variables: feature reduction Here is an example of a scree plot – Each eigenvalue corresponds to a part of the variance corresponding to the factor

51

Processing of numerical variables: computation of proximities Here are some popular distances – The Euclidean distance

– The L1 distance

52

Processing of numerical variables: computation of proximities The Canberra metric

Gower’s metric

53

Processing of numerical variables: computation of proximities – Mahalanobis distance

where Σ is the variance-covariance matrix

54

Processing of numerical variables: computation of proximities Here are some popular similarity measures – The inner product or cosine similarity

55

Processing of numerical variables: computation of proximities – The Tanimoto similarity

56

Processing of categorical variables: feature reduction Categorical data are ‘dangerous’ (they can condition the clustering) – For instance we could obtain a clustering only representing the sex of the individuals – If the variable has many levels (modalities), it is coded by indicator variables – As a result, it has an important weight in the Euclidean distance 57

Processing of categorical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – We want to transform the categorical features into numerical features

Perform a multiple correspondence analysis on the categorical features and keep a given number of factors 58

Processing of categorical variables: feature reduction This is an equivalent of principal components analysis for categorical features It provides a set of factors which account for as much chi-square distance as possible It provides numerical features 59

Processing of categorical variables: feature reduction As for PCA, we can compute a ‘scree plot’ Or keep a predefined % of variability

60

Processing of categorical variables: computation of proximities Feature can be – Binary – More than two values

61

Processing of categorical variables: computation of proximities Many different similarity measures are available We code the categorical features by indicator variables (vectors) – Having a 1 in position i when the individual has modality i for this feature, and 0 otherwise – For instance [0, 1, 0, 0]T if the individual i possesses the second modality of the feature

62

Processing of categorical variables: computation of proximities Suppose a, b, c, and d are defined by

– Counting the number of 1’s and 0’s in common

63

Processing of categorical variables: computation of proximities

64

Processing of categorical variables: computation of proximities Each of these similarities can be used in order – To compare two individuals – Based on the categorical variables

65

Processing of categorical variables: Feature combination for mixed data Feature combination for mixed data – The data contains numerical + categorical features

Three standard ways of dealing with mixed data: – Converting categorical data into numerical data – Using clustering techniques dealing with both types of features – Computing mixed proximities combining categorical and numerical data 66

Processing of categorical variables: Feature combination for mixed data Converting categorical data into numerical data: – By using correspondence analysis – If there is only one categorical variable, it performs optimal scaling – Keep only the first factors

67

Processing of categorical variables: Feature combination for mixed data Using clustering techniques dealing with both types of features – Some clustering techniques use mixtures of generative models

For instance – Gaussian for numerical variables – Poisson for integer variables – Multinomial for categorical variables 68

Processing of categorical variables: Feature combination for mixed data Thus, this model integrates mixed variables in a natural way This mixture model is estimated with E-M algorithm One example of such a model is AutoClass It can deal with different types of variables as well as missing values 69

Processing of categorical variables: Feature combination for mixed data Computing mixed proximities combining categorical and numerical data One such solution was proposed by Gower (1971) It allows to combine categorical and numerical variables in order to obtain a similarity or a distance 70

Processing of categorical variables: Feature combination for mixed data Define similarity

– Where sij is a similarity measure between individuals i and j – And where sijk is the similarity between individuals i and j based on feature k 71

Processing of categorical variables: Feature combination for mixed data For binary features – sijk = 1 if there is a positive match between i and j on binary feature k – sijk = 0 otherwise

For categorial features – sijk = 1 if individuals i and j fall in the same category – sijk = 0 otherwise 72

Processing of categorical variables: Feature combination for mixed data For numerical features

– One has to be careful with outliers, however

The weights are defined as

73

Processing of categorical variables: Feature combination for mixed data A meaningful distance can be defined as

Weights wk are provided, weighting each feature – So that you can give more importance to an important feature 74

Missing values If there are missing values, – It is not a good idea to perform imputation – Imputation introduces artificial similarities between individuals – It is better to ignore the missing variables

75

Missing values For instance, when computing the Euclidean distance

– The sum is performed on the non-missing features only – So that p will be the number of non-missing features

76

Some clustering algorithms

77

Some clustering algorithms The standard k-means Fuzzy clustering Mixture of distributions The kernel trick for clustering Kohonen self-organizing maps Hierarchical clustering Minimum spanning tree clustering Graph partitioning – Only basic principles will be explained !! 78

The standard k-means Maybe the most popular clustering algorithm Feature-based but can be adapted to distance-based Crisp membership, but there exists extensions to fuzzy memberships – Fuzzy clustering

The number of groups is provided in advance Iterative method

79

The standard k-means Basic principle : reallocation iterative method based on mobile prototypes (for instance, the ‘centers of gravity’) k-means method : k fixed = number of groups 1. Initialization Choose k initial prototypes (centers of gravity) among the individuals that have to be clustered

2. Reallocation Allocate each individual to the cluster from which it is closest

3. Recentering Recompute the prototypes of the clusters (for instance the center of gravity)

4. Repeat steps 2) and 3) until stabilisation

80

The standard k-means Here is an example initialization

x

x

x

x

x x

x x

c1

first iteration

c2

x • x

x

x

x

•x

x

x

x x

g1

second iteration…

g2

x

•

x

x x

x

x

x g1

x

• x

g2

81

The standard k-means Justification of the method – Inertia within group Gk (k: 1 → K): Ik =

2 d ∑ (x i ,g k )

i∈Gk

K

– Total inertia within groups: IW = ∑ Ik k=1 K 2 I = G d – Total inertia between groups: B ∑ k (g k ,g) k =1

€ with g = global center of gravity, and Gk = n(k) =

number of individuals in group Gk 82

The standard k-means Thus, Intra-cluster distances are minimized

Inter-cluster distances are maximized

83

The standard k-means n

One can show that

Itot = ∑ d 2 (x i,g) i =1

I tot =

IW + I B

Thus, we have g3 g1

g xi

g

g2

within + between 84

The standard k-means A good partition of the clusters ⇔ IW minimum (homogeneous, compact clusters) A good partition of the clusters ⇔ IB maximum (well-separated clusters) Since IW + IB = Itot – Decreasing IW ⇔ Increasing IB – The algorithm decreases IW at each iteration

85

The standard k-means However ! IW always decreases when K increases ⇒ Does not allow to find the ideal number of clusters !

But allows to compare two partitions in K clusters : – The best one ⇔ IW minimum (or IB maximum)

Notice that the problem of finding the best partition (having the lowest IW) is NP-complete – The k-means only finds a local minimum of IW – One has to restart the algorithm many times with different initial values and keep the best one

86

The standard k-means Properties of the k-means algorithm : – At each reallocation-recentering step 2) + 3), one can show that IW decreases – ⇒ stabilisation when IW does not decrease any more – The largest decreases occur in the first steps – ⇒ The partition converges relatively rapidly ( +/- 1020 iterations of 2) + 3))

Avantages of this method : – Quick and easy to implement – Works with large data sets 87

The standard k-means: convergence of the method The within-class inertia is defined as – xi is the vector of features for individual i

88

The standard k-means: convergence of the method The total within group inertia is

First step: – Compute the prototype g(k) such that IW is minimized, while maintaining the allocations of the individuals fixed:

89

The standard k-means: convergence of the method We easily obtain:

And thus

So that we find that the prototypes must be the centers of gravity of each group ! 90

The standard k-means: convergence of the method Second step: – Reallocate the individuals xi such that IW is minimized, while keeping the prototypes fixed – If pi(k) is the probability of affecting an individual xi to the cluster having index k (= allocation to the clusters), we have

– The optimal allocation rule is to affect each individual to the nearest prototype 91

The standard k-means: convergence of the method Therefore, step 2) and step 3) each decrease the within group inertia Since this inertia cannot be negative, the decrease must tend to zero And the procedure converges

92

The standard k-means Notice that, instead of using the Euclidean distance, one could use the L1 distance – This works better; the clustering is more stable – The clustering is less sensitive to outliers – Instead of computing the centroid, on has to compute the medoid – Less sensible to the ‘empty cluster’ problem

93

Importance of choosing initial centroids Consider the following data Iteration 6 3 2.5 2 1.5

y

1 0.5 0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

94

Importance of choosing initial centroids Iteration 1

y

Iteration 2 3

3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

1

y

1

1

0.5

0.5

0.5

0

0

0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2

-1.5

-1

-0.5

x

0

0.5

1

1.5

2

-2

Iteration 5 3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

1

y

1 0.5

0.5

0

0

0

-0.5

0

x

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

x

0

0.5

1

1.5

2

1

1.5

2

1

0.5

-1

-0.5

Iteration 6

3

-1.5

-1

x

3

-2

-1.5

x

Iteration 4

y

Iteration 3

3

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

0.5

x

95

Importance of choosing initial centroids Iteration 1

y

3

2.5

2.5

2

2

1.5

1.5

y

1

1

0.5

0.5

0

0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

Iteration 3

y

Iteration 2

3

-2

-1.5

-1

-0.5

0

0.5

Iteration 4 3

3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

y

1 0.5

0.5

0

0

0

-1.5

-1

-0.5

0

x

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

x

0.5

2

1

0.5

-2

1.5

Iteration 5

3

1

1

x

1

1.5

2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

96

Limitations of k-means: Differing sizes

Original Points

K-means (3 Clusters) 97

Limitations of k-means: Differing density

Original Points

K-means (3 Clusters)

98

Limitations of k-means: Non-globular shapes

Original Points

K-means (2 Clusters) 99

Limitations of k-means: Number of clusters Within-cluster inertia can be used for comparing clustering solutions – Unfortunately, it always decreases 10 9

6

8 4

7

2

6

0

E 5 S S 4

-2

3 2

-4

1 -6

0 5

10

15

2

5

10

15

20

25

30

K

100

Limitations of k-means: Number of clusters 1 2

6

4

3

5

7

101

The standard k-means Some drawbacks of this method : 1. We have to fix K in advance 2. The result depends on the initial centers of gravity => Does not provide the optimal solution (the partition in K groups for which IW is minimum) ⇒ Provides a local minimum depending on the initial centroids

3. The often is an ‘empty cluster’ problem => Some of the clusters become empty (outliers, etc) 102

The standard k-means Solutions → Still an active research area For the dependence on the initial centroid choice – Simple approach: repeat many times the clustering and choose the partition having IW minimum. – Try to find a consensus by repeating many times the clustering ("formes fortes"): 1. Retry the clustering many times with different initial centroids and different bootstrap samples 2. Try to find a high consensus (individuals that are always in the same group) 3. Retry a k-means with, as initial centroid, the ‘consensus groups’

103

The standard k-means For the value of K (number of clusters) that has to be provided ISODATA method (and other variants) Includes ‘merge group’ and ‘divide group’ operations in the ‘k-means’ algorithm – Fusion of 2 groups if their distance is low (i.e. if the distance between centroids < some treshold) – Division of a group into 2 sub-groups if the inertia of the group (Ik) is too high (> some treshold)

=> parameters (tresholds) to be fixed !!! 104

The standard k-means Use of different criteria that allow to compare various solutions having different number of clusters – For instance, the Condorcet criterion – A wide range of criterion has been proposed (see later)

There is however no universal solution to this problem – Always try to visualize the data – For instance by using a discriminant analysis

105

The standard k-means The k-means algorithm can easily be adapted in order to work on – A distance matrix – A similarity matrix

106

The standard k-means For each cluster Cl, we define a prototype – Which is an individual belonging to Cl – The index of this individual will be pl

The distance between an individual having index k and cluster Cl will be – The distance to the prototype pl

107

The standard k-means The within-group inertia of cluster l will be

The total within-group inertia is

108

The standard k-means Step 1: initialize the prototypes Step 2: reallocate the individuals, while maintaining the prototypes fixed – Each individual k is affected to the nearest cluster having index l

109

The standard k-means Step 3: Each prototype is recomputed – The individual having the smallest distance to the whole cluster is chosen as prototype, while maintaining the allocations to the clusters

Steps 2 and 3 minimize IW at each iteration 110

The standard k-means K-means can also be adapted in order to take the extension of the clusters into account – Compute a Mahalanobis distance for each cluster k

111

The standard k-means The variance-covariance matrices Wk have to be recomputed at each iteration – Just as the centers of gravity matrix W1

G1

matrix W2

G2

112

Bisecting k-means Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical clustering

113

Fuzzy k-means The memberships to the clusters, pi(k), are now fuzzy – They reflect the degree of membership to the clusters = Degree of membership of individual i to cluster k

The aim is to minimize the within-class inertia 114

Fuzzy k-means Subject to the constraints

That is, we fix a given entropy level 115

Fuzzy k-means We thus introduce the Lagrange function

We compute the membership values by taking the derivative in terms of the pi(k)

116

Fuzzy k-means This leads to And therefore Which gives

117

Fuzzy k-means For computing the prototypes, we use

Which provides

Therefore: 118

Fuzzy k-means We therefore iterate

119

Mixture of distributions Clustering based on a priori distributional assumptions Each cluster membership is represented as a probability distribution – It represents the probability that the feature vector x is generated in this cluster – It contains a set of parameters that have to be estimated – The probability distribution can be Poisson, Gaussian, Bernoulli, etc… 120

Mixture of distributions Feature-based but can be adapted to distance-based Probabilistic cluster memberships The number of groups must be provided in advance Iterative method Only a local maximum of the likelihood is computed 121

Mixture of distributions Here is an example of a Gaussian mixture of three Gaussians – The objective is to fit these Gaussians on the data set

122

Mixture of distributions We assume that there are K clusters – Each having a particular distribution – Which defines the probability of observing vector x in this cluster – Here is an example of a Gaussian density for cluster k

123

Mixture of distributions The mixture of distribution, for all classes, can be written

– Each distribution is weighted by the probability P(Ck) of belonging to the corresponding cluster 124

Mixture of distributions For estimating the parameters, , one relies on the likelihood of the data sample

– The objective is to maximize this likelihood – We assume that the samples have been generated independently 125

Mixture of distributions Here are some examples of mixtures – From Duda, Hart & Stork (2001)

126

Mixture of distributions The cluster membership for an individual having feature vector x is provided by the posterior probabilities

127

Mixture of distributions Maximizing the likelihood provides us a set of nonlinear equations – We differentiate the likelihood in terms of the estimated parameters

– And in the Gaussian mixture case: 128

Mixture of distributions Thus, we compute

And we easily obtain

129

Mixture of distributions The parameters are estimated by iterating the resulting equations We use a three-steps procedure – Just as we did for the k-means

Step 1: Initialize the parameters Step 2: Recompute the membership values Step 3: Reestimate the parameter values 130

Mixture of distributions For the Gaussian mixture case: 1. Initialize 2. Recompute the membership values – While maintaining the parameters fixed

131

Mixture of distributions 3. Reestimate the parameters – While maintaining the membership values fixed

132

Mixture of distributions Other distributions can be used as well – Poisson for count variables – Multinomial for categorical variables – Etc…

This basic model has been extended in many ways – Other optimization algorithms can be used (EM and extensions) – We can deal properly with missing values – Bayesian extensions have been proposed 133 (AutoClass)

The kernel trick for clustering Many clustering algorithms have been « kernelized » This allows to cluster objects based on a similarity matrix – Which usually is a kernel matrix K – Containing inner products [K]ij = kij = (xi)Txj

134

Kernel k-means We want to minimize the total within-class inertia

We introduce the « kernel trick », where X is the data matrix 135

Kernel k-means We easily obtain

136

Kernel k-means The k-means iteratively minimizes J by iteratively – Assigning cluster labels, li, to nodes – Recomputing the cluster prototypes γk

The cluster assignment that minimizes J is

137

Kernel k-means The cluster prototypes γk that minimize J are solutions of

– where ki is the ith column of K and nk is the number of nodes assigned to cluster Ck 138

Kernel k-means Since, in both sides of the equation, we have a linear combination of the columns of K, one solution is

139

Kernel k-means Thus, we simply iterate – For all observations i:

– For all clusters k:

140

Kohonen self-organizing map This clustering algorithm has been developped in the framework of artificial neural networks – It is a kind of iterative k-means – With spatial constraints that provide a graphical representation of the clusters

It is called self-organizing map (SOM) in the field 141

Kohonen self-organizing map Principle: Two-levels artificial neural – One input layer and one output layer – The output layer is organized as a two-dimensional "topologic map" – It reflects the "natural relationships" between the input data – The topologic map is structured by a so-called unsupervised learning algorithm – In such a way that similar feature vectors activate spatially close output units – Inspired by biological processes such as speech or visual perception 142

Kohonen self-organizing map Computes a nonlinear representation of the data On a two-dimensional map By creation a topological structure preserving the neighborhood carte topologique

•••

entrées

143

Kohonen self-organizing map A set of prototypes is defined – Corresponding to a grid of output units – Each prototype corresponds to a unit of the grid

144

Kohonen self-organizing map Here is a summary of the basic SOM algorithm If x is a feature vector and pi is a prototype corresponding to the position i on the grid The SOM algorithm proceeds in four steps

145

Kohonen self-organizing map Step 1: the prototypes are initialized

x1

x2

xp

146

Kohonen self-organizing map Step 2: We identify the « winning unit » p* such that x − p* = min x − pi i

€

– Where, usually the Euclidean distance is used x1

x2

This winning unit determines a neighbourhood V*

xp

147

Kohonen self-organizing map Step 3: Compute the neighbourhood V*(t) around the winning unit p*on the grid – Which decreases as time goes on – The iteration step is denoted by t p*

V*(t1)

V*(t2) 148

Kohonen self-organizing map Step 4: Adapt the prototypes pi belonging to the neigbourhood of p* on the grid pi (t + 1) = pi (t) + η(t)[ x − pi (t)]

€

∀pi ∈ V * (t)

– with 0 < η(t) < 1, which decreases progressively during the time (t) – The prototypes in the neighbourhood of x are moving towards x 149

Kohonen self-organizing map The units (prototype) on the grid are representative of some "cluster" The algorithm preserves proximities: prototypes which are close together in the feature space will be close together on the grid of the topological map This auto-organisation is due to the update rule which link the prototypes that are close together 150

Kohonen self-organizing map Here is an example – Without structure in the data – The IRIS dataset (3 groups)

151

Kohonen self-organizing map If there is no neighbourhood (only one unit/prototype is update) ⇒the algorithms is a simple clustering algorithm minimizing the within-class inertia (an iterative variant of k-means). ⇒In this case, the convergence can be proved for some family of decreasing rules of η(t)

Some conditions for convergence in the 2D case can be derived 152

Hierarchical clustering Two types: Bottom-up methods – Agglomerative hierarchical clustering

Top-down methods – Divisive methods

153

Hierarchical clustering Agglomerative methods merge clusters at each step

154

Hierarchical clustering Divisive methods are partitioning the data in two parts at each step .. .. .. . .. ..

.. . ..

... .

155

Hierarchical clustering In practice, most hierarchical clustering methods are agglomerative We must define an agglomeration rule Principle : one element = one individual to be clustered or one group of individuals already defining a cluster We use a four-steps procedure 156

Hierarchical clustering Step 1: initialty, N individuals to be clustered We have a distance matrix (or a similarity matrix) between the individuals 2 by 2, DNxN 0 D=

0

dii' 0

di' i

0

0

157

Hierarchical clustering Step 2: We find the two closest elements (dii' minimum) → merged into a new element

There are n – 2 elements remaining + one new element (a cluster)

158

Hierarchical clustering Step 3: We compute the distance (or the similarity) between the new element and all the other elements We adapt the distance matrix D, which has one element less than before the merge

159

Hierarchical clustering Step 4: We repeat steps 2) and 3) until the aggregation is complete: – Only one element containing all the individuals

160

Hierarchical clustering This means that we have to define a distance (or similarity) between clusters – A merge of two elements reduces the number of groups by one 2

1 3

5 4

Step:

1

2

3

4

5

161

Hierarchical clustering This is represented by a dendrogram – The y axis represents the cumulative distance between the two merged elements distance

P1

(9)

P2 (8) (7) (6) (1)

(3)

(4)

(2)

(5)

Initial elements = individuals to cluster

Leafs of the tree = groups

P3 P4 P5 Pk = partition of k classes 162 (nested sequence: Pn → P1)

Hierarchical clustering The dendrogram represents nested clusters 5

4

1

2 5

0.25 0.2

2

0.15

3

6 1

4

3

Nested Clusters

0.1 0.05 0

3

6

4

1

Dendrogram

2

5

163

Hierarchical clustering Aggregation criteria – Single linkage:

(1)

(2)

(i )

d(G1 ,G2 ) = min d( x ,x ), x ∈ Gi

– Complete linkage: d(G1 ,G 2 ) = max d( x(1),x (2) )

1 – Average linkage: d(G1 ,G 2 ) = G1 .G2

∑

d(x(1),x (2) )

x (1) ∈G1 x ( 2) ∈G2 164

Hierarchical clustering Ward’s criterion – Computes the increase in within-group inertia when merging two elements – Starts with a zero inertia (each individual is a group)

When Pk + 1 → Pk, after merging 2 elements of Pk + 1: – IB decreases and IW increases (Itot remains constant)

165

Hierarchical clustering ⇒ merge the 2 elements A and B of Pk + 1 such that: IW (Pk ) − IW (Pk +1 ) minimum IB (Pk ) − IB (Pk +1 ) minimum A .B 2 . d (gA ,g B ) A+B – If we use the Euclidean distance

Which is equivalent to

min

166

Hierarchical clustering 4

1 2

1

3

€ 4

1 2

5

d(2,4)

5

d(1,5)

3

€5

1 (d13 + d14 + d15 + d 23 + d 24 + d 25 ) 6

gA gB

gAB

€

IAB − (IA + IB )

167

Hierarchical clustering Ward’s method We now prove A.B 2 ΔIB = . dist (gA ,g B ) A+B

Which is very easy and quick to compute Ward’s criterion is very popular for hierarchical clustering 168

Hierarchical clustering Ward’s method We first prove a useful relationship about the inertia from a point z – g is the center of gravity of the cloud of feature vectors xi weighted by n(i) – nx is the number of different xi – Each xi is weighted by the number of individuals that it represents: n(i) 169

Hierarchical clustering Ward’s method

170

Hierarchical clustering Ward’s method The inertia with respect to a point z is equivalent to – The inertia with respect to the centroid g – Plus the inertia of g with respect to z, by weighting the term by the number of individuals it represents

171

Hierarchical clustering Ward’s method Thus we have:

Suppose two groups A and B are merged: 172

Hierarchical clustering Ward’s method The common centroid is

Let us compute the between-group inertia before (k + 1 groups) and after (k groups) the fusion

173

Hierarchical clustering Ward’s method After the fusion of A and B (k groups), we have a large group of (nA + nB) data having center of gravity gAB Before the fusion (k + 1 groups), we have

174

Hierarchical clustering Ward’s method Where we applied the decomposition of inertia with respect to a vector z – This vector is z = g in this case – There are only two groups (A and B), each having nA and nB observations – Notice that the center of gravity of the two groups is gAB and not g in this case !

We thus decompose the inertia as the sum of the inertia with respect to the centroid and the inertia with respect to z = g 175

Hierarchical clustering Ward’s method Thus we applied

In which we replace nx = 2; x1 = gA; x2 = gB; n(1) = nA; n(2) = nB; z = g; g = gAB (centroid of the merged data); N = nA + nB 176

Hierarchical clustering Ward’s method The difference in inertia is

177

Hierarchical clustering Ward’s method The single linkage method is sensible to the chaining effect

The most used criteria are – the Ward criterion – The average linkage aggregation rule 178

Hierarchical clustering If we want a partition, we cut the tree at some level – After the agregations at low levels (very small clusters) – Before the aggregation at higher level (large clusters) – It is often quite subjective

179

Hierarchical +k-means clustering K-means and hierarchical clustering could be combined in a useful way 1. First perform a k-means for vector quantization 2. Then, perform a hierarchical clustering 3. Decide about the cut level to obtain a partition 4. Re-compute a k-means on the partition defined by the hierarchical clustering •

In order to optimize the within-group inertia 180

Hierarchical +k-means clustering

1) Initial partition - k-means

.

3) 3 classes kept

2) Hierarchical clustering

4) k-means

181

Minimum spanning tree clustering Very simple clustering method Suppose we have a distance matrix D between individuals Step 1: Construct a graph – With individuals as nodes – Similarities as weights on the edges

182

Minimum spanning tree clustering Step 2: From this graph, construct the minimum spanning tree – Which corresponds to the tree • Spanning all the nodes • Having minimum sum of edges

– There are efficient algorithms for computing the minimum spanning tree

Step 3: Remove the edges – Which are ‘inconsistents’ – Having large distances compared to others 183

Minimum spanning tree clustering Example from Dubes & Jain (1988)

184

Minimum spanning tree clustering Simple exploratory technique Very fast to compute Interesting also for visualizing the data – Visualizing a spanning tree in 2D is easy

185

Graph partitioning Some clustering algorithms are applied to graphs Suppose we have a graph structure

186

Graph partitioning Suppose we have a similarity matrix S between individuals Construct a graph – With individuals as nodes – Similarities as weights on the edges

187

Graph partitioning The elements aij of the adjacency matrix A of a weighted, undirected, graph are defined as

where A is symmetric

The wij ≥ 0 represents a similarity or strength of relationship between node i and node j

188

Graph partitioning We define a similarity index between partition C1 and partition C2 as

Define h1 as an indicator vector containing 1 if the node belongs to C1 and 0 if it belongs to C2 Define h2 as the equivalent vector for C2 Define e as a column vector made of 1s 189

Graph partitioning We obtain

190

Graph partitioning And thus

Now,

Thus [D]ij = 0 for i ≠ j and 191

Graph partitioning We finally obtain

By relaxing the fact that h1 and h2 contain binary values, – This leads to the following optimization problem 192

Graph partitioning This leads to an eigenvector problem – Compute the smallest non-trivial (L is not of full rank) eigenvector of – which is called the Fiedler vector

193

Graph partitioning This approach is very popular for the moment ! For instance: find dense groups in a social network graph This is called ‘spectral clustering’

194

Graph partitioning: one example Visualization of a network of criminals

195

Conclusion A myriad of clustering techniques Which can be combined There are many other clustering techniques Still an active research area – For instance, consensus clustering and graph clustering are quite popular

However, very few techniques are implemented in standard statistical softwares 196

Cluster validity

197

Cluster validity Visualizing the clustering solution Association of the cluster solution with initial features Comparing clustering solutions Testing hypotheses Robustness estimation Consensus clustering About the number of clusters

198

Visualizing the clustering solution Always visually inspect the cluster solution!! For instance run a discriminant analysis – For viewing the data in a 2D or a 3D space – Which best discriminates between the clusters – As dependent variable, take the cluster labels – Eventually plot one cluster against the others

Also, perform a PCA and visualize the data in 2D or 3D 199

Visualizing the clustering solution Here is the result of a k-means visualized by a dicriminant analysis – On the IRIS dataset 18 17 16

14

1

Sepal length Sepal width Petal width

3

Petal length

13

2

Canonical2

15

12 11 10 9 -4

-2

0

2

4

6

8

10

Canonical1

200

Visualizing the clustering solution Order the similarity matrix with respect to cluster labels and inspect visually 1

1 0.9 0.8

s t n i o P

0.7 0.6

y

0.5 0.4 0.3 0.2 0.1 0

10

0.9

20

0.8

30

0.7

40

0.6

50

0.5

60

0.4

70

0.3

80

0.2

90

0.1

100 0

0.2

0.4

0.6

x

0.8

1

20

40

60

80

0 100Similarity

Points

201

Association of the cluster solution with initial features In order to be sure that some feature does not dominate the solution – Compute the association between the clustering solution and initial features – In order to avoid clusters that are completely dominated by only one unimportant variable – Eventually re-weight the features in the clustering 202

Comparing clustering solutions In order to validate clustering solutions We should be able to compare clustering solutions – Even if the number of clusters found is different!

Two standard techniques: – Huber’s Γ statistics – The RAND index R – Many others not detailled here

203

Comparing clustering solutions Huber’s Γ statistics Define two indicator variables – y1(i,j) and y2(i,j)

y1(i,j)=1 if individuals i and j are clustered together in clustering solution 1 – Otherwise y1(i,j)=0

y2(i,j)=1 if individuals i and j are clustered together in clustering solution 2 – Otherwise y2(i,j)=0

204

Comparing clustering solutions Huber’s Γ statistics can be written

– Where N(N – 1)/2 is the total number of terms (pairs) – If i and j are never clustered together, Γ = 0 – It is a kind of covariance value between the two clustering solutions – It can be used for comparing clusterings and proximities as well 205

Comparing clustering solutions However, it is better to use the normalized version – For which we standardize the variables y1(i,j) and y2(i,j)

– Then, –1 ≤ Γ’ ≤ 1 206

Comparing clustering solutions With

207

Comparing clustering solutions The RAND index R – It is a degree of agreement between two clustering solutions

Compute the number of pairs of individuals (i,j) that are – Clustered together in both clustering solutions – Not clustered together in both clustering solutions

Divide by the total number of pairs N(N – 1)/2

208

Comparing clustering solutions In other words, it computes the proportion of pairs for which both clusterings (1 and 2) agree – Either the pair belong to the same group in both clusterings – Either the pair does not belong to the same group in both clusterings

209

Comparing clustering solutions The RAND index can be computed by

Very useful for comparing clustering solutions Or comparing a clustering against a predefined solution 210

Comparing clustering solutions For each of these statistics t, there exists a corresponding ‘corrected’ statistics t which is a normalized version of the original one:

– The expectation is taken under the null hypothesis – The range is between 0 and 1 – The statistics is corrected with respect to matches occuring ‘by chance’ 211

Comparing clustering solutions Methods for comparing dendrograms are also available – They compare hierarchical clustering solutions

212

Testing hypotheses Several tests of hypothesis are available Suppose you measure some statistics t – That is related to the quality of the clustering – For instance, a measure of compactness of the clusters

213

Testing hypotheses You also want to test some null hypothesis H0 – For instance the individuals are distributed at random – So that all sets of location in some region of the feature space are equally likely – Which means that there is no structure in the data

Suppose also that the distribution of t is known under the null hypothesis H0 214

Testing hypotheses For a sample, you can – Compute t from that sample – test if the value of t is very unlikely or not – If it is very unlikely, you reject the hypothesis H0 (of no structure)

215

Testing hypotheses In ours case, tests of hypotheses can be performed by Monte-Carlo simulation techniques For instance, let us assume that the null hypothesis corresponds to – A uniform distribution of data – Without any structure

Assume we are interested in the ratio between between-cluster inertia on within-cluster inertia: Here is a simulation allowing to test this hypothesis

216

Testing hypotheses 1. 2. 3. 4. 5.

6.

Run a clustering algorithm on a data set, retrieve the cluster solution and compute IB/ IW Generate a new data set in the area of interest following a uniform distribution Run the same clustering algorithm on the new data set Compute the resulting IB/ IW Run 2) to 4) several times and draw a histogram representing the distribution of IB/ IW under the null hypothesis Does the value of IB/ IW computed on the original data set differ significantly from the 217 distribution of IB/ IW under the null hypothesis ?

Testing hypotheses There are many such tests – Which, for instance, assume Gaussian clusters

However, testing against the absence of structure is not always of interest – Indeed, even if there is no structure, clustering may be useful – Think about vector quantization 218

Robustness estimation One could be interested in testing the robustness of the solution ‘Split-sample’ methods are useful in this context – Randomly divide the data into two subsets – Perform an analysis separately – Compare the solutions

Here is a useful method in this context 219

Robustness estimation 1. 2. 3.

4. 5.

Divide the sample in two parts Perform independently a clustering on the two samples Classify the individuals of the other sample according to the classification rules (centroids) obtained on the first sample Compare the two solutions using, i.e. Huber’s statistics Run steps 1) to 4) many times and compute the average of Huber’s statistics 220

Robustness estimation If there really is a structure in the data – The algorithm should be able to retrieve this structure almost all the time

The same method is used in order to estimate the number of groups – Pick up the number of groups for which the solution is most stable 221

Consensus clustering The idea here is to find robust solutions by – Running many times the clustering – Finding a consensus between the different solutions – Thus seeking for a stable solution

222

Consensus clustering One could be interested in ‘consensus’ clustering 1. Take a bootstrap sample of the data set 2. Run a clustering algorithm with different seeds and label all the individuals 3. Run 1) and 2) many times 4. For each pair (i,j), compute the proportion of times the two individuals were grouped together 5. This provides a new similarity matrix 6. Find the labelling that was closest to this similarity matrix among the clusterings realized in 1) and 2) 223

About the number of clusters What is the natural number of clusters in the data? – No universal answer to this question – Usually, one has an approximate a priori idea

Three heuristic answers – Plot of the compactness criterion in terms of number of clusters • Plot the clustering criterion (between-cluster inertia) against number of clusters • A large drop is generally taken as suggestive of a particular number of groups

– Examine the dendrogram (hierachical clustering) – Perform a robustness analysis and pick up the most stable solution

224

About the number of clusters However, a number of other, more formal, techniques have been suggested – These are mainly statistics computed from the sum of squared distances (inertia) – Various clustering, with increasing number of groups, are performed, and this statistics is computed – The number for which the statistics is a minimum is retained – Some of these techniques are available in the 225 R software

About the number of clusters There is, however, no miraculous solution to this problem Always visualize the data – this may be a good way to have an idea of the quality of the solution

226

Cluster interpretation

227

Cluster interpretation Visual representation Cluster profiling Minimum spanning tree representation Discriminant analysis

228

Visual representation Ones we have obtained clusters, We have to provide an interpretation and a representation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters

Time-consuming but very fulfilling work 229

Visual representation Perform a discriminant analysis – And visualize the clusters – By coloring them 18 17 16

14

1

Sepal length Sepal width Petal width

3

Petal length

13

2

Canonical2

15

12 11 10 9 -4

-2

0

2

4

Canonical1

6

8

10

230

Visual representation Eventually, if there are many clusters – Eliminate individuals that are far from the centroid – Keep only the individuals that are in the kernel of the clusters – Then perform a discriminant analysis – Usually, you obtain a clear representation 231

Visual representation Eventually plot the cluster density on a Belgian map – Urban clusters will appear

232

Cluster profiling Focus on one important variable For each cluster, compute – The odds of this important variables against the population mean – For instance mean revenue within the cluster divided by the mean revenue on the whole population

Create a bar plot representing odds in function of cluster (x axis), for this important variable 233

Cluster profiling Some MOSAIC examples

234

Cluster profiling Focus on one cluster For each important variable, compute – The odds of this important variable against the population mean

Create a bar plot representing odds in function of important variables (x axis), for this cluster 235

Cluster profiling Example MOSAIC – ‘Classe privilégiée’

236

Cluster profiling Example MOSAIC – ‘Classe defavorisée urbaine’

237

Cluster3

Cluster1 Petal width

Petal width

Sepal width

Sepal length

Petal width

Petal length

Sepal width

Sepal length

Petal length

Cluster2

Petal length

Sepal width

Sepal length

Petal width

Petal length

Sepal width

Sepal length

Cluster profiling Example of profiling (IRIS dataset)

Cluster Means

238

Minimum spanning tree representation Compute the distances between the centroids of the clusters Determine the minimum spanning tree Plot the minimum spanning tree

239

Minimum spanning tree representation An example from the MOSAIC segmentation

240

Discriminant analysis In order to give an interpretation to the cluster – Perform a discriminant analysis – Or a decision tree – Of one cluster against the others – Detect the most relevant features • Allowing to contrast/characterize this cluster against the others 241

Discriminant analysis Here is an example of a decision tree applied to a k-means clustering of the IRIS data All Rows Count 150

G^2 Level 328.75054 1 2 3

Petal length<3.0 Count 50

G^2 Level 9.8039113 1 2 3

Prob 0.3667 0.3067 0.3267

Petal length>=3.0 Prob 0.0000 0.0200 0.9800

Count 100

G^2 Level 137.62776 1 2 3

Sepal length<6.3 Count 49

G^2 Level 43.614415 1 2 3

Prob 0.5500 0.4500 0.0000

Sepal length>=6.3 Prob 0.1633 0.8367 0.0000

Count 51

G^2 Level 28.041985 1 2 3

Prob 0.9216 0.0784 0.0000

242

Softwares and further readings

243

Softwares and further readings Softwares – The R software – S-Plus software – SAS

244

Softwares and further readings Some multivariate data analysis books

245

Softwares and further readings Some clustering books

246

Softwares and further readings Some data analysis books (French school)

247

Table of contents

2

Table of contents General introduction Data pre-processing Some clustering algorithms Cluster validity Cluster interpretation Clustering in practice Softwares, further readings 3

Slides references Many slides and figures have been adapted from the slides associated to the following books: – Alpaydin (2004), Introduction to machine learning. The MIT Press. – Duda, Hart & Stork (2000), Pattern classification, John Wiley & Sons. – Han & Kamber (2006), Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann. – Tan, Steinbach & Kumar (2006), Introduction to data mining. Addison Wesley.

As well as from Wikipedia, the free online encyclopedia I really thank these authors for their contribution to machine learning and data mining teaching

4

General introduction

5

General introduction What is clustering ? Clustering versus vector quantization Research fields where clustering is studied An ill-posed problem ? Hierarchical versus fixed number of clusters Notions of proximity Proximity-based versus feature-based clustering Joint clustering of individuals and features Importance of validation Importance of interpretation Some notations

6

What is clustering ? Also called – Typology analysis (biology) – Segmentation (marketing) – Unsupervised learning (AI) – Automatic classification (data analysis)

Here are some illustrations from – Tan, Steinbach & Kumar, Introduction to data mining. Addison Wesley. 7

What is clustering, in brief ? Given a set of data points, each having a set of features, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another – Data points in separate clusters are less similar to one another

Similarity measures: – Euclidean distance if attributes are continuous – Other Problem-specific Measures

8

Application 1: market segmentation Goal: subdivide a market into distinct subsets of customers – where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix

Approach: – Collect different attributes/features of customers based on their geographical and lifestyle related information – Find clusters of similar customers – Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from 9 different clusters

Application 2: document clustering Goal: To find groups of documents that are similar to each other based on the important terms appearing in them Approach: – To identify frequently occurring terms in each document – Form a similarity measure based on the frequencies of different terms – Use it to cluster 10

What is clustering ? Objective: On the basis of – A similarity matrix D between individuals, or – A data matrix X describing the individuals through • Features or variables

Group – The individuals into a set of natural disjoint clusters – With high homogeneity within the clusters – And high heterogeneity between the clusters 11

What is clustering ? When using features, one speaks about – Feature-based or data-based clustering

When using similarities, one speaks about – Similarity-based clustering

12

What is clustering ? This means that we have to define – The features on which we will base our clustering – The concept of similarity – The concept of homogeneity

As well as – How to use these concepts in order to group the individuals = clustering algorithms 13

What is clustering ? Here is an ideal case… ****** ****** *

* *** ** ****** * *

Intracluster Intraclusterdistances distances are minimized are minimized Intercluster Interclusterdistances distances are aremaximized maximized

With well-separated clusters Clusters memberships may be crisp or fuzzy

14

Clustering versus vector quantization Clustering is related to the field of vector quantization (VQ) Objective: To partition a set of individuals In order to code these individuals In some optimal way – VQ techniques are related to clustering – But we do not necessarily want to identify wellseparated clusters 15

Clustering versus vector quantization Example of vector quantization

16

Research field where clustering is studied Clustering is studied in a vide range of fields: – Pattern recognition – Knowledge discovery in databases and data mining – Machine learning – Applied statistics 17

Research field where clustering is studied It is still a very active research field Some recent application areas – Bioinformatics – Computational linguistics and natural language processing

18

An ill-posed problem ? In fact,clustering is often an ill-posed problem Data containing well-separated groups are quite rare There is a problem for defining the « natural » number of clusters

19

An ill-posed problem ? Consider the following data:

20

An ill-posed problem ? Consider the following data:

How many clusters?

Six Clusters

Two Clusters

Four Clusters 21

Hierarchical versus fixed number of clusters Two types of clustering algorithms Hierarchical algorithms – Provide a whole family of solutions – With varying number of clusters – Visual drawing: dendrogram

« Fixed number of clusters » algorithms – The number of clusters is provided in advance

Both techniques can be combined

22

Notions of proximity The notion of proximity is central to clustering – Dissimilarities – Similarities

It will condition the shape of the clusters you will find – Euclidean distance tends to find spherical clusters 23

Notions of proximity Proximities can be computed from – Features measured on individuals – Links betweenindividuals, without any notion of feature – Direct comparisons between individuals – Etc…

24

Notions of proximity When using features measured on individuals Notions of proximity will be different for – Categorical variables – Numerical variables – For mixed features, one has to merge the proximities – In order to obtain a global proximity score between the individuals 25

Notions of proximity When using directly proximities between individuals – One has to use clustering algorithms based on proximity matrices – Not always readily available in computer softwares (except hierarchical clustering)

26

Proximity-based versus feature-based clustering Most feature-based clustering algorithms work on numerical variables only Three alternatives: – Transform categorical features into numerical features and use a standard numerical-feature based clustering – Compute similarities and then use a clustering based on the similarity matrix – Use a clustering algorithm that handles both types of variables 27

Joint clustering of individuals and features One can also try to jointly cluster both – Features and – Individuals

Into mutually disjoint clusters of individuals and features – Depends on the data

28

Importance of validation Validation is quite – Important and – Time-consuming

Some questions that should be adressed: – Are there really clusters? – However, even if there are no really wellseparated clusters, is the clustering still useful? – Are these clusters robust? – How do we compare two clusterings? 29

Importance of interpretation Ones we have obtained clusters, We have to provide an interpretation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters

Time-consuming but very fulfilling work 30

Some notations The data matrix will be X – Each row represents an individual – There are N (sometimes denotes as n) individuals in total – Each column represents a feature measured on the individual

An element i, j of X will be denoted as xij The column vector of p features Xj measured on individual i (realizations) will be xi The random vector containing the realizations of the random variables Xj (features) will be x Cluster i will be denoted by its label, Gi or Ci (ωi in31 supervised classification)

Some notations The distance matrix between individual i and individual j will be denoted as D The similarity matrix between individual i and individual j will be denoted as S

32

Conclusion Clustering – With fuzzy vs crisp membership – Feature-based vs similarity-based – Hierarchical vs fixed number of clusters

Importance of – Choice of proximities – Pre-processing of the data – Validation of the clustering – Interpretation of the clustering 33

Data pre-processing

34

Data pre-processing Exploratory analysis Definition of proximities (similarities and dissimilarities) Processing of numerical variables Processing of categorical variables Feature combination for mixed data (numerical + categorical features) Missing values 35

Exploratory analysis First step: gain insight about your data – Univariate distributions of all features – Correlations between features

0.00 0.00

Density

0.00

0.00 0

50000

100000 150000

36

Exploratory analysis Eliminate highly redundant (correlated) features = feature extraction

Re-code very rare modalities (or levels) of categorical variables Visualize the data

37

Exploratory analysis Perform a PCA (on correlation matrix) in order to visualize the data – For numerical features

Perform a multiple correspondence analysis – For categorical features

Perform a multidimensional scaling – If the data are in the form of a proximity matrix 38

Definition of proximities Some clustering algorithms perform directly on the data matrix (feature-based) Other clustering algorithms work on proximities: – A distance matrix or – A similarity matrix

39

Definition of proximities Many different dissimilarity measures have been defined Usually, we use distances which have the following properties

40

Definition of proximities We can also compute similarities Usually, we use similarity measures, which have the following properties

41

Definition of proximities There is a dual relation between – Dissimilarity measures and – Similarity measures

If dij is a distance between i and j, then sij = (dmax – dij) is a similarity measure

If s is a similarity measure, then dij2 = sii– 2 sij + sjj is a dissimilarity measure

Etc… 42

Processing of numerical variables: feature transformation Standardize the numerical features – To be independent of the unit measure – To be able to compare the variables

However, standardization can also have a negative effect 43

Processing of numerical variables: feature transformation Eventually, perform a Box-Cox transformation

44

Processing of numerical variables: feature transformation Transforms the numerical feature into – A new variable – That is « most Gaussian » in some sense

0

50000

100000 150000

190000 220000

250000

280000

45

Processing of numerical variables: feature transformation Transformation is chosen among the family

And λ is chosen in order to maximize – The likelihood of x(λ) computed on the data set – With the assumption that x(λ) is Gaussian 46

Processing of numerical variables: feature transformation Here is an example

50000

SSE

0

100000 150000

190000 220000

5e+11 4e+11 3e+11 2e+11 1e+11 0 -2.0-1.5 -1.0-0.5 .0

.5

250000

280000

1.0 1.5 2.0

Lambda

47

Processing of numerical variables: feature transformation The decision to transform the variable depends on – Its interpretation – Its distribution – The clustering algorithm you are using

48

Processing of numerical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – Large number of numerical features – Features having a few integer values

Perform a PCA on the numerical features and keep a given number of factors 49

Processing of numerical variables: feature reduction The PCA projets the data into a subspace – That keeps as much variance as possible – Among all the possible linear subspace projections

Visualize the ‘scree plot’, which is a plot of the eigenvalues λi against i Or keep a pre-defined % of the total variance 50

Processing of numerical variables: feature reduction Here is an example of a scree plot – Each eigenvalue corresponds to a part of the variance corresponding to the factor

51

Processing of numerical variables: computation of proximities Here are some popular distances – The Euclidean distance

– The L1 distance

52

Processing of numerical variables: computation of proximities The Canberra metric

Gower’s metric

53

Processing of numerical variables: computation of proximities – Mahalanobis distance

where Σ is the variance-covariance matrix

54

Processing of numerical variables: computation of proximities Here are some popular similarity measures – The inner product or cosine similarity

55

Processing of numerical variables: computation of proximities – The Tanimoto similarity

56

Processing of categorical variables: feature reduction Categorical data are ‘dangerous’ (they can condition the clustering) – For instance we could obtain a clustering only representing the sex of the individuals – If the variable has many levels (modalities), it is coded by indicator variables – As a result, it has an important weight in the Euclidean distance 57

Processing of categorical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – We want to transform the categorical features into numerical features

Perform a multiple correspondence analysis on the categorical features and keep a given number of factors 58

Processing of categorical variables: feature reduction This is an equivalent of principal components analysis for categorical features It provides a set of factors which account for as much chi-square distance as possible It provides numerical features 59

Processing of categorical variables: feature reduction As for PCA, we can compute a ‘scree plot’ Or keep a predefined % of variability

60

Processing of categorical variables: computation of proximities Feature can be – Binary – More than two values

61

Processing of categorical variables: computation of proximities Many different similarity measures are available We code the categorical features by indicator variables (vectors) – Having a 1 in position i when the individual has modality i for this feature, and 0 otherwise – For instance [0, 1, 0, 0]T if the individual i possesses the second modality of the feature

62

Processing of categorical variables: computation of proximities Suppose a, b, c, and d are defined by

– Counting the number of 1’s and 0’s in common

63

Processing of categorical variables: computation of proximities

64

Processing of categorical variables: computation of proximities Each of these similarities can be used in order – To compare two individuals – Based on the categorical variables

65

Processing of categorical variables: Feature combination for mixed data Feature combination for mixed data – The data contains numerical + categorical features

Three standard ways of dealing with mixed data: – Converting categorical data into numerical data – Using clustering techniques dealing with both types of features – Computing mixed proximities combining categorical and numerical data 66

Processing of categorical variables: Feature combination for mixed data Converting categorical data into numerical data: – By using correspondence analysis – If there is only one categorical variable, it performs optimal scaling – Keep only the first factors

67

Processing of categorical variables: Feature combination for mixed data Using clustering techniques dealing with both types of features – Some clustering techniques use mixtures of generative models

For instance – Gaussian for numerical variables – Poisson for integer variables – Multinomial for categorical variables 68

Processing of categorical variables: Feature combination for mixed data Thus, this model integrates mixed variables in a natural way This mixture model is estimated with E-M algorithm One example of such a model is AutoClass It can deal with different types of variables as well as missing values 69

Processing of categorical variables: Feature combination for mixed data Computing mixed proximities combining categorical and numerical data One such solution was proposed by Gower (1971) It allows to combine categorical and numerical variables in order to obtain a similarity or a distance 70

Processing of categorical variables: Feature combination for mixed data Define similarity

– Where sij is a similarity measure between individuals i and j – And where sijk is the similarity between individuals i and j based on feature k 71

Processing of categorical variables: Feature combination for mixed data For binary features – sijk = 1 if there is a positive match between i and j on binary feature k – sijk = 0 otherwise

For categorial features – sijk = 1 if individuals i and j fall in the same category – sijk = 0 otherwise 72

Processing of categorical variables: Feature combination for mixed data For numerical features

– One has to be careful with outliers, however

The weights are defined as

73

Processing of categorical variables: Feature combination for mixed data A meaningful distance can be defined as

Weights wk are provided, weighting each feature – So that you can give more importance to an important feature 74

Missing values If there are missing values, – It is not a good idea to perform imputation – Imputation introduces artificial similarities between individuals – It is better to ignore the missing variables

75

Missing values For instance, when computing the Euclidean distance

– The sum is performed on the non-missing features only – So that p will be the number of non-missing features

76

Some clustering algorithms

77

Some clustering algorithms The standard k-means Fuzzy clustering Mixture of distributions The kernel trick for clustering Kohonen self-organizing maps Hierarchical clustering Minimum spanning tree clustering Graph partitioning – Only basic principles will be explained !! 78

The standard k-means Maybe the most popular clustering algorithm Feature-based but can be adapted to distance-based Crisp membership, but there exists extensions to fuzzy memberships – Fuzzy clustering

The number of groups is provided in advance Iterative method

79

The standard k-means Basic principle : reallocation iterative method based on mobile prototypes (for instance, the ‘centers of gravity’) k-means method : k fixed = number of groups 1. Initialization Choose k initial prototypes (centers of gravity) among the individuals that have to be clustered

2. Reallocation Allocate each individual to the cluster from which it is closest

3. Recentering Recompute the prototypes of the clusters (for instance the center of gravity)

4. Repeat steps 2) and 3) until stabilisation

80

The standard k-means Here is an example initialization

x

x

x

x

x x

x x

c1

first iteration

c2

x • x

x

x

x

•x

x

x

x x

g1

second iteration…

g2

x

•

x

x x

x

x

x g1

x

• x

g2

81

The standard k-means Justification of the method – Inertia within group Gk (k: 1 → K): Ik =

2 d ∑ (x i ,g k )

i∈Gk

K

– Total inertia within groups: IW = ∑ Ik k=1 K 2 I = G d – Total inertia between groups: B ∑ k (g k ,g) k =1

€ with g = global center of gravity, and Gk = n(k) =

number of individuals in group Gk 82

The standard k-means Thus, Intra-cluster distances are minimized

Inter-cluster distances are maximized

83

The standard k-means n

One can show that

Itot = ∑ d 2 (x i,g) i =1

I tot =

IW + I B

Thus, we have g3 g1

g xi

g

g2

within + between 84

The standard k-means A good partition of the clusters ⇔ IW minimum (homogeneous, compact clusters) A good partition of the clusters ⇔ IB maximum (well-separated clusters) Since IW + IB = Itot – Decreasing IW ⇔ Increasing IB – The algorithm decreases IW at each iteration

85

The standard k-means However ! IW always decreases when K increases ⇒ Does not allow to find the ideal number of clusters !

But allows to compare two partitions in K clusters : – The best one ⇔ IW minimum (or IB maximum)

Notice that the problem of finding the best partition (having the lowest IW) is NP-complete – The k-means only finds a local minimum of IW – One has to restart the algorithm many times with different initial values and keep the best one

86

The standard k-means Properties of the k-means algorithm : – At each reallocation-recentering step 2) + 3), one can show that IW decreases – ⇒ stabilisation when IW does not decrease any more – The largest decreases occur in the first steps – ⇒ The partition converges relatively rapidly ( +/- 1020 iterations of 2) + 3))

Avantages of this method : – Quick and easy to implement – Works with large data sets 87

The standard k-means: convergence of the method The within-class inertia is defined as – xi is the vector of features for individual i

88

The standard k-means: convergence of the method The total within group inertia is

First step: – Compute the prototype g(k) such that IW is minimized, while maintaining the allocations of the individuals fixed:

89

The standard k-means: convergence of the method We easily obtain:

And thus

So that we find that the prototypes must be the centers of gravity of each group ! 90

The standard k-means: convergence of the method Second step: – Reallocate the individuals xi such that IW is minimized, while keeping the prototypes fixed – If pi(k) is the probability of affecting an individual xi to the cluster having index k (= allocation to the clusters), we have

– The optimal allocation rule is to affect each individual to the nearest prototype 91

The standard k-means: convergence of the method Therefore, step 2) and step 3) each decrease the within group inertia Since this inertia cannot be negative, the decrease must tend to zero And the procedure converges

92

The standard k-means Notice that, instead of using the Euclidean distance, one could use the L1 distance – This works better; the clustering is more stable – The clustering is less sensitive to outliers – Instead of computing the centroid, on has to compute the medoid – Less sensible to the ‘empty cluster’ problem

93

Importance of choosing initial centroids Consider the following data Iteration 6 3 2.5 2 1.5

y

1 0.5 0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

94

Importance of choosing initial centroids Iteration 1

y

Iteration 2 3

3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

1

y

1

1

0.5

0.5

0.5

0

0

0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2

-1.5

-1

-0.5

x

0

0.5

1

1.5

2

-2

Iteration 5 3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

1

y

1 0.5

0.5

0

0

0

-0.5

0

x

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

x

0

0.5

1

1.5

2

1

1.5

2

1

0.5

-1

-0.5

Iteration 6

3

-1.5

-1

x

3

-2

-1.5

x

Iteration 4

y

Iteration 3

3

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

0.5

x

95

Importance of choosing initial centroids Iteration 1

y

3

2.5

2.5

2

2

1.5

1.5

y

1

1

0.5

0.5

0

0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

Iteration 3

y

Iteration 2

3

-2

-1.5

-1

-0.5

0

0.5

Iteration 4 3

3

2.5

2.5

2.5

2

2

2

1.5

1.5

1.5

y

y

1 0.5

0.5

0

0

0

-1.5

-1

-0.5

0

x

0.5

1

1.5

2

-2

-1.5

-1

-0.5

0

x

0.5

2

1

0.5

-2

1.5

Iteration 5

3

1

1

x

1

1.5

2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

96

Limitations of k-means: Differing sizes

Original Points

K-means (3 Clusters) 97

Limitations of k-means: Differing density

Original Points

K-means (3 Clusters)

98

Limitations of k-means: Non-globular shapes

Original Points

K-means (2 Clusters) 99

Limitations of k-means: Number of clusters Within-cluster inertia can be used for comparing clustering solutions – Unfortunately, it always decreases 10 9

6

8 4

7

2

6

0

E 5 S S 4

-2

3 2

-4

1 -6

0 5

10

15

2

5

10

15

20

25

30

K

100

Limitations of k-means: Number of clusters 1 2

6

4

3

5

7

101

The standard k-means Some drawbacks of this method : 1. We have to fix K in advance 2. The result depends on the initial centers of gravity => Does not provide the optimal solution (the partition in K groups for which IW is minimum) ⇒ Provides a local minimum depending on the initial centroids

3. The often is an ‘empty cluster’ problem => Some of the clusters become empty (outliers, etc) 102

The standard k-means Solutions → Still an active research area For the dependence on the initial centroid choice – Simple approach: repeat many times the clustering and choose the partition having IW minimum. – Try to find a consensus by repeating many times the clustering ("formes fortes"): 1. Retry the clustering many times with different initial centroids and different bootstrap samples 2. Try to find a high consensus (individuals that are always in the same group) 3. Retry a k-means with, as initial centroid, the ‘consensus groups’

103

The standard k-means For the value of K (number of clusters) that has to be provided ISODATA method (and other variants) Includes ‘merge group’ and ‘divide group’ operations in the ‘k-means’ algorithm – Fusion of 2 groups if their distance is low (i.e. if the distance between centroids < some treshold) – Division of a group into 2 sub-groups if the inertia of the group (Ik) is too high (> some treshold)

=> parameters (tresholds) to be fixed !!! 104

The standard k-means Use of different criteria that allow to compare various solutions having different number of clusters – For instance, the Condorcet criterion – A wide range of criterion has been proposed (see later)

There is however no universal solution to this problem – Always try to visualize the data – For instance by using a discriminant analysis

105

The standard k-means The k-means algorithm can easily be adapted in order to work on – A distance matrix – A similarity matrix

106

The standard k-means For each cluster Cl, we define a prototype – Which is an individual belonging to Cl – The index of this individual will be pl

The distance between an individual having index k and cluster Cl will be – The distance to the prototype pl

107

The standard k-means The within-group inertia of cluster l will be

The total within-group inertia is

108

The standard k-means Step 1: initialize the prototypes Step 2: reallocate the individuals, while maintaining the prototypes fixed – Each individual k is affected to the nearest cluster having index l

109

The standard k-means Step 3: Each prototype is recomputed – The individual having the smallest distance to the whole cluster is chosen as prototype, while maintaining the allocations to the clusters

Steps 2 and 3 minimize IW at each iteration 110

The standard k-means K-means can also be adapted in order to take the extension of the clusters into account – Compute a Mahalanobis distance for each cluster k

111

The standard k-means The variance-covariance matrices Wk have to be recomputed at each iteration – Just as the centers of gravity matrix W1

G1

matrix W2

G2

112

Bisecting k-means Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical clustering

113

Fuzzy k-means The memberships to the clusters, pi(k), are now fuzzy – They reflect the degree of membership to the clusters = Degree of membership of individual i to cluster k

The aim is to minimize the within-class inertia 114

Fuzzy k-means Subject to the constraints

That is, we fix a given entropy level 115

Fuzzy k-means We thus introduce the Lagrange function

We compute the membership values by taking the derivative in terms of the pi(k)

116

Fuzzy k-means This leads to And therefore Which gives

117

Fuzzy k-means For computing the prototypes, we use

Which provides

Therefore: 118

Fuzzy k-means We therefore iterate

119

Mixture of distributions Clustering based on a priori distributional assumptions Each cluster membership is represented as a probability distribution – It represents the probability that the feature vector x is generated in this cluster – It contains a set of parameters that have to be estimated – The probability distribution can be Poisson, Gaussian, Bernoulli, etc… 120

Mixture of distributions Feature-based but can be adapted to distance-based Probabilistic cluster memberships The number of groups must be provided in advance Iterative method Only a local maximum of the likelihood is computed 121

Mixture of distributions Here is an example of a Gaussian mixture of three Gaussians – The objective is to fit these Gaussians on the data set

122

Mixture of distributions We assume that there are K clusters – Each having a particular distribution – Which defines the probability of observing vector x in this cluster – Here is an example of a Gaussian density for cluster k

123

Mixture of distributions The mixture of distribution, for all classes, can be written

– Each distribution is weighted by the probability P(Ck) of belonging to the corresponding cluster 124

Mixture of distributions For estimating the parameters, , one relies on the likelihood of the data sample

– The objective is to maximize this likelihood – We assume that the samples have been generated independently 125

Mixture of distributions Here are some examples of mixtures – From Duda, Hart & Stork (2001)

126

Mixture of distributions The cluster membership for an individual having feature vector x is provided by the posterior probabilities

127

Mixture of distributions Maximizing the likelihood provides us a set of nonlinear equations – We differentiate the likelihood in terms of the estimated parameters

– And in the Gaussian mixture case: 128

Mixture of distributions Thus, we compute

And we easily obtain

129

Mixture of distributions The parameters are estimated by iterating the resulting equations We use a three-steps procedure – Just as we did for the k-means

Step 1: Initialize the parameters Step 2: Recompute the membership values Step 3: Reestimate the parameter values 130

Mixture of distributions For the Gaussian mixture case: 1. Initialize 2. Recompute the membership values – While maintaining the parameters fixed

131

Mixture of distributions 3. Reestimate the parameters – While maintaining the membership values fixed

132

Mixture of distributions Other distributions can be used as well – Poisson for count variables – Multinomial for categorical variables – Etc…

This basic model has been extended in many ways – Other optimization algorithms can be used (EM and extensions) – We can deal properly with missing values – Bayesian extensions have been proposed 133 (AutoClass)

The kernel trick for clustering Many clustering algorithms have been « kernelized » This allows to cluster objects based on a similarity matrix – Which usually is a kernel matrix K – Containing inner products [K]ij = kij = (xi)Txj

134

Kernel k-means We want to minimize the total within-class inertia

We introduce the « kernel trick », where X is the data matrix 135

Kernel k-means We easily obtain

136

Kernel k-means The k-means iteratively minimizes J by iteratively – Assigning cluster labels, li, to nodes – Recomputing the cluster prototypes γk

The cluster assignment that minimizes J is

137

Kernel k-means The cluster prototypes γk that minimize J are solutions of

– where ki is the ith column of K and nk is the number of nodes assigned to cluster Ck 138

Kernel k-means Since, in both sides of the equation, we have a linear combination of the columns of K, one solution is

139

Kernel k-means Thus, we simply iterate – For all observations i:

– For all clusters k:

140

Kohonen self-organizing map This clustering algorithm has been developped in the framework of artificial neural networks – It is a kind of iterative k-means – With spatial constraints that provide a graphical representation of the clusters

It is called self-organizing map (SOM) in the field 141

Kohonen self-organizing map Principle: Two-levels artificial neural – One input layer and one output layer – The output layer is organized as a two-dimensional "topologic map" – It reflects the "natural relationships" between the input data – The topologic map is structured by a so-called unsupervised learning algorithm – In such a way that similar feature vectors activate spatially close output units – Inspired by biological processes such as speech or visual perception 142

Kohonen self-organizing map Computes a nonlinear representation of the data On a two-dimensional map By creation a topological structure preserving the neighborhood carte topologique

•••

entrées

143

Kohonen self-organizing map A set of prototypes is defined – Corresponding to a grid of output units – Each prototype corresponds to a unit of the grid

144

Kohonen self-organizing map Here is a summary of the basic SOM algorithm If x is a feature vector and pi is a prototype corresponding to the position i on the grid The SOM algorithm proceeds in four steps

145

Kohonen self-organizing map Step 1: the prototypes are initialized

x1

x2

xp

146

Kohonen self-organizing map Step 2: We identify the « winning unit » p* such that x − p* = min x − pi i

€

– Where, usually the Euclidean distance is used x1

x2

This winning unit determines a neighbourhood V*

xp

147

Kohonen self-organizing map Step 3: Compute the neighbourhood V*(t) around the winning unit p*on the grid – Which decreases as time goes on – The iteration step is denoted by t p*

V*(t1)

V*(t2) 148

Kohonen self-organizing map Step 4: Adapt the prototypes pi belonging to the neigbourhood of p* on the grid pi (t + 1) = pi (t) + η(t)[ x − pi (t)]

€

∀pi ∈ V * (t)

– with 0 < η(t) < 1, which decreases progressively during the time (t) – The prototypes in the neighbourhood of x are moving towards x 149

Kohonen self-organizing map The units (prototype) on the grid are representative of some "cluster" The algorithm preserves proximities: prototypes which are close together in the feature space will be close together on the grid of the topological map This auto-organisation is due to the update rule which link the prototypes that are close together 150

Kohonen self-organizing map Here is an example – Without structure in the data – The IRIS dataset (3 groups)

151

Kohonen self-organizing map If there is no neighbourhood (only one unit/prototype is update) ⇒the algorithms is a simple clustering algorithm minimizing the within-class inertia (an iterative variant of k-means). ⇒In this case, the convergence can be proved for some family of decreasing rules of η(t)

Some conditions for convergence in the 2D case can be derived 152

Hierarchical clustering Two types: Bottom-up methods – Agglomerative hierarchical clustering

Top-down methods – Divisive methods

153

Hierarchical clustering Agglomerative methods merge clusters at each step

154

Hierarchical clustering Divisive methods are partitioning the data in two parts at each step .. .. .. . .. ..

.. . ..

... .

155

Hierarchical clustering In practice, most hierarchical clustering methods are agglomerative We must define an agglomeration rule Principle : one element = one individual to be clustered or one group of individuals already defining a cluster We use a four-steps procedure 156

Hierarchical clustering Step 1: initialty, N individuals to be clustered We have a distance matrix (or a similarity matrix) between the individuals 2 by 2, DNxN 0 D=

0

dii' 0

di' i

0

0

157

Hierarchical clustering Step 2: We find the two closest elements (dii' minimum) → merged into a new element

There are n – 2 elements remaining + one new element (a cluster)

158

Hierarchical clustering Step 3: We compute the distance (or the similarity) between the new element and all the other elements We adapt the distance matrix D, which has one element less than before the merge

159

Hierarchical clustering Step 4: We repeat steps 2) and 3) until the aggregation is complete: – Only one element containing all the individuals

160

Hierarchical clustering This means that we have to define a distance (or similarity) between clusters – A merge of two elements reduces the number of groups by one 2

1 3

5 4

Step:

1

2

3

4

5

161

Hierarchical clustering This is represented by a dendrogram – The y axis represents the cumulative distance between the two merged elements distance

P1

(9)

P2 (8) (7) (6) (1)

(3)

(4)

(2)

(5)

Initial elements = individuals to cluster

Leafs of the tree = groups

P3 P4 P5 Pk = partition of k classes 162 (nested sequence: Pn → P1)

Hierarchical clustering The dendrogram represents nested clusters 5

4

1

2 5

0.25 0.2

2

0.15

3

6 1

4

3

Nested Clusters

0.1 0.05 0

3

6

4

1

Dendrogram

2

5

163

Hierarchical clustering Aggregation criteria – Single linkage:

(1)

(2)

(i )

d(G1 ,G2 ) = min d( x ,x ), x ∈ Gi

– Complete linkage: d(G1 ,G 2 ) = max d( x(1),x (2) )

1 – Average linkage: d(G1 ,G 2 ) = G1 .G2

∑

d(x(1),x (2) )

x (1) ∈G1 x ( 2) ∈G2 164

Hierarchical clustering Ward’s criterion – Computes the increase in within-group inertia when merging two elements – Starts with a zero inertia (each individual is a group)

When Pk + 1 → Pk, after merging 2 elements of Pk + 1: – IB decreases and IW increases (Itot remains constant)

165

Hierarchical clustering ⇒ merge the 2 elements A and B of Pk + 1 such that: IW (Pk ) − IW (Pk +1 ) minimum IB (Pk ) − IB (Pk +1 ) minimum A .B 2 . d (gA ,g B ) A+B – If we use the Euclidean distance

Which is equivalent to

min

166

Hierarchical clustering 4

1 2

1

3

€ 4

1 2

5

d(2,4)

5

d(1,5)

3

€5

1 (d13 + d14 + d15 + d 23 + d 24 + d 25 ) 6

gA gB

gAB

€

IAB − (IA + IB )

167

Hierarchical clustering Ward’s method We now prove A.B 2 ΔIB = . dist (gA ,g B ) A+B

Which is very easy and quick to compute Ward’s criterion is very popular for hierarchical clustering 168

Hierarchical clustering Ward’s method We first prove a useful relationship about the inertia from a point z – g is the center of gravity of the cloud of feature vectors xi weighted by n(i) – nx is the number of different xi – Each xi is weighted by the number of individuals that it represents: n(i) 169

Hierarchical clustering Ward’s method

170

Hierarchical clustering Ward’s method The inertia with respect to a point z is equivalent to – The inertia with respect to the centroid g – Plus the inertia of g with respect to z, by weighting the term by the number of individuals it represents

171

Hierarchical clustering Ward’s method Thus we have:

Suppose two groups A and B are merged: 172

Hierarchical clustering Ward’s method The common centroid is

Let us compute the between-group inertia before (k + 1 groups) and after (k groups) the fusion

173

Hierarchical clustering Ward’s method After the fusion of A and B (k groups), we have a large group of (nA + nB) data having center of gravity gAB Before the fusion (k + 1 groups), we have

174

Hierarchical clustering Ward’s method Where we applied the decomposition of inertia with respect to a vector z – This vector is z = g in this case – There are only two groups (A and B), each having nA and nB observations – Notice that the center of gravity of the two groups is gAB and not g in this case !

We thus decompose the inertia as the sum of the inertia with respect to the centroid and the inertia with respect to z = g 175

Hierarchical clustering Ward’s method Thus we applied

In which we replace nx = 2; x1 = gA; x2 = gB; n(1) = nA; n(2) = nB; z = g; g = gAB (centroid of the merged data); N = nA + nB 176

Hierarchical clustering Ward’s method The difference in inertia is

177

Hierarchical clustering Ward’s method The single linkage method is sensible to the chaining effect

The most used criteria are – the Ward criterion – The average linkage aggregation rule 178

Hierarchical clustering If we want a partition, we cut the tree at some level – After the agregations at low levels (very small clusters) – Before the aggregation at higher level (large clusters) – It is often quite subjective

179

Hierarchical +k-means clustering K-means and hierarchical clustering could be combined in a useful way 1. First perform a k-means for vector quantization 2. Then, perform a hierarchical clustering 3. Decide about the cut level to obtain a partition 4. Re-compute a k-means on the partition defined by the hierarchical clustering •

In order to optimize the within-group inertia 180

Hierarchical +k-means clustering

1) Initial partition - k-means

.

3) 3 classes kept

2) Hierarchical clustering

4) k-means

181

Minimum spanning tree clustering Very simple clustering method Suppose we have a distance matrix D between individuals Step 1: Construct a graph – With individuals as nodes – Similarities as weights on the edges

182

Minimum spanning tree clustering Step 2: From this graph, construct the minimum spanning tree – Which corresponds to the tree • Spanning all the nodes • Having minimum sum of edges

– There are efficient algorithms for computing the minimum spanning tree

Step 3: Remove the edges – Which are ‘inconsistents’ – Having large distances compared to others 183

Minimum spanning tree clustering Example from Dubes & Jain (1988)

184

Minimum spanning tree clustering Simple exploratory technique Very fast to compute Interesting also for visualizing the data – Visualizing a spanning tree in 2D is easy

185

Graph partitioning Some clustering algorithms are applied to graphs Suppose we have a graph structure

186

Graph partitioning Suppose we have a similarity matrix S between individuals Construct a graph – With individuals as nodes – Similarities as weights on the edges

187

Graph partitioning The elements aij of the adjacency matrix A of a weighted, undirected, graph are defined as

where A is symmetric

The wij ≥ 0 represents a similarity or strength of relationship between node i and node j

188

Graph partitioning We define a similarity index between partition C1 and partition C2 as

Define h1 as an indicator vector containing 1 if the node belongs to C1 and 0 if it belongs to C2 Define h2 as the equivalent vector for C2 Define e as a column vector made of 1s 189

Graph partitioning We obtain

190

Graph partitioning And thus

Now,

Thus [D]ij = 0 for i ≠ j and 191

Graph partitioning We finally obtain

By relaxing the fact that h1 and h2 contain binary values, – This leads to the following optimization problem 192

Graph partitioning This leads to an eigenvector problem – Compute the smallest non-trivial (L is not of full rank) eigenvector of – which is called the Fiedler vector

193

Graph partitioning This approach is very popular for the moment ! For instance: find dense groups in a social network graph This is called ‘spectral clustering’

194

Graph partitioning: one example Visualization of a network of criminals

195

Conclusion A myriad of clustering techniques Which can be combined There are many other clustering techniques Still an active research area – For instance, consensus clustering and graph clustering are quite popular

However, very few techniques are implemented in standard statistical softwares 196

Cluster validity

197

Cluster validity Visualizing the clustering solution Association of the cluster solution with initial features Comparing clustering solutions Testing hypotheses Robustness estimation Consensus clustering About the number of clusters

198

Visualizing the clustering solution Always visually inspect the cluster solution!! For instance run a discriminant analysis – For viewing the data in a 2D or a 3D space – Which best discriminates between the clusters – As dependent variable, take the cluster labels – Eventually plot one cluster against the others

Also, perform a PCA and visualize the data in 2D or 3D 199

Visualizing the clustering solution Here is the result of a k-means visualized by a dicriminant analysis – On the IRIS dataset 18 17 16

14

1

Sepal length Sepal width Petal width

3

Petal length

13

2

Canonical2

15

12 11 10 9 -4

-2

0

2

4

6

8

10

Canonical1

200

Visualizing the clustering solution Order the similarity matrix with respect to cluster labels and inspect visually 1

1 0.9 0.8

s t n i o P

0.7 0.6

y

0.5 0.4 0.3 0.2 0.1 0

10

0.9

20

0.8

30

0.7

40

0.6

50

0.5

60

0.4

70

0.3

80

0.2

90

0.1

100 0

0.2

0.4

0.6

x

0.8

1

20

40

60

80

0 100Similarity

Points

201

Association of the cluster solution with initial features In order to be sure that some feature does not dominate the solution – Compute the association between the clustering solution and initial features – In order to avoid clusters that are completely dominated by only one unimportant variable – Eventually re-weight the features in the clustering 202

Comparing clustering solutions In order to validate clustering solutions We should be able to compare clustering solutions – Even if the number of clusters found is different!

Two standard techniques: – Huber’s Γ statistics – The RAND index R – Many others not detailled here

203

Comparing clustering solutions Huber’s Γ statistics Define two indicator variables – y1(i,j) and y2(i,j)

y1(i,j)=1 if individuals i and j are clustered together in clustering solution 1 – Otherwise y1(i,j)=0

y2(i,j)=1 if individuals i and j are clustered together in clustering solution 2 – Otherwise y2(i,j)=0

204

Comparing clustering solutions Huber’s Γ statistics can be written

– Where N(N – 1)/2 is the total number of terms (pairs) – If i and j are never clustered together, Γ = 0 – It is a kind of covariance value between the two clustering solutions – It can be used for comparing clusterings and proximities as well 205

Comparing clustering solutions However, it is better to use the normalized version – For which we standardize the variables y1(i,j) and y2(i,j)

– Then, –1 ≤ Γ’ ≤ 1 206

Comparing clustering solutions With

207

Comparing clustering solutions The RAND index R – It is a degree of agreement between two clustering solutions

Compute the number of pairs of individuals (i,j) that are – Clustered together in both clustering solutions – Not clustered together in both clustering solutions

Divide by the total number of pairs N(N – 1)/2

208

Comparing clustering solutions In other words, it computes the proportion of pairs for which both clusterings (1 and 2) agree – Either the pair belong to the same group in both clusterings – Either the pair does not belong to the same group in both clusterings

209

Comparing clustering solutions The RAND index can be computed by

Very useful for comparing clustering solutions Or comparing a clustering against a predefined solution 210

Comparing clustering solutions For each of these statistics t, there exists a corresponding ‘corrected’ statistics t which is a normalized version of the original one:

– The expectation is taken under the null hypothesis – The range is between 0 and 1 – The statistics is corrected with respect to matches occuring ‘by chance’ 211

Comparing clustering solutions Methods for comparing dendrograms are also available – They compare hierarchical clustering solutions

212

Testing hypotheses Several tests of hypothesis are available Suppose you measure some statistics t – That is related to the quality of the clustering – For instance, a measure of compactness of the clusters

213

Testing hypotheses You also want to test some null hypothesis H0 – For instance the individuals are distributed at random – So that all sets of location in some region of the feature space are equally likely – Which means that there is no structure in the data

Suppose also that the distribution of t is known under the null hypothesis H0 214

Testing hypotheses For a sample, you can – Compute t from that sample – test if the value of t is very unlikely or not – If it is very unlikely, you reject the hypothesis H0 (of no structure)

215

Testing hypotheses In ours case, tests of hypotheses can be performed by Monte-Carlo simulation techniques For instance, let us assume that the null hypothesis corresponds to – A uniform distribution of data – Without any structure

Assume we are interested in the ratio between between-cluster inertia on within-cluster inertia: Here is a simulation allowing to test this hypothesis

216

Testing hypotheses 1. 2. 3. 4. 5.

6.

Run a clustering algorithm on a data set, retrieve the cluster solution and compute IB/ IW Generate a new data set in the area of interest following a uniform distribution Run the same clustering algorithm on the new data set Compute the resulting IB/ IW Run 2) to 4) several times and draw a histogram representing the distribution of IB/ IW under the null hypothesis Does the value of IB/ IW computed on the original data set differ significantly from the 217 distribution of IB/ IW under the null hypothesis ?

Testing hypotheses There are many such tests – Which, for instance, assume Gaussian clusters

However, testing against the absence of structure is not always of interest – Indeed, even if there is no structure, clustering may be useful – Think about vector quantization 218

Robustness estimation One could be interested in testing the robustness of the solution ‘Split-sample’ methods are useful in this context – Randomly divide the data into two subsets – Perform an analysis separately – Compare the solutions

Here is a useful method in this context 219

Robustness estimation 1. 2. 3.

4. 5.

Divide the sample in two parts Perform independently a clustering on the two samples Classify the individuals of the other sample according to the classification rules (centroids) obtained on the first sample Compare the two solutions using, i.e. Huber’s statistics Run steps 1) to 4) many times and compute the average of Huber’s statistics 220

Robustness estimation If there really is a structure in the data – The algorithm should be able to retrieve this structure almost all the time

The same method is used in order to estimate the number of groups – Pick up the number of groups for which the solution is most stable 221

Consensus clustering The idea here is to find robust solutions by – Running many times the clustering – Finding a consensus between the different solutions – Thus seeking for a stable solution

222

Consensus clustering One could be interested in ‘consensus’ clustering 1. Take a bootstrap sample of the data set 2. Run a clustering algorithm with different seeds and label all the individuals 3. Run 1) and 2) many times 4. For each pair (i,j), compute the proportion of times the two individuals were grouped together 5. This provides a new similarity matrix 6. Find the labelling that was closest to this similarity matrix among the clusterings realized in 1) and 2) 223

About the number of clusters What is the natural number of clusters in the data? – No universal answer to this question – Usually, one has an approximate a priori idea

Three heuristic answers – Plot of the compactness criterion in terms of number of clusters • Plot the clustering criterion (between-cluster inertia) against number of clusters • A large drop is generally taken as suggestive of a particular number of groups

– Examine the dendrogram (hierachical clustering) – Perform a robustness analysis and pick up the most stable solution

224

About the number of clusters However, a number of other, more formal, techniques have been suggested – These are mainly statistics computed from the sum of squared distances (inertia) – Various clustering, with increasing number of groups, are performed, and this statistics is computed – The number for which the statistics is a minimum is retained – Some of these techniques are available in the 225 R software

About the number of clusters There is, however, no miraculous solution to this problem Always visualize the data – this may be a good way to have an idea of the quality of the solution

226

Cluster interpretation

227

Cluster interpretation Visual representation Cluster profiling Minimum spanning tree representation Discriminant analysis

228

Visual representation Ones we have obtained clusters, We have to provide an interpretation and a representation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters

Time-consuming but very fulfilling work 229

Visual representation Perform a discriminant analysis – And visualize the clusters – By coloring them 18 17 16

14

1

Sepal length Sepal width Petal width

3

Petal length

13

2

Canonical2

15

12 11 10 9 -4

-2

0

2

4

Canonical1

6

8

10

230

Visual representation Eventually, if there are many clusters – Eliminate individuals that are far from the centroid – Keep only the individuals that are in the kernel of the clusters – Then perform a discriminant analysis – Usually, you obtain a clear representation 231

Visual representation Eventually plot the cluster density on a Belgian map – Urban clusters will appear

232

Cluster profiling Focus on one important variable For each cluster, compute – The odds of this important variables against the population mean – For instance mean revenue within the cluster divided by the mean revenue on the whole population

Create a bar plot representing odds in function of cluster (x axis), for this important variable 233

Cluster profiling Some MOSAIC examples

234

Cluster profiling Focus on one cluster For each important variable, compute – The odds of this important variable against the population mean

Create a bar plot representing odds in function of important variables (x axis), for this cluster 235

Cluster profiling Example MOSAIC – ‘Classe privilégiée’

236

Cluster profiling Example MOSAIC – ‘Classe defavorisée urbaine’

237

Cluster3

Cluster1 Petal width

Petal width

Sepal width

Sepal length

Petal width

Petal length

Sepal width

Sepal length

Petal length

Cluster2

Petal length

Sepal width

Sepal length

Petal width

Petal length

Sepal width

Sepal length

Cluster profiling Example of profiling (IRIS dataset)

Cluster Means

238

Minimum spanning tree representation Compute the distances between the centroids of the clusters Determine the minimum spanning tree Plot the minimum spanning tree

239

Minimum spanning tree representation An example from the MOSAIC segmentation

240

Discriminant analysis In order to give an interpretation to the cluster – Perform a discriminant analysis – Or a decision tree – Of one cluster against the others – Detect the most relevant features • Allowing to contrast/characterize this cluster against the others 241

Discriminant analysis Here is an example of a decision tree applied to a k-means clustering of the IRIS data All Rows Count 150

G^2 Level 328.75054 1 2 3

Petal length<3.0 Count 50

G^2 Level 9.8039113 1 2 3

Prob 0.3667 0.3067 0.3267

Petal length>=3.0 Prob 0.0000 0.0200 0.9800

Count 100

G^2 Level 137.62776 1 2 3

Sepal length<6.3 Count 49

G^2 Level 43.614415 1 2 3

Prob 0.5500 0.4500 0.0000

Sepal length>=6.3 Prob 0.1633 0.8367 0.0000

Count 51

G^2 Level 28.041985 1 2 3

Prob 0.9216 0.0784 0.0000

242

Softwares and further readings

243

Softwares and further readings Softwares – The R software – S-Plus software – SAS

244

Softwares and further readings Some multivariate data analysis books

245

Softwares and further readings Some clustering books

246

Softwares and further readings Some data analysis books (French school)

247