Cluster analysis Marco Saerens (UCL), with Christine Decaestecker (ULB) 2 Table of contents 3 Table of contents General introduction Data pre-processi...
6 downloads
25 Views
18MB Size
Cluster analysis Marco Saerens (UCL), with Christine Decaestecker (ULB)
Table of contents
2
Table of contents General introduction Data pre-processing Some clustering algorithms Cluster validity Cluster interpretation Clustering in practice Softwares, further readings 3
Slides references Many slides and figures have been adapted from the slides associated to the following books: – Alpaydin (2004), Introduction to machine learning. The MIT Press. – Duda, Hart & Stork (2000), Pattern classification, John Wiley & Sons. – Han & Kamber (2006), Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann. – Tan, Steinbach & Kumar (2006), Introduction to data mining. Addison Wesley.
As well as from Wikipedia, the free online encyclopedia I really thank these authors for their contribution to machine learning and data mining teaching
4
General introduction
5
General introduction What is clustering ? Clustering versus vector quantization Research fields where clustering is studied An ill-posed problem ? Hierarchical versus fixed number of clusters Notions of proximity Proximity-based versus feature-based clustering Joint clustering of individuals and features Importance of validation Importance of interpretation Some notations
6
What is clustering ? Also called – Typology analysis (biology) – Segmentation (marketing) – Unsupervised learning (AI) – Automatic classification (data analysis)
Here are some illustrations from – Tan, Steinbach & Kumar, Introduction to data mining. Addison Wesley. 7
What is clustering, in brief ? Given a set of data points, each having a set of features, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another – Data points in separate clusters are less similar to one another
Similarity measures: – Euclidean distance if attributes are continuous – Other Problem-specific Measures
8
Application 1: market segmentation Goal: subdivide a market into distinct subsets of customers – where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix
Approach: – Collect different attributes/features of customers based on their geographical and lifestyle related information – Find clusters of similar customers – Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from 9 different clusters
Application 2: document clustering Goal: To find groups of documents that are similar to each other based on the important terms appearing in them Approach: – To identify frequently occurring terms in each document – Form a similarity measure based on the frequencies of different terms – Use it to cluster 10
What is clustering ? Objective: On the basis of – A similarity matrix D between individuals, or – A data matrix X describing the individuals through • Features or variables
Group – The individuals into a set of natural disjoint clusters – With high homogeneity within the clusters – And high heterogeneity between the clusters 11
What is clustering ? When using features, one speaks about – Feature-based or data-based clustering
When using similarities, one speaks about – Similarity-based clustering
12
What is clustering ? This means that we have to define – The features on which we will base our clustering – The concept of similarity – The concept of homogeneity
As well as – How to use these concepts in order to group the individuals = clustering algorithms 13
What is clustering ? Here is an ideal case… ****** ****** *
* *** ** ****** * *
Intracluster Intraclusterdistances distances are minimized are minimized Intercluster Interclusterdistances distances are aremaximized maximized
With well-separated clusters Clusters memberships may be crisp or fuzzy
14
Clustering versus vector quantization Clustering is related to the field of vector quantization (VQ) Objective: To partition a set of individuals In order to code these individuals In some optimal way – VQ techniques are related to clustering – But we do not necessarily want to identify wellseparated clusters 15
Clustering versus vector quantization Example of vector quantization
16
Research field where clustering is studied Clustering is studied in a vide range of fields: – Pattern recognition – Knowledge discovery in databases and data mining – Machine learning – Applied statistics 17
Research field where clustering is studied It is still a very active research field Some recent application areas – Bioinformatics – Computational linguistics and natural language processing
18
An ill-posed problem ? In fact,clustering is often an ill-posed problem Data containing well-separated groups are quite rare There is a problem for defining the « natural » number of clusters
19
An ill-posed problem ? Consider the following data:
20
An ill-posed problem ? Consider the following data:
How many clusters?
Six Clusters
Two Clusters
Four Clusters 21
Hierarchical versus fixed number of clusters Two types of clustering algorithms Hierarchical algorithms – Provide a whole family of solutions – With varying number of clusters – Visual drawing: dendrogram
« Fixed number of clusters » algorithms – The number of clusters is provided in advance
Both techniques can be combined
22
Notions of proximity The notion of proximity is central to clustering – Dissimilarities – Similarities
It will condition the shape of the clusters you will find – Euclidean distance tends to find spherical clusters 23
Notions of proximity Proximities can be computed from – Features measured on individuals – Links betweenindividuals, without any notion of feature – Direct comparisons between individuals – Etc…
24
Notions of proximity When using features measured on individuals Notions of proximity will be different for – Categorical variables – Numerical variables – For mixed features, one has to merge the proximities – In order to obtain a global proximity score between the individuals 25
Notions of proximity When using directly proximities between individuals – One has to use clustering algorithms based on proximity matrices – Not always readily available in computer softwares (except hierarchical clustering)
26
Proximity-based versus feature-based clustering Most feature-based clustering algorithms work on numerical variables only Three alternatives: – Transform categorical features into numerical features and use a standard numerical-feature based clustering – Compute similarities and then use a clustering based on the similarity matrix – Use a clustering algorithm that handles both types of variables 27
Joint clustering of individuals and features One can also try to jointly cluster both – Features and – Individuals
Into mutually disjoint clusters of individuals and features – Depends on the data
28
Importance of validation Validation is quite – Important and – Time-consuming
Some questions that should be adressed: – Are there really clusters? – However, even if there are no really wellseparated clusters, is the clustering still useful? – Are these clusters robust? – How do we compare two clusterings? 29
Importance of interpretation Ones we have obtained clusters, We have to provide an interpretation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters
Time-consuming but very fulfilling work 30
Some notations The data matrix will be X – Each row represents an individual – There are N (sometimes denotes as n) individuals in total – Each column represents a feature measured on the individual
An element i, j of X will be denoted as xij The column vector of p features Xj measured on individual i (realizations) will be xi The random vector containing the realizations of the random variables Xj (features) will be x Cluster i will be denoted by its label, Gi or Ci (ωi in31 supervised classification)
Some notations The distance matrix between individual i and individual j will be denoted as D The similarity matrix between individual i and individual j will be denoted as S
32
Conclusion Clustering – With fuzzy vs crisp membership – Feature-based vs similarity-based – Hierarchical vs fixed number of clusters
Importance of – Choice of proximities – Pre-processing of the data – Validation of the clustering – Interpretation of the clustering 33
Data pre-processing
34
Data pre-processing Exploratory analysis Definition of proximities (similarities and dissimilarities) Processing of numerical variables Processing of categorical variables Feature combination for mixed data (numerical + categorical features) Missing values 35
Exploratory analysis First step: gain insight about your data – Univariate distributions of all features – Correlations between features
0.00 0.00
Density
0.00
0.00 0
50000
100000 150000
36
Exploratory analysis Eliminate highly redundant (correlated) features = feature extraction
Re-code very rare modalities (or levels) of categorical variables Visualize the data
37
Exploratory analysis Perform a PCA (on correlation matrix) in order to visualize the data – For numerical features
Perform a multiple correspondence analysis – For categorical features
Perform a multidimensional scaling – If the data are in the form of a proximity matrix 38
Definition of proximities Some clustering algorithms perform directly on the data matrix (feature-based) Other clustering algorithms work on proximities: – A distance matrix or – A similarity matrix
39
Definition of proximities Many different dissimilarity measures have been defined Usually, we use distances which have the following properties
40
Definition of proximities We can also compute similarities Usually, we use similarity measures, which have the following properties
41
Definition of proximities There is a dual relation between – Dissimilarity measures and – Similarity measures
If dij is a distance between i and j, then sij = (dmax – dij) is a similarity measure
If s is a similarity measure, then dij2 = sii– 2 sij + sjj is a dissimilarity measure
Etc… 42
Processing of numerical variables: feature transformation Standardize the numerical features – To be independent of the unit measure – To be able to compare the variables
However, standardization can also have a negative effect 43
Processing of numerical variables: feature transformation Eventually, perform a Box-Cox transformation
44
Processing of numerical variables: feature transformation Transforms the numerical feature into – A new variable – That is « most Gaussian » in some sense
0
50000
100000 150000
190000 220000
250000
280000
45
Processing of numerical variables: feature transformation Transformation is chosen among the family
And λ is chosen in order to maximize – The likelihood of x(λ) computed on the data set – With the assumption that x(λ) is Gaussian 46
Processing of numerical variables: feature transformation Here is an example
50000
SSE
0
100000 150000
190000 220000
5e+11 4e+11 3e+11 2e+11 1e+11 0 -2.0-1.5 -1.0-0.5 .0
.5
250000
280000
1.0 1.5 2.0
Lambda
47
Processing of numerical variables: feature transformation The decision to transform the variable depends on – Its interpretation – Its distribution – The clustering algorithm you are using
48
Processing of numerical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – Large number of numerical features – Features having a few integer values
Perform a PCA on the numerical features and keep a given number of factors 49
Processing of numerical variables: feature reduction The PCA projets the data into a subspace – That keeps as much variance as possible – Among all the possible linear subspace projections
Visualize the ‘scree plot’, which is a plot of the eigenvalues λi against i Or keep a pre-defined % of the total variance 50
Processing of numerical variables: feature reduction Here is an example of a scree plot – Each eigenvalue corresponds to a part of the variance corresponding to the factor
51
Processing of numerical variables: computation of proximities Here are some popular distances – The Euclidean distance
– The L1 distance
52
Processing of numerical variables: computation of proximities The Canberra metric
Gower’s metric
53
Processing of numerical variables: computation of proximities – Mahalanobis distance
where Σ is the variance-covariance matrix
54
Processing of numerical variables: computation of proximities Here are some popular similarity measures – The inner product or cosine similarity
55
Processing of numerical variables: computation of proximities – The Tanimoto similarity
56
Processing of categorical variables: feature reduction Categorical data are ‘dangerous’ (they can condition the clustering) – For instance we could obtain a clustering only representing the sex of the individuals – If the variable has many levels (modalities), it is coded by indicator variables – As a result, it has an important weight in the Euclidean distance 57
Processing of categorical variables: feature reduction Feature reduction In some cases, it is useful to perform a feature reduction – We want to transform the categorical features into numerical features
Perform a multiple correspondence analysis on the categorical features and keep a given number of factors 58
Processing of categorical variables: feature reduction This is an equivalent of principal components analysis for categorical features It provides a set of factors which account for as much chi-square distance as possible It provides numerical features 59
Processing of categorical variables: feature reduction As for PCA, we can compute a ‘scree plot’ Or keep a predefined % of variability
60
Processing of categorical variables: computation of proximities Feature can be – Binary – More than two values
61
Processing of categorical variables: computation of proximities Many different similarity measures are available We code the categorical features by indicator variables (vectors) – Having a 1 in position i when the individual has modality i for this feature, and 0 otherwise – For instance [0, 1, 0, 0]T if the individual i possesses the second modality of the feature
62
Processing of categorical variables: computation of proximities Suppose a, b, c, and d are defined by
– Counting the number of 1’s and 0’s in common
63
Processing of categorical variables: computation of proximities
64
Processing of categorical variables: computation of proximities Each of these similarities can be used in order – To compare two individuals – Based on the categorical variables
65
Processing of categorical variables: Feature combination for mixed data Feature combination for mixed data – The data contains numerical + categorical features
Three standard ways of dealing with mixed data: – Converting categorical data into numerical data – Using clustering techniques dealing with both types of features – Computing mixed proximities combining categorical and numerical data 66
Processing of categorical variables: Feature combination for mixed data Converting categorical data into numerical data: – By using correspondence analysis – If there is only one categorical variable, it performs optimal scaling – Keep only the first factors
67
Processing of categorical variables: Feature combination for mixed data Using clustering techniques dealing with both types of features – Some clustering techniques use mixtures of generative models
For instance – Gaussian for numerical variables – Poisson for integer variables – Multinomial for categorical variables 68
Processing of categorical variables: Feature combination for mixed data Thus, this model integrates mixed variables in a natural way This mixture model is estimated with E-M algorithm One example of such a model is AutoClass It can deal with different types of variables as well as missing values 69
Processing of categorical variables: Feature combination for mixed data Computing mixed proximities combining categorical and numerical data One such solution was proposed by Gower (1971) It allows to combine categorical and numerical variables in order to obtain a similarity or a distance 70
Processing of categorical variables: Feature combination for mixed data Define similarity
– Where sij is a similarity measure between individuals i and j – And where sijk is the similarity between individuals i and j based on feature k 71
Processing of categorical variables: Feature combination for mixed data For binary features – sijk = 1 if there is a positive match between i and j on binary feature k – sijk = 0 otherwise
For categorial features – sijk = 1 if individuals i and j fall in the same category – sijk = 0 otherwise 72
Processing of categorical variables: Feature combination for mixed data For numerical features
– One has to be careful with outliers, however
The weights are defined as
73
Processing of categorical variables: Feature combination for mixed data A meaningful distance can be defined as
Weights wk are provided, weighting each feature – So that you can give more importance to an important feature 74
Missing values If there are missing values, – It is not a good idea to perform imputation – Imputation introduces artificial similarities between individuals – It is better to ignore the missing variables
75
Missing values For instance, when computing the Euclidean distance
– The sum is performed on the non-missing features only – So that p will be the number of non-missing features
76
Some clustering algorithms
77
Some clustering algorithms The standard k-means Fuzzy clustering Mixture of distributions The kernel trick for clustering Kohonen self-organizing maps Hierarchical clustering Minimum spanning tree clustering Graph partitioning – Only basic principles will be explained !! 78
The standard k-means Maybe the most popular clustering algorithm Feature-based but can be adapted to distance-based Crisp membership, but there exists extensions to fuzzy memberships – Fuzzy clustering
The number of groups is provided in advance Iterative method
79
The standard k-means Basic principle : reallocation iterative method based on mobile prototypes (for instance, the ‘centers of gravity’) k-means method : k fixed = number of groups 1. Initialization Choose k initial prototypes (centers of gravity) among the individuals that have to be clustered
2. Reallocation Allocate each individual to the cluster from which it is closest
3. Recentering Recompute the prototypes of the clusters (for instance the center of gravity)
4. Repeat steps 2) and 3) until stabilisation
80
The standard k-means Here is an example initialization
x
x
x
x
x x
x x
c1
first iteration
c2
x • x
x
x
x
•x
x
x
x x
g1
second iteration…
g2
x
•
x
x x
x
x
x g1
x
• x
g2
81
The standard k-means Justification of the method – Inertia within group Gk (k: 1 → K): Ik =
2 d ∑ (x i ,g k )
i∈Gk
K
– Total inertia within groups: IW = ∑ Ik k=1 K 2 I = G d – Total inertia between groups: B ∑ k (g k ,g) k =1
€ with g = global center of gravity, and Gk = n(k) =
number of individuals in group Gk 82
The standard k-means Thus, Intra-cluster distances are minimized
Inter-cluster distances are maximized
83
The standard k-means n
One can show that
Itot = ∑ d 2 (x i,g) i =1
I tot =
IW + I B
Thus, we have g3 g1
g xi
g
g2
within + between 84
The standard k-means A good partition of the clusters ⇔ IW minimum (homogeneous, compact clusters) A good partition of the clusters ⇔ IB maximum (well-separated clusters) Since IW + IB = Itot – Decreasing IW ⇔ Increasing IB – The algorithm decreases IW at each iteration
85
The standard k-means However ! IW always decreases when K increases ⇒ Does not allow to find the ideal number of clusters !
But allows to compare two partitions in K clusters : – The best one ⇔ IW minimum (or IB maximum)
Notice that the problem of finding the best partition (having the lowest IW) is NP-complete – The k-means only finds a local minimum of IW – One has to restart the algorithm many times with different initial values and keep the best one
86
The standard k-means Properties of the k-means algorithm : – At each reallocation-recentering step 2) + 3), one can show that IW decreases – ⇒ stabilisation when IW does not decrease any more – The largest decreases occur in the first steps – ⇒ The partition converges relatively rapidly ( +/- 1020 iterations of 2) + 3))
Avantages of this method : – Quick and easy to implement – Works with large data sets 87
The standard k-means: convergence of the method The within-class inertia is defined as – xi is the vector of features for individual i
88
The standard k-means: convergence of the method The total within group inertia is
First step: – Compute the prototype g(k) such that IW is minimized, while maintaining the allocations of the individuals fixed:
89
The standard k-means: convergence of the method We easily obtain:
And thus
So that we find that the prototypes must be the centers of gravity of each group ! 90
The standard k-means: convergence of the method Second step: – Reallocate the individuals xi such that IW is minimized, while keeping the prototypes fixed – If pi(k) is the probability of affecting an individual xi to the cluster having index k (= allocation to the clusters), we have
– The optimal allocation rule is to affect each individual to the nearest prototype 91
The standard k-means: convergence of the method Therefore, step 2) and step 3) each decrease the within group inertia Since this inertia cannot be negative, the decrease must tend to zero And the procedure converges
92
The standard k-means Notice that, instead of using the Euclidean distance, one could use the L1 distance – This works better; the clustering is more stable – The clustering is less sensitive to outliers – Instead of computing the centroid, on has to compute the medoid – Less sensible to the ‘empty cluster’ problem
93
Importance of choosing initial centroids Consider the following data Iteration 6 3 2.5 2 1.5
y
1 0.5 0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
94
Importance of choosing initial centroids Iteration 1
y
Iteration 2 3
3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
1
y
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
Iteration 5 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
1
y
1 0.5
0.5
0
0
0
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0
0.5
1
1.5
2
1
1.5
2
1
0.5
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
Iteration 4
y
Iteration 3
3
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
0.5
x
95
Importance of choosing initial centroids Iteration 1
y
3
2.5
2.5
2
2
1.5
1.5
y
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Iteration 3
y
Iteration 2
3
-2
-1.5
-1
-0.5
0
0.5
Iteration 4 3
3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
y
1 0.5
0.5
0
0
0
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
2
1
0.5
-2
1.5
Iteration 5
3
1
1
x
1
1.5
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
96
Limitations of k-means: Differing sizes
Original Points
K-means (3 Clusters) 97
Limitations of k-means: Differing density
Original Points
K-means (3 Clusters)
98
Limitations of k-means: Non-globular shapes
Original Points
K-means (2 Clusters) 99
Limitations of k-means: Number of clusters Within-cluster inertia can be used for comparing clustering solutions – Unfortunately, it always decreases 10 9
6
8 4
7
2
6
0
E 5 S S 4
-2
3 2
-4
1 -6
0 5
10
15
2
5
10
15
20
25
30
K
100
Limitations of k-means: Number of clusters 1 2
6
4
3
5
7
101
The standard k-means Some drawbacks of this method : 1. We have to fix K in advance 2. The result depends on the initial centers of gravity => Does not provide the optimal solution (the partition in K groups for which IW is minimum) ⇒ Provides a local minimum depending on the initial centroids
3. The often is an ‘empty cluster’ problem => Some of the clusters become empty (outliers, etc) 102
The standard k-means Solutions → Still an active research area For the dependence on the initial centroid choice – Simple approach: repeat many times the clustering and choose the partition having IW minimum. – Try to find a consensus by repeating many times the clustering ("formes fortes"): 1. Retry the clustering many times with different initial centroids and different bootstrap samples 2. Try to find a high consensus (individuals that are always in the same group) 3. Retry a k-means with, as initial centroid, the ‘consensus groups’
103
The standard k-means For the value of K (number of clusters) that has to be provided ISODATA method (and other variants) Includes ‘merge group’ and ‘divide group’ operations in the ‘k-means’ algorithm – Fusion of 2 groups if their distance is low (i.e. if the distance between centroids < some treshold) – Division of a group into 2 sub-groups if the inertia of the group (Ik) is too high (> some treshold)
=> parameters (tresholds) to be fixed !!! 104
The standard k-means Use of different criteria that allow to compare various solutions having different number of clusters – For instance, the Condorcet criterion – A wide range of criterion has been proposed (see later)
There is however no universal solution to this problem – Always try to visualize the data – For instance by using a discriminant analysis
105
The standard k-means The k-means algorithm can easily be adapted in order to work on – A distance matrix – A similarity matrix
106
The standard k-means For each cluster Cl, we define a prototype – Which is an individual belonging to Cl – The index of this individual will be pl
The distance between an individual having index k and cluster Cl will be – The distance to the prototype pl
107
The standard k-means The within-group inertia of cluster l will be
The total within-group inertia is
108
The standard k-means Step 1: initialize the prototypes Step 2: reallocate the individuals, while maintaining the prototypes fixed – Each individual k is affected to the nearest cluster having index l
109
The standard k-means Step 3: Each prototype is recomputed – The individual having the smallest distance to the whole cluster is chosen as prototype, while maintaining the allocations to the clusters
Steps 2 and 3 minimize IW at each iteration 110
The standard k-means K-means can also be adapted in order to take the extension of the clusters into account – Compute a Mahalanobis distance for each cluster k
111
The standard k-means The variance-covariance matrices Wk have to be recomputed at each iteration – Just as the centers of gravity matrix W1
G1
matrix W2
G2
112
Bisecting k-means Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical clustering
113
Fuzzy k-means The memberships to the clusters, pi(k), are now fuzzy – They reflect the degree of membership to the clusters = Degree of membership of individual i to cluster k
The aim is to minimize the within-class inertia 114
Fuzzy k-means Subject to the constraints
That is, we fix a given entropy level 115
Fuzzy k-means We thus introduce the Lagrange function
We compute the membership values by taking the derivative in terms of the pi(k)
116
Fuzzy k-means This leads to And therefore Which gives
117
Fuzzy k-means For computing the prototypes, we use
Which provides
Therefore: 118
Fuzzy k-means We therefore iterate
119
Mixture of distributions Clustering based on a priori distributional assumptions Each cluster membership is represented as a probability distribution – It represents the probability that the feature vector x is generated in this cluster – It contains a set of parameters that have to be estimated – The probability distribution can be Poisson, Gaussian, Bernoulli, etc… 120
Mixture of distributions Feature-based but can be adapted to distance-based Probabilistic cluster memberships The number of groups must be provided in advance Iterative method Only a local maximum of the likelihood is computed 121
Mixture of distributions Here is an example of a Gaussian mixture of three Gaussians – The objective is to fit these Gaussians on the data set
122
Mixture of distributions We assume that there are K clusters – Each having a particular distribution – Which defines the probability of observing vector x in this cluster – Here is an example of a Gaussian density for cluster k
123
Mixture of distributions The mixture of distribution, for all classes, can be written
– Each distribution is weighted by the probability P(Ck) of belonging to the corresponding cluster 124
Mixture of distributions For estimating the parameters, , one relies on the likelihood of the data sample
– The objective is to maximize this likelihood – We assume that the samples have been generated independently 125
Mixture of distributions Here are some examples of mixtures – From Duda, Hart & Stork (2001)
126
Mixture of distributions The cluster membership for an individual having feature vector x is provided by the posterior probabilities
127
Mixture of distributions Maximizing the likelihood provides us a set of nonlinear equations – We differentiate the likelihood in terms of the estimated parameters
– And in the Gaussian mixture case: 128
Mixture of distributions Thus, we compute
And we easily obtain
129
Mixture of distributions The parameters are estimated by iterating the resulting equations We use a three-steps procedure – Just as we did for the k-means
Step 1: Initialize the parameters Step 2: Recompute the membership values Step 3: Reestimate the parameter values 130
Mixture of distributions For the Gaussian mixture case: 1. Initialize 2. Recompute the membership values – While maintaining the parameters fixed
131
Mixture of distributions 3. Reestimate the parameters – While maintaining the membership values fixed
132
Mixture of distributions Other distributions can be used as well – Poisson for count variables – Multinomial for categorical variables – Etc…
This basic model has been extended in many ways – Other optimization algorithms can be used (EM and extensions) – We can deal properly with missing values – Bayesian extensions have been proposed 133 (AutoClass)
The kernel trick for clustering Many clustering algorithms have been « kernelized » This allows to cluster objects based on a similarity matrix – Which usually is a kernel matrix K – Containing inner products [K]ij = kij = (xi)Txj
134
Kernel k-means We want to minimize the total within-class inertia
We introduce the « kernel trick », where X is the data matrix 135
Kernel k-means We easily obtain
136
Kernel k-means The k-means iteratively minimizes J by iteratively – Assigning cluster labels, li, to nodes – Recomputing the cluster prototypes γk
The cluster assignment that minimizes J is
137
Kernel k-means The cluster prototypes γk that minimize J are solutions of
– where ki is the ith column of K and nk is the number of nodes assigned to cluster Ck 138
Kernel k-means Since, in both sides of the equation, we have a linear combination of the columns of K, one solution is
139
Kernel k-means Thus, we simply iterate – For all observations i:
– For all clusters k:
140
Kohonen self-organizing map This clustering algorithm has been developped in the framework of artificial neural networks – It is a kind of iterative k-means – With spatial constraints that provide a graphical representation of the clusters
It is called self-organizing map (SOM) in the field 141
Kohonen self-organizing map Principle: Two-levels artificial neural – One input layer and one output layer – The output layer is organized as a two-dimensional "topologic map" – It reflects the "natural relationships" between the input data – The topologic map is structured by a so-called unsupervised learning algorithm – In such a way that similar feature vectors activate spatially close output units – Inspired by biological processes such as speech or visual perception 142
Kohonen self-organizing map Computes a nonlinear representation of the data On a two-dimensional map By creation a topological structure preserving the neighborhood carte topologique
•••
entrées
143
Kohonen self-organizing map A set of prototypes is defined – Corresponding to a grid of output units – Each prototype corresponds to a unit of the grid
144
Kohonen self-organizing map Here is a summary of the basic SOM algorithm If x is a feature vector and pi is a prototype corresponding to the position i on the grid The SOM algorithm proceeds in four steps
145
Kohonen self-organizing map Step 1: the prototypes are initialized
x1
x2
xp
146
Kohonen self-organizing map Step 2: We identify the « winning unit » p* such that x − p* = min x − pi i
€
– Where, usually the Euclidean distance is used x1
x2
This winning unit determines a neighbourhood V*
xp
147
Kohonen self-organizing map Step 3: Compute the neighbourhood V*(t) around the winning unit p*on the grid – Which decreases as time goes on – The iteration step is denoted by t p*
V*(t1)
V*(t2) 148
Kohonen self-organizing map Step 4: Adapt the prototypes pi belonging to the neigbourhood of p* on the grid pi (t + 1) = pi (t) + η(t)[ x − pi (t)]
€
∀pi ∈ V * (t)
– with 0 < η(t) < 1, which decreases progressively during the time (t) – The prototypes in the neighbourhood of x are moving towards x 149
Kohonen self-organizing map The units (prototype) on the grid are representative of some "cluster" The algorithm preserves proximities: prototypes which are close together in the feature space will be close together on the grid of the topological map This auto-organisation is due to the update rule which link the prototypes that are close together 150
Kohonen self-organizing map Here is an example – Without structure in the data – The IRIS dataset (3 groups)
151
Kohonen self-organizing map If there is no neighbourhood (only one unit/prototype is update) ⇒the algorithms is a simple clustering algorithm minimizing the within-class inertia (an iterative variant of k-means). ⇒In this case, the convergence can be proved for some family of decreasing rules of η(t)
Some conditions for convergence in the 2D case can be derived 152
Hierarchical clustering Two types: Bottom-up methods – Agglomerative hierarchical clustering
Top-down methods – Divisive methods
153
Hierarchical clustering Agglomerative methods merge clusters at each step
154
Hierarchical clustering Divisive methods are partitioning the data in two parts at each step .. .. .. . .. ..
.. . ..
... .
155
Hierarchical clustering In practice, most hierarchical clustering methods are agglomerative We must define an agglomeration rule Principle : one element = one individual to be clustered or one group of individuals already defining a cluster We use a four-steps procedure 156
Hierarchical clustering Step 1: initialty, N individuals to be clustered We have a distance matrix (or a similarity matrix) between the individuals 2 by 2, DNxN 0 D=
0
dii' 0
di' i
0
0
157
Hierarchical clustering Step 2: We find the two closest elements (dii' minimum) → merged into a new element
There are n – 2 elements remaining + one new element (a cluster)
158
Hierarchical clustering Step 3: We compute the distance (or the similarity) between the new element and all the other elements We adapt the distance matrix D, which has one element less than before the merge
159
Hierarchical clustering Step 4: We repeat steps 2) and 3) until the aggregation is complete: – Only one element containing all the individuals
160
Hierarchical clustering This means that we have to define a distance (or similarity) between clusters – A merge of two elements reduces the number of groups by one 2
1 3
5 4
Step:
1
2
3
4
5
161
Hierarchical clustering This is represented by a dendrogram – The y axis represents the cumulative distance between the two merged elements distance
P1
(9)
P2 (8) (7) (6) (1)
(3)
(4)
(2)
(5)
Initial elements = individuals to cluster
Leafs of the tree = groups
P3 P4 P5 Pk = partition of k classes 162 (nested sequence: Pn → P1)
Hierarchical clustering The dendrogram represents nested clusters 5
4
1
2 5
0.25 0.2
2
0.15
3
6 1
4
3
Nested Clusters
0.1 0.05 0
3
6
4
1
Dendrogram
2
5
163
Hierarchical clustering Aggregation criteria – Single linkage:
(1)
(2)
(i )
d(G1 ,G2 ) = min d( x ,x ), x ∈ Gi
– Complete linkage: d(G1 ,G 2 ) = max d( x(1),x (2) )
1 – Average linkage: d(G1 ,G 2 ) = G1 .G2
∑
d(x(1),x (2) )
x (1) ∈G1 x ( 2) ∈G2 164
Hierarchical clustering Ward’s criterion – Computes the increase in within-group inertia when merging two elements – Starts with a zero inertia (each individual is a group)
When Pk + 1 → Pk, after merging 2 elements of Pk + 1: – IB decreases and IW increases (Itot remains constant)
165
Hierarchical clustering ⇒ merge the 2 elements A and B of Pk + 1 such that: IW (Pk ) − IW (Pk +1 ) minimum IB (Pk ) − IB (Pk +1 ) minimum A .B 2 . d (gA ,g B ) A+B – If we use the Euclidean distance
Which is equivalent to
min
166
Hierarchical clustering 4
1 2
1
3
€ 4
1 2
5
d(2,4)
5
d(1,5)
3
€5
1 (d13 + d14 + d15 + d 23 + d 24 + d 25 ) 6
gA gB
gAB
€
IAB − (IA + IB )
167
Hierarchical clustering Ward’s method We now prove A.B 2 ΔIB = . dist (gA ,g B ) A+B
Which is very easy and quick to compute Ward’s criterion is very popular for hierarchical clustering 168
Hierarchical clustering Ward’s method We first prove a useful relationship about the inertia from a point z – g is the center of gravity of the cloud of feature vectors xi weighted by n(i) – nx is the number of different xi – Each xi is weighted by the number of individuals that it represents: n(i) 169
Hierarchical clustering Ward’s method
170
Hierarchical clustering Ward’s method The inertia with respect to a point z is equivalent to – The inertia with respect to the centroid g – Plus the inertia of g with respect to z, by weighting the term by the number of individuals it represents
171
Hierarchical clustering Ward’s method Thus we have:
Suppose two groups A and B are merged: 172
Hierarchical clustering Ward’s method The common centroid is
Let us compute the between-group inertia before (k + 1 groups) and after (k groups) the fusion
173
Hierarchical clustering Ward’s method After the fusion of A and B (k groups), we have a large group of (nA + nB) data having center of gravity gAB Before the fusion (k + 1 groups), we have
174
Hierarchical clustering Ward’s method Where we applied the decomposition of inertia with respect to a vector z – This vector is z = g in this case – There are only two groups (A and B), each having nA and nB observations – Notice that the center of gravity of the two groups is gAB and not g in this case !
We thus decompose the inertia as the sum of the inertia with respect to the centroid and the inertia with respect to z = g 175
Hierarchical clustering Ward’s method Thus we applied
In which we replace nx = 2; x1 = gA; x2 = gB; n(1) = nA; n(2) = nB; z = g; g = gAB (centroid of the merged data); N = nA + nB 176
Hierarchical clustering Ward’s method The difference in inertia is
177
Hierarchical clustering Ward’s method The single linkage method is sensible to the chaining effect
The most used criteria are – the Ward criterion – The average linkage aggregation rule 178
Hierarchical clustering If we want a partition, we cut the tree at some level – After the agregations at low levels (very small clusters) – Before the aggregation at higher level (large clusters) – It is often quite subjective
179
Hierarchical +k-means clustering K-means and hierarchical clustering could be combined in a useful way 1. First perform a k-means for vector quantization 2. Then, perform a hierarchical clustering 3. Decide about the cut level to obtain a partition 4. Re-compute a k-means on the partition defined by the hierarchical clustering •
In order to optimize the within-group inertia 180
Hierarchical +k-means clustering
1) Initial partition - k-means
.
3) 3 classes kept
2) Hierarchical clustering
4) k-means
181
Minimum spanning tree clustering Very simple clustering method Suppose we have a distance matrix D between individuals Step 1: Construct a graph – With individuals as nodes – Similarities as weights on the edges
182
Minimum spanning tree clustering Step 2: From this graph, construct the minimum spanning tree – Which corresponds to the tree • Spanning all the nodes • Having minimum sum of edges
– There are efficient algorithms for computing the minimum spanning tree
Step 3: Remove the edges – Which are ‘inconsistents’ – Having large distances compared to others 183
Minimum spanning tree clustering Example from Dubes & Jain (1988)
184
Minimum spanning tree clustering Simple exploratory technique Very fast to compute Interesting also for visualizing the data – Visualizing a spanning tree in 2D is easy
185
Graph partitioning Some clustering algorithms are applied to graphs Suppose we have a graph structure
186
Graph partitioning Suppose we have a similarity matrix S between individuals Construct a graph – With individuals as nodes – Similarities as weights on the edges
187
Graph partitioning The elements aij of the adjacency matrix A of a weighted, undirected, graph are defined as
where A is symmetric
The wij ≥ 0 represents a similarity or strength of relationship between node i and node j
188
Graph partitioning We define a similarity index between partition C1 and partition C2 as
Define h1 as an indicator vector containing 1 if the node belongs to C1 and 0 if it belongs to C2 Define h2 as the equivalent vector for C2 Define e as a column vector made of 1s 189
Graph partitioning We obtain
190
Graph partitioning And thus
Now,
Thus [D]ij = 0 for i ≠ j and 191
Graph partitioning We finally obtain
By relaxing the fact that h1 and h2 contain binary values, – This leads to the following optimization problem 192
Graph partitioning This leads to an eigenvector problem – Compute the smallest non-trivial (L is not of full rank) eigenvector of – which is called the Fiedler vector
193
Graph partitioning This approach is very popular for the moment ! For instance: find dense groups in a social network graph This is called ‘spectral clustering’
194
Graph partitioning: one example Visualization of a network of criminals
195
Conclusion A myriad of clustering techniques Which can be combined There are many other clustering techniques Still an active research area – For instance, consensus clustering and graph clustering are quite popular
However, very few techniques are implemented in standard statistical softwares 196
Cluster validity
197
Cluster validity Visualizing the clustering solution Association of the cluster solution with initial features Comparing clustering solutions Testing hypotheses Robustness estimation Consensus clustering About the number of clusters
198
Visualizing the clustering solution Always visually inspect the cluster solution!! For instance run a discriminant analysis – For viewing the data in a 2D or a 3D space – Which best discriminates between the clusters – As dependent variable, take the cluster labels – Eventually plot one cluster against the others
Also, perform a PCA and visualize the data in 2D or 3D 199
Visualizing the clustering solution Here is the result of a k-means visualized by a dicriminant analysis – On the IRIS dataset 18 17 16
14
1
Sepal length Sepal width Petal width
3
Petal length
13
2
Canonical2
15
12 11 10 9 -4
-2
0
2
4
6
8
10
Canonical1
200
Visualizing the clustering solution Order the similarity matrix with respect to cluster labels and inspect visually 1
1 0.9 0.8
s t n i o P
0.7 0.6
y
0.5 0.4 0.3 0.2 0.1 0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100 0
0.2
0.4
0.6
x
0.8
1
20
40
60
80
0 100Similarity
Points
201
Association of the cluster solution with initial features In order to be sure that some feature does not dominate the solution – Compute the association between the clustering solution and initial features – In order to avoid clusters that are completely dominated by only one unimportant variable – Eventually re-weight the features in the clustering 202
Comparing clustering solutions In order to validate clustering solutions We should be able to compare clustering solutions – Even if the number of clusters found is different!
Two standard techniques: – Huber’s Γ statistics – The RAND index R – Many others not detailled here
203
Comparing clustering solutions Huber’s Γ statistics Define two indicator variables – y1(i,j) and y2(i,j)
y1(i,j)=1 if individuals i and j are clustered together in clustering solution 1 – Otherwise y1(i,j)=0
y2(i,j)=1 if individuals i and j are clustered together in clustering solution 2 – Otherwise y2(i,j)=0
204
Comparing clustering solutions Huber’s Γ statistics can be written
– Where N(N – 1)/2 is the total number of terms (pairs) – If i and j are never clustered together, Γ = 0 – It is a kind of covariance value between the two clustering solutions – It can be used for comparing clusterings and proximities as well 205
Comparing clustering solutions However, it is better to use the normalized version – For which we standardize the variables y1(i,j) and y2(i,j)
– Then, –1 ≤ Γ’ ≤ 1 206
Comparing clustering solutions With
207
Comparing clustering solutions The RAND index R – It is a degree of agreement between two clustering solutions
Compute the number of pairs of individuals (i,j) that are – Clustered together in both clustering solutions – Not clustered together in both clustering solutions
Divide by the total number of pairs N(N – 1)/2
208
Comparing clustering solutions In other words, it computes the proportion of pairs for which both clusterings (1 and 2) agree – Either the pair belong to the same group in both clusterings – Either the pair does not belong to the same group in both clusterings
209
Comparing clustering solutions The RAND index can be computed by
Very useful for comparing clustering solutions Or comparing a clustering against a predefined solution 210
Comparing clustering solutions For each of these statistics t, there exists a corresponding ‘corrected’ statistics t which is a normalized version of the original one:
– The expectation is taken under the null hypothesis – The range is between 0 and 1 – The statistics is corrected with respect to matches occuring ‘by chance’ 211
Comparing clustering solutions Methods for comparing dendrograms are also available – They compare hierarchical clustering solutions
212
Testing hypotheses Several tests of hypothesis are available Suppose you measure some statistics t – That is related to the quality of the clustering – For instance, a measure of compactness of the clusters
213
Testing hypotheses You also want to test some null hypothesis H0 – For instance the individuals are distributed at random – So that all sets of location in some region of the feature space are equally likely – Which means that there is no structure in the data
Suppose also that the distribution of t is known under the null hypothesis H0 214
Testing hypotheses For a sample, you can – Compute t from that sample – test if the value of t is very unlikely or not – If it is very unlikely, you reject the hypothesis H0 (of no structure)
215
Testing hypotheses In ours case, tests of hypotheses can be performed by Monte-Carlo simulation techniques For instance, let us assume that the null hypothesis corresponds to – A uniform distribution of data – Without any structure
Assume we are interested in the ratio between between-cluster inertia on within-cluster inertia: Here is a simulation allowing to test this hypothesis
216
Testing hypotheses 1. 2. 3. 4. 5.
6.
Run a clustering algorithm on a data set, retrieve the cluster solution and compute IB/ IW Generate a new data set in the area of interest following a uniform distribution Run the same clustering algorithm on the new data set Compute the resulting IB/ IW Run 2) to 4) several times and draw a histogram representing the distribution of IB/ IW under the null hypothesis Does the value of IB/ IW computed on the original data set differ significantly from the 217 distribution of IB/ IW under the null hypothesis ?
Testing hypotheses There are many such tests – Which, for instance, assume Gaussian clusters
However, testing against the absence of structure is not always of interest – Indeed, even if there is no structure, clustering may be useful – Think about vector quantization 218
Robustness estimation One could be interested in testing the robustness of the solution ‘Split-sample’ methods are useful in this context – Randomly divide the data into two subsets – Perform an analysis separately – Compare the solutions
Here is a useful method in this context 219
Robustness estimation 1. 2. 3.
4. 5.
Divide the sample in two parts Perform independently a clustering on the two samples Classify the individuals of the other sample according to the classification rules (centroids) obtained on the first sample Compare the two solutions using, i.e. Huber’s statistics Run steps 1) to 4) many times and compute the average of Huber’s statistics 220
Robustness estimation If there really is a structure in the data – The algorithm should be able to retrieve this structure almost all the time
The same method is used in order to estimate the number of groups – Pick up the number of groups for which the solution is most stable 221
Consensus clustering The idea here is to find robust solutions by – Running many times the clustering – Finding a consensus between the different solutions – Thus seeking for a stable solution
222
Consensus clustering One could be interested in ‘consensus’ clustering 1. Take a bootstrap sample of the data set 2. Run a clustering algorithm with different seeds and label all the individuals 3. Run 1) and 2) many times 4. For each pair (i,j), compute the proportion of times the two individuals were grouped together 5. This provides a new similarity matrix 6. Find the labelling that was closest to this similarity matrix among the clusterings realized in 1) and 2) 223
About the number of clusters What is the natural number of clusters in the data? – No universal answer to this question – Usually, one has an approximate a priori idea
Three heuristic answers – Plot of the compactness criterion in terms of number of clusters • Plot the clustering criterion (between-cluster inertia) against number of clusters • A large drop is generally taken as suggestive of a particular number of groups
– Examine the dendrogram (hierachical clustering) – Perform a robustness analysis and pick up the most stable solution
224
About the number of clusters However, a number of other, more formal, techniques have been suggested – These are mainly statistics computed from the sum of squared distances (inertia) – Various clustering, with increasing number of groups, are performed, and this statistics is computed – The number for which the statistics is a minimum is retained – Some of these techniques are available in the 225 R software
About the number of clusters There is, however, no miraculous solution to this problem Always visualize the data – this may be a good way to have an idea of the quality of the solution
226
Cluster interpretation
227
Cluster interpretation Visual representation Cluster profiling Minimum spanning tree representation Discriminant analysis
228
Visual representation Ones we have obtained clusters, We have to provide an interpretation and a representation of these clusters – In order to gain insights about the clustering – In order to communicate the results – Profiling and visualization of the clusters
Time-consuming but very fulfilling work 229
Visual representation Perform a discriminant analysis – And visualize the clusters – By coloring them 18 17 16
14
1
Sepal length Sepal width Petal width
3
Petal length
13
2
Canonical2
15
12 11 10 9 -4
-2
0
2
4
Canonical1
6
8
10
230
Visual representation Eventually, if there are many clusters – Eliminate individuals that are far from the centroid – Keep only the individuals that are in the kernel of the clusters – Then perform a discriminant analysis – Usually, you obtain a clear representation 231
Visual representation Eventually plot the cluster density on a Belgian map – Urban clusters will appear
232
Cluster profiling Focus on one important variable For each cluster, compute – The odds of this important variables against the population mean – For instance mean revenue within the cluster divided by the mean revenue on the whole population
Create a bar plot representing odds in function of cluster (x axis), for this important variable 233
Cluster profiling Some MOSAIC examples
234
Cluster profiling Focus on one cluster For each important variable, compute – The odds of this important variable against the population mean
Create a bar plot representing odds in function of important variables (x axis), for this cluster 235
Cluster profiling Example MOSAIC – ‘Classe privilégiée’
236
Cluster profiling Example MOSAIC – ‘Classe defavorisée urbaine’
237
Cluster3
Cluster1 Petal width
Petal width
Sepal width
Sepal length
Petal width
Petal length
Sepal width
Sepal length
Petal length
Cluster2
Petal length
Sepal width
Sepal length
Petal width
Petal length
Sepal width
Sepal length
Cluster profiling Example of profiling (IRIS dataset)
Cluster Means
238
Minimum spanning tree representation Compute the distances between the centroids of the clusters Determine the minimum spanning tree Plot the minimum spanning tree
239
Minimum spanning tree representation An example from the MOSAIC segmentation
240
Discriminant analysis In order to give an interpretation to the cluster – Perform a discriminant analysis – Or a decision tree – Of one cluster against the others – Detect the most relevant features • Allowing to contrast/characterize this cluster against the others 241
Discriminant analysis Here is an example of a decision tree applied to a k-means clustering of the IRIS data All Rows Count 150
G^2 Level 328.75054 1 2 3
Petal length<3.0 Count 50
G^2 Level 9.8039113 1 2 3
Prob 0.3667 0.3067 0.3267
Petal length>=3.0 Prob 0.0000 0.0200 0.9800
Count 100
G^2 Level 137.62776 1 2 3
Sepal length<6.3 Count 49
G^2 Level 43.614415 1 2 3
Prob 0.5500 0.4500 0.0000
Sepal length>=6.3 Prob 0.1633 0.8367 0.0000
Count 51
G^2 Level 28.041985 1 2 3
Prob 0.9216 0.0784 0.0000
242
Softwares and further readings
243
Softwares and further readings Softwares – The R software – S-Plus software – SAS
244
Softwares and further readings Some multivariate data analysis books
245
Softwares and further readings Some clustering books
246
Softwares and further readings Some data analysis books (French school)
247