1 Web mining Marco Saerens (UCL), with Christine Decaestecker (ULB) Document retrieval: A short overview of some old and recent techniques Marco Saere...
9 downloads
31 Views
13MB Size
Web mining
Marco Saerens (UCL), with Christine Decaestecker (ULB)
Document retrieval: A short overview of some old and recent techniques
Marco Saerens (UCL)
1
Contents ! General
introduction ! Information retrieval: Basic standard techniques (content-based methods) – Documents pre-processing – Vector-space model – Probabilistic model – Assessment of performances ! Information
retrieval: More recent techniques
– Exploiting links between documents (web link analysis)) • The PageRank algorithm • The HITS algorithm
– Exploiting the relational structure in order to improve 3 retrieval
General introduction
4
2
Introduction ! We
have a collection of documents (mainly text or html-based) ! We have a set of users – A user wants to retrieve the documents related to a given concept ! He
consequently submits a query expressed through words or terms ! An information retrieval system returns the documents most related to this concept 5
Introduction ! One
major problem:
– We want to express a concept – With words – There is no one-to-one mapping (eg. marché) words
concepts
= synonymy = polysemy
3
Documents preprocessing
7
Documents preprocessing ! R.
Baeza-Yates & B. Ribeiro-Neto (1999) – Modern Information Retrieval – Addison Wesley
8
4
Documents preprocessing ! Manning
– Introduction to information retrieval – Cambridge University Press
9
Documents pre-processing ! We
have a collection of documents ! Here are the standard pre-processing steps – Extract text and structure (eg. from Microsoft Word, PDF, or LaTeX to XML) – Remove stop words (eg. remove "the", "at", "all", etc) – Named entity recognition (eg. find proper names) – Stemming (eg. extract "process" from "processing)
10
5
Documents pre-processing ! It
is a tedious job
Documents
Extract text and structure
Remove stop words
Named entity recognition
Stemming
Documents
– But some tools are readily available • Galilei project developped at the ULB • NLTK written in Python 11
Documents pre-processing ! Stemming
aims to extract the « root » of
the words ! Stemming can be based on – A dictionnary (for instance Mmorph developped at the University of Geneva) – A set of rules developped by linguists (like Porter's stemming algorithm for english) 12
6
Documents pre-processing ! Example
of stemming rules in french:
13
Basic Methods The vector space model
14
7
The vector space method ! M.
W. Berry & M. Browne (1999)
– Understanding Search Engines – SIAM
15
The vector space model: Introduction ! In
its basic form, each document is represented by a vector – A query is also represented by a vector – A user profile may be represented by a vector as well
! The
coordinates of the vector are words
– Each element of the vector represents the frequency of the word in the document or the query – In the space of words 16
8
The vector space model: Basics ! Thus
a document is represented by a vector – Document j is characterized by – is the frequency of word in document j – The total number of words is
! The
dimension of the vector is 17
The vector space model: Basics ! Thus
each document is represented by
! This
is called the « bag of words » representation in the words space – The order of the words is not taken into account – This vector is usually sparse – This vector is very large
18
9
The vector space model: Basics ! The
total number of documents is ! The term-document matrix is
19
The vector space model: Basics ! A
query is also represented by a vector
– Here is a query q – Each element is 0 or 1 (presence or absence of a word) word wi is present in the query
20
10
The vector space model: Basics ! The
purpose is of course to retrieve documents di based on a query q ! We have to define a notion of similarity between a query and a document
q d 21
The vector space model: Basics ! The
similarity between a query q and a document di can be defined as – The cosinus of the angle between these two vectors:
– Euclidean distance does not work well because queries contain much lesser words than documents – It is called the cosine similarity
22
11
The vector space model: Basics ! The
similarity between the query and all documents can be computed by using the term-document matrix
23
The vector space model: Refinements ! Two
refinements of the basic model:
– Term weighting – Latent semantic models
24
12
The vector space model: Term weighting ! We
now introduce term weighting
– Of course, each word does not have the same « weight » – We would like to take account of the "discriminative power" of every word – For instance, if a word is present in every document, it is useless – is the a priori probability that word wi appears in a document 25
The vector space model: Term weighting ! The
following quantity is often called the inverse document frequency (idf) associated to word wi: idfi = –log2[P(wi)] – It is a measure of the general importance of the word (or term) wi
! It
is estimated by taking the logarithm of
– the number of documents in which wi appears, divided by the total number of documents
26
13
The vector space model: idf score ! Another
quantity of interest is the term frequency, tfij
– It measures the importance of the term wi within the particular document dj – It is normalized to prevent a bias towards longer documents 27
The vector space model: tf.idf score ! The
tf-idf score is simply the product of the tf and the idf scores, tf.idfij = idfi . tfij – The tf-idf weighting scheme is often used in the vector space model together with cosine similarity – To determine, for instance, the similarity between two documents
! By
replacing the term-frequency elements of the term-document matrix by the tf-idf scores
28
14
The vector space model: Term weighting !
We redefine the query vector q as
!
Each word wi is weighted by the information provided by knowing the presence of the word We then compute the cosine between the query and 29 the documents in order to rank them
!
The vector space model: Latent semantic models ! Latent
semantic models
– These models try to capture some semantic information – For instance, if we introduce a query with "newborn", it would be nice if documents containing "baby" but not "newborn" are also retrieved – We say that words are semantically related when they are used in the same context 30
15
The vector space model: Latent semantic models ! This
way, we can capture some « semantic similarity » between words – In the present case, we will say that two words are semantically related – When they often co-occur in the same document
31
The vector space model: Latent semantic models ! One
solution to this problem is to use "sub-space projection methods" like – "Singular Value Decomposition" (SVD) or – « Factor analysis"
! The
rank m SVD of a matrix of rank n is the « best approximation » to this matrix having rank m < n – In the present case, we use a SVD in order to reduce the rank of the term-document matrix 32
16
The vector space model: Latent semantic models ! This
allows to reduce the dimensionality of the space by clustering the words that are semantically "similar",
! That
is, used in the same documents
! This
allows us to build a kind of concept space
33
The vector space model: Latent semantic models ! Every
matrix has a "singular value decomposition". If D is of full rank:
34
17
The vector space model: Latent semantic models ! But,
for rank-deficient or rectangular matrices, 1
2
···
n
0
! If
we want the best rank-m approximation to D, we put
35
The vector space model: Latent semantic models ! So
that we obtain
36
18
The vector space model: Latent semantic models !
is the best rank-m approximation to D
– That is, there is no rank-m matrix closer to D in terms of the Frobenius norm:
˜ kD
Dk2F =
nd nw X X
(d˜ij
dij )2
i=1 j=1
37
The vector space model: Latent semantic models ! The
queries are now adressed to instead of D
38
19
The vector space model: Latent semantic models ! But
how does it work ? w3
w2 w1
39
The vector space model: Conclusion ! The
vector-space method relies on linear algebra concepts ! The SVD approach allows to work in a latent space representing concepts ! The main problem: How many dimensions of the subspace do we keep? ! Moreover, it cannot be applied on very large data sets (millions of documents) 40
20
Basic Methods Probabilistic methods
41
Probabilistic methods ! K.
Sparck Jones & P. Willett (Editors) (1997) – Readings in Information Retrieval – Morgan Kaufmann
! Collection
of papers
42
21
Probabilistic methods: Introduction ! The
probabilistic methods rely on statistical models – Each user profile is represented by a statistical model
! A
document can be relevant or not to a user – R = 1 if it is relevant; R = 0 if it is not relevant 43
Probabilistic methods: Introduction ! Based
on
– Relevance feedback from the user – Or simply the ranking of a vector space model ! We
can build a probabilistic model
– It will estimate the probability that a document is relevant ! We
will introduce the binary independence retrieval model
44
22
Probabilistic methods: Introduction ! We
introduced a query
– Based on a vector space model, we obtain Results of a vector model Considered as relevant
Considered as irrelevant
45
Probabilistic methods: Introduction ! Expanding
the query based on
– the most relevant documents or – a relevance feedback from the used ! Is
called query expansion
46
23
Probabilistic methods: Basic model ! Each
document di is represented by a binary vector This word is present in the document
= 1 if word wj is in document di ! [di]j = 0 if word wj is not in document di ! [di]j
47
Probabilistic methods: Basic model ! Based
on ranking, some documents are considered as relevant (R = 1) ! And some documents are considered as not relevant (R = 0) ! To a user uk
48
24
Probabilistic methods: Basic model ! We
define
– as the probability of observing a document d = x given that this document is relevant for user uk ! We
will see that it is easy to estimate these probabilities for the binary independence model 49
Probabilistic methods: Basic model ! However,
during the document retrieval phase, we are mainly interested in:
! The
larger this value, the more likely the document x is relevant ! This probability has to be computed for each document in the database 50
25
Probabilistic methods: Basic model ! Now,
! It
instead of computing
is easier to compute the odds
51
Probabilistic methods: Basic model ! It
is a monotonic increasing function of
! It
therefore provides the same ranking ! The larger this value λ, the more likely the document is relevant
52
26
Probabilistic methods: Basic model ! Remember
Bayes' law !
53
Probabilistic methods: Basic model ! We
can easily compute λ by assuming conditional independence between the words (dn is element n of vector d)
54
27
Probabilistic methods: Basic model ! And
finally λ is proportional to
! This
is really a naive Bayes classifier ! The , are easy to compute = Likelihoods estimated by frequencies
55
Probabilistic methods: Basic model ! The
, are easy to compute – Likelihoods estimated by frequencies
! They
are estimated by the proportion of documents containing the word wn among relevant/irrelevant documents
56
28
Probabilistic methods: Conclusion ! The
binary independence probabilistic retrieval model makes strong asumptions about independence of word occurrence ! More sophisticated models are available – For instance Poisson models can be used in order to take account of the number of words appearing in the document – We can also take account of second-order interactions between words (correlations) 57
Assessment of documents retrieval systems ! In
general, we compute two measures:
– The precision – The recall ! As
well as the F-measure
58
29
Assessment of documents retrieval systems ! The
precision measure estimates the percentage of relevant retrieved documents in the set of all retrieved documents – Precision indicates to which extend the retrieved documents are indeed relevant
59
Assessment of documents retrieval systems ! The
recall measure estimates the percentage of relevant retrieved documents in the set of all relevant documents – Recall indicates to which extend the relevant documents are indeed retrieved
60
30
Assessment of documents retrieval systems Retrieved
Relevant
Retrieved and relevant 61
Assessment of documents retrieval systems ! There
is a trade-off between precision and recall
! The
F-measure, taking both precision and recall is F = 2.(precision . recall)/(precision + recall) 62
31
Link analysis techniques
Marco Saerens (UCL)
Cutting a graph in small pieces and exploring it
64
32
Cutting a graph in small pieces and exploring it ! A
nice reference
– Sedgewick – Algorithms in Java – Addison-Wesley
65
Cutting a graph in small pieces and exploring it ! Many
software solutions are readily available for cuting a graph – See www.insna.org/INSNA/soft_inf.html – It provides a number of pointers to softwares
66
33
Cutting a graph in small pieces and exploring it ! JUNG
– the Java Universal Network/Graph Framework – Open source and written in Java – Mainly a toolbox of methods – http://jung.sourceforge.net
67
Cutting a graph in small pieces and exploring it ! Pajek
– A software for large network analysis – There is a companion book – Quite powerfull – Free software
68
34
Cutting a graph in small pieces and exploring it ! UCINET
– Social Network Analysis Software – Written by active researchers in the social network community – Quite complete comercial software – http://www.analytictech.com
69
Cutting a graph in small pieces and exploring it ! The
first step in analysing a graph is often to look at its connected components – In an undirected graph, a connected component is a maximal connected subgraph – A strongly connected component is the similar concept defined for directed graphs 70
35
Cutting a graph in small pieces and exploring it ! Here
is a graph with two connected components
71
Cutting a graph in small pieces and exploring it ! There
exists linear-time algorithms for solving this problem – Using, for instance, depth-first or breadthfirst search
72
36
Cutting a graph in small pieces and exploring it ! An
articulation point or vertex-cut is a node (vertex) of a graph such that – removal of the node causes an increase in the number of connected components
73
Cutting a graph in small pieces and exploring it linear-time algorithms are available for this problem, using, for instance
! Standard
– depth-first or breadth-first search
74
37
Cutting a graph in small pieces and exploring it ! A
bridge or an edge-cut is an arc (edge) of a graph such that – removal of the arc causes an increase in the number of connected components
75
Cutting a graph in small pieces and exploring it linear-time algorithms are available for this problem, using, for instance
! Standard
– depth-first or breadth-first search
76
38
Identifying central or prestigious nodes by link analysis
77
Some link analysis books ! P.
Baldi, P. Frasconi & P. Smyth (2003)
– Modeling the Internet and the Web – John Wiley & Sons
78
39
Some link analysis books ! S.
Chakrabarti (2003)
– Mining the Web – Morgan Kaufmann
79
Some link analysis books ! A.
Langville & C. Meyer (2006)
– Google’s PageRank and beyond – Princeton university press
80
40
Some link analysis books ! B.
Liu (2006)
– Web data mining – Springer
81
Some link analysis books ! Markov
& Larose (2007)
– Data mining the web
82
41
Identifying central or prestigious nodes by link analysis ! The
PageRank algorithm ! The HITS algorithm ! The SALSA algorithm
83
The PageRank algorithm
42
The basic PageRank algorithm ! Introduced
by Page, Brin, Motwani & Winograd in 1998 ! Partly implemented in Google ! Corresponds to a measure of « prestige » in a directed graph
85
Web link analysis ! A
set of techniques
– Applied to: Hyperlink document repositories – Typically web pages ! Objective:
– To exploit the link structure of the documents – In order to extract interesting information – Viewing the document repository as a directed graph where • Nodes are documents • Edges are directed links between documents
– It does not exploit the content of the pages !!
86
43
Web link analysis ! Suppose
we performed a search with a search engine ! Objective: to improve the (contentbased) ranking of the search engine – Based on the graph structure of the web hyperlinks – PageRank is computed off-line
87
The basic PageRank algorithm !
To each web page we associate a score, xi – The score of page i, xi, is proportional to the weighted averaged score of the pages pointing to page i
Page i
88
44
The basic PageRank algorithm ! Let
wij be the weight of the link connecting page i to page j – Usually, it is simply 0 or 1 – Thus, wij = 1 if page i has a link to page j; wij = 0 otherwise
! Let
W be the matrix made of the elements wij – Notice that this matrix is not symmetric – We suppose that the graph is strongly connected
89
The basic PageRank algorithm ! In
other words
Page i
– Where wj. is the outdegree of page j
90
45
The basic PageRank algorithm ! In
other words, ! A page with a high score is a page that is pointed by – Many pages – Having each a high score ! Thus
a page is an important page if
Page i
– It is pointed by many, important, pages 91
The basic PageRank algorithm ! These
equations can be updated iteratively until convergence ! In order to obtain the scores, xi – We normalize the vector x at each iteration ! The
pages are then ranked according to their score
92
46
The basic PageRank algorithm ! This
definition has a nice interpretation in terms of random surfing ! If we define the probability of following the link from page j to page i as
93
The basic PageRank algorithm ! We
can write the updating equation as
! And
thus we can define a random surfer following the links according to the transition probabilities 94
47
The basic PageRank algorithm ! This
is the equation of a Markov model of random surf through the web ! This is exactly the same equation as before:
95
The basic PageRank algorithm ! If
we denote element i, j of the transition probability matrix P as pij ! We thus have
! And
the equation can be rewritten as 96
48
The basic PageRank algorithm ! In
matrix form, if the vector x has elements xi
! The
stationary distribution is given by x(k+1) = x(k), and thus
97
The basic PageRank algorithm can then be viewed as the probability of being at page i
! xi
– The solution to these equations is the stationary distribution of the random surf – Which is the probability of finding the surfer on page i on the long-term behaviour ! The
« most probable page » is the best ranked 98
49
The basic PageRank algorithm ! The
PageRank scores can be obtained
– By computing the left eigenvector of the matrix P corresponding to eigenvalue 1 – Which is the right eigenvector of PT – Where P is the transition probabilities matrix of the Markov process – Containing the transition probabilities ! If
the graph is undirected, the scores are simply the indegrees of the nodes
99
The basic PageRank algorithm ! For
that purpose, we can use the wellknown power method ! The computation of the dominant right eigenvector x of a matrix M iterates 8
! And,
:x
Mx x kxk
in our case, M = PT
100
50
Adjustments to the basic model ! However,
there is a problem with
– Dangling nodes – That is, nodes without any outgoing link ! In
this case, the P matrix is no more stochastic – Rows do not sum to one
! Moreover,
the graph could have separate components 101
Adjustments to the basic model ! One
potential solution is to allow to jump to any node of the graph – With some non-zero probability (teleportation)
! Thus,
where G is called the Google matrix, e is a column vector full of 1’s and 0 < α < 1
102
51
Adjustments to the basic model ! Moreover,
the dangling nodes are made
absorbing – That is, if i is a dangling node, pii = 1 and pij = 0 for i ≠ j ! So
that the row sum is equal to 1 ! And since there is a transportation probability, random walkers will be able to escape from these dangling nodes 103
Adjustments to the basic model ! In
this case,
– The matrix is stochastic – The matrix is irreducible (no separate component) – The matrix is aperiodic ! Then,
there is a unique eigenvector associated to eigenvalue 1 ! However, G is no more sparse ! 104
52
Computing PageRank ! The
problem is thus to find the left eigenvector of G – corresponding to the eigenvalue 1 – instead of P – with the normalization
! One
can use the standard power method 105
Computing PageRank ! Fortunately,
the power method results in sparse matrix multiplication only:
where x is normalized after each iteration
106
53
Personalization in PageRank ! How
can we favour some pages in a natural way (advertising, etc) ?
! Rather
than using eeT/n, ! use evT where – v > 0 is a probability vector (vTe = 1) – Which is called the personalization vector ! It
contains the a priori probability of jumping to any page
107
Personalization in PageRank ! The
Google matrix thus becomes
– Where v is provided a priori
108
54
The PageRank problem as a sparse linear system ! Here
is an alternative formulation of the PageRank problem – As a sparse linear system
109
The PageRank problem as a sparse linear system ! In
other words, the problem is
! Which
has been shown (Del Corso, Gulli & Romani, 2005; Langville & Meyer, 2006) to be equivalent to
110
55
The HITS algorithm
The HITS algorithm ! Introduced
by Kleinberg in 1998/1999
112
56
Web link analysis ! Suppose
we performed a search with a search engine – Compute the neighborhood graph from the retrieved documents – Associated to the particular query
! Objective:
to improve the ranking provided by the search engine – Based on the graph structure of the web hyperlinks
113
The HITS algorithm ! The
model proposed by Kleinberg is based on two concepts – Hub pages – Authorities pages
! These
are two categories of web
pages ! These two concepts are strongly connected
114
57
The HITS algorithm ! Example:
– Suppose we introduced the query “Car constructors” Ferrari
Renault
Ford
Authorities Car constructors
People interested in car constructors
Hubs Prost
Schumacher
115
The HITS algorithm ! Hubs
– Link heavily to authorities – A good hub points to many good authorities – Hubs have very few incoming links Ferrari
Renault
Ford
Authorities
Hubs Prost
Schumacher
116
58
The HITS algorithm ! Authorities
– Do not link to other authorities – A good authority is pointed by many good hubs – The main authorities on a topic are often in competition with one another Ferrari
Renault
Ford
Authorities
Hubs Prost
Schumacher
117
The HITS algorithm ! The
objective is to detect good hubs and good authorities – from the results of the search engine
! We
therefore assign two numbers to each returned page i: – A hub score, xhi – An authority score, xai 118
59
The HITS algorithm ! Let
wij be the weight of the link connecting page i to page j – Usually, it is simply 0 or 1 – Thus, wij = 1 if page i has a link to page j; wij = 0 otherwise
! Let
W be the matrix made of elements
wij – Notice that this matrix is not symmetric – We suppose that the graph is strongly connected
119
The HITS algorithm ! A
possible procedure for computing hub/ authorities scores (Kleinberg) – A page's authority score is proportional to the sum of the hub scores that link to it
– A page's hub score is itself proportional to the sum of the authority scores that it links to
120
60
The HITS algorithm ! In
matrix form,
! And
thus,
121
The HITS algorithm ! Kleinberg
used this iterative procedure in order to estimate the scores – with a normalization at each step – This is equivalent to computing the eigenvectors of the following matrices
– To obtain respectively the vector of hubs scores and the vector of authorities scores 122
61
Links with principal components analysis ! This
is exactly uncentered principal components analysis (PCA) – The proof is based on the dual view of PCA
! As
for multidimensional scaling
– View the set of rows of W (authorities) as a cloud of points in the columns space – View the set of columns of W (hubs) as a cloud of points in the rows space 123
Links with principal components analysis ! Let
us consider a data matrix X ! The first PCA axis on which the data will be projected is given by the eigensystem
! Thus,
the first projection axis correspond to the dominant eigenvector of 124
62
Links with principal components analysis ! Then,
the first coordinate of the projected data is given by Xu1 – Which corresponds to the data vectors projected on the first principal axis, u1
! These
are the PCA scores for the first principal axis 125
Links with principal components analysis ! Here
is a sketch of the proof that HITS is equivalent to uncentered PCA ! Let us consider the adjacency matrix W as a data matrix ! We
simply substitute X by W for computing the first principal axis (PCA), u1 126
63
Links with principal components analysis !
We pre-multiply this equation by W
!
Wu1 is an eigenvector of WWT and thus contains the hubs scores
!
Since Wu1 is the projection of the data on the first principal axis (= PCA scores)
!
The hubs scores are equal to the uncentered PCA scores, up to a proportionality factor, computed from the data matrix W 127
Links with principal components analysis ! The
same result holds for the authorities scores ! We now consider the transposed adjacency matrix WT as a data matrix
! And
proceed as before ! The authorities scores are equal to the uncentered PCA scores, up to a proportionality factor, computed from the data matrix WT
128
64
Links with principal components analysis !
Thus, the situation is exactly the same as for multidimensional scaling – The first eigenvector of WWT represents the projection of the row vectors on the first principal component (hubs scores) – The first eigenvector of WTW represents the projection of the column vectors on the first principal component (authority scores)
!
This procedure is also related to both – Correspondence analysis – A random walk (Markov) model through the graph
129
HITS’ relationships to bibliometrics ! The
HITS algorithm has also strong connections to bibliometrics research: – Cocitation – Coreference
! Cocitation
occurs when two documents are both cited by the same third document ! Coreference occurs when two documents both refer to the same third 130 document
65
HITS’ relationships to bibliometrics ! Thus,
Document i
Coreference
Document i
Cocitation
131
HITS’ relationships to bibliometrics ! C.
Ding (2002) showed that
– where Din is a diagonal matrix containing the indegree of each node Din = Diag(w•j) – Dout is a diagonal matrix containing the outdegree of each node Dout = Diag(wi•) – Ccit and Cref are the cocitation and coreference matrices 132
66
HITS’ relationships to bibliometrics ! Thus,
– The hub matrix is closely related to the coreference matrix – The authority matrix is closely related to the cocitation matrix
133
The SALSA algorithm
67
The SALSA algorithm ! Introduced
by Lempel & Moran in 2000
– « A Stochastic Approach to Link Structure Analysis » – Combines ideas from PageRank and HITS
135
The SALSA algorithm ! From
the neighborhood graph, compute two sets of nodes – The hub nodes – The authority nodes – View this as a bipartite graph a1
a2
a3
Authorities
Hubs h1
136
h2
68
The SALSA algorithm ! From
this bipartite graph, compute a Markov chain with – Ph = (Din)-1WT : the probability of jumping from an authority node to a hub node – Pa = (Dout)-1W : the probability of jumping from a hub node to an authority node
! Thus:
xh(k+1) = (Ph)T xa(k) xa(k+1) = (Pa)T xh(k) where xh, xa are the probability distributions for hub and authority nodes 137
The SALSA algorithm ! The
transition probabilities matrix of the Markov chain – Restricted to the authority nodes is PhPa – Restricted to the hub nodes is PaPh 138
69
The SALSA algorithm ! The
steady-state probability distribution of the two restricted Markov chains are the hub scores and authority scores – When removing dangling nodes – When computing the steady-state for each connected component 139
Betweenness centrality
140
70
Betweenness centrality ! Suppose
we have a graph
141
Betweenness centrality ! This
quantity measures to which extend a node is an important intermediary ! Explaining the relationship between two nodes ! One way to measure betweenness relies on Freeman’s algorithm
142
71
Betweenness centrality !
For every pair of nodes (i, j): 1. Compute the shortest-path between the two nodes i and j 2. Add one to each node lying on this shortest-path
!
The betweenness associated to each node is the number of times this node lies on the shortest-path 143
Betweenness centrality ! Example:
node i will have a high betweenness score i
144
72
Betweenness centrality ! Many
extensions of this concept exist
– For instance, shortest-paths could be replaced by random-walkers
145
Collaborative recommendation
146
73
Collaborative recommendation ! Suppose
we have tables connected by relationships – Example: A movie database
147
Collaborative recommendation ! Computing
similarities beween people and movies allows to suggest movies to watch or not to watch = Collaborative recommendation ! Based on historical data (purchases)
148
74
Collaborative recommendation ! Most
popular algorithms based on a bipartite graph: – Individuals-based nearest neighbours algorithms
! We
thus have individuals and items (the movies) – Each individual i is represented by a vector vi in the item space – Containing, as elements, 1 if individual i purchased the item, and 0 otherwise – We call this vector the individuals’ profile 149 vector
Collaborative recommendation ! We
also compute the individuals–items frequency matrix W – Containing as element wij – Which indicates whether item j was purchased by user i (binary values) – Or, alternatively, represents the number of times individual i purchased item j, wij = n(i,j) 150
75
Collaborative recommendation ! We
first compute a similarity between two individuals i and j on the basis of their profile vectors ! There is a large choice of such similarities – for instance for binary values, – suppose a, b, c, and d are defined by
151
Collaborative recommendation ! Counting
the number of 1’s and 0’s in
common – Here are some popular choices:
152
76
Collaborative recommendation ! A
popular choice is for instance sim(i, j) = a/(a + b + c)
! Or
the cosine similarity
where vi is a binary vector containing the items individual i purchased (or not) – An adjusted cosine similarity has been developed 153 for collaborative recommendation
Collaborative recommendation ! From
this similarity measure between individuals, we compute the k nearest neighbours (KNN) of individual i – These are the individuals having similar tastes
! From
these nearest neighbours, we compute some « predicted values » for the items – The first items proposed to individual i are the ones with the highest predicted value 154
77
Collaborative recommendation ! The
predicted value of item j for individual i is computed as the number of times item j has been purchased by i‘s neighbours – Weighted by the similarities between the individuals
where apj is an element of the individuals-items frequency matrix A and we keep only the k nearest 155 neighbours
Latent class model ! We
now introduce the well-known latent class model – for collaborative filtering
! We
have
– a random variable, x, representing individuals (m in total) – and a random variable, y, representing items (n in total) 156
78
Latent class model ! We
also assume that there exists a latent, unobserved, class z – Representing classes of interest (l in total) of users – And classes of items
! The
variables x, y are assumed independent, given z (conditionaly independent) – The number of latent classes, l, must be provided a priori
157
Latent class model ! In
this case, we are interested in the posterior probability distribution on the items for individual i, P(y | x = i)
! From
which the most probable items will be recommended to the user 158
79
Latent class model ! The
joint distribution users-items is
where we used the conditional independence relation 159
Latent class model ! The
parameters of the model are defined as – P(z = k) (class prior probabilities) – P(x = i|z = k) (within-class emission of users) – P(y = j|z = k) (within-class emission of items)
160
80
Latent class model ! The
hidden, latent, variables will be the classes, z = k
! The
observations (N in total) are
– The wij = n(i,j) : the number of times the transaction between individual x = i and item y = j appears in the database – That is, the number of times item y = j was purchased by individual x = i 161
Latent class model ! The
posterior probability of items can be computed from the joint distribution – In terms of the parameters
162
81
Latent class model ! The
variable z is considered as an unobserved, hidden, latent variable ! The parameters are estimated by maximum likelihood – But the likelihood function is difficult to maximize ! A
EM algorithm is used in order to estimate the parameters – The expectation step estimates the value 163 of z for each user, item (class membership of the transaction) – The maximization step provides reestimation formulas for the parameters
Latent class model ! A
EM algorithm is used in order to estimate the parameters – The expectation step estimates the value of z for each user, item (class membership of the transaction) – The maximization step provides reestimation formulas for the parameters
164
82
Latent class model ! First,
compute the complete loglikelihood of the data, l (assuming latent classes are known) ! The EM algorithm iterates two step until convergence – The expectation step aims at computing the expectation of l given the current value of the parameters Θ and the observations:
E [l|x, y;
]
165
Latent class model – The maximization step aims at maximizing the expectation of the complete loglikelihood in terms of the parameters (assuming that the expectations of the hidden variables are fixed) ! Iterate
expectation-maximization until convergence ! It can be shown that this procedure increases the likelihood function at each 166 iteration step
83
Latent class model ! First,
the expectation step
– Where the values of the parameters are kept fixed – The hidden variables are reestimated from the current values of the parameters and the observations • By computing the expectation of the complete log-likelihood conditionally on the observations and the current value of the parameters 167
Latent class model ! The
complete likelihood is:
– Assuming an i.i.d. sample and that the latent classes are known for each observation t (each transaction) N
L =
l
[P(xt , yt , zt = k)]
t (k)
t=1 k=1 N
l
= t=1 k=1
[P(xt |zt = k)P(yt |zt = k)P(zt = k)]
t (k)
– where the indicator variable δt(k) is = 1 if transaction t belongs to latent class k – δt(k) is = 0 if transaction t does not belong to 168 latent class k
84
Latent class model ! And
l
the complete log-likelihood is
= log(L) N
l
=
t (k) t=1 k=1
log (P(xt |zt = k)P(yt |zt = k)P(zt = k))
169
Latent class model ! Therefore,
the expectation of the complete log-likelihood is
E [l|x, y;
]=
N X l X
E [ t (k)|xt , yt ;
t=1 k=1
=
m X n X l X
n(i, j) E [ (k)|x = i, y = j;
] log (P(xt |zt = k)P(yt |zt = k)P(zt = k)) ] log (P(x = i|z = k)P(y = j|z = k)P(z = k))
i=1 j=1 k=1
170
85
Latent class model ! Assuming
an iid sample, the conditional expectation of δt(k) – given the observations and the current parameter vector Θ is E [ (k)|x, y;
] = P(z = k|x, y) P(x, y|z = k)P(z = k) = P(x, y) P(x|z = k)P(y|z = k)P(z = k) = l k=1 P(z = k)P(x|z = k)P(y|z = k) 171
Latent class model ! And
thus the expectation step is
! Recall
that we defined n(i,j) as the number of times a transaction between individual x = i and item y = j appears in the database
172
86
Latent class model ! Thus,
the expectation of the complete log-likelihood reads
E[l|x, y; ⇥] =
m X n X l X
i=1 j=1 k=1
b = k|i, j)[log(P(x = i|z = k)) n(i, j)P(z
+ log(P(y = j|z = k)) + log(P(z = k))]
where x and y are vectors containing the observations {xt} and {yt} in the chronological order 173
Latent class model ! Second,
the maximization step
– Where the values of the hidden variables are kept fixed – The parameters are reestimated from the current values of the hidden variables and the observations • By computing the maximum of the expectation of the complete log-likelihood in terms of the parameters 174
87
Latent class model ! It
can further be shown that the maximization step leads to b P(z)
Pm Pn i=1
b j=1 P(z|x = i, y = j)n(i, j) P Pn m 0 0 i0 =1 j 0 =1 n(i , j )
175
Latent class model ! Intuitively,
this makes sense since
and P(x = i,y = j) can be estimated by n(i,j)/N
176
88
Latent class model ! Now,
in the same spirit,
177
Latent class model ! And
finally,
178
89
Latent class model ! These
update formulas are iterated until convergence of the parameters – The likelihood function converges to a local maximum
! The
prediction are then provided by
as computed before 179
The ItemRank model ! We
now introduce an extension of PageRank that can be used for collaborative recommendation ! It is based on the « random walk with restart » idea – And was developed by Pucci and Gori in the framework of collaborative recommendation 180
90
The ItemRank model ! It
is called ItemRank ! Assume a graph has been defined – Nodes are defined for users and items – There is a link between an item and a user if the user bought the item – Weights quantify the number of times the item has been bought 181
The ItemRank model ! Assume
a random walker on the graph
– Where vi contains 1/ni for each item that user i bought and 0 everywhere else – And where ni is the number of bought items – Where P is the transition probabilities matrix derived from the graph 182
91
The ItemRank model ! In
other words, the random walker has a probability (1 – α)vi of being teleported on a bought item node ! The steady-state solution will be the similarity score associated to each item – for user i
xi = (1
↵)(I
↵PT )
1
vi 183
Nonnegative matrix factorization ! Let
W ≥ 0 be the users-items matrix ! The elements of the matrix are approximated by – the inner products of two latent feature vectors, u and v – ui is the latent feature vector associated to user i, characterizing its tastes – vj is the latent feature vector associated to item j, characterizing its attributes 184
92
Nonnegative matrix factorization – ui and vj are constrained to be non-
negative – Otherwise we could use a SVD – This way, the interpretation of the elements is easier – As for SVD, the dimensionality of the latent space is assumed to be low
185
Nonnegative matrix factorization ! In
other words, we have
Each wij ⇡ (uri )T vjr or, in matrix form, W ⇡ UVT
– where both U and V contain nonnegative elements so that they can be easily interpreted
! This
leads, e.g, to the following problem minimize
kW
subject to
U V
U,V
UVT k2F O O
186
93
Nonnegative matrix factorization ! A
popular algorithm for finding the values of U, V is alternating least squares ( (
minimize
kW
subject to
U
minimize
kW
subject to
V
U
V
UVT k2F O
and
UVT k2F O
– There are standard algorithms for solving each subproblem converging to a local 187 minimum
Nonnegative matrix factorization ! An
even more popular and very simple algorithm is
8
:minimize kW V
UVT k2F and set all the negative elements of U to zero
UVT k2F and set all the negative elements of V to zero
! but
it lacks a proof of convergence ! However, it seems to work well in practice 188
94
Performance evaluation ! We
now turn to the important problem of the assessment of recommender systems ! For the movie database: – We delete links to some watched movies in the training set – These deleted watched movies form a test set ! The
algorithm has to predict these watched movies (test set) – Indeed, the watched movies should be well ranked ! 189
Performance evaluation ! Then,
use the standard information retrieval performance indicators: – Recall – Precision
! Always
compare to a ranking algorithm based on maximum frequency
190
95
Collaborative recommendation: some extensions ! More
general similarity measures are developped – Computing similarities between elements of remote tables
! One
can also develop item-based algorithms – Which are based on items nearest neighbours – Items-based nearest neighbours
191
Reputation model Marco Saerens
96
A reputation model ! We
now introduce a simple reputation model ! Suppose there are some providers (nx in total) – Who have some internal, latent, quality of service, q – Who send an item with quality x 193
A reputation model ! There
are also customers (ny in total) receiving the item or service – An individual can be both a provider and a customer – Through the provider/customer transaction – These customers will rate the received item – And provide a rating, y, which is observed – The observed ratings are standardized in order to have zero mean and unit variance 194
97
A reputation model ! The
quality xki of the item i sent by provider k will be Normally distributed – And centered on the quality of the provider, qk
195
A reputation model ! Thus,
each provider is characterized by two features: – his internal quality score, q, and – his stability in providing a constant quality, σx
196
98
A reputation model ! The
consumer l who ordered the item i will rate this transaction based on the inspection of its quality, xki ! Each consumer is characterized by three different features: – his reactivity with respect to the quality of the provided item, a, – his bias, b, and – his stability in providing constant rates for a fixed observed quality, σy 197
A reputation model ! The
rating ykli given by consumer l for transaction i with provider k will be given by – A linear regression model
198
99
A reputation model ! The
quantity
! Is
often called the expectation of the consumer
199
A reputation model ! The
rating provided by a customer ! Is directly proportional to the difference between – the observed quality of the item – and his expectation about the item qk, σxk k
al, bl, σyl xki
ykli
l 200
100
A reputation model ! The
as
complete likelihood can be written
– Since the xki are unobserved, we will use a EM-like algorithm 201
A reputation model ! The
complete log-likelihood l is
202
101
A reputation model ! The
expectation step provides a new estimate of the unobserved variable xki x bkli
h b al (bkx )2 qbk + 2 x 2 ykli b al (bk ) + (bly )2
i b (b al qbk + bl )
203
A reputation model ! We
define
204
102
A reputation model ! By
the maximization step, the parameters of the provider are updated by
– As the model is over-parametrized, we constrain the qk to be standardized (we standardize at each iteration)
205
A reputation model ! By
the maximization step, the parameters of the consumer are updated by
206
103
A reputation model ! and
207
A reputation model ! Iterating
these expectation and maximization steps until convergence provides – an estimate of the quality of each provider – as well as the other parameters
208
104
A simplified reputation model ! A
simplified model is obtained by setting the al = 1 (no reactivity parameter) and ! assuming that the provider always provides the same quality level ! In other words, the model is simply
ykli = qk + bl + "li 209
A simplified reputation model ! In
this case, the re-estimation expressions are simpler qˆk
ˆbl
2 nc X 1 4 X (ykli ˆl2 l=1
np X
i2{k!l} nc X nkl0 ˆl20 l0 =1
X
(ykli
, for all k
qˆk )
k=1 i2{k!l} np
X
3
ˆbl )5
, for all l and then center the ˆbl
nk 0 l
k0 =1
ˆl2
np X
X
(ykli
k=1 i2{k!l} np
X
qˆk
ˆbl )2 , for all l
nk 0 l
210
k0 =1
105
A simplified reputation model ! With
nkl being the number of transactions between k and l P nkl = i2{k!l} 1
– Notice that, here, we prefer to center the bias (remove its mean) instead of the quality level
ˆbl
ˆbl
nc 1 X ˆbl0 nc 0 l =1
211
Thank you !!!
212
106