5 downloads
22 Views
26MB Size
Decision making Marco Saerens (UCL)
1
Classification decision rules when costs are associated to errors (classifier-based decisions)
2
Optimal decisions based on classifiers’ outputs •
We assume the classifier provides estimates of the a posteriori probabilities of the classes:
P(" k x)
•
P(" k x) # yˆ k (x)
! •
!
The outputs of the classifier will be denoted as
Which decision should we take based on these outputs ?
3
Optimal decisions based on classifiers’ outputs (1) Decision rule with only the prior information – Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2
(2) Use of the class–conditional information – The sea bass/salmon example: – Assume P(x | ω1) and P(x | ω2) describe the difference in lightness between populations of sea bass and salmon
4
Optimal decisions based on classifiers’ outputs
5
Optimal decisions based on classifiers’ outputs •
By Bayes’ rule, we have
P(ωj | x) = (P(x|ωj)/P(x)) . P (ωj)
•
Thus, Posterior = (Likelihood_ratio x Prior)
•
In case of two categories: j= 2
P(x) = # P(x | " j )P(" j ) j=1
6
Optimal decisions based on classifiers’ outputs
7
Optimal decisions based on classifiers’ outputs • •
•
Suppose we observe a feature vector x For this x, the probability of error is
P(error | x) = P(ω2 | x) if we decide ω1
P(error | x) = P(ω1 | x) if we decide ω2
Thus, for minimizing the probability of error: Decide ω1 if P(ω1 | x) > P(ω2 | x)
Decide ω2 if P(ω2 | x) > P(ω1 | x)
8
Optimal decisions based on classifiers’ outputs •
Therefore for minimizing the probability of error, we should use Bayes decision
Decide ωk = argmaxω[P(ω1 | x), P(ω2 | x)]
•
The probability of error is then
P(error | x) = min[P(ω1 | x), P(ω2 | x)]
•
Stated otherwise,
Decide ωk = argmaxω[P(x | ω1) P(ω1), P(x | ω2) P 9 (ω2)]
Optimal decisions based on classifiers’ outputs •
Here is an illustration from Duda, Hart & Stork
10
Optimal decisions based on classifiers’ outputs •
Here is an example for a 5-categories problem
11
Optimal decisions based on classifiers’ outputs •
Generalization of the preceding ideas – Use more than two classes – Allowing actions/decisions and not only decide on the class • Allowing actions other than classification primarily allows the possibility of rejection
– Introduce a cost function which is more general than the probability of error • The cost function states how costly each action taken is 12
Optimal decisions based on classifiers’ outputs •
Let us now deal intuitively with costs – Let cij ≥ 0 be the cost associated to the fact that the decision-maker decides ωj while the true class is ωi
•
And this cost is defined for all i,j
13
Optimal decisions based on classifiers’ outputs • •
Suppose we observe a feature vector x For this x, the expected cost will be
l1(x) = c11 P(ω1 | x) + c21 P(ω2 | x) if we decide ω1
•
•
l2(x) = c12 P(ω1 | x) + c22 P(ω2 | x) if we decide ω2
Thus, for minimizing the expected cost: Decide ω1 if l1(x) < l2(x)
Decide ω2 if l2(x) < l1(x)
Stated otherwise,
Decide ωk = argminω[l1(x), l2(x)] 14
Optimal decisions based on classifiers’ outputs • •
Here is a more formal proof for q categories The random variable indicating the decision de of the decision-maker is d – It takes its values on the classes ωi – For instance, d = ωi means that the decision maker classifies the individual in class ωi
•
The decision-maker designs/controls P(d | x) – He controls the probability of assigning the observation x to the category – This probability distribution is chosen in order to minimize some cost
15
Optimal decisions based on classifiers’ outputs •
Let cij ≥ 0 be the cost associated to the fact that – The decision-maker assigns the observation to category j (d = ωj) – But in reality, this observation belongs to i (y = ωi)
•
The expectation of this cost, E[cost], is the “risk”:
16
Optimal decisions based on classifiers’ outputs •
if – cij = 1 when – cii = 0 when i = j
• • •
We count the classification errors And the risk is the error rate The objective is to minimize the risk 17
Optimal decisions based on classifiers’ outputs •
If as usual x is the vector of features, we obtain
18
Optimal decisions based on classifiers’ outputs •
Since, conditionnaly on the available features x, the decision cannot depend on the true event that has not yet happened
19
Optimal decisions based on classifiers’ outputs •
Thus:
•
Since all the terms are positive, this is equivalent to minimizing the conditional risk for every x:
20
Optimal decisions based on classifiers’ outputs •
The vector
•
Subject to the contraint or
•
corresponds to the decision-making or choice vector: assign x to class ωj with likelihood uj(x)
21
Optimal decisions based on classifiers’ outputs •
For optimal decision-making, we choose u(x) by
•
where u*(x) is the optimal choice vector
22
Optimal decisions based on classifiers’ outputs •
The conditional risk can be rewritten as
•
And since P(" i x) # yˆ i (x)
!
23
Optimal decisions based on classifiers’ outputs •
Thus
•
By defining the expected cost when chosing class j as: 24
Optimal decisions based on classifiers’ outputs •
The problem is
•
with
•
This problem can easily be solved 25
Optimal decisions based on classifiers’ outputs •
The optimal choice is: assign x to the class k such that
•
Indeed, the solution can be written as
•
Thus, we should choose the class k corresponding to the minimal estimated lj(x)
26
Optimal decisions based on classifiers’ outputs •
Notice that this decision rule is independent of the origin of the costs – If you add a constant to every cost cij, nothing changes – Thus, if some costs are negative, just add a constant to each of them to make them positive
27
Optimal decisions based on classifiers’ outputs •
In the case of a two-classes problem (binary classification), the decision rule becomes
28
Optimal decisions based on classifiers’ outputs •
If the information are provided in terms of benefit/reward: Cost = – Reward Cost’ = Cost + constant ≥ 0
29
Optimal decisions based on classifiers’ outputs •
We now introduce a reject option – That is, we do not want to classify the individual if the risk is too high – In this case, we simply reject it (we do not classify it)
•
The decision is to accept a feature vector x and assign it to class ωk if
lk(x) = minj[lj(x)] ≤ θ
•
And to reject feature vector x if
lk(x) = minj[lj(x)] > θ
30
Optimal decisions based on classifiers’ outputs •
•
This is equivalent to defining a reject region with a constant conditional risk, l0 (x) = θ So that the optimal decision rule with reject option is to assign feature vector x to class ωk if lk(x) < lj(x), for j ≠ k and j = 0, 1, …, q
•
In other words, we obtain the usual rule k = argminj[lj(x)]
31
Multi-criteria decisions: A brief overview of the Promethee method (multi-criteria preferences modelling)
Pascal Francq (ULB) & Marco Saerens (UCL)
32
The Promethee method • •
Let A be a set of alternatives Each element a belonging to A receives an evaluation – According to some criterion fj (k criteria in total) – The evaluation of a being fj(a)
•
As an illustrative example, – the set of elements are cars – Ant the criterion are price, fuel consumption, comfort, power
33
The Promethee method •
Here are the data for the alternative cars
•
Which car will I choose according to my personal preferences ? 34
The Promethee method •
Here is the evaluation table – For each alternative – For each criterion
35
The Promethee method •
The main goal is to rank the set of different alternatives A according to user s preferences – By taking all of the criteria into account – By modelling how user s preferences vary in function of the value of the criterion – By weighting the different criteria 36
The Promethee method – user s preferences • •
User s preferences: For this purpose, we define preference functions Pj(a,b)
– which provide the degree of preference of a over b for a given criterion – for a given criterion fj
•
Usually, Pj(a,b) is a non-decreasing function of the deviation d = fj(a) – fj(b)
37
The Promethee method – user s preferences •
The function Pj(a,b) is chosen normalized
•
when d ≤ 0, a is not preferred to b
•
when d > 0, a is preferred to b
38
The Promethee method – user s preferences • •
Thus, when d ≤ 0, Pj(a,b) = 0 but Pj(b,a) could be positive By convenience, we therefore define the symmetrized function Hj
39
The Promethee method – user s preferences •
Here are some symmetrized preference functions
– The p and q tresholds are called indifference and preference tresholds
40
The Promethee method •
A multi-criteria preference index π(a,b) of a over b is defined – Taking all the criteria into account – As well as the preferences
•
A weight wj ≥ 0 is associated to each criterion by the user
41
The Promethee method •
Thus, π(a,b) expresses to what extend a is preferred to b
•
π(a,b) is computed for all pairs of alternatives (a,b) in A x A
– We will agregate all these multi-criteria preference indexes – Into one global index for each alternative 42
The Promethee method •
We now have to rank the alternatives according to the multicriteria preference indexes – Each alternative faces the (n – 1) others
•
We therefore define – The positive outranking flow – The negative outranking flow
•
Which expresses to what extend each alternative outranks all the others
43
The Promethee method •
•
The positive outranking flow is
The negative outranking flow is
44
The Promethee method Negative outranking Φ–
(incoming preferences flow)
Positive outranking Φ+
(outcoming preferences flow)
45
The Promethee method •
• •
At this stage, there are many ways of defining a ranking based on these positive and negative outranking flows Many versions of the Promethee methodology exist Another methodology, developped before Promethee, is Electre 46
The Promethee method
•
Promethee II defines the ranking as the ordering simply provided by The net outranking flow
•
Thus, a is preferred to b if and only if
•
47
The Promethee method •
When Φ(a) = Φ(b), the two alternatives a and b are indifferent – And we cannot conclude that a is preferred to b or vice-versa
48
The Promethee method • •
For the cars example, each criterion is rescaled between 0 and 1 We then set – The preference treshold to p = 0.20 – The indifference treshold to q = 0.05
49
The Promethee method •
Let us first compute the solution when all criteria have the same weight = 1
50
The Promethee method •
Let us now recompute the solution when all criteria have weight = 1 except power having weight = 4
51
Dynamic programming and edit-distance
Marco Saerens (UCL)
52
Dynamic programming •
Suppose we have a lattice with N levels: 3
3
3
2 2
1 2
1
2 3
2
3
3
2
2
2
3
1
0
1 3 2
1 3
0
2 2
3
2
5
2
2
2
4
1
2
2
2
0
53
Dynamic programming • • •
The problem is to reach level N
From level 1 With minimum cost = shortest-path problem
54
Dynamic programming •
Some definitions
•
The local cost associated to the decision to jump to the state sk = j at level k
•
Given that we were in state sk–1 = i at level k–1
55
Dynamic programming •
The total cost of a path is
•
The optimal cost when starting from state s0 is
56
Dynamic programming •
The optimal cost, whatever the initial state, is
•
The optimal cost when starting from some intermediate state
57
Dynamic programming •
Here are the recurrence relations allowing to obtain the optimal cost:
58
Dynamic programming •
Graphically: S k+1= i+1
Sk = i
S k+1 = i
S k+1= i–1
59
Dynamic programming 0
•
3
2
Example: 5
4
2
1
1
2
3 4
12
4
5 4 4
5
60
Dynamic programming •
Proof:
61
Dynamic programming •
In a symmetric way:
62
Dynamic programming •
Now, if there are jumps bypassing some levels,
•
where is the set of states to which there is a direct jump from sk = 63i
Dynamic programming To
find the optimal path, we need to keep track of ‒ the previous state in each node (a pointer from the previous state to the current state)
And ‒
use backtracking from the last node in order to retrieve the optimal path 64
Application to edit-distance •
Computation of a distance between two strings – It computes the minimal number of insertions, deletions and substitutions – That is, the minimum number of editions
•
For transforming one character string x into another character string y
65
Application to edit-distance •
Also called the “Levenstein distance” We thus have two character strings
•
xi being the character i of string x
•
66
Application to edit-distance •
The length of the string x is denoted by |x|
•
In general, we have
•
The substring of x , beginning at character i and ending at j is defined by 67
Application to edit-distance • • •
We will now read the characters of x one by one In order to construct the string y To this aim, we define the three editing operations: – Insertion of a character in y (without taking any character from x) – Deletion of a character from x (without concatenating it to y) – Substitution of a character from y by one of x 68
Application to edit-distance •
First convention:
– Means that we have read (extracted) the i– 1 first characters of x
– Thus, x has been cutted from its i–1 first characters – They have been taken in order to construct y
69
Application to edit-distance •
Second convention:
– Means that the j first characters of y have been transcribed •
We progressively read the first characters of x in order to build y
70
Application to edit-distance •
It corresponds to a process with levels (steps) and states – We will now apply dynamic programming
– One state is characterized by a couple (i, j)
71
Application to edit-distance •
Here is the formal definition of the three operations
– For the two first operations, we are jumping from level k to level k+1 – For the last one, we directly jump to level k +2 (one level is passed) 72
Application to edit-distance •
This situation can be represented in a two-dimensional table – One level is represented by
(i + j) = constant
– One state is represented by (i, j)
– One operation corresponds to a valid transition in this table
73
Application to edit-distance •
Exemple of a table computing the distance between “livre” and “lire”
74
Application to edit-distance •
Each level corresponds to a diagonal of the table: – (i + j) = constant
75
Application to edit-distance •
A cost (or penalty) is associated to each operation (insertion, deletion, substitution); for instance
•
with 76
Application to edit-distance •
The dynamic programming formula can be applied to this problem: – Initialization:
– Then:
77
Application to edit-distance •
And finally:
•
This value is the edit-distance or Levenstein distance
78
Application to edit-distance •
One example:
•
dist(lire, livre) = 1 79
Application to edit-distance •
In order to find the optimal path (optimal sequence of operations) – A pointer to the previous state has to be maintained for each cell of the table – The full path can be found by backtracking from the final state
•
Numerous extensions and generalizations of this basic algorithm have been developed
80
Application to edit-distance •
Notice that the related quantity, the longest common subsequence, lcs(x,y), can be obtained by (no proof provided) dist(x, y) = lcs(x, x) + lcs(y, y) − 2 lcs(x, y)
•
and thus
= |x| + |y| − 2 lcs(x, y)
1 lcs(x, y) = (|x| + |y| − dist(x, y)) 2
81
Application to edit-distance •
Thus, since
– dist(lire, livre) = 1
•
We have
– lcs(lire, livre) = 0.5 (4 + 5 – 1) = 4
82
Application to edit-distance
83
Markov decision processes: A short, informal, introduction (dynamic decisions)
84
Introduction to Markov decision processes •
We define the « Markov decision process » problem
•
Or the deterministic/stochastic « shortest path problem »
85
Markov decision processes •
We have a set of states, S = {1, 2, …, n} – st = k means that the process, or system, is in state k at time t
•
In each state s = k, we have a set of admissible control actions, actions, or decisions, U(k)
– So that u(k) ∈ U(k) is a control action available at state k
– The action only depends on the current state
86
Markov decision processes •
When we choose action u(st) at state st, – A bounded cost c(u(st)| st) < ∞ is incurred – The system jumps to state st+1 = k with a probability distribution:
P(st+1 = k | st = k, u(k) = i) = p(k |k, i)
•
Which only depends on the current state k and not on the past states (Markov assumption)
87
Markov decision processes • •
•
The probability distribution is independent of the past If the problem is deterministic, the probability distribution reduces to a Kronecker delta We suppose the network of states does not contain any negative cycle 88
Markov decision processes • •
The policy π is defined as the set of actions π = {u(1), u(2), …}
Which will be chosen as optimal in a certain sense 6
1
1
2 2
2
12 2
1 7
2 2
1
3
13
2 2 2
1
2 2
2
2 2
1
2
4 1
11
2
8
2 14
2
2 2
10
1
5 1
1 9
89
Markov decision processes • • •
•
The goal is to reach a destination or goal state, s = d
From some initial state, s0 = k0 While minimizing the total expected cost
The expectation is taken on the random variables 90
Markov decision processes •
Once we have reached the goal state, we stay there forever with no cost: p(d|d, i) = 1 and c(d|d) = 0
= absorbing state
•
Thus an optimal policy π* is given by
91
Markov decision processes •
In other words, we have to determine the best policy π that minimizes Vπ(k0)
– That is, the best action associated to each state
•
We will show that the optimal expected cost can be computed thanks to the recurrence relations – Value-iteration algorithm
92
Markov decision processes •
Thus, we route the agents as fast as possible through the network 6
1
1
2 2
2
12 2
1 7
2 2
1
3
13
2 2 2
1
2 2
2
2 2
1
2
4 1
11
2
8
2 14
2
2 2
10
1
5 1
1 9
93
Markov decision processes • •
Here is a heuristic derivation of the optimality conditions This policy will be called the optimal policy
94
Markov decision processes
We try to paticularize/factorize with respect to t = 0 by computing the conditional probability and using the Markov property
95
Markov decision processes
96
Markov decision processes •
Thus, we obtain:
= Bellman s equations •
The optimal action in state k is the action that exactly verifies this equation 97
Markov decision processes •
Which yields the value-iteration algorithm
•
After convergence, the best action is then provided by 98
Markov decision processes •
But this assumes that we both know – The transition probabilities – The rewards and the following states
•
We want to design an algorithm that does not need this prior knowledge – Which is able to learn-by-doing
99
Markov decision processes •
We therefore introduce a new quantity, Q, called the Q-value – Which is the expected cost when choosing action i in state k and then relying on policy π
100
Markov decision processes •
And of course,
•
Thus
101
Markov decision processes •
The equivalent of the value-iteration algorithm is thus
– And involves the expectation of (c(i|k) + min Q-values) in state k 102
Markov decision processes •
The idea is now to use stochastic approximation – In order to adjust the Q-values – When trying an action i in state k – Observing cost c(i|k) – And going to state k
103
Markov decision processes • •
Let us briefly introduce stochastic approximation Consider you have to compute the expectation of a random variable z
104
Markov decision processes •
Thus, you are seeking for E[z] – We now design a procedure that relies on empirical realizations of z
•
Thus, instead of computing the mathematical expectation, – let us compute the mean based on empirical realizations z(t) – appearing sequentially at each time step t 105
Markov decision processes •
The mean at time t, m(t), can be computed by � t 1 m(t) = z(τ ) t τ =1 � � t−1 � 1 = z(t) + z(τ ) t τ =1
t−1 1 (t − 1) 1 � = z(t) + z(τ ) t t (t − 1) τ =1
z(t) (t − 1) = + m(t − 1) t t 1 = m(t − 1) + (z(t) − m(t − 1)) t
106
Markov decision processes •
Thus, we have the updating rule
m ˆ ←m ˆ + α(t)(z(t) − m) ˆ • •
When α(t) = 1/t, we compute the exact mean Stochastic approximation considers such updating rules 107
Markov decision processes •
One can show that they converge to the true expectation, E[z] – when some conditions on α(t) are satisfied, – but also on the probability distribution
•
In the case of reinforcement learning, there is one approximation for each state k 108
Markov decision processes •
Since Q(k, i) = c(i|k) + =
� k�
� k�
p(k � |k, i) min� Q(k � , j) j∈U (k )
� � p(k � |k, i) c(i|k) + min� Q(k � , j) j∈U (k )
= E[c(i|k) + min� Q(k � , j)] j∈U (k )
•
The updating rule for the Q-values becomes 109
Markov decision processes •
The updating parameter 0 < αk(t) < 1 should decrease gradually according to the theory of stochastic approximation – So that the expectation of the Q-values is approximated when the iteration step t in node k increases
•
This is the basis of reinforcement learning algorithms 110
Markov decision processes •
There are many generalizations of this technique – For instance, there is not always a final destination state
•
In that case, we can minimize – The average cost per time unit – The total discounted cost 111
Markov decision processes •
Total discounted cost:
•
Which leads to the value-iteration algorithm
112
Markov decision processes •
The average cost per time unit criterion:
•
Which is somewhat more complex to analyze 113
A short introduction to Bayesian networks
114
Bayesian networks •
The conditional independence of a random variable xi with respect to xj, given y, is defined as
•
And thus 115
Bayesian networks •
A bayesian network is a – Directed, acyclic graph – Each node correponds to a random variables (and not a state as in Markov models) • From which some events/outcomes can occur
– An arc represent a causal relationship between two random variables • The knowledge of the first variable provides some insights on the outcome of the second variable
116
Bayesian networks • •
Each node xi is conditionally independent of its ancestors Given its parents, – Where represents the set of parents of xi
– Thus, once the value of the parents is known, a node is independent of the predecessor nodes 117
Bayesian networks •
A bayesian network is thus a model of the relationships between random variables – Assume there are n random variables, {x1,x2,…,xn}
– Assume also that these random varables have been sorted according to their dependencies (an acyclic digraph induces a partial order) – So that no predecessor of xi appears on the right of xi
– All the predecessors of xi appear on its left118
Bayesian networks •
We are seeking for the joint distribution
•
We will show that the joint probability distribution factorizes as
– Where, by convention, if xi has no parent, 119
Bayesian networks • •
Once the joint distribution is known The probability of any event can be computed by – Marginalization (elimination of variables) – Bayes law (conditioning)
120
Bayesian networks •
Marginalization
•
Conditioning
121
Bayesian networks •
For instance, naive Bayes classifier has the following form x1
x2
xp–1
…
y
xp Features
Class 122
Bayesian networks •
And thus we deduce that
•
And therefore the a posteriori probabilities are provided by
123
Bayesian networks •
Let us show that the joint probability can be factorized in terms of the the conditional probabilities
124
Bayesian networks •
Thus, only the probabilities given the parents need to be known – This is a huge saving in comparison with the knowledge of the full joint distribution
125
Bayesian networks •
Here are some softwares dedicated to bayesian networks modelling – Hugin – Netica – Bayes Net Toolbox for use in Matlab (open source)
126
Bayesian networks •
Let us consider a concrete, illustrative, toy example – Taken from the software vendor Netica
127
Bayesian networks •
The a priori probabilities of the events are: – Also called prevalences in medicine
128
Bayesian networks •
After some time, the doctor revised his prevalence estimations
129
Bayesian networks •
The doctor has to diagnose a new case – He adds knowledge about the patient: – He suffers from dyspnea
130
Bayesian networks •
Some more knowledge: – He did travel recently in Asia – The chance for tuberculosis is increasing
131
Bayesian networks •
Some more knowledge: – The patient is smoking
132
Bayesian networks •
The doctor decides to perform a x-ray: – The x-ray is normal
133
Bayesian networks •
If the x-ray was abnormal, we would have
134
Bayesian networks •
Here is a more complex example in ecological system modelling
135
Bayesian networks •
Sophisticated algorithms are available for updating the probabilities – Such a belief propagation
136
Bayesian networks • •
For decision-making, Utility and decision nodes can be added as well – A utility function quantifies the preferences of a user – It reflects the usefulness or desirability of the outcomes – And thus his level of satisfaction – According to some decisions that have to be made 137
Bayesian networks •
It is assumed that our decision u can influence the outcome of a random variable y – Once we have taken decision u = i, the probability of observing outcome y = k is P(y = k | u = i)
•
The utility of an outcome y = k is defined as U(k)
138
Bayesian networks •
For instance, the utility of money is often modelled as
139
Bayesian networks •
The expected utility of a decision u = i is therefore
– Where ny is the number of values (outcomes) of y
140
Bayesian networks •
Let us take a very simple illustrative example – The purpose is to help us taking or not our umbrella – Based on weather forecasting information
141
Bayesian networks • •
The decision node is a rectangle While the utility node is a six-sided figure – It is called a satisfaction node in Netica
142
Bayesian networks •
The weather can be (No_Rain and Rain) – We have to fill in our utility values – Depending on weather and decision
143
Bayesian networks •
Initially, the network looks like this – According to the a priori probabilities
144
Bayesian networks •
Now, if we listen to the weather forecast, – And the forecast is sunny, – the values are updated as
145
Bayesian networks •
Here is a more complex example
146
Bayesian networks • •
The car could be good with 0.8 and bad with 0.2 If the car is good, the profit will be € 60 while, if bad, € –100
147
Bayesian networks •
Before buying the car, he has the option of performing one or two tests – The first test has a 90% chance of returning positive if the car is good and 40% if bad
148
Bayesian networks •
If the first test returns positive, the second test has – a 88.89% chance of returning positive if the car is good and 33.33% if the car is bad
•
If the first test returns negative, the second test has – a 100% chance of returning positive if the car is good and 44.44% if the car is bad 149
Bayesian networks •
The user has to encode these information
150
Bayesian networks • •
The first test costs € 9 while the second test costs € 4 – Thus, both tests cost € 13
Bayesian networks •
The first question is of course « Do I perform the test or not ? » – Based on the a priori probabilities of finding a good and bad car – The utility of each decision about the tests: None, First only and Both tests
Bayesian networks •
The result is shown here – the expected utility of « Do Tests ? » is printed in the node – The best choice is « None » (no test at all)
153
Bayesian networks •
If you decide not to make any test (largest utility), – Expected utilities automatically appear in the « Buy It ? » node
154
Bayesian networks • •
Thus, the best choice is to buy the car with an expected profit (utility) of € 28 If we nevertheless decide to perform the first test, we obtain
155
Bayesian networks •
If we now indicate that the result of the first test is positive, the best decision is to buy and we obtain a profit of € 35
156
Bayesian networks •
But if the result of the first test is negative, the best decision is not to buy and we obtain a profit of € –9 (the cost of the test)
157
Bayesian networks •
The main disavantages of Bayesian networks are the fact that – All the conditional probabilities have to be estimated: – The utilities have to be estimated as well
•
There are methods for estimating utilities based on fair lotteries 158
A short introduction to twoplayers game theory (conflictual decisions)
159
Two-persons games •
Assume two decision-makers x, y controlling their strategy – The first decision-maker controls m decisions or actions, dx = i – The second decision-maker controls n decisions or actions, dy = j
•
The strategy is a probability distribution on the set of decisions 160
Two-persons games •
The vectors containing the probabilities are denoted as u, v – For decision-makers x, y
– Thus we have ui = P(dx = i), i = 1…m
– And vj = P(dy = j) , j = 1…n
•
These vectors containing the probabililities of actions are called the strategies associated to each decision161 maker
Two-persons games •
•
There is a reward (or payoff) associated to each pair of actions and its consequence for both decision-makers It represents the utility associated to each – aij ≥ 0 is the reward of decision-maker x associated to dx = i and dy = j
– bij ≥ 0 is the reward of decision-maker y associated to dx = i and dy = j
162
Two-persons games •
Assuming independence of the two players, – the total expected reward for player x is
– the total expected reward for player y is 163
Two-persons games •
In matrix form, we have
•
where A and B, containing the aij and bij, are the reward or payoff matrices 164
Two-persons games: first scenario • •
Assume now decision-maker y follows a fixed, known, strategy v0 The expected reward for x is
– That has to be maximized with respect to u
– with 165
Two-persons games: first scenario •
The optimal strategy is thus to always choose an action corresponding – to the largest element(s) of colum vector Av0
•
That is, pick up the actions k such that
166
Two-persons games: first scenario •
In other words, if there is only one maximum,
•
If there are many maxima, any probability distribution on these maxima is valid
167
Two-persons games: first scenario •
Let us nevertheless apply the necessary Karush-Kuhn-Tucker conditions of optimality – For a constrained optimization problem
for all i, j
168
Two-persons games: first scenario •
The necessary optimality conditions are
•
with for all i, j
169
Two-persons games: first scenario •
Applying these conditions to the problem at hand provides
•
with
•
By pre-multiplying the first equation by (u*)T, we obtain that ν is the expected reward:
170
Two-persons games: first scenario •
And also for all i
– The first equation states that ν is the maximum element of Av0
– The second equation states that all the components of u* are zero except those for which Av0 reaches its maximum value, ν
171
Two-persons games: first scenario •
where [Av0]i is element i of Av0:
•
We thus obtain the expected result !
172
Two-persons games: second scenario • •
Let us now consider that both decisionmakers are controlling their actions They both want to maximize their expected rewards
– This is often called bimatrix games
173
Two-persons games: second scenario •
Computing the Karush-Kuhn-Tucker conditions provides, in addition to – (u*)Te = 1 and (v*)Te = 1 – u* > 0 and v* > 0
•
For both maximization problems:
174
Two-persons games: second scenario •
And
•
Which means that each decision maker s strategy is the best response to the opponent s strategy 175
Two-persons games: second scenario •
Where [.]i denotes the ith row of the column vector
•
These are the necessary optimality conditions
176
Two-persons games: second scenario • •
These solutions are called equilibrium solutions or saddle points There are efficient algorithms that are able to enumerate all the solutions, based on these necessary conditions – It is called the linear complementarity problem – It can be shown that there is always at least one solution to the problem
177
Two-persons games: second scenario •
Consider now an example – Suppose that you (a student) take the subway each day – And you don t want to pay (c €) every traject…
•
So you decide – To buy a ticket (B) with probability u1 – Not to buy a ticket (–B) with probability u2
178
Two-persons games: second scenario •
The inspector from the transit authority – Controls your ticket with probability v1 – Do not contol (–C) it with probability v2
– If the inspector controls (C) your ticket and you did not buy a ticket, you get a fine of f €
– We further assume that the cost for controling someone is e €
179
Two-persons games: second scenario •
Your (student) reward matrix is A A=
•
B –B
C –c
–f
–C –c
0
The controler s reward matrix is B
C –C B = B c – e
c
–B f – e
0
180
Two-persons games: second scenario •
Assume now that the inspector decides – To control with a probability 0.1 – And not to control with a probability 0.9
•
The column vector Av is thus
[–c, –0.1 f]T
•
The optimal strategy is thus to always buy a ticket if –c > –0.1 f, or c < 0.1 f
– Which looks quite natural 181
Two-persons games: second scenario •
We now turn to the general problem – Where each decision-maker seeks his optimal strategy – We have to write the Karush-Kuhn-Tucker conditions
Two-persons games: second scenario •
Let us now compute the optimal equilibrium solution – The Karush-Kuhn-Tucker conditions tell us that if the strategy is a bivariate distribution, – Both components of BTu should reach the maximum value, vy :
183
Two-persons games: second scenario •
Assuming that both u1 and u2 are nonzero, we thus have to solve
– Which is a simple linear system of equations allows to find both u and νy
184
Two-persons games: second scenario •
The solution is
185
Two-persons games: second scenario •
The expected reward of the student is
•
This means that there is no benefit at all for behaving illegally !! – Or only to pay the inspector
186
Two-persons games: third scenario Let us now suppose that A + B = 0, or
B = –A • This is called a zero-sum game •
– Notice that nothing changes when a constant value is added to each entry of A and B
– So that it is equivalent to say that A + B = constant 187
Two-persons games: third scenario •
In that case, – decision-maker x maximizes uTAv – while decision-maker y maximizes –uTAv – Or, in other words, minimizes uTAv
188
Two-persons games: third scenario •
•
We can of course use the Karush-KuhnTucker condition in order to find the solution One important constatation in this case is that, at the optimum,
– Meaning that the expected cost ν is the same for decision-maker x and y 189
Two-persons games: third scenario •
But there is another view on the problem – that leads to a linear programming formulation of the problem
190
Two-persons games: third scenario •
The aim of the first decision-maker x is to maximize uTAv
•
Thus we want – With ν as large as possible – For any strategy v of the opponent y 191
Two-persons games: third scenario •
But
•
And since we want this condition to be true whatever the strategy v of the opponent, we must have that
•
or
192
Two-persons games: third scenario •
Now, let us define the rescaled vector
– We thus have •
Now, since – We also have
193
Two-persons games: third scenario •
Which means that maximizing ν is equivalent to minimizing
•
Subject to the constraints
194
Two-persons games: third scenario •
Which corresponds to a linear programming problem – It provides the best strategy for decisionmaker x
•
Indeed, once this problem is solved, the optimal strategy u for decision-maker x is obtained by simple rescaling 195
Two-persons games: third scenario •
The second decision-maker y has to solve the following linear programming problem
•
Subject to
196
Two-persons games: third scenario •
Let us take an example – We will analyse O’Neill card game
•
Alice and Bob each have four cards: – Ace, King, Queen, and Joker
• •
They simultaneously show a card Here are the rules 197
Two-persons games: third scenario •
Alice win if – Both show an ace or – There is a mismatch of picture cards
•
Bob wins if – Both show the same picture card or – If one shows an ace and the other doesn’t
198
Two-persons games: third scenario •
The reward matrices are
A K QJ A K QJ A = A 1
0
0
0
B = A 0
1
1
1
K 0
0
1
1
K 1
1
0
0
Q0
1
0
1
Q1
0
1
0
– So that the sum is always 1 J 0
1
1
0
J 1
0
0
1
199
Two-persons games: third scenario •
Thus, if we are seeking a strategy which has a non-zero probability distribution on the set of actions, – We have Av* = ν e
– and the constraint eTv* = 1
• •
We solve (v*/ν) = A–1e
And find the normalization factor ν by eTv* = 1
200
Two-persons games: third scenario •
The solution is u* = v* = [0.4,0.2,0.2,0.2]T
201
A short introduction to social choice (collective decisions)
202
Social choice basics •
Assume now that m raters provide – a complete ranking – of n > 2 alternatives
•
This corresponds to social choice problems – such as chosing a representative by voting
203
Social choice basics •
Let us first consider a simple example – We have three alternatives, A, B, C – And six raters
•
The preference orders are as follows (preferred on top) 22% 23% 15% 29% 07% 04%
A B C
A C B
B A C
B C A
C B A
C A B
204
Social choice basics • •
We easily see that A is the winner by plurality vote with a 45% score > 44% But if we consider a two-person contest, B would have defeated – A by 51% to 49% (B > A) – C by 66% to 34% (B > C)
•
Who is the winner ? 205
Social choice basics •
A social choice procedure is a function whose – input is a set of rankings or preference lists S and – output is a non-empty subset of S (called the winners or the social choice)
•
Let us consider some social choice procedures 206
Social choice basics •
Plurality voting – Just count the number of times each alternative is ranked first – Then, choose the set of alternatives most often ranked first – Only the alternative ranked first matters for each rater
207
Social choice basics •
The Borda count – For each alternative, sum up points corresponding to its rating order, for all ratings – It gets a (n – 1) points if rated first (on top) – It gets 0 points if rated last
•
This procedure accounts for the ranking place 208
Social choice basics •
The Hare system – It successively delete the less desirable alternatives
•
We iterate the two following rules: – First rule: if an alternative is ranked at the top (ranked first) in more than half of the preference lists, it is the winner – Second rule: delete the alternative that is on top of the fewest preference lists from 209 all the preference lists
Social choice basics •
Sequential pairwise voting with a fixed agenda – First decide a sequential order of one-toone competitions, involving the alternatives – Thus, the number of competitions is the number of alternatives minus 1 – For each competitition, only keep the alternative that is most often preferred by the raters • That is, the alternative that most often precedes 210 the other in the rankings
Social choice basics •
Dictatorship – Select à priori a rater – The winner will be the alternative ranked first by this rater, whatever his choice
211
Social choice basics •
Let us take a second example – Here are the preference lists (or ratings) for seven raters and six alternatives R1 a b c d e
R2 a d b e c
R3 a d b e c
R4 c b d e a
R5 c d b a e
R6 b c d a e
R7 e c d b a
212
Social choice basics • • •
Plurality voting: the winner is a Borda count: the winner is b Hare system: the winner is c – Delete first d since it is the alternative that is the least present on the first rank – It does not appear at all on top
213
Social choice basics •
We obtain R1 a b c e
•
R2 a b e c
R3 a b e c
R4 c b e a
R5 c b a e
R6 b c a e
R7 e c b a
We thus delete e that appears only once
214
Social choice basics •
We further obtain R1 a b c
•
R2 a b c
R3 a b c
R4 c b a
R5 c b a
R6 b c a
R7 c b a
We thus delete b that appears only once 215
Social choice basics •
Then, R1 R2 R3 R4 R5 R6 R7 a a a c c c c c c c a a a a
•
Since c appears on top of more than one half of the preference lists, it is the winner 216
Social choice basics •
Sequential pairwise voting: the winner is d for the sequential order (a, b, c, d, e) 1. a versus b: b > a 2. b versus c: b > c 3. b versus d: d > b 4. d versus e: d > e 217
Social choice basics •
Dictatorship: let s the dictator be rater R7 – Then the winner is e
• •
Each procedure provides a different winner ! Let us now enumerate four desirable properties of social choice procedures 218
Social choice basics • •
The Pareto criterion A social choice procedure is Pareto compliant if – for every pair (x, y) of alternatives: – If everyone prefers x to y, then y is not a social choice
219
Social choice basics •
The « Condorcet winner » criterion – The Condorcet winner is the alternative that defeats every other alternative in oneto-one contests – Exactly as for the sequential pairwise voting, except that we compute the results of all contests – It is therefore preferred to any other alternative 220 – Often, there is no Condorcet winner
Social choice basics •
A social choice procedure satisfies the Condorcet winner criterion provided that – If there is a Condorcet winner, – Then it is the only social choice winner
221
Social choice basics •
The monotonicity criterion – A social choice procedure is monotone provided that the following holds: – If x is a social choice, then – If every rater moves x up one spot – Then x should still be a social choice
222
Social choice basics •
The independence of irrelevant alternatives criterion – If the social choice set includes alternative x but not alternative y – And one or more raters change their preferences by upward or downward shifts of alternatives other than x, y – But no one changes his mind about whether x is preferred to y or y is preferred to x – Then the social choice set should still not 223 include y
Social choice basics •
Let us now have a look to the social choice procedures Pareto
C.W.C. Monoto
I.I.A.
Plurality
Yes
No
Yes
No
Borda
Yes
No
Yes
No
Hare
Yes
No
No
No
Seq pair
No
Yes
Yes
No
Dictator
Yes
No
Yes
Yes 224
Social choice basics •
In fact, it has been shown that there is no social choice procedure, for more than two alternatives, that satisfies both – The independence of irrelevant alternatives and – The Condorcet winner criteria
•
This is a variant of well-known Arrow s theorem of impossibility 225
Social choice basics • •
It can be proved from the Condorcet voting paradox Which shows that – Suppose there is a procedure for selecing a single preferred alternative – Could it be that 90% of the raters are unhappy – In the sense that there exists an alternative 90% agree is preferable to the social choice 226
Social choice basics •
Consider this example:
R1 R2 R3 a
c
b
b
a
c
c
b
a
– If a is the social choice, then R2 and R3 agree that c is better – If b is the social choice, then R1 and R2 agree that a is better – … => 2/3 of the raters are unhappy !
227
Social choice basics •
Let us now introduce another point of view on the problem – We will now define a distance measure between rankings containing eventually ties – It was introduced by Kemeny & Snell
228
Social choice basics •
Suppose R1 and R2 are rankings – The distance is computed from the set of all pairs of alternatives (a1, a2) with a1 ∈ R1 and a2 ∈ R2
•
Intuitively, this distance computes the minimum number of one-position switches for obtaining ranking R2 from ranking R1
– For complete orders, it corresponds to the bubble sort distance in computer science 229
Social choice basics • •
For a pair of alternatives (a1,a2), a1∈ R1 and a2∈ R2 Let us define δR1R2(a1,a2) as equal to = 0 if R1 and R2 agree in their ordering of (a1,a2)
=
+2 if R1 and R2 do not agree in their ordering of (a1,a2)
=
+1 if one ranking puts a1 over a2 or a2 over a1 and the other ranking has a1 and a2 tied
•
Then, d(R1,R2) is the sum of δR1R2(ai,aj) over all pairs of alternatives {ai,aj}
230
Social choice basics •
In mathematical notations,
R1 R2 a a–c b b c –
•
In this example, d(R1,R2) = δR1R2(a,b)
+ δR1R2(a,c) + δR1R2(b,c) = 0 + 1 + 2 = 3
231
Social choice basics • •
Kemeny & Snell s distance is closely related to Kendall s tau Which measures the correlation between two rankings
with dmax = n(n –1) and corresponds to the maximal distance between two rankings 232
Social choice basics •
It can be shown that the matrix made of the τ(Rk,Rl) is positive semi-definite – It is therefore a valid kernel matrix – And contains inner products
233
Social choice basics • •
If we have a set of rankings, one could claim that the most relevant ranking is the median or the mean of these ranking – Thus, the ranking that is most central is the social choice
•
This is the Kemeny choice 234
Social choice basics •
The mean ranking is the ranking R* that minimizes
•
The median ranking is the ranking R* that minimizes
235
Social choice basics •
The main drawback of this method is the fact that it becomes very quickly computationally untractable – NP-hard to compute
236
Thank you !