Università di Milano-Bicocca
Laurea Magistrale in Informatica
Corso di
APPRENDIMENTO E APPROSSIMAZIONE
Prof. Giancarlo Mauri
Lezione 5 - Statistical Learning
Outline
Bayes theorem
MAP, ML hypotheses
Minimum Description Length principle
Optimal Bayes classifier
Naive Bayes classifier
Expectation Maximization (EM) algorithm
Two roles for Bayesian learning
Provides practical learning algorithms
 Naive Bayes learning
 Bayesian belief network learning
 Combines prior knowledge (prior probabilities) with observed
data
 Requires prior probabilities
Provides useful conceptual framework
 “Gold standard” for evaluating other learning algorithms
 Additional insight into Occam’s razor
Bayesian learning
Advantages
 No hypothesis is eliminated, even if inconsistent with data
 Each hypothesis h is given a probability P(h) to be the correct
one
 P(h) is incrementally modified after seen an example
 Hypotheses that make probabilistic predictions (eg, for
medical diagnosis) are allowed
Drawbacks
 Not easy to estimate prior probabilities
 Huge computational cost in the general case
Bayesian learning
View as updating of a probability distribution over the
hypothesis space H
H (random) hypothesis variable, values h1, h2, …
Use Bayes theorem to compute posterior probability
of each hypothesis
Start with prior distribution P(H)
jth observation gives the outcome dj of random
variable Dj - Training data D = {d1, d2, …, dN}
Bayes theorem
Allows to compute probabilities of hypotheses, given training
sample D and prior probabilities
P(D|h) P(h)
P(h|D) = -----------------P(D)
P(h) = prior probability of hypothesis h
P(D) = probability of training set D
P(h|D) = posterior probability of h given D
P(D|h) = likelihood of D given h
NB: increases with P(D|h) and P(h), decreases when P(D)
increases
Bayesian prediction
Prediction based on weighted average of likelihood wrt
hypotheses probabilities
P(X|D) = i P(X|D,hi)P(hi|D) = i P(X|hi)P(hi|D)
No need to pick one best-guess hypothesis!
Example
Un paziente fa un esame di laboratorio per un
marcatore tumorale. Sappiamo che:
 L’incidenza della malattia su tutta la popolazione è dell’8 per
mille (probabilità a priori)
 L’esame dà un risultato positivo nel 98% dei casi in cui è
presente la malattia (quindi 2% di falsi negativi)
 dà un risultato negativo nel 97% dei casi in cui non è presente
la malattia (quindi 3% di falsi positivi)
Example
Abbiamo le seguenti probabilità a priori e condizionate:
P(c) = 0,008
P(+|c) = 0,98
P(+|c) = 0,03
P(c) = 0,992
P(-|c) = 0,02
P(-|c) = 0,97
Se il test per il nostro paziente risulta positivo (D=+),
quali sono le probabilità a posteriori che abbia o non
abbia il cancro ?
P(+|c)P(c) = 0.98x0.008 = 0.0078 = P(c|+)
P(+|c)P(c) = 0.03x0.992 = 0.0298 = P(c|+)
Example
Dividendo per 0,0078+0,0298 per normalizzare a 1,
otteniamo:
P(c|+) = 0,21
P(c|+) = 0,79
E’ un risultato controintuitivo, che si spiega col fatto
che i falsi positivi, su una popolazione in stragrande
maggioranza sana, diventano molto numerosi
Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
Posterior probability of hypotheses
Number of samples in d
Prediction probability
0
2
4
6
8
10
Posterior probability of hypotheses
The correct hypothesis in the limit will dominate the
prediction, independently of the prior distribution,
provided that the correct hypothesis is not given 0
probability
The bayesian prediction is optimal, i.e. it will be
correct more often than each other prediction
MAP approximation
Summing over the hypothesis space is often intractable (e.g.,
18,446,744,073,709,551,616 Boolean functions of 6 attributes)
Maximum a posteriori (MAP) learning: choose the most probable hypothesis wrt
training data
P(D|h) P(h)
hMAP = arg max P(h|D) = arg max -----------------P(D)
hH
= arg max P(D|h) P(h)
= arg max (log P(D|h) + log P(h))
N.B. Log terms can be viewed as (negative of) bits to encode data given
hypothesis + bits to encode hypothesis (basic idea of minimum description
length (MDL) learning)
For deterministic hypotheses, P(D|h) is 1 if consistent, 0 otherwise
 MAP = simplest consistent hypothesis
Example
Number of samples in d
After three samples, h5 will be used for prediction
with probability 1
Brute force MAP learner
1.
For each h in H, calculate the posterior probability
P(D|h) P(h)
P(h|D) = -----------------P(D)
2. Output the hypothesis hMAP with the highest
posterior probability
hMAP = arg max P(h|D)
hH
Relation to Concept Learning
Consider our usual concept learning task
 instance space X, hypothesis space H, training examples D
Consider the FIND-S learning algorithm (outputs most
specific hypothesis from the version space VSH.D)
What would Bayes rule produce as the MAP hypothesis?
Does Find-S output a MAP hypothesis?
Relation to Concept Learning
Assume fixed set of instances ‹x1,…,xm›
Assume D is the set of classifications
D = ‹c(x1),…,c(xm)>
Let P(D|h) = 1 if h consistent with D
= 0 otherwise
Choose P(h) to be uniform distribution
P(h) = 1/|H| for all h in H
Then
P(h|D) = 1/|VSH,D| if h consistent with D
=0
otherwise
Evolution of posterior probabilities
ML approximation
Maximum likelihood (ML) hypothesis
If prior probabilities are uniform (P(hi)=P(hj) i,j), let choose h
that maximizes likelihood of D
hML = arg max P(D|h)
For large data sets, prior becomes irrelevant
ML is the “standard” (non-Bayesian) statistical learning method
I.e., simply get the best fit to the data; identical to MAP for
uniform prior (which is reasonable if all hypotheses are of the
same complexity)
ML parameter learning
Bag from a new manufacturer; fraction  of cherry candies?
Maximize this w.r.t. , which is easier for the log-likelihood:
Any  is possible: continuum of hypotheses h
 is a parameter for this simple (binomial) family of models
Suppose we unwrap N candies, c cherries and α = N-c limes
These are i.i.d. (independent, identically distributed) observations, so
P(D|h) = Nj=1P(dj|h) = c (1-)α
L(D|h) = log P(D|h) = Nj=1 log P(dj|h) = c log + α log(1-)
dL(D|h)P(h)
c
α
c
c
---------------- = --- - ------ = 0  
dc+ α
N
Seems sensible, but causes problems with 0 counts!
Learning a Real Valued Function
Consider any real-valued target
function f
Training examples ‹xi,di›, where di is
noisy training value
di = f(xi) + ei
Then the maximum likelihood
hypothesis hML is the one that
minimizes the sum of squared
errors:
exii is random variable (noise) drawn
indipendently for each xi according
to some Gaussian distribution with
mean=0
m
2
i=1(di-h(xi))
hML = arg min 
hH
Learning a real valued function
Minimum Description Length Principle
Minimum Description Length Principle
Minimum Description Length Principle
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
FORMULA
Where LC(x) is the description length of x under econding C
_______________________________________________________________
Example: H = decision trees, D = training data labels
 LC1(h) is # bits to describe tree h
 LC2(D|h) is # bits to describe D given h
-Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe
exceptions
 Hence hMDL trades off tree size for training errors
Minimum Description Length Principle
FORMULE
Interesting fact from information theory:
The optimal (shortest coding length) code for an event with probability p is
-log2p bits.
So interpret (1):
-log P(h) is length of h under optimal code
-log P(D|h) is length of D given h under optimal code
2
2
Prefer the hypothesis that minimizes
Length(h) + length(misclassifications)
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP)
Given new instance x, what is its most probable classification?
 hMAP(x) is not the most probable classification!
Consider:
Three possible hypotheses:
FORMULA
Given new instance x,
h1(x) = +, h2(x) = -, h3(x) = -
What’s the most probable classification of x?
Bayes Optimal Classifier
Bayes optimal classification:
FORMULA
Example:
P(h1|D) = .4, P(-|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(-|h2) = 0, P(+|h2) = 0
P(h3|D) = .3, P(-|h3) = 0, P(+|h3) = 0
Therefore
FORMULE
And
FORMULA
Gibbs Classifier
Bayes optimal classifier provides best result, but can be expensive if many
hypotheses.
Gibbs algorithm:
1.
2.
Choose one hypothesis at random, according to P(h|D)
Use this to classify new instance
Surprising fact: assume target concepts are drawn at random from H according to
priors on H. Then:
FORUMLA
Suppose correct, uniform prior distribution over H, then
-
Pick any hypothesis from VS, with uniform probability
Its expected error no worse than twice Bayes optimal
Naive Bayes Classifier
Along with decision trees, neural networks, nearest nbr, one of the most
practical learning methods.
When to use
-
Moderate or large training set available
Attributes that describe instances are conditionally independent
given classification
Successful applications:
-
Diagnosis
Classifying text documents
Naive Bayes Classifier
Assume target function f : X  V, where each instance x described by
attributes <a1, a2… an>.
Most probable value of f(x) is:
FORMULE
Naive Bayes assumption:
FORMULA
Which gives
Naive Bayes classifier:
FORMULA
Naive Bayes Algorithm
Naive_Bayes_Learn (examples)
For each target value vj
P(vj)  estimate P(vj)
For each attribute value ai of each attribute a
P(ai|vj)  estimate P(ai|vj )
Classify_New_Instance(x)
FORMULA
Naive Bayes: Example
Consider Playtennis again and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
Wanto to compute:
FORMULA
P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005
P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021
vNB = n
Naive Bayes: Subtleties
Conditional independence assumption is often violated
FORMULA
-
…but it works surprisingly well anyway. Note don’t need estimated
posteriors P(vj|x) to be correct; need only that
FORMULA
-
See [Domingos&Piazzoni, 1996] for analysis
Naive Bayes posteriors often unrealistically close to 1 or 0
Naive Bayes: Subtleties
2. What if none of the training instances with target value vj have
attribute value aj ? Then
FORMULE
Typical solution is Bayesian estimate for P(aj| vj)
FORMULA
Where:
-
n is number of training examples for which v = vj
nc number of examples for which v = vj and a = ai
p is prior estimate for P(aj| vj)
m is weight given to prior (i.e. number of “virtual” examples)
Why?
-
Learning to Classify Text
Learn which new articles are of interest
Learn to classify web pages by topic
Naive Bayes is among most effective algorithm
What attributes shall we use to represent text documents?
Learning to Classify Text
Target concept Interesting? : Document  {+, -}
1.
2.
-
Represent each document by vector of words
one attribute per word position in document
Learning: use training examples to estimate
P(+)
P(-)
P(doc|+)
P(doc|-)
Naive Bayes conditional independence assumption
FORMULA
Where_ _ _is probably that word in position I is_ _ _given_ _ _
One more assumption
FORMULA
1.
2.
-
-
Learn_Naive_Bayes_Text (Examples, V)
Collect all words and other tokens that occur in Examples
Vocabulary  all distinct words and other tokens in Examples
Calculate the required P(vj) and P(wk|vj) probability terms
For each traget value vj in V do
Docsj  subset of Examples for which the target value is vj
FORMULA
Textj  a single document created by concatenating all members of
docsj
N  total number of words in Textj (counting duplicate words
multiple times)
For each word wk in Vocabulary
*nk  number of times word wk occurs in Textj
*FORMULA
Classify_Naive_Bayes_Text (Doc)
-
positions  all word positions in Doc that contain tokens found in
Vocabulary
-
Return vNB, where
FORMULA
Classify_Naive_Bayes_Text (Doc)
Given 1000 training documents from each group learn to classify new document according
to which newsgroup it came from
comp.graphics
misc.forsale
comps.os.ms-windows.misc
rec.autos
comp.sys.ibm.pc.hardware
rec.motorcycles
comp.sys.mac.hardware
comp.windows.x
rec.sport.baseball
rec.sport.hockey
alt. atheism
sci.space
soc.religion.christian
sci.crypt
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
Naive Bayes: 89% classification accuracy
sci.electronics
sci.med
Reti Bayesiane
Il classificatore ottimale di Bayes non fa assunzioni di
indipendenza tra variabili ed è computazionalmente
pesante
Il classificatore “ingenuo” di Bayes è efficiente grazie
all’ipotesi molto restrittiva di indipendenza
condizionale di tutti gli attributi dato il valore
obiettivo v
Le reti Bayesiane descrivono l’indipendenza
condizionale tra sottoinsiemi di variabili
Reti Bayesiane
DEF. X, Y, Z variabili casuali discrete.
X è condizionalmente indipendente da Y dato Z se:
 xi, yj, zk P(X= xi|Y= yj,Z= zk)= P(X= xi|Z= zk)
Facilmente estendibile a insiemi di variabili:
P(X1,…, Xl|Y1,…,Ym,Z1,…, Zn)=P(X1,…, Xl | Z1,…, Zn)
Reti Bayesiane - Rappresentazione
Grafo aciclico orientato
 I nodi rappresentano le variabili casuali
 Gli archi rappresentano la struttura di dipendenza
condizionale: ogni variabile è condizionalmente indipendente
dai suoi non discendenti, dati i suoi genitori
Tabella di probabilità condizionali locali per ogni nodo,
con la distribuzione di probabilità della variabile
associata dati i predecessori immediati
Si può calcolare:
P(y1,…, yn) = i P(yi|gen(Yi))
Reti Bayesiane Rappresentazione
Temporale
Gita
T,G T,G T,G T,G
Fuoco di campo
F 0,4 0,1
0,8
0,2
F 0,6 0,9
0,2
0,8
Fulmine
Tuono
(Variabili a valori booleani)
Incendio