LINGUISTICA GENERALE E
COMPUTAZIONALE
CLASSIFICAZIONE DI TESTI
CLASSIFICAZIONE
• Un CLASSIFICATORE e’ una FUNZIONE da oggetti
che si vogliono classificare a etichette
– Assegnare la parte del discorso a parole
– Assegnare valore SPAM/NO SPAM a email
– Positivo / negativo
• I vari aspetti dell’interpretazione del linguaggio
che abbiamo visto nella prima lezione
(disambiguazione delle parti del discorso, analisi
sintattica, etc) possono essere tutti visti come
problemi di classificazione
ESEMPIO: DISAMBIGUAZIONE DELLE
PARTI DEL DISCORSO COME
CLASSIFICAZIONE
• La disambiguazione delle parti del discorso
(POS tagging) puo’ essere vista come un
classificatore che determina l’interpretazione
piu’ probabile di ogni parola (nome, verbo,
etc)
SECONDO ESEMPIO: DISAMBIGUAZIONE
DEI SIGNIFICATI
SENSES OF “line”
• Product: “While he wouldn’t estimate the sale price, analysts have
estimated that it would exceed $1 billion. Kraft also told analysts it plans
to develop and test a line of refrigerated entrees and desserts, under the
Chillery brand name.”
• Formation: “C-LD-R L-V-S V-NNA reads a sign in Caldor’s book department.
The 1,000 or so people fighting for a place in line have no trouble filling in
the blanks.”
• Text: “Newspaper editor Francis P. Church became famous for a 1897
editorial, addressed to a child, that included the line “Yes, Virginia, there is
a Santa Clause.”
• Cord: “It is known as an aggressive, tenacious litigator. Richard D. Parsons,
a partner at Patterson, Belknap, Webb and Tyler, likes the experience of
opposing Sullivan & Cromwell to “having a thousand-pound tuna on the
line.”
• Division: “Today, it is more vital than ever. In 1983, the act was entrenched
in a new constitution, which established a tricameral parliament along
racial lines, with separate chambers for whites, coloreds and Asians but
none for blacks.”
• Phone: “On the tape recording of Mrs. Guba's call to the 911 emergency
line, played at the trial, the baby sitter is heard begging for an
ambulance.”
5
UNA VISIONE GEOMETRICA DELLA
CLASSIFICAZIONE
SPAM
NON-SPAM
ESEMPIO DI CLASSIFICATORE:
DECISION TREE
IL RUOLO DELL’APPRENDIMENTO
AUTOMATICO
• Nella linguistica computazionale moderna,
questi classificatori non vengono specificati a
mano, ma vengono APPRESI
AUTOMATICAMENTE a partire da esempi.
CLASSIFICAZIONE PROBABILISTICA
• Ad ogni etichetta e’ tipicamente associata una
PROBABILITA’
• Il classificatore puo’ essere sviluppato a mano
o APPRESO da (grandi quantita’ di) ESEMPI
usando metodi di APPRENDIMENTO
AUTOMATICO
POS TAGGER PROBABILISTICI
• Un POS TAGGER e’ un classificatore che riceve
come input informazioni sulla parola (FEATURES)
–
–
–
–
UNIGRAM PROBABILITY: P(N|salto), P(V|salto)
AFFIXES (‘ing’, ‘ould’)
N-GRAM PROBABILITIES: P(NN|un salto)
….
• Produce in output una probabilita’
– P(N|UProb,AFF,Nprob) = …
– P(V|UProb,AFF,Nprob) = …
TIPI DI CLASSIFICATORI
• SUPERVISIONATI (SUPERVISED)
– Imparano da esempi etichettati
– Modellano l’apprendimento tramite insegnanti
• NON SUPERVISIONATI (UNSUPERVISED)
– Scoprono da soli la struttura del problema
– Modellano l’apprendimento del linguaggio
• SEMI-SUPERVISED
– Ricevono come input pochi esempi poi procedono
per somiglianza
SUPERVISED CLASSIFICATION FOR POS
TAGGING
• L’algoritmo di apprendimento riceve come
input un corpus di TRAINING classificato con
POS tags
– La/Art gatta/N fece/V un/Art salto/N ./.
– Giuseppe/PN e’/V matto/Adj ./.
• Si estrae le features / calcola le probabilita’
• Costruisce un MODELLO che puo’ poi essere
usato per classificare ALTRI testi
TRAIN/TEST
METODI PER L’APPRENDIMENTO
• DECISION TREES
• NAÏVE BAYES
NAÏVE BAYES
• Metodi Bayesiani: decisione su classificazione
basata su
– un modello PROBABILISTICO
– che coniuga uso di informazioni A PRIORI ed A
POSTERIORI come nella regola di Bayes
• Metodi NAÏVE BAYES: si fanno assunzioni che
semplificano molto il calcolo delle probabilità
Legge di Bayes
P(C , X )  P(C | X ) P( X )  P( X | C ) P(C )
P( X | C ) P(C )
P(C | X ) 
P( X )
Bayes applicata alla classificazione di
testi
P(Classe|Proprietà) = P(Proprietà|Classe)*P(Classe) /
P(Proprietà)
Maximum a posteriori Hypothesis
hMAP  argmax P(h | D)
hH
hMAP
P( D | h ) P( h )
 argmax
P( D )
hH
hMAP  argmax P( D | h) P(h)
hH
Naive Bayes Classifiers
Task: Classify a new instance based on a tuple of attribute values
x1 , x2 ,, xn
cMAP  argmax P(c j | x1 , x2 ,, xn )
c j C
cMAP  argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P(c1 , c2 ,, cn )
cMAP  argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Classifier: Assumptions
• P(cj)
– Can be estimated from the frequency of classes in
the training examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|)
– Could only be estimated if a very, very large
number of training examples was available.
Conditional Independence Assumption:
 Assume that the probability of observing the conjunction of
attributes is equal to the product of the individual probabilities.
The Naïve Bayes Classifier
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
• Conditional Independence Assumption:
features are independent of each other
given the class:
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
Learning the Model
C
X1
X2
X3
X4
X5
X6
• Common practice:maximum likelihood
– simply use the frequencies in the data
N (C  c j )
ˆ
P (c j ) 
N
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
• What if we have seen no training cases where patient had no flu and
muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )
• Zero probabilities cannot be conditioned away, no matter the other
evidence!
  arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Avoid Overfitting
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi
• Somewhat more subtle version
Pˆ ( xi ,k | c j ) 
overall fraction in data
where Xi=xi,k
N ( X i  xi ,k , C  c j )  mpi ,k
N (C  c j )  m
extent of
“smoothing”
Using Naive Bayes Classifiers to
Classify Text: Basic method
• Attributes are text positions, values are words.
cNB  argmax P(c j ) P( xi | c j )
c jC
i
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
c jC

Naive Bayes assumption is clearly violated.



Example?
Still too many possibilities
Assume that classification is independent of the positions of the
words

Use same parameters for each position
ESEMPIO DI CLASSIFICAZIONE:
DOCUMENT CLASSIFICATION
(NLTK book, p. 227-228)
VALUTAZIONE
• ACCURACY: percentuale di risposte corrette
• Nel caso di problemi in cui la classe di
interesse rappresenta una percentuale
minima del totale: PRECISION e RECALL
PRECISION E RECALL
ESEMPIO DI CLASSIFICAZIONE:
GENDER IDENTIFICATION
(NLTK book, p. 222-227)
APPRENDERE DECISION TREES
• Top-down: dato un certo insieme di esempi,
trovare la proprieta’ che permette di dividerli
in sottogruppi piu’ COERENTI
• Poi si procede ricorsivamente
• Scelta della proprieta’: INFORMATION GAIN
Top-down DT induction
• Partition training examples into good “splits”,
based on values of a single “good” feature:
(1) Sat, hot, no, casual, keys -> +
(2) Mon, cold, snow, casual, no-keys -> (3) Tue, hot, no, casual, no-keys -> (4) Tue, cold, rain, casual, no-keys -> (5) Wed, hot, rain, casual, keys -> +
Top-down DT induction
keys?
yes
Drive: 1,5
no
Walk: 2,3,4
Top-down DT induction
• Partition training examples into good “splits”,
based on values of a single “good” feature
(1) Sat, hot, no, casual -> +
(2) Mon, cold, snow, casual -> (3) Tue, hot, no, casual -> (4) Tue, cold, rain, casual -> (5) Wed, hot, rain, casual -> +
• No acceptable classification: proceed recursively
Top-down DT induction
t?
cold
Walk: 2,4
hot
Drive: 1,5
Walk: 3
Top-down DT induction
t?
cold
hot
Walk: 2,4
day?
Sat
Wed
Tue
Drive: 1
Walk: 3
Drive: 5
Top-down DT induction
t?
cold
hot
Walk: 2,4
Mo,
Thu,
Fr, Su
day?
Sat
Wed
Tue
Drive: 1
Walk: 3
Drive: 5
?
Drive
Selezione della proprieta’
• La scelta della proprieta’ da usare per dividere
l’insieme corrente in sottinsiemi piu’ coerenti
si basa su un criterio di RIDUZIONE DEL
DISORDINE basato sulla nozione di ENTROPIA
ENTROPIA
Entropy and Decision Trees
keys?
no
Walk: 2,4
E(Sno)=0
E(S)=-0.6*lg(0.6)-0.4*lg(0.4)=
0.97
yes
Drive: 1,3,5
E(Skeys)=0
Entropy and Decision Trees
t?
cold
Walk: 2,4
E(Scold)=0
E(S)=-0.6*lg(0.6)-0.4*lg(0.4)=
0.97
hot
Drive: 1,5
Walk: 3
E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)=
0.92
INFORMATION GAIN
Information gain
• For each feature f, compute the reduction in
entropy on the split:
Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|)
f=keys? : Gain(S,f)=0.97
f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42
f=clothing?: Gain(S,f)= ?
TEXT CATEGORIZATION WITH DT
• Build a separate decision tree for each
category
• Use WORDS COUNTS as features
Reuters Data Set
(21578 - ModApte split)
• 9603 training, 3299 test articles; ave. 200 words
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
Common categories
(#train, #test)
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
44
AN EXAMPLE OF REUTERS TEXT
Foundations of Statistical
Natural Language Processing,
Manning and Schuetze
Decision Tree for Reuter classification
Foundations of Statistical
Natural Language Processing,
Manning and Schuetze
Information gain & text classification
Scarica

slides - clic