TEORIE E TECNICHE DEL
RICONOSCIMENTO
Apprendimento automatico
Sentiment analysis in SciKit-Learn
REMINDER 1:
SENTIMENT ANALYSIS
• (o opinion mining)
• Identifica il ‘sentimento’ che un testo esprime
Sentiment Analysis
Positive
Negative
Neutral
SENTIMENT ANALYSIS COME
CLASSIFICAZIONE DI TESTI
• Treat sentiment analysis as a type of classification
• Use corpora annotated for subjectivity and/or
sentiment
• Train machine learning algorithms:
–
–
–
–
Naïve bayes
Decision trees
SVM
…
• Learn to automatically annotate new text
SENTIMENT ANALYSIS DI TWEETS
EASIER AND HARDER PROBLEMS
• Tweets from Twitter are probably the easiest
– short and thus usually straight to the point
• Reviews are next
– entities are given (almost) and there is little noise
• Discussions, comments, and blogs are hard.
– Multiple entities, comparisons, noisy, sarcasm,
etc
REMINDER 2: NAÏVE BAYES
• Metodi Bayesiani: decisione su classificazione
basata su
– un modello PROBABILISTICO
– che coniuga uso di informazioni A PRIORI ed A
POSTERIORI come nella regola di Bayes
• Metodi NAÏVE BAYES: si fanno assunzioni che
semplificano molto il calcolo delle probabilità
CLASSIFICAZIONE DI TESTI USANDO
NAÏVE BAYES
• Attributes are text positions, values are words.
cNB  argmax P(c j ) P( xi | c j )
c jC
i
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
c jC
DATASET
• Diversi dataset di tweets annotati con
sentiments
– P.e. SEMEVAL-2014
• In questo studio: il dataset di Nick Sanders
– Totale 5000 Tweets
– Annotati con classi positive / negative / neutral /
irrelevant
– Script per scaricarsi I tweets a partire dalle ID
owever, that we don't calculate any real probabilities any more. Instead,
we are estimating which class is more likely given the evidence. This is another
estimating
which
issomore
likely
given
theinterested
evidence.
This
reason why
Naive class
Bayes is
robust:
it is not
so much
in the
realis another
whyprobabilities,
Naive Bayes
is so
it is not
so class
much
interested
the real
but only
in robust:
the information
which
is more
likely to.in
In short,
we can
ilities,
butwrite
onlyit as
infollows:
the information which class is more likely to. In short,
write it as follows:
Naïve Bayes per sentiment analysis:
un esempio
Here we are calculating the part after argmax for all classes of C ("pos" and "neg"
in our case) and returning the class that results in the highest value.
Butcalculating
for Supponiamo
the following
us
stick
to for
real all
probabilities
andC
cheafter
il let
nostro
training
setclasses
consista
dido("pos"
6some
tweets,
e are
theexample,
part
argmax
of
and "neg"
calculations to see how Naive Bayes works. For the sake of simplicity, we will
contengono
soloforlethe
parole
‘awesome’
e ‘crazy’,
e che
case)assume
and che
returning
class
results
in the
highest
value.
that Twitter the
allows
onlythat
two words
mentioned
earlier,
awesome
sono stati classificati come segue:
and crazy
the following example, let us stick to real probabilities and do some
tions to see how Naive Bayes
the sake of simplicity, we will
Tweet works. ForClass
Positive
e that Twitter allows onlyawesome
for the two words
mentioned earlier, awesome
awesome
Positive
zy
awesome crazy
Positive
crazy
Positive
Tweet
Class
awesome
crazy
Positive
Negative
awesome
Positive
crazy
Negative
which results in the following priors:
he tweet to be positive.
hat is still missing is the calculation of
and
es for the two features and conditioned on class C.
Esempio, con’t
, which are the
ulated
as the
number
of tweets
inabout
which
have
seen
thebeconcrete
his means,
without
knowing
anything
thewe
tweet
itself,
wethat
would
wise
inne
case,
we have
six
total
tweets,
out
of
which
four
are
positive
and
two
I the
priors
saranno
segue:
ivided by
number
of quindi
tweetscome
that have
been labeled with the class of
ssuming
the
tweet
to
be positive.
results in the following priors:
we want to know the probability of seeing awesome
occurring
once in a
he piece that is still missing is the calculation of
and
, which are the
wing
that its
is "positive";
we would
have the
following:
robabilities
forclass
the two
features and
conditioned
on class
C.
his is calculated as the number of tweets in which we have seen that the concrete
eature is divided by the number of tweets that have been labeled with the class of
. Let's say we
to know le
thelikelihoods
probabilityper
of seeing
awesome occurring once in a
Orawant
calcoliamo
`awesome’
weet
that
its tweets
class is
"positive";
we would
have
the
following:
eans,
without
knowing
anything
about
thethe
tweet
itself,
we would
be wi
f theknowing
four
positive
three contained
word
awesome,
obviously
ng the
to be positive.
ility
fortweet
not having
awesome in a positive tweet is its inverse as we have
weets with the counts 0 or 1:
ce that is still missing is the calculation of
and
, which ar
ince
outfor
of the
tweetsand
three contained
the word
awesome,
ilities
thefour
twopositive
features
conditioned
on class
C. obviously
he probability for not having awesome in a positive tweet is its inverse as we have
een
only tweets
withnumber
the counts
or 1 : in which we have seen that the conc
calculated
as the
of0tweets
or
rest (omitting
the case
a word
notbeen
occurring
in with
a tweet):
isthe
divided
by the number
ofthat
tweets
that is
have
labeled
the clas
seen only tweets with the counts 0 or 1 :
Since out of the four positive tweets three contained the word awesome, obviously
the probability for not having awesome in a positive tweet is its inverse as we have
seen only tweets with the counts 0 or 1 :
Ancora l’esempio
Similarly for the rest (omitting the case that a word is not occurring in a tweet):
Le altre likelihoods:
Similarly for the rest (omitting the case that a word is not occurring in a tweet):
For the sake of completeness, we will also compute the evidence so that we can see
real probabilities in the following example tweets. For two concrete values of and
Evidence
(= il denominatore)
,wethe
can
calculate
the evidence
as follows:
For
sake
of completeness,
we will
also compute the evidence so that we can see
real probabilities in the following example tweets. For two concrete values Chapter
of and
6
,we can calculate the evidence as follows:
is denotation " " leads to the following values:
[ 122 ]
Classificazione dei tweets che ne
risulta
Now we have all the data to classify new tweets. The only work left is to parse the
tweet and give features to it.
Tweet
Class probabilities
Classification
awesome
1
0
Positive
crazy
0
1
Negative
awesome
crazy
1
1
Positive
awesome
text
0
0
Undefined,
because we have
never seen these
words in this
tweet before
So far, so good. The
for the last one,
Due problemi
• Probabilita’ 0
• Underflow
Cosa fare con gli zeri
• Anche un corpus di grandi dimensioni rimane
sempre un campione molto limitato dell’uso
del linguaggio, in cui molte parole anche di
uso comune non occorrono
• Soluzione: SMOOTHING – distribuire la
probabilita’ in modo da coprire tutti gli eventi
• In particolare, ADD ONE o LAPLACE smoothing
Underflow
• Le vere probabilita’ hanno valori molto bassi
• E quando si hanno molte features il prodotto
delle loro probabilita’ diventa ancora piu’
basso
• Presto anche i limiti di NumPy vengono
raggiunti
di gi t s
>>> np.
ar r ay( [
>>> np.
ar r ay( [
( def aul t i s 8)
ar r ay( [ 2. 48E- 324] )
4. 94065645841246544177e- 324] )
ar r ay( [ 2. 47E- 324] )
0. ] )
Esempio di underflow
So, how probable is it that we will ever hit a number like 2
Supponiamo di avere 65 features, per ognuna delle quali
we just have to imagine a likelihood for the conditional pr
P(F|C) e’ <then
0.0001
multiply 65 of them together (meaning that we have 6
>>> x=0. 00001
>>> x* * 64 # st i l l f i ne
1e- 320
>>> x* * 65 # ouch
0. 0
A f l oat in Python is typically implemented using doubl e
it is the case for your platform, you can check it as follows:
>>> i mpor t sys
>>> sys. f l oat _i nf o
sys. f l oat _i nf o( max=1. 7976931348623157e+308, max_
exp=308, mi n=2. 2250738585072014e- 308, mi n_exp=- 1
look at the previous graph shows that the curve never goes down w
left to right. In short, applying the logarithm does not change the hig
o, let us stick this into the formula we used earlier:
Soluzione
rtunately,Rimpiazzare
there is a better
way todelle
take probabilita’
care of this,con
andlaitSOMMA
has to do with a
il prodotto
ationship
that
maybe
knowcare
from
ately,
there
isloro
awe
better
waystill
to take
ofschool:
this, and it has to do with a nic
dei
LOGARITMI
use this to retrieve the formula for two features that will give us the
nship that we maybe still know from school:
real-world data that we will see in practice:
we apply it to our case, we get the following:
pply it to our case, we get the following:
e, we will not be very successful with only two features, so let us rew
the arbitrary number of features:
the probabilities are in the interval between 0 and 1 , the log of the proba
s in the interval - ∞ and 0 . Don't get irritated with that. Higher numbers ar
probabilities
arefor
in the
the correct
intervalclass—it
betweenis0only
andthat
1 , the log of the probabili
onger indicator
they are negative now
he interval - ∞ and 0. Don't get irritated with that. Higher numbers are s
indicator
er are,
readyfor the correct class—it is only that they are negative now.
Un sentiment analyzer Naïve Bayes in
SciKit-Learn
• La libreria sklearn.naive_bayes contiene
implementazioni di tre classificatori Naïve
Bayes:
– GaussianNB (quando le features hanno
distribuzione Gaussiana, per esempio altezze, etc)
– MultinomialNB (quando le features sono delle
frequenze di occorrenza di parole)
– BernoulliNB (quando le features sono boolean)
• Per sentiment analysis: MultinomialNB
Creazione del modello
• Le parole dei tweets sono usate come feature. Vengono estratte e
pesate usando la funzione create_ngram_model
– create_ngram_model usa la funzione TfidfVectorizer del
pacchetto feature_extraction di scikit learn per estrarre i
termini dai tweets
• http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect
orizer.html
• create_ngram_model usa MultinomialNB per creare un
classificatore
– http://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial
NB.html
• La funzione Pipeline di scikit-learn viene usata per combinare
l’estrattore di features e il classificatore in un unico oggetto (un
estimator) che puo’ essere usato per estrarre features dai dati,
creare (‘fit’) un modello, e usare il modello per classificare
– http://scikitlearn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Estrazione di termini dai tweets &
classificazione
Estrae features & le pesa
Classificatore Naïve Bayes
Pipeline
Addestramento e valutazione
• La funzione train_model
– Usa uno dei metodi nella libreria cross_validation
di scikit-learn, ShuffleSplit, per calcolare i
folds da usare nella cross validation
– Ad ogni iterazione della cross validation addestra
un modello usando il metodo fit, poi valuta I
risultati usando score
Creazione di un modello
Determina gli indici nella in
ogni fold
Addestra il modello
Esecuzione e risultati
Ottimizzazione
• Il programma sopra usa I valori di default dei
parametri di TfidfVectorizer e MultinomialNB
• Possibili varianti per Tfidf Vectorizer:
– Usare unigrammi, bigrammi, trigrammi
(parametro Ngrams)
– Rimuovere stopwords (stop_words)
– Usare versione binomiale dei counts
• Possibili varianti per MultinomialNB:
– Che tipo di smoothing usare
Esplorazione dei parametri
• La ricerca dei valori migliori per I parametri e’
una delle operazioni standard in
apprendimento automatico
• Scikit-learn, come anche Weka ed altri
pacchetti simili, fornisce una funzione
(GridSearchCV) che permette di esplorare i
risultati ottenuti con valori diversi dei
parametri
implemented as met r i cs. f 1_scor e :
Ottimizzazione con GridSearchCV
Putting everything together, we get the following code:
f r om skl ear n. gr i d_sear ch i mpor t Gr i dSear chCV
f r om skl ear n. met r i cs i mpor t f 1_scor e
Notate la sintassi per
specificare I valori dei parametri
def gr i d_sear ch_model ( cl f _f act or y, X, Y) :
cv = Shuf f l eSpl i t (
n=l en( X) , n_i t er =10, t est _si ze=0. 3, i ndi ces=Tr ue, r andom_
st at e=0)
Funzione di
par am_gr i d = di ct ( vect __ngr am_r ange=[ ( 1, 1) , ( 1, 2) , ( 1, 3) ] ,
vect __mi n_df =[ 1, 2] ,
vect __st op_wor ds=[ None, " engl i sh" ] ,
smoothing
vect __smoot h_i df =[ Fal se, Tr ue] ,
vect __use_i df =[ Fal se, Tr ue] ,
vect __subl i near _t f =[ Fal se, Tr ue] ,
vect __bi nar y=[ Fal se, Tr ue] ,
cl f __al pha=[ 0, 0. 01, 0. 05, 0. 1, 0. 5, 1] ,
)
Usa F measure
valutazione
gr i d_sear ch = Gr i dSear chCV( cl f _f act or y( ) ,
par am_gr i d=par am_gr i d,
cv=cv,
scor e_f unc=f 1_scor e,
ver bose=10)
gr i d_sear ch. f i t ( X, Y)
r et ur n gr i d_sear ch. best _est i mat or _
come metro di
Ripulire i tweets: emoticons
Ripulire I tweets: abbreviazioni
Usare il preprocessing con
TfidfVectorizer
Altri possibili miglioramenti
• Usare il POS tagger di nltk
• Usare un sentiment lexicon, per esempio
SentiWordNet
– http://sentiwordnet.isti.cnr.it
Risultati complessivi
LETTURE
QUESTA LEZIONE
• Echert & Coelho, capitolo 6
Scarica

slides