TEORIE E TECNICHE DEL RICONOSCIMENTO Apprendimento automatico Sentiment analysis in SciKit-Learn REMINDER 1: SENTIMENT ANALYSIS • (o opinion mining) • Identifica il ‘sentimento’ che un testo esprime Sentiment Analysis Positive Negative Neutral SENTIMENT ANALYSIS COME CLASSIFICAZIONE DI TESTI • Treat sentiment analysis as a type of classification • Use corpora annotated for subjectivity and/or sentiment • Train machine learning algorithms: – – – – Naïve bayes Decision trees SVM … • Learn to automatically annotate new text SENTIMENT ANALYSIS DI TWEETS EASIER AND HARDER PROBLEMS • Tweets from Twitter are probably the easiest – short and thus usually straight to the point • Reviews are next – entities are given (almost) and there is little noise • Discussions, comments, and blogs are hard. – Multiple entities, comparisons, noisy, sarcasm, etc REMINDER 2: NAÏVE BAYES • Metodi Bayesiani: decisione su classificazione basata su – un modello PROBABILISTICO – che coniuga uso di informazioni A PRIORI ed A POSTERIORI come nella regola di Bayes • Metodi NAÏVE BAYES: si fanno assunzioni che semplificano molto il calcolo delle probabilità CLASSIFICAZIONE DI TESTI USANDO NAÏVE BAYES • Attributes are text positions, values are words. cNB argmax P(c j ) P( xi | c j ) c jC i argmax P(c j ) P( x1 " our" | c j ) P( xn " text" | c j ) c jC DATASET • Diversi dataset di tweets annotati con sentiments – P.e. SEMEVAL-2014 • In questo studio: il dataset di Nick Sanders – Totale 5000 Tweets – Annotati con classi positive / negative / neutral / irrelevant – Script per scaricarsi I tweets a partire dalle ID owever, that we don't calculate any real probabilities any more. Instead, we are estimating which class is more likely given the evidence. This is another estimating which issomore likely given theinterested evidence. This reason why Naive class Bayes is robust: it is not so much in the realis another whyprobabilities, Naive Bayes is so it is not so class much interested the real but only in robust: the information which is more likely to.in In short, we can ilities, butwrite onlyit as infollows: the information which class is more likely to. In short, write it as follows: Naïve Bayes per sentiment analysis: un esempio Here we are calculating the part after argmax for all classes of C ("pos" and "neg" in our case) and returning the class that results in the highest value. Butcalculating for Supponiamo the following us stick to for real all probabilities andC cheafter il let nostro training setclasses consista dido("pos" 6some tweets, e are theexample, part argmax of and "neg" calculations to see how Naive Bayes works. For the sake of simplicity, we will contengono soloforlethe parole ‘awesome’ e ‘crazy’, e che case)assume and che returning class results in the highest value. that Twitter the allows onlythat two words mentioned earlier, awesome sono stati classificati come segue: and crazy the following example, let us stick to real probabilities and do some tions to see how Naive Bayes the sake of simplicity, we will Tweet works. ForClass Positive e that Twitter allows onlyawesome for the two words mentioned earlier, awesome awesome Positive zy awesome crazy Positive crazy Positive Tweet Class awesome crazy Positive Negative awesome Positive crazy Negative which results in the following priors: he tweet to be positive. hat is still missing is the calculation of and es for the two features and conditioned on class C. Esempio, con’t , which are the ulated as the number of tweets inabout which have seen thebeconcrete his means, without knowing anything thewe tweet itself, wethat would wise inne case, we have six total tweets, out of which four are positive and two I the priors saranno segue: ivided by number of quindi tweetscome that have been labeled with the class of ssuming the tweet to be positive. results in the following priors: we want to know the probability of seeing awesome occurring once in a he piece that is still missing is the calculation of and , which are the wing that its is "positive"; we would have the following: robabilities forclass the two features and conditioned on class C. his is calculated as the number of tweets in which we have seen that the concrete eature is divided by the number of tweets that have been labeled with the class of . Let's say we to know le thelikelihoods probabilityper of seeing awesome occurring once in a Orawant calcoliamo `awesome’ weet that its tweets class is "positive"; we would have the following: eans, without knowing anything about thethe tweet itself, we would be wi f theknowing four positive three contained word awesome, obviously ng the to be positive. ility fortweet not having awesome in a positive tweet is its inverse as we have weets with the counts 0 or 1: ce that is still missing is the calculation of and , which ar ince outfor of the tweetsand three contained the word awesome, ilities thefour twopositive features conditioned on class C. obviously he probability for not having awesome in a positive tweet is its inverse as we have een only tweets withnumber the counts or 1 : in which we have seen that the conc calculated as the of0tweets or rest (omitting the case a word notbeen occurring in with a tweet): isthe divided by the number ofthat tweets that is have labeled the clas seen only tweets with the counts 0 or 1 : Since out of the four positive tweets three contained the word awesome, obviously the probability for not having awesome in a positive tweet is its inverse as we have seen only tweets with the counts 0 or 1 : Ancora l’esempio Similarly for the rest (omitting the case that a word is not occurring in a tweet): Le altre likelihoods: Similarly for the rest (omitting the case that a word is not occurring in a tweet): For the sake of completeness, we will also compute the evidence so that we can see real probabilities in the following example tweets. For two concrete values of and Evidence (= il denominatore) ,wethe can calculate the evidence as follows: For sake of completeness, we will also compute the evidence so that we can see real probabilities in the following example tweets. For two concrete values Chapter of and 6 ,we can calculate the evidence as follows: is denotation " " leads to the following values: [ 122 ] Classificazione dei tweets che ne risulta Now we have all the data to classify new tweets. The only work left is to parse the tweet and give features to it. Tweet Class probabilities Classification awesome 1 0 Positive crazy 0 1 Negative awesome crazy 1 1 Positive awesome text 0 0 Undefined, because we have never seen these words in this tweet before So far, so good. The for the last one, Due problemi • Probabilita’ 0 • Underflow Cosa fare con gli zeri • Anche un corpus di grandi dimensioni rimane sempre un campione molto limitato dell’uso del linguaggio, in cui molte parole anche di uso comune non occorrono • Soluzione: SMOOTHING – distribuire la probabilita’ in modo da coprire tutti gli eventi • In particolare, ADD ONE o LAPLACE smoothing Underflow • Le vere probabilita’ hanno valori molto bassi • E quando si hanno molte features il prodotto delle loro probabilita’ diventa ancora piu’ basso • Presto anche i limiti di NumPy vengono raggiunti di gi t s >>> np. ar r ay( [ >>> np. ar r ay( [ ( def aul t i s 8) ar r ay( [ 2. 48E- 324] ) 4. 94065645841246544177e- 324] ) ar r ay( [ 2. 47E- 324] ) 0. ] ) Esempio di underflow So, how probable is it that we will ever hit a number like 2 Supponiamo di avere 65 features, per ognuna delle quali we just have to imagine a likelihood for the conditional pr P(F|C) e’ <then 0.0001 multiply 65 of them together (meaning that we have 6 >>> x=0. 00001 >>> x* * 64 # st i l l f i ne 1e- 320 >>> x* * 65 # ouch 0. 0 A f l oat in Python is typically implemented using doubl e it is the case for your platform, you can check it as follows: >>> i mpor t sys >>> sys. f l oat _i nf o sys. f l oat _i nf o( max=1. 7976931348623157e+308, max_ exp=308, mi n=2. 2250738585072014e- 308, mi n_exp=- 1 look at the previous graph shows that the curve never goes down w left to right. In short, applying the logarithm does not change the hig o, let us stick this into the formula we used earlier: Soluzione rtunately,Rimpiazzare there is a better way todelle take probabilita’ care of this,con andlaitSOMMA has to do with a il prodotto ationship that maybe knowcare from ately, there isloro awe better waystill to take ofschool: this, and it has to do with a nic dei LOGARITMI use this to retrieve the formula for two features that will give us the nship that we maybe still know from school: real-world data that we will see in practice: we apply it to our case, we get the following: pply it to our case, we get the following: e, we will not be very successful with only two features, so let us rew the arbitrary number of features: the probabilities are in the interval between 0 and 1 , the log of the proba s in the interval - ∞ and 0 . Don't get irritated with that. Higher numbers ar probabilities arefor in the the correct intervalclass—it betweenis0only andthat 1 , the log of the probabili onger indicator they are negative now he interval - ∞ and 0. Don't get irritated with that. Higher numbers are s indicator er are, readyfor the correct class—it is only that they are negative now. Un sentiment analyzer Naïve Bayes in SciKit-Learn • La libreria sklearn.naive_bayes contiene implementazioni di tre classificatori Naïve Bayes: – GaussianNB (quando le features hanno distribuzione Gaussiana, per esempio altezze, etc) – MultinomialNB (quando le features sono delle frequenze di occorrenza di parole) – BernoulliNB (quando le features sono boolean) • Per sentiment analysis: MultinomialNB Creazione del modello • Le parole dei tweets sono usate come feature. Vengono estratte e pesate usando la funzione create_ngram_model – create_ngram_model usa la funzione TfidfVectorizer del pacchetto feature_extraction di scikit learn per estrarre i termini dai tweets • http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html • create_ngram_model usa MultinomialNB per creare un classificatore – http://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html • La funzione Pipeline di scikit-learn viene usata per combinare l’estrattore di features e il classificatore in un unico oggetto (un estimator) che puo’ essere usato per estrarre features dai dati, creare (‘fit’) un modello, e usare il modello per classificare – http://scikitlearn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html Estrazione di termini dai tweets & classificazione Estrae features & le pesa Classificatore Naïve Bayes Pipeline Addestramento e valutazione • La funzione train_model – Usa uno dei metodi nella libreria cross_validation di scikit-learn, ShuffleSplit, per calcolare i folds da usare nella cross validation – Ad ogni iterazione della cross validation addestra un modello usando il metodo fit, poi valuta I risultati usando score Creazione di un modello Determina gli indici nella in ogni fold Addestra il modello Esecuzione e risultati Ottimizzazione • Il programma sopra usa I valori di default dei parametri di TfidfVectorizer e MultinomialNB • Possibili varianti per Tfidf Vectorizer: – Usare unigrammi, bigrammi, trigrammi (parametro Ngrams) – Rimuovere stopwords (stop_words) – Usare versione binomiale dei counts • Possibili varianti per MultinomialNB: – Che tipo di smoothing usare Esplorazione dei parametri • La ricerca dei valori migliori per I parametri e’ una delle operazioni standard in apprendimento automatico • Scikit-learn, come anche Weka ed altri pacchetti simili, fornisce una funzione (GridSearchCV) che permette di esplorare i risultati ottenuti con valori diversi dei parametri implemented as met r i cs. f 1_scor e : Ottimizzazione con GridSearchCV Putting everything together, we get the following code: f r om skl ear n. gr i d_sear ch i mpor t Gr i dSear chCV f r om skl ear n. met r i cs i mpor t f 1_scor e Notate la sintassi per specificare I valori dei parametri def gr i d_sear ch_model ( cl f _f act or y, X, Y) : cv = Shuf f l eSpl i t ( n=l en( X) , n_i t er =10, t est _si ze=0. 3, i ndi ces=Tr ue, r andom_ st at e=0) Funzione di par am_gr i d = di ct ( vect __ngr am_r ange=[ ( 1, 1) , ( 1, 2) , ( 1, 3) ] , vect __mi n_df =[ 1, 2] , vect __st op_wor ds=[ None, " engl i sh" ] , smoothing vect __smoot h_i df =[ Fal se, Tr ue] , vect __use_i df =[ Fal se, Tr ue] , vect __subl i near _t f =[ Fal se, Tr ue] , vect __bi nar y=[ Fal se, Tr ue] , cl f __al pha=[ 0, 0. 01, 0. 05, 0. 1, 0. 5, 1] , ) Usa F measure valutazione gr i d_sear ch = Gr i dSear chCV( cl f _f act or y( ) , par am_gr i d=par am_gr i d, cv=cv, scor e_f unc=f 1_scor e, ver bose=10) gr i d_sear ch. f i t ( X, Y) r et ur n gr i d_sear ch. best _est i mat or _ come metro di Ripulire i tweets: emoticons Ripulire I tweets: abbreviazioni Usare il preprocessing con TfidfVectorizer Altri possibili miglioramenti • Usare il POS tagger di nltk • Usare un sentiment lexicon, per esempio SentiWordNet – http://sentiwordnet.isti.cnr.it Risultati complessivi LETTURE QUESTA LEZIONE • Echert & Coelho, capitolo 6