EVALITA 2007 Frascati, September 10th 2007 ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION Roberto Zanoli and Emanuele Pianta TextPro A suite of modular NLP tools developed at FBK-irst TokenPro: MorphoPro: TagPro: LemmaPro: EntityPro: ChunkPro: SentencePro: tokenization morphological analysis Part-of-Speech tagging lemmatization Named Entity recognition phrase chunking sentence splitting Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface 2 EntityPro YamCha Training data TagPro Test data Feature extraction ortho, prefix, suffix, dictionary, collocation bigram dictionary Feature extraction ortho, prefix, suffix, dictionary, collocation bigram Feature selection Controller Feature selection Learning models Classification We used YamCha, an SVM-based machine learning environment, to build EntityPro, a system exploiting a rich set of linguistic features, such as the orthographic features, prefixes and suffixes, and the occurrence in proper nouns gazetteers. EntityPro’s architecture 3 YamCha • Created as generic, customizable, open source text chunker • Can be adapted to a lot of other tag-oriented NLP tasks • Uses state-of-the-art machine learning algorithm (SVM) Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest) Practical chunking time (1 or 2 sec./sentence.) Available as C/C++ library 4 Support Vector Machines Support vector machines are based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) from computational learning theory. Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. 5 YamCha: Setting Window Size Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized 6 Training and Tuning Set Evalita Development set randomly split into two parts training: 92.241 tokens tuning : 40.348 tokens 7 FEATURES (1/3) For each running word: WORD: the word itself (both unchanged and lower-cased) e.g. Casa casa POS: the part of speech of the word (as produced by TagPro) e.g. Oggi SS (singular noun) AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the word) e.g. Oggi {o,og,ogg,oggi, – i,gi,ggi,oggi} ORTHOgraphic information (e.g. capitalization, hyphenation) e.g. Oggi C (capitalized) oggi L (lowercased) 8 FEATURES (2/3) COLLOCation bigrams (36.000, Italian newspapers ranked by MI values) e.g. l’ avvocato di Rossi Carlo Taormina ha ……. O O O O B-COL I-COL O 9 FEATURES (3/3): GAZETTeers • • • • TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns (12.000, from various internet sites) STOCK-MARKET: Italian and American stock market organizations (5.000, from stock market sites) WIKI-GEO: Wikipedia geographical locations (3.200,) PERSONS: Person proper names or titles (154.000, Italian phone-book, Wikipedia,) difeso dall' avvocato Mario De Murgo di Vicenza …………….. O O O O O O O GPE O O O O O O O O O O O O O O O O O O TRIG B-NAM B-SUR I-SUR O O 10 An Example of Feature Extraction difeso dall' avvocato Mario De Murgo , VSP ES SS SPN E SPN XPW O O O B-PER I-PER I-PER O difeso difeso d di dif dife o so eso feso L N O O O O O VSP O dall' dall' d da dal dall ' l' ll' all' L A O O O O O ES O avvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS O Mario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PER De de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PER Murgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER 11 Static vs Dynamic Features STATIC FEATURES extracted for the current, previous and following word WORD, POS, AFFIX, ORTHO, COLLOC, GAZET DYNAMIC FEATURES decided dynamically during tagging tag of the 3 tokens preceding the current token. 12 Finding the best features Baseline: WORD (both unchanged and lower-cased) AFFIX ORTHOgraphic window-size: STAT: +2,-2 DYNAMIC: -2 Pr Re F1 baseline 75.28 68.74 71.86 +POS +1.31 +2.78 +2.11 +GAZET +6.09 +7.93 +7.09 +COLLOC +0.37 +0.54 +0.46 +CLUSTER_5-class -0.45 -0.04 -0.23 +POS+GAZET+COLLOC +6.56 +9.14 +7.95 13 Finding the best window-size Given the best set of features (F1=79.81) we tried to improve F1 measure changing the window-size STAT DYN Pr Re F1 +2,-2 -2 81.84 77.88 79.81 +3,-3 +6,-6 -3 -6 +1.03 +0.01 -1,17 -3.14 -0.14 -1.67 +1,-1 +1,-1 -1 -3 +1.87 +2.21 +2.46 +3.04 +2.18 +2.64 -7.70 -0.72 -4.19 +1,-1 14 Evaluating the best algorithm PKI vs. PKE YamCha uses two implementations of SVMs: PKI and PKE. •both are faster than the original SVMs PKI produces the same accuracy as the original SVMs. PKE approximates the orginal SVM, slightly less accurate but faster Pr Re F1 tokens/sec PKI 84.05 80.92 82.45 1400 PKE 83.22 80.16 81.66 4200 15 Feature Contribution to the best configuration Pr Re F1 Best Configuration 84.05 80.92 82.45 no POS no GAZET no COLLOC +0.27 -8.25 +0.01 -0.71 -8.40 -0.13 -0.24 -8.33 -0.06 no GAZET, no COLLOC (i.e. no external resources) no ORTHO no AFFIX -8.26 -0.96 -1.30 -8.49 -3.22 -2.51 -8.38 -2.14 -1.93 16 The learning curve 17 Test Results Test-Set Pr Re F1 All 83.41 80.91 82.14 GPE 84.80 86.30 85.54 LOC 77.78 68.85 73.04 ORG 68.84 60.26 64.27 PER 91.62 92.63 92.12 18 Conclusion (1/2) A statistical approach to Named Entity Recognition for Italian based on YamCha/SVMs Results confirm that SVMs can deal with a big number of features and that they perform at state of the art. For the features, GAZETteers seem to be the most important feature 31% error reduction Large context (large values of window-size e.g. +6,-6) involves a significant decrease of the recall (data sparseness), 3 points. 19 Conclusion (2/2) F1 values for both PER (92.12) and GPE (85.54) appear rather good, comparing well with those obtain in CONLL2003 for English. Recognition of LOCs (F1: 73.04) seems more problematic: we suspect that the number of LOCs in the training is insufficient for the learning algorithm. ORGs appear to be highly ambiguous. 20 Examples Token Gold Prediction Token Gold Prediction è stato denunciato dai carabinieri di Vigolo Vattaro O O O O B-ORG O B-GPE I-GPE O O O O O O B-GPE I-GPE è stato fermato dai carabinieri ed in seguito ad un controllo O O O O O O O O O O O O O O O O O O O O O O 21 Examples 2 Token Gold Prediction Fontana ( Villazzano ) , Campo ( Baone ) , Rao ( Alta Vallagarina ) . B-PER O B-ORG O O B-PER O B-ORG O O B-PER O B-ORG I-ORG O O B-PER O B-GPE O O B-PER O B-GPE O O B-PER O B-ORG I-ORG O O Token Gold Prediction dovrà dare a via Segantini un ruolo diverso O O O B-ORG I-ORG O O O O O O B-LOC I-LOC O O O 22 Appendix A Test-Set (without external resources) Pr Re All 75.79 72.43 GPE 78.56 76.51 LOC 81.08 49.18 ORG 57.09 52.28 PER 85.71 85.50 F1 74.07 77.53 61.22 54.58 85.60 23 EntityPro EntityPro is a system for Named Entity Recognition (NER) based on YamCha in order to implement Support Vector Machines (SVMs). YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a generic, customizable, and open source text chunker. EntityPro can exploit a rich set of linguistic features such as the Part of Speech, orthographic features and proper name gazetteers. The system is part of TextPro, a suite of NLP tools developed at FBK-irst. 24