APPLYING WALU TO ANNOTATE NAMED ENTITIES IN ITALIAN TEXTS - shared task contribution - MARC RÖSSLER* ANDREAS WAGNER* FELIX JUNGERMANN+ WOLFGANG HOEPPNER* * University of Duisburg-Essen +University of Dortmund Outlook • WALU – a tool for learning and annotating Named Entities • The feature set used for the learning algorithm • SVM-HMM – the learning algorithm • Conclusion WALU - WIKINGER Annotations- und Lernumgebung eScience project WIKINGER - WIKI Next Generation Enhanced Repository Semantic WIKIs as knowledge platforms for scientific communities NLP techniques, in particular Information Extraction applied to a large amount of scientific literature to create an initial semantic network Tools for NER and Relation Discovery are required, that are adaptive w.r.t. language and domain WALU - WIKINGER Annotations- und Lernumgebung NER for eScience – WALU provides a convenient GUI for an efficient knowledge exchange between developer and domain expert ("examplebased communication") routines and data structures for the efficient setup and evaluation of recognition components visualizations for the "qualitative" evaluation and debugging Features – data oriented Knowledge-poor features Five word windows (word to examine +/- two tokens): Word surface features, indicating e.g. whether a word ends with a dot or starts with an uppercase The string representation of the token The substrings of length=3 representing the token (Frascati->"fra", "ras", "sca", "cat", "ati") This data-oriented generalization facilitates the adaptivity of the approach Features – WIKIPEDIA-based WIKIPEDIA-category features Linked Text (within articles) Articles … Frascati (comune) Frascati (vino) … Rugby Frascati Sede suburbicaria di Frascati Fraschetta Frascati … Categories Comuni della regione Lazio stub Comuni della provincia di Roma Castelli Romani Vini DOC della provincia di Roma Vini DOC e DOCG prodotti con uva Trebbiano Toscano … Linked Text (within articles) Articles … Frascati (comune) Frascati (vino) … Rugby Frascati Sede suburbicaria di Frascati Fraschetta Frascati … Categories Comuni della regione Lazio stub Comuni della provincia di Roma Castelli Romani Vini DOC della provincia di Roma Vini DOC e DOCG prodotti con uva Trebbiano Toscano … "Frascati " would produce five category-features Overall approximately 8000 category features SVM-HMM – the learning algorithm SVM-HMM is an implementation of structural SVMs for sequence tagging Given an observed input sequence x=(x1...xl) of feature vectors x1...xl, the model predicts a tag sequence y=(y1...yl) using a linear discriminant function Combines the power of the SVM to deal efficiently with large feature sets with the ability to model sequences http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html Evaluation Features Prec Rec F-Meas Wiki-Cat, POS, ReadAgain 71.62% 72.94% 72.27 w/o Wiki-Cat 68.54% 69.97% 69.25 w/o ReadAgain 73.28% 70.62% 71.93 w/o POS 71.94% 72.23% 72.09 Basic-Features only (word, surface,substrings) 69.69% 65.74% 67.66 Evaluation of the feature sets Evaluation Category Prec Rec F-Meas overall 71.62% 72.94% 72.27 GPE 76.28% 80.62% 78.39 LOC 67.07% 45.08% 53.92 ORG 50.40% 49.39% 49.89 PER 82.58% 86.35% 84.42 Evaluation of the different categories Conclusion The adaptivity of our approach to NER was demonstrated SVM-HMM is an algorithm suitable for NER WIKIPEDIA is a knowledge source with great potential for NER