APPLYING WALU TO ANNOTATE NAMED
ENTITIES IN ITALIAN TEXTS
- shared task contribution -
MARC RÖSSLER* ANDREAS WAGNER*
FELIX JUNGERMANN+ WOLFGANG HOEPPNER*
* University of Duisburg-Essen
+University of Dortmund
Outlook
• WALU – a tool for learning and annotating Named
Entities
• The feature set used for the learning algorithm
• SVM-HMM – the learning algorithm
• Conclusion
WALU - WIKINGER Annotations- und Lernumgebung
eScience project WIKINGER - WIKI Next Generation
Enhanced Repository
Semantic WIKIs as knowledge platforms for scientific
communities
NLP techniques, in particular Information Extraction applied
to a large amount of scientific literature to create an initial
semantic network
Tools for NER and Relation Discovery are required, that are
adaptive w.r.t. language and domain
WALU - WIKINGER Annotations- und Lernumgebung
NER for eScience – WALU provides
a convenient GUI for an efficient knowledge exchange
between developer and domain expert ("examplebased communication")
routines and data structures for the efficient setup and
evaluation of recognition components
visualizations for the "qualitative" evaluation and
debugging
Features – data oriented
Knowledge-poor features
Five word windows (word to examine +/- two tokens):
Word surface features, indicating e.g. whether a word
ends with a dot or starts with an uppercase
The string representation of the token
The substrings of length=3 representing the token
(Frascati->"fra", "ras", "sca", "cat", "ati")
This data-oriented generalization facilitates the adaptivity of
the approach
Features – WIKIPEDIA-based
WIKIPEDIA-category features
Linked Text (within articles)
Articles
…
Frascati (comune)
Frascati (vino)
…
Rugby Frascati
Sede suburbicaria di Frascati
Fraschetta
Frascati
…
Categories
Comuni della regione Lazio stub
Comuni della provincia di Roma
Castelli Romani
Vini DOC della provincia di Roma
Vini DOC e DOCG prodotti con uva Trebbiano Toscano
…
Linked Text (within articles)
Articles
…
Frascati (comune)
Frascati (vino)
…
Rugby Frascati
Sede suburbicaria di Frascati
Fraschetta
Frascati
…
Categories
Comuni della regione Lazio stub
Comuni della provincia di Roma
Castelli Romani
Vini DOC della provincia di Roma
Vini DOC e DOCG prodotti con uva Trebbiano Toscano
…
"Frascati " would produce five category-features
Overall approximately 8000 category features
SVM-HMM – the learning algorithm
SVM-HMM is an implementation of structural SVMs for
sequence tagging
Given an observed input sequence x=(x1...xl) of feature
vectors x1...xl, the model predicts a tag sequence
y=(y1...yl) using a linear discriminant function
Combines the power of the SVM to deal efficiently with
large feature sets with the ability to model sequences
http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html
Evaluation
Features
Prec
Rec
F-Meas
Wiki-Cat, POS, ReadAgain
71.62%
72.94%
72.27
w/o Wiki-Cat
68.54%
69.97%
69.25
w/o ReadAgain
73.28%
70.62%
71.93
w/o POS
71.94%
72.23%
72.09
Basic-Features only (word, surface,substrings)
69.69%
65.74%
67.66
Evaluation of the feature sets
Evaluation
Category
Prec
Rec
F-Meas
overall
71.62%
72.94%
72.27
GPE
76.28%
80.62%
78.39
LOC
67.07%
45.08%
53.92
ORG
50.40%
49.39%
49.89
PER
82.58%
86.35%
84.42
Evaluation of the different categories
Conclusion
The adaptivity of our approach to NER was demonstrated
SVM-HMM is an algorithm suitable for NER
WIKIPEDIA is a knowledge source with great potential for
NER