APPLYING WALU TO ANNOTATE NAMED
ENTITIES IN ITALIAN TEXTS
- shared task contribution -
MARC RÖSSLER* ANDREAS WAGNER*
FELIX JUNGERMANN+ WOLFGANG HOEPPNER*
* University of Duisburg-Essen
+University of Dortmund
Outlook
• WALU – a tool for learning and annotating Named
Entities
• The feature set used for the learning algorithm
• SVM-HMM – the learning algorithm
• Conclusion
WALU - WIKINGER Annotations- und Lernumgebung

eScience project WIKINGER - WIKI Next Generation
Enhanced Repository

Semantic WIKIs as knowledge platforms for scientific
communities

NLP techniques, in particular Information Extraction applied
to a large amount of scientific literature to create an initial
semantic network

Tools for NER and Relation Discovery are required, that are
adaptive w.r.t. language and domain
WALU - WIKINGER Annotations- und Lernumgebung
NER for eScience – WALU provides

a convenient GUI for an efficient knowledge exchange
between developer and domain expert ("examplebased communication")

routines and data structures for the efficient setup and
evaluation of recognition components

visualizations for the "qualitative" evaluation and
debugging
Features – data oriented
Knowledge-poor features
Five word windows (word to examine +/- two tokens):

Word surface features, indicating e.g. whether a word
ends with a dot or starts with an uppercase

The string representation of the token

The substrings of length=3 representing the token
(Frascati->"fra", "ras", "sca", "cat", "ati")
This data-oriented generalization facilitates the adaptivity of
the approach
Features – WIKIPEDIA-based
WIKIPEDIA-category features
Linked Text (within articles)
Articles
…
Frascati (comune)
Frascati (vino)
…
Rugby Frascati
Sede suburbicaria di Frascati
Fraschetta
Frascati
…
Categories
Comuni della regione Lazio stub
Comuni della provincia di Roma
Castelli Romani
Vini DOC della provincia di Roma
Vini DOC e DOCG prodotti con uva Trebbiano Toscano
…
Linked Text (within articles)
Articles
…
Frascati (comune)
Frascati (vino)
…
Rugby Frascati
Sede suburbicaria di Frascati
Fraschetta
Frascati
…
Categories
Comuni della regione Lazio stub
Comuni della provincia di Roma
Castelli Romani
Vini DOC della provincia di Roma
Vini DOC e DOCG prodotti con uva Trebbiano Toscano
…

"Frascati " would produce five category-features

Overall approximately 8000 category features
SVM-HMM – the learning algorithm

SVM-HMM is an implementation of structural SVMs for
sequence tagging

Given an observed input sequence x=(x1...xl) of feature
vectors x1...xl, the model predicts a tag sequence
y=(y1...yl) using a linear discriminant function

Combines the power of the SVM to deal efficiently with
large feature sets with the ability to model sequences
http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html
Evaluation
Features
Prec
Rec
F-Meas
Wiki-Cat, POS, ReadAgain
71.62%
72.94%
72.27
w/o Wiki-Cat
68.54%
69.97%
69.25
w/o ReadAgain
73.28%
70.62%
71.93
w/o POS
71.94%
72.23%
72.09
Basic-Features only (word, surface,substrings)
69.69%
65.74%
67.66
Evaluation of the feature sets
Evaluation
Category
Prec
Rec
F-Meas
overall
71.62%
72.94%
72.27
GPE
76.28%
80.62%
78.39
LOC
67.07%
45.08%
53.92
ORG
50.40%
49.39%
49.89
PER
82.58%
86.35%
84.42
Evaluation of the different categories
Conclusion

The adaptivity of our approach to NER was demonstrated

SVM-HMM is an algorithm suitable for NER

WIKIPEDIA is a knowledge source with great potential for
NER
Scarica

Applying WALU to Annotate Named Entities in Italian