EVALITA 2007
Frascati, September 10th 2007
ENTITYPRO
EXPLOTING SVM FOR ITALIAN
NAMED ENTITY RECOGNITION
Roberto Zanoli and Emanuele Pianta
TextPro
A suite of modular NLP tools developed at FBK-irst







TokenPro:
MorphoPro:
TagPro:
LemmaPro:
EntityPro:
ChunkPro:
SentencePro:
tokenization
morphological analysis
Part-of-Speech tagging
lemmatization
Named Entity recognition
phrase chunking
sentence splitting
 Architecture designed to be efficient, scalable and robust.
 Cross-platform: Unix / Linux / Windows / MacOS X
 Multi-lingual models
 All modules integrated and accessible through unified
command line interface
2
EntityPro
YamCha
Training
data
TagPro
Test
data
Feature extraction
ortho, prefix, suffix, dictionary,
collocation bigram
dictionary
Feature extraction
ortho, prefix, suffix, dictionary,
collocation bigram
Feature
selection
Controller
Feature
selection
Learning
models
Classification
We used YamCha, an SVM-based machine learning environment, to
build EntityPro, a system exploiting a rich set of linguistic features, such
as the orthographic features, prefixes and suffixes, and the occurrence
in proper nouns gazetteers.
EntityPro’s architecture
3
YamCha
• Created as generic, customizable, open source text chunker
• Can be adapted to a lot of other tag-oriented NLP tasks
• Uses state-of-the-art machine learning algorithm (SVM)
 Can redefine
 Context (window-size)
 parsing-direction (forward/backward)
 algorithms for multi-class problem (pair wise/one vs rest)
 Practical chunking time (1 or 2 sec./sentence.)
 Available as C/C++ library
4
Support Vector Machines
Support vector machines are based on the Structural Risk Minimization principle
(Vladimir N. Vapnik, 1995) from computational learning theory.
Support vector machines map input vectors to a higher dimensional space where
a maximal separating hyperplane is constructed. Two parallel hyperplanes are
constructed on each side of the hyperplane that separates the data. The
separating hyperplane is the hyperplane that maximizes the distance between
the two parallel hyperplanes.
5
YamCha:
Setting Window Size
Default setting is "F:-2..2:0.. T:-2..-1".
The window setting can be customized
6
Training and Tuning Set
Evalita Development set randomly split into two
parts
 training:
92.241 tokens
 tuning :
40.348 tokens
7
FEATURES (1/3)
For each running word:
WORD: the word itself (both unchanged and lower-cased)
e.g. Casa
casa
POS: the part of speech of the word (as produced by TagPro)
e.g. Oggi
SS (singular noun)
AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the
word)
e.g. Oggi
{o,og,ogg,oggi, – i,gi,ggi,oggi}
ORTHOgraphic information (e.g. capitalization, hyphenation)
e.g. Oggi
C (capitalized)
oggi
L (lowercased)
8
FEATURES (2/3)

COLLOCation bigrams (36.000, Italian newspapers ranked by MI values)
e.g. l’
avvocato
di
Rossi
Carlo
Taormina
ha
…….
O
O
O
O
B-COL
I-COL
O
9
FEATURES (3/3): GAZETTeers
•
•
•
•
TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns
(12.000, from various internet sites)
STOCK-MARKET: Italian and American stock market organizations
(5.000, from stock market sites)
WIKI-GEO: Wikipedia geographical locations
(3.200,)
PERSONS: Person proper names or titles
(154.000, Italian phone-book, Wikipedia,)
difeso
dall'
avvocato
Mario
De
Murgo
di
Vicenza
……………..
O
O
O
O
O
O
O
GPE
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
TRIG
B-NAM
B-SUR
I-SUR
O
O
10
An Example of
Feature Extraction
difeso
dall'
avvocato
Mario
De
Murgo
,
VSP
ES
SS
SPN
E
SPN
XPW
O
O
O
B-PER
I-PER
I-PER
O
difeso difeso d di dif dife o so eso feso L N O O O O O VSP O
dall' dall' d da dal dall ' l' ll' all' L A O O O O O ES O
avvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS O
Mario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PER
De de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PER
Murgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER
11
Static vs Dynamic Features
 STATIC FEATURES
 extracted for the current, previous and
following word
 WORD, POS, AFFIX, ORTHO,
COLLOC, GAZET
 DYNAMIC FEATURES
 decided dynamically during tagging
 tag of the 3 tokens preceding the
current token.
12
Finding the best features
Baseline:
WORD (both unchanged and lower-cased)
AFFIX
ORTHOgraphic
window-size: STAT: +2,-2 DYNAMIC: -2
Pr
Re
F1
baseline
75.28
68.74
71.86
+POS
+1.31
+2.78
+2.11
+GAZET
+6.09
+7.93
+7.09
+COLLOC
+0.37
+0.54
+0.46
+CLUSTER_5-class
-0.45
-0.04
-0.23
+POS+GAZET+COLLOC
+6.56
+9.14
+7.95
13
Finding the best window-size
Given the best set of features (F1=79.81)
we tried to improve F1 measure changing the window-size
STAT
DYN
Pr
Re
F1
+2,-2
-2
81.84
77.88
79.81
+3,-3
+6,-6
-3
-6
+1.03
+0.01
-1,17
-3.14
-0.14
-1.67
+1,-1
+1,-1
-1
-3
+1.87
+2.21
+2.46
+3.04
+2.18
+2.64
-7.70
-0.72
-4.19
+1,-1
14
Evaluating the best algorithm
PKI vs. PKE
YamCha uses two implementations of SVMs: PKI and PKE.
•both are faster than the original SVMs
PKI produces the same accuracy as the original SVMs.
PKE approximates the orginal SVM, slightly less accurate but
faster
Pr
Re
F1
tokens/sec
PKI
84.05
80.92 82.45
1400
PKE
83.22
80.16 81.66
4200
15
Feature Contribution to the best
configuration
Pr
Re
F1
Best Configuration
84.05
80.92
82.45
no POS
no GAZET
no COLLOC
+0.27
-8.25
+0.01
-0.71
-8.40
-0.13
-0.24
-8.33
-0.06
no GAZET, no COLLOC
(i.e. no external resources)
no ORTHO
no AFFIX
-8.26
-0.96
-1.30
-8.49
-3.22
-2.51
-8.38
-2.14
-1.93
16
The learning curve
17
Test Results
Test-Set
Pr
Re
F1
All
83.41
80.91
82.14
GPE
84.80
86.30
85.54
LOC
77.78
68.85
73.04
ORG
68.84
60.26
64.27
PER
91.62
92.63
92.12
18
Conclusion (1/2)

A statistical approach to Named Entity Recognition for Italian
based on YamCha/SVMs

Results confirm that SVMs can deal with a big number of
features and that they perform at state of the art.

For the features, GAZETteers seem to be the most important
feature
31% error reduction

Large context (large values of window-size e.g. +6,-6) involves
a significant decrease of the recall (data sparseness), 3 points.
19
Conclusion (2/2)

F1 values for both PER (92.12) and GPE (85.54) appear rather
good, comparing well with those obtain in CONLL2003 for
English.

Recognition of LOCs (F1: 73.04) seems more problematic: we
suspect that the number of LOCs in the training is insufficient
for the learning algorithm.

ORGs appear to be highly ambiguous.
20
Examples
Token
Gold
Prediction
Token
Gold
Prediction
è
stato
denunciato
dai
carabinieri
di
Vigolo
Vattaro
O
O
O
O
B-ORG
O
B-GPE
I-GPE
O
O
O
O
O
O
B-GPE
I-GPE
è
stato
fermato
dai
carabinieri
ed
in
seguito
ad
un
controllo
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
21
Examples 2
Token
Gold
Prediction
Fontana
(
Villazzano
)
,
Campo
(
Baone
)
,
Rao
(
Alta
Vallagarina
)
.
B-PER
O
B-ORG
O
O
B-PER
O
B-ORG
O
O
B-PER
O
B-ORG
I-ORG
O
O
B-PER
O
B-GPE
O
O
B-PER
O
B-GPE
O
O
B-PER
O
B-ORG
I-ORG
O
O
Token
Gold
Prediction
dovrà
dare
a
via
Segantini
un
ruolo
diverso
O
O
O
B-ORG
I-ORG
O
O
O
O
O
O
B-LOC
I-LOC
O
O
O
22
Appendix A
Test-Set (without external resources)
Pr
Re
All
75.79
72.43
GPE
78.56
76.51
LOC
81.08
49.18
ORG
57.09
52.28
PER
85.71
85.50
F1
74.07
77.53
61.22
54.58
85.60
23
EntityPro

EntityPro is a system for Named Entity Recognition (NER) based on
YamCha in order to implement Support Vector Machines (SVMs).

YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a
generic, customizable, and open source text chunker.

EntityPro can exploit a rich set of linguistic features such as the Part of
Speech, orthographic features and proper name gazetteers.

The system is part of TextPro, a suite of NLP tools developed at FBK-irst.
24
Scarica

Exploiting SVM for Italian Named Entity Recognition