UNIBA @ EVALITA 2009
Lexical SubstitutionTask
Pierpaolo Basile and Giovanni Semeraro
{[email protected] and [email protected]}
Department of Computer Science
University of Bari “Aldo Moro” (ITALY)
EVALITA 2009, Reggio Emilia (ITALY), 12 December 2009
P. Basile ([email protected])
UNIBA
1 / 14
Outline
1
Methods for Lexical Substitution Task
WSD algorithm: JIGSAW
Lexical Substitution exploiting a large corpus
2
Evaluation
3
Conclusions
P. Basile ([email protected])
UNIBA
2 / 14
Methods for Lexical Substitution Task
Methods for Lexical Substitution Task
A knowledge-based WSD algorithm (JIGSAW) which exploits
ItalWordNet1 as knowledge-base
An unsupervised approach which relies on a large corpus in order to
find the different contexts in which words are used
Two lexical resources of candidate synonyms:
1
2
ItalWordNet
Il dizionario dei sinonimi e contrari, De Mauro Paravia
1
Roventini A. et al., ItalWordNet: a Large Semantic Database for Italian. In
Proceedings of LREC 2000, Volume II pages 783-790, 2000.
P. Basile ([email protected])
UNIBA
3 / 14
Methods for Lexical Substitution Task
WSD algorithm: JIGSAW
JIGSAW
JIGSAW
Knowledge-based WSD algorithm
Three different strategies for: nouns, verbs and adjectives/adverbs
Main motivation: the effectiveness of a WSD algorithm is strongly
influenced by PoS-tag
Italian/English WSD algorithm
Italian: EVALITA 2007 All-Words WSD Task a
English: SemEval-1 - Evaluating WSD on cross-language information
retrieval b
a
P. Basile and G. Semeraro. JIGSAW: An algorithm for Word Sense Disambiguation.
Intelligenza Articiale, 4(2):53-54, 2007.
b
P. Basile et al., JIGSAW algorithm for Word Sense Disambiguation. In
SemEval-2007, pages 398-401. ACL press, 2007.
P. Basile ([email protected])
UNIBA
4 / 14
Methods for Lexical Substitution Task
WSD algorithm: JIGSAW
JIGSAWz (Z=ZIPF distribution)
Taking into account the synset rank distribution:
1/k s
f (k; N; s) = PN
s
n=1 1/n
(1)
where:
N is the number of word meanings
k is the word meaning rank (we adopt the ItalWordNet synset rank)
s is the value of the exponent characterizing the distribution
(approximated using the Pearson’s chi-square - χ2 test method)
Compute the frequency of the word meaning rank in MultiSemCor
2
2
L. Bentivogli and E. Pianta. Exploiting parallel texts in the creation of multilingual
semantically annotated resources: the MultiSemCor Corpus. Natural Language
Engineering, 11(03):247-261, 2005.
P. Basile ([email protected])
UNIBA
5 / 14
Methods for Lexical Substitution Task
WSD algorithm: JIGSAW
JIGSAW performance
JIGSAW at EVALITA 2007 WSD All-Words Task
system
JIGSAW
JIGSAWz
P. Basile ([email protected])
P
0.598
0.639
R
0.567
0.606
UNIBA
A(%)
94.7
94.7
F
0.582
0.622
6 / 14
Methods for Lexical Substitution Task
Lexical Substitution exploiting a large corpus
Lexical Substitution exploiting a large corpus 1/3
Idea: index a large corpus and then try to find phrases in which
synonyms of the target word occur in the same context
ItWaC-Italian Web Corpus 3 : a large corpus of about 1,900,000
documents built automatically from the Web
Ingredients:
Apache Lucene to index and search ItWaC (with terms positions)
Lexical resource which provides a list of candidate synonyms:
ItalWordNet
Il dizionario dei sinonimi e contrari, De Mauro Paravia
3
M. Baroni and A. Kilgarriff. Large linguistically-processed Web corpora for multiple
languages. In EACL 2006, pages 87-90, 2006.
P. Basile ([email protected])
UNIBA
7 / 14
Methods for Lexical Substitution Task
Lexical Substitution exploiting a large corpus
Lexical Substitution exploiting a large corpus 2/3
Strategy
1
Retrieve the list of possible synonyms CS
2
Rank the candidate synonyms list CS exploiting the corpus:
for each synonym si ∈ CS
1
2
search corpus’ phrases in which the synonym occurs in the same
context (using the 3-gram)
IF no results are retrieved
THEN the slop a is incremented by one until slop is equal to a specified
value slopmax
ELSE score si according to:
score(si ) = ndoc ∗ (1/slop) ∗ boostsi
a
(2)
slop factor allows to find words which are a within a specific distance away
P. Basile ([email protected])
UNIBA
8 / 14
Methods for Lexical Substitution Task
Lexical Substitution exploiting a large corpus
Lexical Substitution exploiting a large corpus 3/3
Example
Solo oggi, con lo spoglio completo dei tabulati, se ne protrà sapere di più.
1
2
slop = 1
CS = {totale, globale, ultimato, esauriente, ...} is the list of candidate
synonyms: for each si ∈ CS
1
2
Phrase queries: “lo spoglio si ”, “spoglio si dei”, “si dei tabulati”
IF no results are retrieved
THEN increment slop and jump to step 1
ELSE compute si score
3
Sort CS
4
Choose best synonyms
P. Basile ([email protected])
UNIBA
9 / 14
Evaluation
Evaluation
Dataset: 2,011 instances in XML format
System setup
ItWaC indexed by Lucene: 8,6 Gbytes of data
slopmax = 30
boost factor:
ItalWordNet
1: candidate synonyms provided by the synsets
0.5: words in hypernym synsets of the candidate synonyms
Dictionary
1: candidate synonyms provided by the dictionary
P. Basile ([email protected])
UNIBA
10 / 14
Evaluation
Evaluation results - BEST
System4
Corpusdictionary (uniba2)
CorpusITWN (uniba1)
JIGSAWz (uniba3)
B1
C3
C2
C1
C4
4
P
8.16
6.80
6.28
6.26
3.95
3.90
3.16
3.52
R
7.18
5.53
5.46
6.01
3.21
3.17
3.16
2.80
P-mode
10.58
8.90
8.13
11.28
6.58
6.71
6.97
5.03
R-mode
10.58
8.90
8.13
10.84
6.58
6.71
6.97
5.03
Bx, Cx denote other participants
P. Basile ([email protected])
UNIBA
11 / 14
Evaluation
Evaluation results - Out Of Ten (OOT)
System5
Corpusdictionary (uniba2)
CorpusITWN (uniba1)
JIGSAWz (uniba3)
C1
C3
C2
B1
C4
5
P
41.46
37.74
28.54
20.09
23.48
23.00
16.65
18.62
R
36.50
30.69
24.79
20.09
19.11
18.72
16.00
14.78
P-mode
47.23
34.84
34.58
27.74
26.58
26.32
24.97
20.52
R-mode
47.23
34.84
34.58
27.74
26.58
26.32
24.00
20.52
Bx, Cx denote other participants
P. Basile ([email protected])
UNIBA
12 / 14
Conclusions
Conclusions
Two systems have been proposed
1
2
JIGSAW: a knowledge-based WSD algorithm
a method based on a large corpus (ItWaC)
Two lexical resources for candidate substitutions
1
2
ItalWordNet
Italian Dictionary: Il dizionario dei sinonimi e contrari, De Mauro
Paravia
Method based on a large corpus outperforms JIGSAW
In spite of the past beliefs, results obtained by JIGSAW are very
encouraging
Dictionary combined with ItWaC achieves the best task result
P. Basile ([email protected])
UNIBA
13 / 14
Conclusions
That’s all folks!
P. Basile ([email protected])
UNIBA
14 / 14
Scarica

UNIBA @ EVALITA 2009 Lexical SubstitutionTask