UNIBA @ EVALITA 2009 Lexical SubstitutionTask Pierpaolo Basile and Giovanni Semeraro {[email protected] and [email protected]} Department of Computer Science University of Bari “Aldo Moro” (ITALY) EVALITA 2009, Reggio Emilia (ITALY), 12 December 2009 P. Basile ([email protected]) UNIBA 1 / 14 Outline 1 Methods for Lexical Substitution Task WSD algorithm: JIGSAW Lexical Substitution exploiting a large corpus 2 Evaluation 3 Conclusions P. Basile ([email protected]) UNIBA 2 / 14 Methods for Lexical Substitution Task Methods for Lexical Substitution Task A knowledge-based WSD algorithm (JIGSAW) which exploits ItalWordNet1 as knowledge-base An unsupervised approach which relies on a large corpus in order to find the different contexts in which words are used Two lexical resources of candidate synonyms: 1 2 ItalWordNet Il dizionario dei sinonimi e contrari, De Mauro Paravia 1 Roventini A. et al., ItalWordNet: a Large Semantic Database for Italian. In Proceedings of LREC 2000, Volume II pages 783-790, 2000. P. Basile ([email protected]) UNIBA 3 / 14 Methods for Lexical Substitution Task WSD algorithm: JIGSAW JIGSAW JIGSAW Knowledge-based WSD algorithm Three different strategies for: nouns, verbs and adjectives/adverbs Main motivation: the effectiveness of a WSD algorithm is strongly influenced by PoS-tag Italian/English WSD algorithm Italian: EVALITA 2007 All-Words WSD Task a English: SemEval-1 - Evaluating WSD on cross-language information retrieval b a P. Basile and G. Semeraro. JIGSAW: An algorithm for Word Sense Disambiguation. Intelligenza Articiale, 4(2):53-54, 2007. b P. Basile et al., JIGSAW algorithm for Word Sense Disambiguation. In SemEval-2007, pages 398-401. ACL press, 2007. P. Basile ([email protected]) UNIBA 4 / 14 Methods for Lexical Substitution Task WSD algorithm: JIGSAW JIGSAWz (Z=ZIPF distribution) Taking into account the synset rank distribution: 1/k s f (k; N; s) = PN s n=1 1/n (1) where: N is the number of word meanings k is the word meaning rank (we adopt the ItalWordNet synset rank) s is the value of the exponent characterizing the distribution (approximated using the Pearson’s chi-square - χ2 test method) Compute the frequency of the word meaning rank in MultiSemCor 2 2 L. Bentivogli and E. Pianta. Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus. Natural Language Engineering, 11(03):247-261, 2005. P. Basile ([email protected]) UNIBA 5 / 14 Methods for Lexical Substitution Task WSD algorithm: JIGSAW JIGSAW performance JIGSAW at EVALITA 2007 WSD All-Words Task system JIGSAW JIGSAWz P. Basile ([email protected]) P 0.598 0.639 R 0.567 0.606 UNIBA A(%) 94.7 94.7 F 0.582 0.622 6 / 14 Methods for Lexical Substitution Task Lexical Substitution exploiting a large corpus Lexical Substitution exploiting a large corpus 1/3 Idea: index a large corpus and then try to find phrases in which synonyms of the target word occur in the same context ItWaC-Italian Web Corpus 3 : a large corpus of about 1,900,000 documents built automatically from the Web Ingredients: Apache Lucene to index and search ItWaC (with terms positions) Lexical resource which provides a list of candidate synonyms: ItalWordNet Il dizionario dei sinonimi e contrari, De Mauro Paravia 3 M. Baroni and A. Kilgarriff. Large linguistically-processed Web corpora for multiple languages. In EACL 2006, pages 87-90, 2006. P. Basile ([email protected]) UNIBA 7 / 14 Methods for Lexical Substitution Task Lexical Substitution exploiting a large corpus Lexical Substitution exploiting a large corpus 2/3 Strategy 1 Retrieve the list of possible synonyms CS 2 Rank the candidate synonyms list CS exploiting the corpus: for each synonym si ∈ CS 1 2 search corpus’ phrases in which the synonym occurs in the same context (using the 3-gram) IF no results are retrieved THEN the slop a is incremented by one until slop is equal to a specified value slopmax ELSE score si according to: score(si ) = ndoc ∗ (1/slop) ∗ boostsi a (2) slop factor allows to find words which are a within a specific distance away P. Basile ([email protected]) UNIBA 8 / 14 Methods for Lexical Substitution Task Lexical Substitution exploiting a large corpus Lexical Substitution exploiting a large corpus 3/3 Example Solo oggi, con lo spoglio completo dei tabulati, se ne protrà sapere di più. 1 2 slop = 1 CS = {totale, globale, ultimato, esauriente, ...} is the list of candidate synonyms: for each si ∈ CS 1 2 Phrase queries: “lo spoglio si ”, “spoglio si dei”, “si dei tabulati” IF no results are retrieved THEN increment slop and jump to step 1 ELSE compute si score 3 Sort CS 4 Choose best synonyms P. Basile ([email protected]) UNIBA 9 / 14 Evaluation Evaluation Dataset: 2,011 instances in XML format System setup ItWaC indexed by Lucene: 8,6 Gbytes of data slopmax = 30 boost factor: ItalWordNet 1: candidate synonyms provided by the synsets 0.5: words in hypernym synsets of the candidate synonyms Dictionary 1: candidate synonyms provided by the dictionary P. Basile ([email protected]) UNIBA 10 / 14 Evaluation Evaluation results - BEST System4 Corpusdictionary (uniba2) CorpusITWN (uniba1) JIGSAWz (uniba3) B1 C3 C2 C1 C4 4 P 8.16 6.80 6.28 6.26 3.95 3.90 3.16 3.52 R 7.18 5.53 5.46 6.01 3.21 3.17 3.16 2.80 P-mode 10.58 8.90 8.13 11.28 6.58 6.71 6.97 5.03 R-mode 10.58 8.90 8.13 10.84 6.58 6.71 6.97 5.03 Bx, Cx denote other participants P. Basile ([email protected]) UNIBA 11 / 14 Evaluation Evaluation results - Out Of Ten (OOT) System5 Corpusdictionary (uniba2) CorpusITWN (uniba1) JIGSAWz (uniba3) C1 C3 C2 B1 C4 5 P 41.46 37.74 28.54 20.09 23.48 23.00 16.65 18.62 R 36.50 30.69 24.79 20.09 19.11 18.72 16.00 14.78 P-mode 47.23 34.84 34.58 27.74 26.58 26.32 24.97 20.52 R-mode 47.23 34.84 34.58 27.74 26.58 26.32 24.00 20.52 Bx, Cx denote other participants P. Basile ([email protected]) UNIBA 12 / 14 Conclusions Conclusions Two systems have been proposed 1 2 JIGSAW: a knowledge-based WSD algorithm a method based on a large corpus (ItWaC) Two lexical resources for candidate substitutions 1 2 ItalWordNet Italian Dictionary: Il dizionario dei sinonimi e contrari, De Mauro Paravia Method based on a large corpus outperforms JIGSAW In spite of the past beliefs, results obtained by JIGSAW are very encouraging Dictionary combined with ItWaC achieves the best task result P. Basile ([email protected]) UNIBA 13 / 14 Conclusions That’s all folks! P. Basile ([email protected]) UNIBA 14 / 14