Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora Alessandro Lenci (Università di Pisa, Italy) Barbara McGillivray (ILC-CNR / Università di Pisa, Italy) Simonetta Montemagni (ILC-CNR, Italy) Vito Pirrelli (ILC-CNR, Italy) Outline 1. Subcategorization acquisition 2. MDL verb clustering 1. Subcategorization acquisition: summary • Previous work • Our acquisition process • Evaluation of results Previous work (1) • Brent, 1991; Ushioda et al., 1993; Briscoe & Carroll, 1997; Korhonen, 2002 • These approaches presuppose a battery of predefined frames • there are languages for which no such SCF repertoires are already available Previous work (2) • alternative: acquisition process as a “SCF discovery” process in corpora • Basili et al., 1997; Zeman & Sarkar, 2000; Alonso et al., 2007; Bourigault & Frérot, 2005 • we present a variation of this “discovery approach” to SC acquisition for Italian verbs Our SC extraction method • simply requires a “chunked” corpus and a limited number of search heuristics that do not rely on any previous knowledge about SCFs – languages other than English – a looser notion of SCF including typical verb modifiers and strongly selected arguments The acquisition process 0. experimental setting – chunked PAROLE Corpus • Italian general corpus • 3 million word tokens • chunked with CHUG-IT – 47 communication verbs The acquisition process (step 1) 1. extraction of verb local contexts (SLCs) from chunked texts • Ex.: [N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico] ‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’ The acquisition process (step 2) 2. Context carving: linguistically-motivated criteria select only those chunks that are in the dependency scope of v noise information is minimized • Ex.: [N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico] ‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’ The acquisition process (step 3) 3. induction of potential subcategorization frames (PSF) a. assumption: all contextual chunks occurring immediately after the verb are very likely governed by it potentially subcategorized slots (PSS) b. Frequency filter on PSSs c. a SLC is eligible as a PSF if its contextual chunks belong to the list of selected PSS d. Frequency filter on PSFs The acquisition process (step 3) Verb accettare ’accept’ SLC PSF Rel.freq. [] [] 0.33 [CHE_C] [CHE_C] 0.05 [] [I_C-di] [I_C-di] 0.13 [CHE_C] [N_C] [N_C] 0.45 [I_C-di] [N_C][ADJ_C] [N_C] 0.45 [N_C] [N_C][ADJPART_C] [N_C] 0.45 [N_C][di_C] [N_C] 0.45 [N_C][NA_C] [N_C] 0.45 [N_C][P_C-a] [N_C] 0.45 [N_C][P_C-di] [N_C] 0.45 [N_C][P_C-di][ADJ_C] [N_C] 0.45 [N_C][P_C-di][ADJPART_C] [N_C] 0.45 PSS Evaluation of results - Italian • Evaluation of our SCF induction method – extracted carved contexts: baseline (step 2) – induced subcat frames (step 4) o type precision P correctly acquired frames all acquired frames o type recall R correctly acquired frames all frames in the gold standard o F-measure F 2*P*R PR Evaluation - Italian (2) • carried out against three gold standards 1. IGS1: a general purpose computational lexicon (SIMPLE-PAROLE-CLIPS lexicon) 2. IGS2: Italian dictionary (Sabatini-Coletti 2006) 3. IGS3: merging IGS1 and IGS2 4. Manual evaluation Evaluation - Italian (3) SCFs baseline IGS1 IGS2 IGS3 4 P 42% 30% 52% 93% R 8% 84% 78% NA F 13% 44% 62% NA P 23% 13% 27% 40% R 72% 68% 75% NA F 35% 22% 38% NA Evaluation - English four gold standards • 1. EGS1: general purpose computational lexicon (Valex5 Lexicon) 2. EGS2: Longman Dictionary (2006); 3. EGS3: biomedical English lexicon (SPECIALIST Lexicon) 4. EGS4: merging EGS1, EGS2 and EGS3 Evaluation – English (2) SCFs baseline EGS1+ EGS2 EGS3 EGS4 P 69% 52% 83% R 48% 54% 51% F 57% 53% 63% P 28% 17% 33% R 52% 49% 53% F 36% 25% 41% 2. Verb clustering: summary • The MDL Principle • Verb clustering using MDL Why verb clustering? • syntax-semantics lexical interface • starting from the SCFs extracted, we aim at inducing clusters of verbs that share similar semantic properties • each verb is represented as a vector whose dimensions report its statistical distribution with the automatically extracted SCFs • a clustering of verb vectors is performed using the Minimum Description Length Principle (MDL) The MDL Principle • from information theory (Rissanen 1989) • model description length: code length in bits for the encoding of the model itself complexity of the model • data description length: code length in bits for the encoding of the given data observed through the model fit of the model to the data • MDL: “any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally” M arg min m ( Lm L( D|m ) ) Verb clustering using MDL 1) Baseline model: each verb belongs to one class M 0 : 1 {v1},, r {vr } 2) Compare M 0 with any model M 1 (h, k ) : j {v j }, j 1,, r , j h, k ; r 1 {vh , vk } 3) Choose (n1 , m1 ) such that M1 (n1 , m1 ) L ( M 0 ) L ( M1 ( h , k ))0 4) [ L( M 0 ) L( M1 (h, k ))] arg max ( h ,k ) Cluster together (vn , vm ) into the class 1 1 r 1 MDL -clustering • 47 Italian communication verbs: 23 clustering steps PROMETTERE RISPONDERE PARLARE PROTESTARE CHIEDERE DIRE ASSERIRE MINACCIARE COMANDARE INSEGNARE AMMONIRE DICHIARARE CONFESSARE CHIARIRE PROIBIRE SUGGERIRE COMUNICARE ACCETTARE PROPORRE MOSTRARE COMMENTARE CHIAMARE PREGARE DISCUTERE RIVELARE RICHIAMARE RIMPROVERARE LEGGERE SPIEGARE REPLICARE DESCRIVERE RICHIEDERE DENUNCIARE OFFRIRE RIMPIANGERE ORDINARE Conclusions • a preliminary qualitative analysis of induced verb clusters shows encouraging results • we expect to evaluate the coherence of the obtained lexico-semantic clusters and the coverage of the subcategorization behaviour of clustered verbs MDL -clustering •The verb classes are assigned a new cluster-based frame distribution [] [che] [I-di] [N] [P-a] [perché] chiarire ‘clarify’ 0.34 0.10 0 0.40 0 0.009 comunicare ‘communicate’ 0.24 0.15 0 0.31 0.08 0 proibire ‘forbid’ 0.21 0.03 0.03 0.51 0 0 suggerire ‘suggest’ 0.24 0.10 0.009 0.42 0.02 0.02 verb class (cluster) 0.25 0.10 0.008 0.41 0.02 0.02