Unsupervised acquisition of
verb subcategorization frames
from shallow-parsed corpora
Alessandro Lenci (Università di Pisa, Italy)
Barbara McGillivray (ILC-CNR / Università di Pisa, Italy)
Simonetta Montemagni (ILC-CNR, Italy)
Vito Pirrelli (ILC-CNR, Italy)
Outline
1. Subcategorization acquisition
2. MDL verb clustering
1. Subcategorization acquisition:
summary
• Previous work
• Our acquisition process
• Evaluation of results
Previous work (1)
• Brent, 1991; Ushioda et al., 1993; Briscoe &
Carroll, 1997; Korhonen, 2002
• These approaches presuppose a battery of
predefined frames
• there are languages for which no such SCF
repertoires are already available
Previous work (2)
• alternative: acquisition process as a “SCF
discovery” process in corpora
• Basili et al., 1997; Zeman & Sarkar, 2000;
Alonso et al., 2007; Bourigault & Frérot, 2005
• we present a variation of this “discovery
approach” to SC acquisition for Italian verbs
Our SC extraction method
• simply requires a “chunked” corpus and a
limited number of search heuristics that do
not rely on any previous knowledge about
SCFs
– languages other than English
– a looser notion of SCF including typical verb
modifiers and strongly selected arguments
The acquisition process
0. experimental setting
– chunked PAROLE Corpus
•
Italian general corpus
•
3 million word tokens
•
chunked with CHUG-IT
– 47 communication verbs
The acquisition process
(step 1)
1. extraction of verb local contexts (SLCs)
from chunked texts
•
Ex.:
[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio]
[P_C a 120] [I_C dopo aver toccato] [P_C
nel corso] [P_C della seduta] [N_C il
massimo storico]
‘the yen closed down in Tokyo at 120 after
reaching the maximum ever in the course of
the session’
The acquisition process
(step 2)
2.
Context carving: linguistically-motivated criteria
select only those chunks that are in the
dependency scope of v
noise information is
minimized
•
Ex.:
[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio]
[P_C a 120] [I_C dopo aver toccato] [P_C nel corso]
[P_C della seduta] [N_C il massimo storico]
‘the yen closed down in Tokyo at 120 after reaching the
maximum ever in the course of the session’
The acquisition process
(step 3)
3. induction of potential subcategorization
frames (PSF)
a. assumption: all contextual chunks occurring
immediately after the verb are very likely
governed by it
potentially subcategorized
slots (PSS)
b. Frequency filter on PSSs
c. a SLC is eligible as a PSF if its contextual
chunks belong to the list of selected PSS
d. Frequency filter on PSFs
The acquisition process
(step 3)
Verb accettare ’accept’
SLC
PSF
Rel.freq.
[]
[]
0.33
[CHE_C]
[CHE_C]
0.05
[]
[I_C-di]
[I_C-di]
0.13
[CHE_C]
[N_C]
[N_C]
0.45
[I_C-di]
[N_C][ADJ_C]
[N_C]
0.45
[N_C]
[N_C][ADJPART_C]
[N_C]
0.45
[N_C][di_C]
[N_C]
0.45
[N_C][NA_C]
[N_C]
0.45
[N_C][P_C-a]
[N_C]
0.45
[N_C][P_C-di]
[N_C]
0.45
[N_C][P_C-di][ADJ_C]
[N_C]
0.45
[N_C][P_C-di][ADJPART_C]
[N_C]
0.45
PSS
Evaluation of results - Italian
• Evaluation of our SCF induction method
– extracted carved contexts: baseline (step 2)
– induced subcat frames (step 4)
o type precision
P
correctly acquired frames
all acquired frames
o type recall
R
correctly acquired frames
all frames in the gold standard
o F-measure
F
2*P*R
PR
Evaluation - Italian (2)
•
carried out against three gold standards
1. IGS1: a general purpose computational
lexicon (SIMPLE-PAROLE-CLIPS lexicon)
2. IGS2: Italian dictionary (Sabatini-Coletti
2006)
3. IGS3: merging IGS1 and IGS2
4. Manual evaluation
Evaluation - Italian (3)
SCFs
baseline
IGS1
IGS2
IGS3
4
P
42%
30%
52%
93%
R
8%
84%
78%
NA
F
13%
44%
62%
NA
P
23%
13%
27%
40%
R
72%
68%
75%
NA
F
35%
22%
38%
NA
Evaluation - English
four gold standards
•
1. EGS1: general purpose computational
lexicon (Valex5 Lexicon)
2.
EGS2: Longman Dictionary (2006);
3.
EGS3: biomedical English lexicon (SPECIALIST
Lexicon)
4.
EGS4: merging EGS1, EGS2 and EGS3
Evaluation – English (2)
SCFs
baseline
EGS1+ EGS2
EGS3
EGS4
P
69%
52%
83%
R
48%
54%
51%
F
57%
53%
63%
P
28%
17%
33%
R
52%
49%
53%
F
36%
25%
41%
2. Verb clustering: summary
• The MDL Principle
• Verb clustering using MDL
Why verb clustering?
• syntax-semantics lexical interface
• starting from the SCFs extracted, we aim at inducing
clusters of verbs that share similar semantic properties
• each verb is represented as a vector whose
dimensions report its statistical distribution with the
automatically extracted SCFs
• a clustering of verb vectors is performed using the
Minimum Description Length Principle (MDL)
The MDL Principle
• from information theory (Rissanen 1989)
• model description length: code length in bits for the encoding
of the model itself
complexity of the model
• data description length: code length in bits for the encoding
of the given data observed through the model
fit of the
model to the data
• MDL: “any regularity in the data can be used to compress
the data, i.e. to describe it using fewer symbols than needed
to describe the data literally”
M  arg min m ( Lm  L( D|m ) )
Verb clustering using MDL
1)
Baseline model: each verb belongs to one class
M 0 :  1  {v1},,  r  {vr }
2)
Compare M 0 with any model
M 1 (h, k ) :  j  {v j }, j  1,, r , j  h, k ;  r 1  {vh , vk }
3)
Choose (n1 , m1 ) such that
M1 (n1 , m1 ) 
L ( M 0 )  L ( M1 ( h , k ))0
4)
[ L( M 0 )  L( M1 (h, k ))]
arg max
( h ,k )
Cluster together (vn , vm ) into the class
1
1
 r 1
MDL -clustering
• 47 Italian
communication
verbs: 23
clustering steps
PROMETTERE
RISPONDERE
PARLARE
PROTESTARE
CHIEDERE
DIRE
ASSERIRE
MINACCIARE
COMANDARE
INSEGNARE
AMMONIRE
DICHIARARE
CONFESSARE
CHIARIRE
PROIBIRE
SUGGERIRE
COMUNICARE
ACCETTARE
PROPORRE
MOSTRARE
COMMENTARE
CHIAMARE
PREGARE
DISCUTERE
RIVELARE
RICHIAMARE
RIMPROVERARE
LEGGERE
SPIEGARE
REPLICARE
DESCRIVERE
RICHIEDERE
DENUNCIARE
OFFRIRE
RIMPIANGERE
ORDINARE
Conclusions
• a preliminary qualitative analysis of induced verb
clusters shows encouraging results
• we expect to evaluate the coherence of the
obtained lexico-semantic clusters and the coverage
of the subcategorization behaviour of clustered
verbs
MDL -clustering
•The verb classes are assigned a new
cluster-based frame distribution
[]
[che] [I-di]
[N]
[P-a]
[perché]
chiarire ‘clarify’
0.34
0.10
0
0.40
0
0.009
comunicare
‘communicate’
0.24
0.15
0
0.31
0.08
0
proibire ‘forbid’
0.21
0.03
0.03
0.51
0
0
suggerire
‘suggest’
0.24
0.10
0.009 0.42
0.02
0.02
verb class
(cluster)
0.25
0.10
0.008 0.41
0.02
0.02
Scarica

Unsupervised Acquisition of Verb Subcategorization Frames from