When GL meets the corpus A data-driven investigation of semantic types and coercion phenomena Elisabetta Jezek1, Alessandro Lenci2 1University of Pavia - Department of Linguistics 2University of Pisa - Department of Linguistics GL Workshop 2007 Paris, 10th May Research goals and methodology Corpus evidence as the major device to explore the semantic type system in the line of Corpus Pattern Analysis (CPA) as proposed by Hanks & Pustejovsky (2005), Pustejovsky et al. (2004) here we focus on V-obj combinations Improve our understanding about the structure of semantic types the organisation of the type system how types behave compositionally Theoretical framework Generative Lexicon Theory (Pustejovsky 1995, 2006) the semantic type system is integrated in a general theory of argument selection Generative devices in the lexicon type system generativity (paradigmatic) Qualia Structure (SQ) and “dot” operations create multidimensional lexical types compositional generativity (syntagmatic) compositional operations (coercion, cocompositions, etc.) change and create types in contexts Type system in GL (Pustejovsky 2001, 2006) Natural types Artifactual types only formal and constitutive information (organized taxonomically) lion:animate, rock:concrete, water:liquid natural types + Telic and/or Agentive quale violinist:animate ⊗T play beer:liquid ⊗T drink knife: concrete ⊗T cut Complex types (dot types) composition of types libro “book”:physical • information cf. inherent polysemy Compositional operations on types (Pustejovsky 2006) Key issue - how the type selected by predicate matches the type of its arguments Pure Selection – selecting type is directly satisfied by the argument type Accomodation – selecting type is inherited by the argument type Type Coercion – selecting type does not directly match the argument type Exploitation – selecting type corresponds to a portion of the QS of the argument. A subcomponent of the argument’s type is accessed to satisfy the predicate requirements Introduction – selecting type is richer than the type of its argument. The argument is wrapped with the type required by the predicate A map of compositional operations on types Domain-preserving operations SELECTING TYPE ARGUMENT TYPE Simple (natural) Unified (artifactual) Simple (natural) selection exploitation introduction Unified (artifactual) exploitation selection introduction Dot (complex) exploitation exploitation selection Other compositional operations non domain-preserving coercion e.g. object event co-composition … Dot (complex) Corpus evidence and GL type system The use of corpus analysis raises crucial issues concerning how to properly map the extracted patterns onto the GL architecture of the lexicon Observed evidence a set of predicative pairs Σ = {σ1, …, σn} extracted from a corpus σi = <read-bookobj>, σi = <eat-cakeobj>, etc. What can we infer from the extracted contexts in Σ about the semantic type system and the compositional rules? what is the type of the argument? what is the type selected by the predicate? what is the operation that allowed the predicate and the argument to compose? Corpus evidence and GL type system Distributional Hypothesis lexical items belonging to the same type are expected to show similar syntagmatic distributions differences in combinatorial distributions can be taken as indicators of differences in type Incremental data-driven type definition top-down definition of repository of “shallow types” acting as a priori constraints on the semantic type system cf. Brandeis Shallow Ontology (Pustejovsky et al. 2004) corpus-based definition of fine-grained types emerging as abstractions over the combinatorial patterns of lexical items Brandeis Shallow Ontology (Pustejovsky et al. 2004) Corpus evidence and GL compositional operations Given GL architecture, we have to assume that each observed context pair σ has been generated by the combinations of two different factors the semantic types of the elements of σ the semantic operations that drove the composition of σ If σ represents the observational datum, (1) and (2) are the two hidden parameters that we have to discover Key methodological consequence Any attempt to get at a data-driven characterization of the semantic types system can not dispense with a careful analysis of the compositional operations between types Corpus processing and data extraction Pilot experiment performed on a 20 million word corpus of written Italian subset of La Repubblica Corpus (Baroni et al. 2004) The corpus has been automatically processed with IDEAL+ dependency-based parser for Italian (Bartolini et al. 2004) 502,404 V-OBJ pairs (σ) have been automatically extracted with their frequency in the corpus Verb-LIBROOBJ oF (book) leggere (read) scrivere (write) presentare (present) ..... How we proceed The extracted patterns are used to build lexical sets (LS) (cf. Hanks & Pustejovsky 2005) nominal LS – the sets of the most “typical” nouns occurring as OBJ of a given V verbal LS – the sets of the most “typical” verbs with which a given N occurs as OBJ Typicality is measured by the log-likelihood (Dunning 1993) association score between V and N LSs are used to investigate two separate but related issues what is the type of an argument? what is the particular operation that allows an argument to compose semantically with a certain predicate? Investigating semantic types with LSs We choose a verb vi that typically selects for a target type τ We identify the nominal LS of vi the set of Ns that co-occur with that verb in a certain argument position type τ Verb vi Nominal LS of vi n1 n2 n3 … Investigating semantic types with LSs The case of “leggere” (read) leggere “read” selective environment prima facie fairly well characterized in terms of its type complex functional type selecting for a complex, dot-argument as its direct object !y : phys • info !x : en [leggere( x, y )] phys ● info concrete entities that have an informational content (e.g. book) &phys • info $ $ $ &TELIC QUALIA = $ $% AGENTIVE % # ! ! read ( x, y, e1 ) # ! write( z , y, e2 )!" !" Top 40 nouns in the LS of leggere Noun LL value Noun LL value libro “book” giornale “newspaper” articolo “article” lettera “letter” romanzo “novel” testo “text” documento “document” intervista “interview” comunicato “communiqué” dichiarazione “statement” pagina “page” sceneggiatura “script” riga “line” discorso “speech” cartella “page” messaggio “message” relazione “report” passo “passage” resoconto “report” parola “word” 225,44 174,98 133,28 96,77 76,63 58,34 56,42 52,37 49,23 48,07 47,76 44,17 42,03 41,07 40,64 36,10 35,14 34,60 30,04 29,71 frase “sentence” sentenza “sentence” motivazione “justification” Freud Financial Times omelia “sermon” notizia “news” saggio “essay” missiva “missive” telegramma “telegram” poesia “poem” verdetto “verdict” brano “passage” nota “note” opera “work” Rimbaud sofisma “sophisma” Tuttosport scritta “writing, notice” telex “telex” 28,75 25,93 23,39 19,96 19,40 16,92 16,14 16,04 15,85 14,97 14,77 14,62 14,62 14,51 14,20 14,19 14,19 14,19 11,75 11,59 How to proceed From the fact that a N is included in the nominal lexical set of leggere, we can not simply infer that its type is phys ● info leggere has the ability not only to combine by pure selection, but also to coerce the argument type type τ cf. leggere Rimbaud “read Rimbaud” leggere can itself undergo co-compositions In order to find out what the type of a N is, we inspect the verbal LS of N, i.e. the verbs with which N most frequently cooccurs Verb vi Nominal LS of vi n1 n2 n3 … Verbal LS of n1 v1 v2 … Verbal LS of n2 v3 v4 … Top 40 nouns in the LS of leggere Noun LL value Noun LL value libro “book” giornale “newspaper” articolo “article” lettera “letter” romanzo “novel” testo “text” documento “document” intervista “interview” comunicato “communiqué” dichiarazione “statement” pagina “page” sceneggiatura “script” riga “line” discorso “speech” cartella “page” messaggio “message” relazione “report” passo “passage” resoconto “report” parola “word” 225,44 174,98 133,28 96,77 76,63 58,34 56,42 52,37 49,23 48,07 47,76 44,17 42,03 41,07 40,64 36,10 35,14 34,60 30,04 29,71 frase “sentence” sentenza “sentence” motivazione “justification” Freud Financial Times omelia “sermon” notizia “news” saggio “essay” missiva “missive” telegramma “telegram” poesia “poem” verdetto “verdict” brano “passage” nota “note” opera “work” Rimbaud sofisma “sophisma” Tuttosport scritta “writing, notice” telex “telex” 28,75 25,93 23,39 19,96 19,40 16,92 16,14 16,04 15,85 14,97 14,77 14,62 14,62 14,51 14,20 14,19 14,19 14,19 11,75 11,59 Top 10 verbs in the LS of nouns selected by leggere Verbs LL value Verbs LL value Libro (book) scrivere (write) leggere (read) pubblicare (publish) presentare (present) sfogliare (turn pages) dedicare (dedicate) riscrivere (rewrite) tradurre (translate) ristampare (reprint) vendere (sell) 369,39 225,44 124,94 66,11 45,98 37,42 25,56 19,82 17,87 17,12 articolo (article) scrivere “write” leggere “read” pubblicare “publish” inviare “send” ricevere“get” abrogare “cancel” applicare“enforce” dedicare “dedicate” approvare “approve” bocciare “reject” 139,79 133,28 103,38 79,18 50,49 46,73 45,56 44,40 38,07 24,60 romanzo (novel) scrivere (write) leggere (read) pubblicare (publish) ristampare (reprint) concepire (conceive) intitolare (give a title) Pianificare (plan) filmare (film) comprare (buy) finire (finish) 188,77 76,63 52,11 13,07 11,61 10.26 8,02 6,79 6,76 6,28 testo (text) pubblicare “publish” approvare “approve” votare “vote” leggere “read” modificare “modify” scrivere “write” redigere “write” emendare “amend” preparare “prepare” diffondere “circulate” 63,13 61,26 59,76 58,34 58,01 55,01 48,06 30,39 25,37 22,79 Top 10 verbs in the LS of nouns selected by leggere Verbs lettera (letter) inviare “send” scrivere “write” ricevere “get” spedire “send” leggere “read” mandare “send” recapitare “deliver” consegnare “deliver” pubblicare “publish” firmare “sign” LL value Verbs 922,22 812,93 122,51 104,99 96,77 94,39 87,28 73,53 57,60 38,14 messaggio (message) inviare “send” lanciare “send” mandare “send” ricevere “get” consegnare “deliver” trasmettere “trasmit” intercettare “intercept” leggere “read” portare “bring” recapitare “deliver” LL value 515,77 208,60 149,36 70,27 68,27 52,75 36,72 36,10 24,64 24,13 Generating types (1) Specifying the phys ● info type The type phys ● info does not suffice in accounting for the whole syntagmatic distribution Differences in syntagmatic distribution can be accounted for in terms of QS specifications (1) QS can be used to generate more fine-grained types libro “book”, articolo “article”, romanzo “novel” : phys ● info ⊗ Telic reading_events {read, reread,…} ⊗ Agentive writing_events {write, rewrite…} ⊗ Agentive publishing_events {publish, print, …} lettera (letter), messaggio (message) : phys ● info ⊗ Telic reading_events {read, reread,…} ⊗Telic transmission_events {send, circulate, deliver…} ⊗Agentive writing_events {write, compile, …} ⊗Agentive publishing_events {publish, …} testo (text), articolo (article) : phys ● info ⊗ Telic applying_events {apply, enforce,…} ⊗agentive performative_events {approve, vote, …} Top 10 verbs in the LS of nouns selected by leggere Verbs giornale (newspaper) leggere (read) scrivere (write) stampare (print) sfogliare (turn the pages) leggiucchiare (read) posare (put down) querelare (bring an action) rileggere (re-read) attaccare (attack) obbligare (force) discorso (speech) pronunciare “pronounce” riprendere “start again” fare “make” tenere “give” leggere “read” allargare “enlarge” riaprire “reopen” ascoltare “listen to” rivolgere “address” concludere “conclude” LL value 174,98 83,46 24,75 20,19 14,64 14,37 14,37 11,51 10,32 9,85 intervista (interview) rilasciare “give” concedere “give” leggere “read” dare “give” mandare “send” pubblicare “publish” rileggere “reread” realizzare “make” raccogliere “collect” registrare “record” 328,26 54,52 48,35 46,97 41,07 39,16 26,70 24,11 21,85 16,65 dichiarazione (declaration) rilasciare “give” 913,25 fare “make” 84,04 diffondere “make circulate” 63,68 leggere “read” 48,07 presentare “present” 45,91 firmare “sign” 45,78 sottoscrivere “endorse” 35,29 smentire “deny” 31,34 consegnare “deliver” 28,42 interpretare “interpret” 26,14 294,57 119,14 52,37 23,59 16,51 15,65 15,20 12,25 10,58 9,37 Generating types (2) Discovering new types (2) intervista “interview” discorso “speech” : event ● info ⊗Agentive speech_events {pronounce, address, give a speech…} ⊗Telic listening_events {listen, …} (3) giornale “newspaper” : organization ● (phys ● info) ⊗Telic reading_events {read, ...} ⊗Agentive publishing_events ⊗Telic {publish, print, …} agentive_events {edit, attack, ...} Conclusions so far … Variations in the verbal lexical sets can be an indicator of two main facts difference in QS specification difference in type Our assumptions about what the type of a N is are sensibly confirmed and reflected by its syntagmatic behavior Predictions about compositional behavior of types (1) Complex Nouns a Complex Noun will compose either by pure selection, with a dot selecting predicate, or by exploitation with a natural or artifactual selecting predicate SELECTING TYPE (V) TYPE SELECTED (N) Natural Artifactual Complex Natural selection introduction introduction Artifactual exploitation selection introduction Complex exploitation exploitation selection To test this prediction against the corpus data we use the verbal LSs of the Ns that we assigned to the following types phys * info, event *info and organization * (phys * info) Type coercion: dot exploitation (1) phys-selecting verbs with phys*info nouns libro bruciare “burn”, portare “carry” articolo firmare “sign”, spostare “move” romanzo portare “carry” testo perdere “lose”, firmare “sign” lettera imbucare “post”, conservare “keep”, infilare “put”, distruggere “destroy”, raccogliere “pick up”, esibire “exhibit”, ritrovare “find again”, perdere “lose” messaggio bruciare “burn”, firmare “sign”, portare “bring”, conservare “keep”, infilare “put” giornale aprire “open”, posare “put down”, distribuire “distribute”, mostrare “show”, portare “bring” Type coercion: dot exploitation (2) info-selecting verbs with phys*info nouns libro amare “love”, citare “quote”, studiare “study” articolo approvare “approve”, bocciare “reject”, citare “quote”, votare “vote”, correggere “correct”, ignorare “ignore”, commentare “comment”, conoscere “know” romanzo tradurre “translate” testo approvare “approve”, votare “vote”, conoscere “know”, analizzare “analyze”, presentare “present”, discutere “discuss”, citare “quote”, difendere “defend” spiegare “explain”, controllare “check” lettera censurare “censor”, scorrere “scroll”, riassumere “summmarize”, interpretare “interpret”, esaminare “examine”, comprendere “understand”, spiegare “explain”, ricordare “remember”, vedere “see” messaggio interpretare “interpret”, citare “quote”, analizzare “analyze”, capire “understand”, spiegare “explain”, decifrare “decipher” Type coercion: dot exploitation (3) info-selecting verbs with organization*(phys*info) giornale criticare “criticize”, censurare “censor”, commentare “comment”, smentire “deny” info-selecting verbs with event*info nouns intervista commentare “comment”, tradurre “translate”, citare “quote”, giudicare “judge”, valutare “evaluate” discorso interpretare “interpret”, commentare “comment”, gradire “like”, contestare “question”, giudicare “judge”, ripensare “rethink” dichiarazione smentire “deny”, interpretare “interpret”, travisare “misrepresent”, valutare “evaluate”, calibrare “graduate” Type coercion: dot exploitation (4) event-selecting verbs with event*info nouns discorso riprendere “start again with”, attendere “wait for”, concludere “conclude”, terminare “finish”, improvvisare “improvize”, interrompere “interrupt”, continuare “go on with”, troncare “cut”, avviare “start with”, completare “complete”, cominciare “begin”, iniziare “start”, finire “finish”, proseguire “go on with”, vedere “see” intervista ultimare “finish”, iniziare “start”, interrompere “stop”, vedere “see”, bloccare “stop”, annunciare “announce” organization-selecting verbs with org*(phys*info) nouns giornale attaccare “attack”, querelare “prosecute”, danneggiare “damage”, obbligare “force”, dirigere “direct”, costringere “force”, lasciare “leave” Asymmetries in dot exploitations (1) articolo, testo are more info then phys articolo phys firmare “sign”, spostare “move” info approvare “approve”, bocciare “reject”, citare “quote”, votare “vote”, correggere “correct”, ignorare “ignore”, commentare “comment”, conoscere “know” testo phys info perdere “lose”, firmare “sign” approvare “approve”, votare “vote”, conoscere “know”, analizzare “analyze”, presentare “present”, revisionare “amend”, discutere “discuss”, censurare “censor”, citare “quote”, decifrare “decipher”, difendere “defend”, spiegare “explain”, controllare “check” Asymmetries in dot exploitations (2) articolo, testo are less phys then libro and lettera libro phys info bruciare “burn”, mandare “send”, portare “carry” amare “love”, citare “quote”, studiare “study” lettera phys imbucare “post”, conservare “keep”, infilare “put”, distruggere “destroy”, raccogliere “pick up”, esibire “exhibit”, ritrovare “find again”, perdere “lose”, info riassumere “summmarize”, interpretare “interpret”, esaminare “examine”, comprendere “understand”, spiegare “explain” Introduction (human) libro accusare “accuse” ===> the PERSON who wrote the book testo difendere “defend” lettera condannare “condemn” Introductions (phys) intervista discorso leggere “read” mandare “send”, rileggere “reread”, pubblicare “publish” leggere “read” dichiarazione consegnare “deliver”, leggere “read”, firmare “sign” Domain-shifting introductions (event) libro terminare “finish”, cominciare “start” romanzo finire “finish”, cominciare “start”, aprire “open” articolo concludere “conclude”, iniziare “start”, cominciare “begin”, terminare “finish”, chiudere “close” testo completare “complete”, finire “finish” lettera concludere “conclude”, terminare “finish”, interrompere “interrupt”, finire “finish” messaggio concludere “finish”, cominciare “start”, finire “finish” Predictions about compositional behavior of types (2) Dot selecting predicate a dot-selecting predicate will compose either by pure selection, with a matching dot-argument, or by introduction, with natural and artifactual arguments SELECTING TYPE (V) TYPE SELECTED (N) Natural Artifactual Complex Natural selection introduction introduction Artifactual exploitation selection introduction Complex exploitation exploitation selection To test this prediction against the corpus data we use the nominal LSs of leggere Dot-selecting predicate: leggere selection dot exploitation leggere un libro “book”, un articolo “article”, un romanzo “novel”, una lettera “letter” leggere un giornale “newspaper” introduction phys leggere la trama “plot”, la musica “music”, un discorso “speech” info leggere la mano “hand”, leggere una lapide “headstone”, un dispositivo “device”, un contatore “meter” phys and info leggere l’anima “soul”, gli umori “mood” Accounting for different senses of leggere leggere leggere una radiografia, (an x-ray), il grafico (a graph), un sintomo (symptom), una favola (a tale) … Corpus evidence helps us to... confirm or falsify our assumptions about what the semantic type of a given N is refine the representation of QS empirically test our assumptions about compositional operations of coercion and co-composition. Concluding remarks Mutual feeding between corpus data and models of the lexicon an architecture of the lexicon like GL can provide the interpretative key for various corpus data corpus data can help to anchor the study of lexical dynamics and architecture on empirical evidence (eventually enriching the model) Future research extend the analysis to other syntagmatic relations (e.g. subj, modifiers, etc.) extend the analysis to other semantic types