Risorse Linguistiche (lessici, corpora, ontologie, …) Standard e tecnologie linguistiche Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected] With many others at ILC N. Calzolari Dottorato, Pisa, Maggio 2009 1 Old slide with Antonio Zampolli (’80s/early ‘90s) Why such needed LRs, were lacking after 30 years of R&D in the field? 1) Because the main trend until mid-’80s was to privilege the processing of “critical” phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language As a result CL was focusing on: few examples - often artificially built lexicons made of few entries (toy lexicons) grammars with poor coverage 2) Because large-scale LRs are costly & their production requires a big organizing effort Why we still lack them?? N. Calzolari Dottorato, Pisa, Maggio 2009 2 Historical notes The beginnings… After many years of complete disregard – or even disdain and contempt – for LRs, due mainly to the prevalence and influence of the generativist school Work on Machine Readable Dictionaries: Early interest: Pioneering Research To become machine-tractable To extract info from them – with much less powerful tools than now Precursor of the trend of automatic acquisition from corpora Acquilex (Pisa et al.) Work on/with Longman dictionary (Las Cruces) NSF & EC International Cooperation grant, promoted by Wilks, Zampolli, Calzolari (Las Cruces & Pisa) N. Calzolari Dottorato, Pisa, Maggio 2009 Don Walker & Antonio Zampolli 3 … back from the ’70s/‘80s Automatic acquisition of lexical information from MRDs Was at the centre of activities in Pisa group, Amsler, Briscoe, Boguraev, Wilks’ group, IBM, then Japanese groups, … The trend was: “large-scale computational methods for the transformation of machine readable dictionaries (MRDs) into machine tractable dictionaries” It became evident that: Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and values. Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries N. Calzolari Dottorato, Pisa, Maggio 2009 4 After that pioneering era, production & use of adequate LRs strongly increased The lexicon has become ever more relevant Both international and national authorities started investing in the field as never before, interested in technologies & systems which are really working and are economically interesting The need of empirical methods, based on the analysis of large amount of data, has been recognized LRs must be robust enough for analysing the concrete uses of a language, either theoretically “interesting” or not Data-driven approaches N. Calzolari Dottorato, Pisa, Maggio 2009 5 Since then … LRs have acquired larger resonance in the last 2 decades, when many activities, in Europe and world-wide, have contributed to substantial advances in knowledge and capability of how to represent, create, acquire, access, exploit, harmonise, tune, maintain, distribute, etc. large lexical and textual repositories In Europe an essential role was played by the EC, through initiatives NERC PAROLE SIMPLE EuroWordNet EAGLES ISLE ELSNET RELATOR … N. Calzolari that saw the participation of many EU groups, linked over the years by sharing common approaches and visions Dottorato, Pisa, Maggio 2009 6 … back from the late ‘80s After acquisition from MRDs, Automatic acquisition of info from texts: This trend has become today a consolidated fact, and we have moved from focusing on acquisition of “linguistic information” (as at the beginning) to broader acquisition of “general knowledge”, with more data intensive, robust, reliable methods N. Calzolari Dottorato, Pisa, Maggio 2009 7 We started building: LRs as necessary infrastructure (Lexicons/Corpora) both for research & applications: LRs give to NLP systems the knowledge needed for the various linguistic processing Realising that most of the needed information escapes individual “introspection” can only be acquired analysing large textual corpora attesting language use in different fields/communicative contexts Sub-product?: Importance of statistical methods BUT need of adequate models to handle actual usage of language Lesson: Going from core sets to large coverage has implications not just in quantitative terms, but more interestingly in terms of changes to the models and the strategies of processes N. Calzolari Dottorato, Pisa, Maggio 2009 8 What are we (LT& LR) assembling, …. since many years? Lexicons & their Ontologies Written, Spoken, ItalWordNets, PAROLE/SIMPLE, FrameNets, … Annotated corpora/Treebanks Basic Tools Integrated Architecture for N. Calzolari Annotation at various levels (from morph. to conceptual) Acquisition/learning Classification Ontology creation … Methodologies Know-how & expertise Infrastructural bodies (on which to build) Dottorato, Pisa, Maggio 2009 Standards … components of a very large infrastructure of LRs & LT 9 History: Some international LRs initiatives ACQUILEX [since ’88] MULTILEX ET-7 ET-10 TEI NERC RELATOR ONOMASTICA MULTEXT COLSIT LSGRAM DELIS Essential role of EC EAGLES to start a basic PAROLE SIMPLE Infrastructure SPARKLE ELSNET EuroWordNet Established a model N. Calzolari EU at the forefront in the areas of LRs and standards in the ’90s Dottorato, Pisa, Maggio 2009 MATE NITE Cluster 488 (Italian) TAL (Italian) ISLE ENABLER INTERA LIRICS … Senseval/Semeval WRITE Forum TAL (Italian) … ISO ELRA LREC LRE Journal NEDO Language Grid BootStrep KYOTO … 10 Today: a broad “potential” Infrastructure Vitality & Success signs… for LRs RELATOR EAGLES/ISLE ENABLER ELSNET TELRI INTERA LIRICS … ELRA BLARK Unified Lexicon (W/S) LREC EU LDC & others ISO COCOSDA/WRITE US Cyberinfrastructure Japan COE21 NEDO Language Grid … National LRE journal … ERANET-LangNet … N. Calzolari Internat FLaReNet (ICT) CLARIN (ESFRI) Dottorato, Pisa, Maggio 2009 … … … 11 WordNets Synsets linked by semantic relations TOP Concepts: Object,Artifact,Building Hyperonym: {edificio,..} {Casa,abitazione,dimora} {home,domicile,..} {house} Role_location: {stare, abitare, ...} Hyponym: {villetta } {catapecchia, bicocca, .. } {cottage} {bungalow } Role_target_direction: {rincasare} Role_patient: {affitto, locazione} Mero_part: {vestibolo} {stanza} Holo_part: {casale} {frazione} {caseggiato} N. Calzolari Dottorato, Pisa, Maggio 2009 12 ItalWordNet Semantic Network [Italian module of EuroWordNet] ~ 55.000 lemmas organized in synonym groups (synsets), structured in hierarchies & linked by ~ 130.000 semantic relations ~ 55.000 hyperonymy/hyponymy relations ~ 16.000 relations among different POS (role, cause, derivation, etc..) ~ 2.000 part-whole relations ~ 1.500 antonymy relations, …etc. Synsets linked to the InterLingual Index (ILI=Princeton WordNet), Through the ILI link to all the European WordNets (de-facto standard) & to the common Top Ontology Possibility of plug-in with domain terminological lexicons (legal, maritime, … linguistic) Usable in IR, CLIR, IE, QA, ... N. Calzolari Dottorato, Pisa, Maggio 2009 13 ItalWordNet: Clusters of “Base Concepts” = words classified according to Ontology Top Concepts Lexicon or ontology ??? Function Top = features 1stOrderEntity 2ndOrderEntity Composition Origin Form SituationType SituationComponent Etc…. Etc. Covering Part Group Natural Object Static Dynamic Physical Location Experience Mental Living Human skin body hair part bodycell covering muscle organ N. Calzolari church company institute organization party union human adult adult female adult male child native offspring Direction distance spatial property spatial relation course path Dottorato, Pisa, Maggio 2009 change of position divide locomotion motion feel desire disturbance emotion feeling humor pleasance 14 2ndOrderEntity 1stOrderEntity EWN TopOntology Origin Form Natural Artifact Living Plant Human Creature Animal Substance Solid Liquid Gas Object1 Composition Part Group Function Vehicle Representation MoneyRepresentation LanguageRepresentation ImageRepresentation Software Place Occupation Instrument Garment Furniture Covering Container Comestible Building ItalWordNet N. Calzolari Dottorato, Pisa, Maggio 2009 SituationType Dynamic BoundedEvent UnboundedEvent Static Property Relation SituationComponent Cause Agentive Phenomenal Stimulating Communication Condition Existence Experience Location Manner Mental Modal Physical Possession Purpose Quantity Social Time Usage 3rdOrderEntity 15 EuroWordNet Multilingual Data Structure TOP ONTOLOGY LIVING ANIMAL HUMAN hond cane Italian WN dog English perro Spanish WN … … N. Calzolari French WN ILI German WN Estonian WN Dottorato, Pisa, Maggio 2009 Dutch WN dog English WN Czech WN … … 17 Terminological Wordnets: e.g. Jur-WordNet Jur-WordNet Extension for the juridical domain of ItalWordNet (With ITTIG-CNR - Istituto di Teoria e Tecniche dell’Informazione Giuridica) N. Calzolari Knowledge base for multilingual access to sources of legal information Source of metadata for semantic markup oflegal texts To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc. Dottorato, Pisa, Maggio 2009 18 Terminological Lexicon of Navigation Nolo Synset 1.614 Lemmas 2.116 Senses 2.232 Nouns 1.621 Verbs 205 Adjectives 35 Proper Nouns 236 N. Calzolari Dottorato, Pisa, Maggio 2009 19 SIMPLE Lexicon & Ontology Multidimensional Type Hierarchy http://www.ilc.cnr.it/clips/CLIPS_ENGLISH.htm Shared by 12 European languages Theoretical background: Generative Lexicon (Pustejovsky) 157 language independent SIMPLE semantic types: N. Calzolari Based on hierarchical & non-hierarch. conceptual relations Difference of internal complexity: Simple types (one-dimensional) characterised in terms of hyperonymic relations Unified types (multi-dimensional) only definable through the combination of: the relation to their supertype + the reference to orthogonal dimensions of meanings (through the Qualia Structure) Dottorato, Pisa, Maggio 2009 20 PAROLE- SIMPLE-CLIPS Lexicon: …harmonised model for 12 European languages N. Calzolari Dottorato, Pisa, Maggio 2009 21 Overall Organization ... Greek lexicon Danish lexicon Type Ontology 150 types Template Catalan lexicon Instantiation Italian lexicon Pred. Layer SemU Qualia N. Calzolari Derivation Polysemy Dottorato, Pisa, Maggio 2009 Predicate, arguments, Selection restrictions Event Type … 22 Model Architecture The first three levels : Information content stress position vowel openness cons. prononciation Phonological Unit Corresp. PhnU-MrphU syntactic argument PoS (& PoS subcategory) inflectional paradigm position list position restr. position list position restr. Morphological Unit syntactic behaviour a. head properties b. subcat. frame a. head properties b. subcat. frame Synt. Struct 1 Frameset Synt. Struct 2 N. Calzolari Corresp. MrphU-SynU Dottorato, Pisa, Maggio 2009 Syntactic Unit 23 The semantic level: Information types Semantic Unit A Ontological type R Extended Qualia Structure M F E O E L N A Domain A T T U I Event Type R O Synonymy G S Derivation E E N M S S U Semantic properties N. Calzolari Regular Polysemy alt. Dottorato, Pisa, Maggio 2009 S Predicative Representation lexical predicate arguments: sem. role; sem. restr. Link to syntactic unit 24 SEMANTIC ENTRY CONTENT Aumento (Increase): L’aumento dei prezzi di un venti% • Semantic type: Cause_change_of_value • Supertype: Cause_relational_change ONTOLOGICAL INFO. • Eventype: transition • Domain: general, economics • Gloss: accrescimento in dimensione o quantità • aumento Isa cambiamento • aumento resulting_state maggiore EXTENDED QUALIA INFO. • Agentivecause: yes • Direction: up • Morphological derivation: Eventverb aumentare • Semantic predicate: PRED_aumentare; 3 arguments PREDICATIVE REPRESENTATION • Type of link: event nominalization • Arguments description: range, semantic role & selectional restriction: N. Calzolari Arg0 Arg1 Arg2 Protoagent ProtoPatient Quantifier Entity Amount Human / Institution Dottorato, Pisa, Maggio 2009 25 Semantic entry USem3527vaporizzatore ontological type semantic type: Instrument unification_path: [Concrete_entity | ArtifactAgentive | Telic] free definition apparecchio usato per vaporizzare example un vaporizzatore per piante event type eventype: ===== cleaning, gardening, cosmetics domain information USem3527vaporizzatore synonymy USem72288nebulizzatore USem3527vaporizzatore instrumentverb Usem5239vaporizzare semantic relations ===== qualia features regular polysemy USem3527vaporizzatore USem3527vaporizzatore USem3527vaporizzatore USem3527vaporizzatore predicative representation regular polysemy: ===== Extended Qualia Structure isa has_as_part created_by used_for Usem3479apparecchio Usem61633pulsante UsemD387fabbricare UsemD66019nebulizzare semantic predicate: PRED_vaporizzare-1 type of link: instrument nominalization arguments description: • range • semantic role • select. restrictions arg0_vaporizzare_1 Protoagent Human/Instrument arg1_vaporizzare_1 Protopatient +liquid arg2_vaporizzare_1 Location Concrete_entity from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 26 Semantic entry USem79678regulate ontological type semantic type: Cause_change_of_state supertype: Cause_relational_change free definition regulation of a function or a physiological process example IL2 negatively regulates IL7 event type domain information eventype: transition biomedicine semantic relations synonymy: ===== morpho. derivation: ===== qualia features agentive_cause: yes resulting_state: yes regular polysemy formal: Usem79678regulate isa constitutive: ===== agentive: ===== telic: ===== predicative representation regular polysemy: ===== Extended Qualia Structure Usem64875process semantic predicate: PRED_regulate-1 type of link: master arguments description: • range arg0_regulate_1 • semantic role Protoagent • select. restrictions Natural_Substance arg1_regulate_1 Protopatient Natural_Substance from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 27 Semantic entry UsemTH31676parotite ontological type semantic type: Disease unification_path: [Phenomenon | Agentive] free definition Infiammazione delle ghiandole parotidi example il bambino ha una parotite event type domain information eventype: ===== Ear-Nose-Throat USemTH31676parotite synonymy USem79528orecchione semantic relations agentive_cause: yes qualia features regular polysemy USemTH31676parotite USemTH31676parotite USemTH31676parotite USemTH31676parotite USemTH31676parotite predicative representation regular polysemy: ===== Extended Qualia Structure isa affects causes caused_by typical_of USem3868malattia USem1788ghiandola Usem72131gonfiore USem1971virus USem3593bambino ===== from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 28 Syntactic entry NF-AT positively regulates IL2, which negatively regulates IL7 SYNU_regulateV verb auxiliary: have passivization: + head properties syntactic arguments P0 : subject mandatory NP subcategorization frame P1 : object mandatory NP link to Semantic Unit USem79678regulate from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 29 Syntax-semantics mapping (1) position synt. restr. position synt. restr. a. head properties b. subcat. frame a. head properties b. subcat. frame syntactic structure 1 Frameset syntactic structure 2 Syntactic Unit Corresp. Syntax-Semantics semant. class domain derivation synonymy formal role constitutive role agentive role telic role sem. restr. Corresp. SynU-SemU ontological type event type semant. features semant. relations Extended Qualia Structure regular polysemy Semantic Unit type of link arguments predicate predicative represent. from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 30 Regulate: Syntax-Semantics mapping S E M predicative representation A N semantic predicate: PRED_regulate-1 type of link: master semantic arguments description: • range arg0_regulate_1 • semantic role Protoagent • select. restrictions Natural_Substance T I C arg1_regulate_1 Protopatient Natural_Substance S syntactic arguments S Y N T subcategorization frame id: np-v-np A X from Nilda Ruimy N. Calzolari synsem correspondence P0 : subject mandatory NP P1 : object mandatory NP <Correspondence id="ISObivalent" correspargposl="ARG0-P0 ARG1-P1 "> </Correspondence> Dottorato, Pisa, Maggio 2009 31 SYNTAX-SEMANTIC MAPPING SYNTACTIC LEVEL SynU_aumentare_V ‘to increase’ Transitive structure P0 P1 Intransitive structure P2 Frameset P0 P1 SEMANTIC LEVEL SemU2_aumentare SemU1_aumentare Sem.Type: CAUSE_CHANGE_OF_VALUE Sem.Type: CHANGE_OF_VALUE LINK PREDICATE-SEMANTIC UNIT SEMANTIC PREDICATE PRED_ aumentare_1 from N. Ruimy N. Calzolari ARG0 : Agent Entity ARG1 : Patient Entity Dottorato, Pisa, Maggio 2009 ARG2 : Undersc. Amount 32 SYNTAX-SEMANTIC MAPPING SynU_aumentare_V Transitive structure P0 P1 Intransitive structure P2 P0 Frameset P1 CORRESPONDENCE SYNTACTIC-SEMANTIC FRAME non-isomorphic corresp. isomorphic correspondence SemU1_aumentare SemU2_aumentare CAUSE_CHANGE_OF_VALUE <Correspondence id="ISOtrivalent" correspargposl="ARG0-P0 ARG1-P1 ARG2P2"> </Correspondence> CHANGE_OF_VALUE <Correspondence id="AUG2to3erg9" comment=" Augmented mapping from TWO Position description to THREE argument description. ARG0 not represented in syntax" correspargposl="ARG1-P0 ARG2-P1"> </Correspondence> PRED_ aumentare ARG0 : Agent ARG1 : Patient ARG2 : Undersc. from N. Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 33 Relations and Predicates Pred_SELL <ARG0>, <ARG1>, <ARG2>, <ARG3> SemU Sell V Is_the_agent_of SemU SemU Seller N Sale N Event_noun N. Calzolari Dottorato, Pisa, Maggio 2009 34 “Predicate - semantic unit(s)” link & Relations accusa accusation accusare to accuse Event_noun master process nominalisation PRED_ACCUSARE <ARG0>, <ARG1>, <ARG2>, patient nominalisation agent nominalisation Is_the_agent_of accusato accusatore accused accusator from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 35 The SIMPLE ontology Simple Ontology: multidimensional type hierarchy based on both hierarchical and non-hierarchical conceptual relations In the SIMPLE ontology, types are not mere labels but the repository of a specific set of structured semantic information from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 36 TOP CONSTITUTIVE AGENTIVE TELIC CAUSE •PART ENTITY CONCRETE_ENTITY The SIMPLE ontology PROPERTY ABSTRACT_ENTITY REPRESENTATION •GROUP •Location •Quality •Domain •Language •AMOUNT •Material •Psych Property •Time •Sign •Artifact Material •Artifact •Physi Property •Moral Standards •Information •Furniture •Food •Social Property •Cognitive Fact •Number •Clothing •Physical Object •Mvmt of Thought •Unit of measure •Container •Organic Object •Artwork •Instrument •Money •Living Entity •Human •Substance •Animal •Metalanguage •Institution •Convention •Abstract Location •Vegetal Entity •Vehicle EVENT •Semiotic Artifact Phenomenon •Weather verbs •Disease •Stimuli Aspectual Cause Aspect. State •Exist •Rel. State Act Psychological_event Change Cause_change •Cognitive Event •Rel. Change •Cause Rel. Change •Experience Event •Change Possession •Cause Change Location •Move •Change Location •Cause Natural Transition •Cause Act •Natural Transition •Creation •Speech Act •Acquire Knowledge •Give Knowledge •Non Rel. Act •Relational Act from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 37 Ontology of Structured Semantic Types: a Template Schema providing a set of structured information crucial to the definition of a semantic type Interface between ontology & lexicon SemU: Related SynU: IWN Base Concept Template_Type: Unification_path Domain: Semantic Class Gloss: Predicative Representation Arg. Selectional Restrictions Derivation: Qualia_Formal: Qualia_Agentive: Qualia_Constitutive: Synonymy: Derivational relations between SemUs isa (1, <container> or <hyperonym>) created_by (1, <Usem>: [CREATION]) //definitorial// made_of (1, <Usem>) //optional// has_as_part (1, <Usem>) //optional// contains (1, <Usem>) used_for (1, <contain>) //definitorial// used_for (1, <measure>) //optional// Synonyms of the SemU //optional// Regular Polysemy: [Amount] [Container] Qualia_Telic: Guide for the lexicographer N. Calzolari Identifier of the Semantic Unit Identifier of the Syntactic Unit the SemU is related to Number of the corresponding ItalWordNet base concept [Container] [Concrete_entity | ArtifactAgentive | Telic] General Link to the LexiQuest (or any other ontology) Lexicographic gloss Predicate associated to the SemU and its argument structure [container_pred (arg0)] Selectional restrictions (Arg0-HeadQuantified-Substance) Dottorato, Pisa, Maggio 2009 38 Semantic type in the SIMPLE Ontology Not just a label but rather a classificatory device consisting of a cluster of structured semantic information Type assignment means endowing a word-sense with a structured set of semantic features and relations with a view to: distinguishing it by other senses of the same word expressing its similarity with other words expressing its relationships to other words drawing inferences from this information Each semantic type is associated to a template, i.e. a schematic structure that contains a cluster of type-defining properties and imposes constraints on lexical items for type membership Templates: interface between Ontology and Lexicon Template-driven encoding methodology ensures internal and cross-lexicons consistency from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 39 Template for the sem. type ‘Instrument’ ontological information predicative representation extended qualia structure Identifier of a SemU Identifier of the SynU to which the SemU is linked Number of the corresponding Base Concept in EuroWordNet Template_Type: Instrument Template_Supertype: Semantic type which dominates the type of the SemU in the type-hierarchy Unification_path: [Concrete_entity | ArtifactAgentive | Telic] Domain information Domain: One of WordNet Classes Semantic Class: Lexicographic definition Gloss: Type of event (state, process, transition) Event Type: Predicate associated with the SemU, and its argument Predicative structure Representation: Selectional restrictions on the arguments Selectional Restr.: Derivational relations between SemUs Derivation: Usem_1 isa Usem_2 [Artifact] Formal: Usem_1 created_by Usem_2 [Creation] Agentive: Constitutive: Usem_1 made_of Usem_2 [Substance] OPTIONAL Usem_1 has_as_part Usem_2 [Artifact] OPTIONAL Usem_1 used_for Usem_2 [Event] Telic: Synonyms of the SemU Synonymy: Collocate information Collocates: Polysemous class of the SemU Complex: SemU: SynU: BC Number: from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 40 Top Formal Is_a Constitutive Telic Agentive Is_a_part_of .. Property Created_by Agentive_cause Indirect_telic Purpose Activity ... Contains .. ... .. Instrumental Is_the_habit_of 100 Rels. Used_for Used_as The targets of relations identify: prototypical semantic information associated with a SemU elements of dictionary definitions of SemUs typical corpus collocates of the SemU N. Calzolari Dottorato, Pisa, Maggio 2009 41 Qualia Structure One of the four levels of semantic representation in the theory of Generative Lexicon Consists of four qualia roles encoding orthogonal dimensions of meaning : formal role (general identification) constitutive role (composition) agentive role (origin) telic role (function) N. Calzolari Dottorato, Pisa, Maggio 2009 42 Formal isa antonym_comp antonym_grad mult_opposition disgusto, provare disgust, feel casa, costruire house, build mohair, capra mohair, goat proiettile, colpire projectile, hit bisturi, chirurgo lancet, surgeon medico, curare doctor, cure N. Calzolari Extended QualiaAgentive Structure Constitutive result_of made_of A agentive_prog G is_a_follower_of C E O agentive_cause has_as_member N N agentive_experience is_a_member_of T S caused_by I has_as_part T V I source instrument E T kinship U created_by ARTIFACTUAL is_a_part_of T derived_from AGENTIVE I resulting_state V relates E uses causes concerns pane, farina affects constitutive_activity bread, flour P contains R has_as_colour has_as_effect O senatore, senato has_as_property P measured_by senator, senate E measures R produces produced_by T property_of Y quantifies manubrio, bicicletta related_to handlebar, bicycle successor_of precedes typical_of regulates contains is_regulated_by feeling ….. is_in lives_in LOCATION Dottorato, Pisa, Maggio 2009 typical_location Telic used_for used_as used_by used_against INSTRUMENTAL indirect_telic purpose is_the_activity_of is_the_ability_of is_the_habit_of object_of_activity TELIC ACTIVITY DIRECT TELIC 43 Formal is_a antonym_comp antonym_grad mult_opposition “Extended” Qualia Structure N. Calzolari Constitutive made_of is_a_follower_of has_as_member is_a_member_of has_as_part instrument kinship is_a_part_of resulting_state relates uses causes concerns affects constitutive_activity contains has_as_colour has_as_effect has_as_property measured_by measures produces produced_by property_of quantifies related_to successor_of precedes typical_of feeling is_in lives_in typical_location Agentive result_of A agentive_prog G E agentive_cause N agentive_experience T caused_by I V source E created_by ARTIFACTUAL derived_from C O N S T I T U T I V E AGENTIVE Telic used_for used_as used_by used_against INSTRUMENTAL indirect_telic purpose TELIC is_the_activity_of is_the_ability_of is_the_habit_of object_of_activity ACTIVITY DIRECT TELIC T-cell, Blood Stem Cell P R O P Ribose, Nucleotide E R T Catalyze, Enzyme Y regulates is_regulated_by ….. LOCATION Dottorato, Pisa, Maggio 2009 NEW! 44 Meaning dimensions expressed by Qualia relations botte barrel Formal: isa Constitutive: made_of recipiente di legno traditional dictionary definition Agentive: created_by fatto Constitutive: made_of di doghe arcuate tenute unite da cerchi di ferro che serve per la conservazione e il trasporto di liquidi, specialmente vino Constitutive: contains Telic: used_for from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 45 …by using Lexical Resources Multidimensional Knowledge Bases Ala agentive fabbricare SemU: 3232 agentive Type: [Part] Parte di aeroplano used_for part_of aeroplano SemU: 3268 Type: [Part] Parte di edificio SemU: D358 used_for part_of uccello edificio Type: [Body_part] Organo degli uccelli SemU: 3467 Type: [Role] Ruolo nel gioco del calcio N. Calzolari volare part_of isa giocatore squadra member_of Dottorato, Pisa, Maggio 2009 46 Semantic Multidimensionality & NLP NLP tasks (IE, WSD, NP Recognition, etc.) need to access multidimensional aspects of word meaning: Extended Qualia Relations Is_a_part_of la pagina del libro (the page of the book) Member_of il difensore della Juventus (Juventus fullback) il suonatore di liuto (the lute player) il tavolo di legno (the wooden table) Telic Made_of N. Calzolari Dottorato, Pisa, Maggio 2009 47 Disambiguation = Interpretation of conceptual relations in context ? duna di sabbia made_of ? bicchiere di birra contains liquid ? fetta di pane ONTOLOGY is_a_part_of …….. SUBSTANCE from Nilda Ruimy N. Calzolari Dottorato, Pisa, Maggio 2009 ARTIFACTUAL_DRINK ………. Nilda Ruimy 48 Domain - Semantic class zucchero NATURAL_SUBSTANCE alloro +edible FLAVOURING mangiare Object_of_the_ tartufo Used_for TELIC aactivity AGENTIVE VEGETAL_ENTITY mestolo Created_by mangiare cucinare carne pentola cuocere friggitrice arrostire bollire tavola forchetta ristorante lessare stufare posata BUILDING friggere cuoco rosolare FURNITURE bollitore grigliare …… mela carota FOOD coniglio FRUIT arrosto VEGETABLES pesciera INSTRUMENT SUBSTANCE_FOOD CONTAINER from Nilda Ruimy N. Calzolari ARTIFACT _FOOD Dottorato, Pisa, Maggio 2009 PROFESSION 49 Noun Compounds/Complex Nominals …are pervasive There is a motivation in most N+N construction: the context provides it The FrameNet (SIMPLE) way appeal to specific frame structures (qualia structures) associated with the head noun, determine from corpus attestations which frame elements (qualia) can get instantiated as a modifier word “container”: complex nominals can specify: • material (aluminium c., glass c., …) contents (food c., trash c., …) size (3 quart c., …) function (shipping c., storage c., …) • ... • • • N. Calzolari Dottorato, Pisa, Maggio 2009 50 Noun Compounds/Complex Nominals & multidimensional semantic approaches a. FrameNet “Container” Frame Structure: Frame Elements: Material: aluminum container, glass c., metal c., tin c. Contents: food container, beverage c., trash c., water c., milk c., fuel c. Size: 3 quart container Function: shipping container, storage c. b. SIMPLE Qualia Relations of "container" as used in compounds: Constitutive: made_of [MATERIAL] aluminum container, glass c., metal c., tin c. Telic: contains [ENTITY] food container, beverage c., trash c., water c., milk c., fuel c. Constitutive:size [QUANTITY] 3 quart container Telic:is_used_for [EVENT]shipping container, storage c. N. Calzolari Dottorato, Pisa, Maggio 2009 51 Complex Nominals E.g. knife (coltello) triggers: a “cutting frame” (FrameNet) specific (SIMPLE) dimensions of meaning SIMPLE Extended Qualia structure for the interpretation of the semantic relation betw. Ns (internal relational structure of MWE) butcher’s knife (coltello da macellaio) TELIC (used_by) Y [Human] PPda plastic knife (coltello di plastica) CONST (made_of) X [Material] PPdi table knife (coltello da tavola) TELIC (used_in) Z [Location] PPda hunting knife (coltello da caccia) TELIC (used_in_activity) E[Activity] Ppda piatto di legno CONST (made_of) X [Material] PPdi piatto di pasta CONST (contains) X [Food] PPdi N. Calzolari Dottorato, Pisa, Maggio 2009 PP disambig. 52 SIMPLE: possible extension Deverbal nominalisation: o noun murder (uccisione, delitto, omicidio (different sem. pref.) PPdi PRED: MURDER (uccidere) PPda_parte_di, di o ARG1: agent [Hum/Anim?] ARG2: patient [Hum/Anim?] MOD1: instr [Weapon] MOD2: means [Action] MOD3: ... […] verb murder (uccidere) subj:NP obj:NP :instr: PPcon [Weapon] (knife m., con coltello) :means: PPper [Action] (strangulation m., per strangolamento) As if it were a Situation :loc: Ppploc|di [Location] (Kent State murders, nel ...) :time: Ppptime|di [Time] (1983 murders, del 1983) N. Calzolari Dottorato, Pisa, Maggio 2009 53 Ontologisation of SIMPLE Automatically converting and enriching a computational lexicon into a formal Ontology For NLP semantic tasks Potential of ontologies in NLP as Backbone in LKBs Pivot in multilingual architectures (e.g. KYOTO) Reasoning capabilities Ontologisation of SIMPLE into OWL Conversion of the SIMPLE ontology Bottom-up enrichment: promoting lexicon knowledge to the ontology level Language independent knowledge from Italian lexico-semantic information from Antonio Toral N. Calzolari Dottorato, Pisa, Maggio 2009 54 Named Entity Repository Automatically build LRs from existing LRs and Web 2.0 semi- structured resources. Combine: Authoritative lexicographic experience → precision Collaborative “wisdom of the crowds” → recall Case study: Multilingual NE repository from LRs (en WN, es WN, it SIMPLE) & Wikipedia NEs linked to three LRs and two ontologies (SUMO, SIMPLE) Interoperable resource: LMF compliant Applied to cross-lingual QA (validate answers): prec. +16,3% from Antonio Toral N. Calzolari Dottorato, Pisa, Maggio 2009 55 Use of SIMPLE Lexicon & Ontology for Time and Event detection/annotation Different PoS may realise an event: verbs, nouns, adjectives, prep. phrases The SIMPLE Lexicon helps in identifying & classifying Events (eventive nouns & adjectives) → in a 10K Words Annotation Experiment each event is associated with an Ontological Type the Event-Type from the SIMPLE-Ontology can be used as default value to provide event composition, and consequently to instantiate a temporal representation for each Event improvement both in identification & classification of Events by annotators: 81.17% accuracy (vs.72.35%) and K-coefficient = 0.84 (vs. 0.7) Morpho-Syntactic Analysis SIMPLE Lexicon Event Detection & Classification from Tommaso Caselli N. Calzolari Dottorato, Pisa, Maggio 2009 56 Mapping SIMPLE Semantic Types to TimeML Classes from Tommaso Caselli N. Calzolari Dottorato, Pisa, Maggio 2009 57 GLML – Generative Lexicon Markup Language with James Pustejovsky, Olga Batiukova, Anna Rumshisky, Marc Verhagen Annotating texts with Argument Selection, Argument Coercion, & Qualia Roles The corpus brings reality to the model, provides statistical cues to improve language models Lexical semantic info, like type coercion/selection, required for applications such as WSD, categorisation, IR (query reformulation, filtering…), IE (coreference resolution, relation extraction…), entailment, .. Predicate – Argument constructions Predicate Sense Disambiguation Argument selection: type selection /coercion Qualia role/relation selection Modification constructions • • • Noun Sense Disambiguation Qualia role/relation selection in Adjectival Modification Qualia role/relation selection in Nominal Modification Complex Types from Valeria Quochi N. Calzolari • Type selection in modification of Dot Objects Dottorato, Pisa, Maggio 2009 58 Using Existing Resources for Italian SIMPLE Lexicon&Ontology/ItalWordNet Sense Disambiguation Type selection /coercion Type selection in Dot Objects SIMPLE Extended Qualia Structure Selection of Qualia roles/relations., e.g. Constitutive Relations e.g Is_a_part_of , Is_a_member_of Telic Relations e.g. Purpose, Object_of_the_activity Agentive Relations e.g. Source, Result_of from Valeria Quochi N. Calzolari Dottorato, Pisa, Maggio 2009 59 59 Ontology & Lexicon Today we can easily say that ontology learning, i.e. the practical feasibility of supporting knowledge acquisition in a domain, depends on developing automatic methods for acquiring conceptual representations from natural language text Semantic Web initiatives are also focussing on the building of ontological representations from texts, and in this respect show a large amount of conceptual overlap with the notion of a dynamic lexicon Lexicon & Corpus Based on various experiences, and as a work strategy for lexical/textual resources We should push towards innovative types of lexicons: a sort of ‘examplebased living lexicons’ that participate of properties of both lexicons and corpora N. Calzolari In such a lexicon redundancy is not a problem, but rather a benefit Dottorato, Pisa, Maggio 2009 60 BUT… Mismatch between LRs and LT Often a gap between advancement in LRs and LT Either adequate LRs are missing … or there are no systems able to use “knowledge intensive” LRs effectively Shortcomings: lack of usable implementations fully exploiting new types of LRs LR claims are not empirically evaluated A parallel evolution of R&D for both LRs and LT is needed N. Calzolari Dottorato, Pisa, Maggio 2009 61 Phenomena to be represented/What is missing?? from Ed Hovy 1. Bracketing / grouping of predications around entities (basic frame structure) 2. Concepts: done done?? Choice of meaning/sense, with frames in some cases Definition and nature of concept repository / ontology Major high-level concept groupings and classes 3. Labels on (dependency) arcs (thematic roles, types of attributes, modifiers, etc.) done 4. Coreference (explicit and indirect): 5. Information Structure and Discourse structure: N. Calzolari intra-sentential intersentential and cross-documents theme-rheme and topic-focus salience coordination nonsemantic inter-clausal relations (RST’s interpersonal ones) etc. Dottorato, Pisa, Maggio 2009 done?? 62 Phenomena to be represented/ What is missing?? 6. Pragmatics: Speech Acts Participants and audience modeling Modality: Ed Hovy Epistemic modalities Deontic modalities Personal attitudes done?? Deixis / reference to external world (or databases) Social register, genre, and style Time (Reichenbach) Space (OWL upper ontology of space, etc.) done?? Cardinality Quantification Manner Towards a Degree and comparison Possession common encoding policy??? Existentials Copular constructions Conditionals Consequences and inference Co-text and intertextuality (including formatting and other media) Meaning of prosody and other speech-related effects 7. Polarity (including scoping) 8. Microtheories (many of them to be incorporated elsewhere) N. Calzolari Dottorato, Pisa, Maggio 2009 63 Lexicon and Corpus: a multi-faceted interaction N. Calzolari LC CL CL LC CL CL CL CL CL LC CL CL CL CL LC tagging frequencies (of different linguistic “objects”) proper nouns, acronyms, … parsing, chunking, … training of parsers lexicon updating “collocational” data (MWE, idioms, gram. patterns ...) “nuances” of meanings & semantic clustering acquisition of lexical (syntactic/semantic) knowledge semantic tagging/word-sense disambiguation (e.g. in Senseval) more semantic information on LE corpus based computational lexicography validation of lexical models … ... Dottorato, Pisa, Maggio 2009 64 BUT … Dynamic lexicons Current computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries Towards a flexible model of dynamic lexicon extending the expressiveness of a core static lexicon adapting to the requirements of language in use as attested in corpora with semantic clustering techniques, etc. Convert the extreme flexibility & multidimensionality of meaning into large-scale and exploitable (VIRTUAL?) resources a “Lexicon & Corpus” together Sort of Example-based Lexicon N. Calzolari Dottorato, Pisa, Maggio 2009 65 Verb/Arguments Interaction at the Lexical-Semantic Level Verb meaning determines/selects the ‘sense’ of its subject and/or direct object e.g. arrestare, both ‘to arrest’ & ‘to stop’, selects direct objects which have themselves, or receive from the verb, a negative connotation o o o o o o o o o o o N. Calzolari Dobj Sem.type Conn.Feat. ladro1 spacciatore1 trafficante1 traffico 2 invasione1 massacro1 inflazione1 pregiudicato1 balordo1 maniaco1 strozzino 1 agent_temp_act agent_temp_act agent_temp_act act cause_act cause_nat_trans event human human human agent_temp_act neg neg neg neg neg neg neg neg neg neg neg Dottorato, Pisa, Maggio 2009 66 Complexity of Word Sense in context: many potential clues A particular meaning (of a verb) may be selected by: A specific syntactic pattern The semantic type of subjects, dir objects, ind. objects human subject (if not collective type) always selects the meaning ‘to understand’ of the verb comprendere The domain of use comprendere + that-clause = ‘to understand’ [not = ‘to include’] aprire + PP introduced by a (preferably with “human” head) = ‘to be ready, open, well disposed towards someone’ (e.g. Cossiga apre a La Malfa) perseguire un reato ‘to prosecute a crime’ (domain=law) A specific modifier perseguire penalmente ‘to prosecute at the penal level’, not ‘to pursue (a goal)’ comprendere benissimo ‘to understand very well’, not ‘to include’ Two different senses of a lemma cannot be selected simultaneously in the same context N. Calzolari BUT… Dottorato, Pisa, Maggio 2009 67 Complexity of Word Sense identification The problem: not sure tests only partial validity & not completely discriminating Moreover, it’s not easy to predict when to apply which test Word Sense Disambiguation (WSD) in different contexts is better achieved using info types at different levels of linguistic description: N. Calzolari morphosyntactic/syntactic/semantic/pragmatic…, even multilingual BUT a-priori unpredictable where is the “clue” Dottorato, Pisa, Maggio 2009 68 Complexity of Word Sense & use of Corpora The availability of large quantities of semantically tagged corpora helps to analyse the impact of different “clues” to perform WSD in different contexts study the interaction of clues belonging to different levels of linguistic description, to improve WSD strategies not just statistics!! Automatically acquire syntactic, semantic, collocational (lexical) ‘indicators’ N. Calzolari which can help in the identification of a word-sense ‘List’ them in the lexicon?? Dottorato, Pisa, Maggio 2009 69 BUT… Problem of regular polysemy … and more actual occurrence of “two senses” in the same context… e.g. both act & result (for deverbal nouns, etc.) In una comunicazione al Parlamento la Commissione ha illustrato le sue riflessioni su … Berlusconi dovrà scegliere se fare l’uomo di governo o mantenere il controllo delle sue tv Underspecified meanings? maybe subsuming more granular distictions, to be used only when disambiguation is feasible/useful in a context Theoretical language, “invented” by lexicographers/linguists who have/want to classify in disjoint classes, vs. actual usage a “continuum” resistant to clear-cut disjunctions by necessity ambiguous wrt imposed classifications N. Calzolari Dottorato, Pisa, Maggio 2009 70 In a “Senseval” framework … … what cannot be easily encoded e.g. at the Lexical-Semantic Level When sense interpretation requires appeal to extra-linguistic knowledge (not to be captured at the lexical-semantic level of description) When corpus annotation either diverges from the lexical resource or further specifies it words acquiring a specific sense, strictly dependent on the context la donna Pauline Collins, che ha già visto arrestare il marito dai tedeschi,… variety of nuances of a verb, e.g. according to co-occurring dir.obj. sem-type metaphors extended to an entire sentence l’auto verde arriva sul tavolo del governo (lit. the green car arrives on the table of the government) ... Not all these “shifts of meanings” can/must be captured through lexical-semantic annotation N. Calzolari Dottorato, Pisa, Maggio 2009 71 Wrt Senseval jargon, neologisms, evaluative suffixation, ‘titles’, … vetturetta minitaxi fumantino (agg. una persona fumantina) komeinista … Primula rossa (= boss mafioso) Scarpa d'oro (= un bravo giocatore) … Not in any lexicon… a semantic type easier to assign than a word-sense in a lexicon N. Calzolari Dottorato, Pisa, Maggio 2009 72 Compounds and idioms uscire di scena farla franca fare fuoco andare in onda … fare [in tempo] andare [a piedi] essere [in testa] (= essere il primo) vincere [per un soffio] partire [a razzo] Croce Rossa Caschi Blu conflitto a fuoco atletica leggera famiglia bene un bagno di folla … Where is the boundary of the MWE? N. Calzolari "andare_a_piedi" vs. andare (Pos V) a_piedi (Pos Adv.loc).? Dottorato, Pisa, Maggio 2009 73 Locutions and Figurative usages N. Calzolari per carità in questione per caso in lizza a volontà a buon mercato … ci mancherebbe! c'è mancato poco … due lavoratori su tre sono a casa (= essere disoccupato) [the collocation with ‘lavoratori’ disambiguates the expression] uomo [di polso] zona medaglia d'oro (= tra i primi) a cielo aperto (discarica a ..) la bella vita (fare …) … If annotation of individual components, loss of the semantic contribution of the MWE acquistare un oggetto a buon (Pos A) mercato (Pos S) !! Dottorato, Pisa, Maggio 2009 74 Usual issues: “Is there a fixed set of senses?” or “Do senses exist as separate objects?” Criteria for sense distinction very application-dependent greater vs. lesser granularity depend on the task/ domain/situation/etc. i.e. the communication purpose & there is no inherently “true” (upper or lower) limit to the granularity ... Impossible a “checklist theory of meaning”: meaning as a “piece of information” with an autonomous status independent of its use Computational resources should provide multi-dimensional information the highest expressiveness in terms of sense-discriminating power contextual information Are we dealing with semantic annotation in the right way?? N. Calzolari Dottorato, Pisa, Maggio 2009 75 Divergences betw. Lexicon encoding & Corpus annotation In the lexicon senses are “de-contextualized” (a necessity to capture generalizations) sense discrimination must be kept “under control” clustering (manually or automatically) In the corpus sense annotation task contextualization plays a predominant role calls for a range of pragmatic issues corpus analysis per se would lead to excessive granularity of sense distinctions Capture just the core basic distinctions in a core lexicon & Acquire additional, more granular info (usu. of collocational nature) from corpora to be encoded within the broader senses, e.g. to help translation N. Calzolari Dottorato, Pisa, Maggio 2009 76 Between LRs and Linguistics: A consequence of the corpus-based approach is Compels to break hypotheses too easily taken for granted in mainstream linguistics In actual usage a characteristics of language is to display many properties which behave as a continuum, not as “yes/no” properties The same holds true for so-called “rules”: we find more frequently “tendencies” towards a rule than precise rules Many of the theoretical rules appear to be simplifications or idealisations in fact dispelled by real usage A number of dichotomies must then be reconciled Lesson learned: [IN-]Adequacy of Lexical resources A long way to be able to recognise & integrate the many dimensions relevant to content interpretation N. Calzolari Dottorato, Pisa, Maggio 2009 77 A number of “dichotomies” not as opposite views, but as complementary perspectives Language as a continuum: rules absolute constraints discreteness theoretical/potential intuition/introspection theory-driven symbolic vs. vs. vs. vs. vs. vs. vs. tendencies preferences continuum/gradedness actual empirical evidence data-driven statistical the right part must be highlighted, then to combine the two Choices on the syntagmatic axis are pervasive Lexicon & Corpus must converge N. Calzolari Dottorato, Pisa, Maggio 2009 78