Laboratorio di analisi di dati linguistici Laurea specialistica in Linguistica Teorica e Applicata, Università di Pavia Andrea Sansò [email protected] A.A. 2005-2006 Corso progredito 10 CFU Laboratorio di analisi di risorse linguistiche Parte quarta Lessici Risorse per la linguistica tipologica Strumenti e tecnologie per la creazione di risorse linguistiche Lessici Una definizione: “A computational lexicon is a very complex – and expensive – component to be built adequately. It must contain, in an explicit and formalised way, all the information which a native speaker uses in everyday situations, from the simpler orthographic, phonetic, morphologic information, to the more complex syntactic, semantic, pragmatic, logical, ontological, multilingual information. A ‘complete’ lexicon should practically incorporate our ‘knowledge of the world’, and represent it in an explicit and formal way” N. Calzolari, “Computational lexicons and corpora. Complementary components in human language technology”, in P. van Sterkenburg (ed.), Linguistics Today – Facing a Greater Challenge, 89-107. Amsterdam-Philadelphia: J. Benjamins, 2004. Lessici Lessici corpora Questa conoscenza del mondo è un oggetto mutevole e continuamente in accrescimento, impossibile da “congelare” in un lessico statico. The only way of reflecting and capturing all the potentialities of a language relies on trying to extract the linguistic and lexical information not only from ‘experts’, i.e. native speakers or linguists, but from the texts themselves in which the language is actually used, with a continuous process of enrichment. From these considerations the importance of corpora obviously emerges. N. Calzolari, ibidem Lessici Lessici corpora LC POS tagging / lemmatisation CL frequencies of different linguistic objects CL proper nouns / named entity recognition LC syntactic parsing CL updating / tuning a lexicon CL collocational data Lessici Lessici corpora CL semantic clustering and ‘nuances’ of meaning LC semantic mark-up CL lexical knowledge acquisition LC word sense disambiguation CL validation of lexical models CL corpus-based computational lexicography Lessici Lessici corpora Esempio: italiano chiedere vs domandare Dal punto di vista teorico (introspettivo) sono sinonimi; i dizionari cartacei utilizzano la stessa definizione Ma: • domandare è utilizzato quasi sempre in senso interrogativo (ask to know); chiedere è utilizzato spesso in senso imperativo (ask to have); • chiedere è molto più usato di domandare; FrameNet FrameNet (FN) is a corpus-based lexicon-building project that documents the links between lexical items and the semantic frame(s) they evoke; it accomplishes this by annotating sets of sentences that exemplify the items being described, and performing various operations on the resulting annotations. The basic units in FN descriptions are the frame and the lexical unit (LU), the latter understood as the pairing of a “word” with just one of its meanings; thus, a word with four meanings is treated as four lexical units. In most cases, for a word to have more than one meaning implies that it belongs to more than one frame. Charles J. Fillmore, Collin F. Baker, and Hiroaki Sato, “FrameNet as a ‘Net’”, in Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon 2004, pp. 1091-1094. FrameNet Main components of the FrameNet database (1) the frame ontology, (2) the set of annotated sentences, and (3) the set of lexical entries. The basis of the ontology is the set of frames, each of which consists of an informal characterization of a situation type (the frame definition), together with a collection of frame elements (FEs). The FEs are the semantic roles of the entities involved in each frame. FE names are used as labels for the words or phrases that are in grammatical construction with the L(exical) U(nit)s that evoke that particular frame. For example, the frame that includes the English verb inform has as its core FEs SPEAKER, ADDRESSEE and MESSAGE. FrameNet • The example sentences are selected by FrameNet annotators as representing the typical uses of the LUs belonging to individual frames. Each set of annotations is centered around a particular LU; the sentence’s constituents are labeled (with FE names) according to the ways in which they fill in information about the frame. For example, sentences (1) and (2) have SPEAKER appearing as subject, and ADDRESSEE as object; the MESSAGE FE appears as a that clause in sentence (1), and as an event-naming nominalization introduced by of in sentence (2). (1) [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE that the prime minister has resigned] (2) [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE of the prime minister’s resignation] FrameNet The lexical entry for each LU is a summary of what has been recorded in its annotations, presented as valence descriptions, showing all the ways in which its frame elements can be realized, such as the alternative syntactic realizations of the MESSAGE just shown for the verb inform. The collection of annotated sentences is made available in the database as evidence for the analysis. The first and most obvious way in which LUs are related to each other is through membership in the same frame. Thus inform shares a frame with the verbs notify and announce, and also with the nouns notification and announcement, and the verb resign shares frame membership with its nominal partner resignation, and with verbal expressions like abdicate, step down and stand down. But LUs can also be related to each other in other ways, either because their frames are related to other frames, or through semantic properties (called semantic types in the FN database) assigned to LUs individually rather than through their frames. FrameNet Semantic types: The FrameNet database allows the assignment of semantic types to LUs, FEs and frames. The perception verbs hear vs. listen are distinguished as passive versus active perception verbs, and so, respectively, are see vs. look. Hearing and seeing are things that happen to you, listening and looking are things that you do, and this difference is considered important enough to merit entry into separate frames. In the FN database, hear and see and the passive perception uses of other sensory words, such as feel, taste and smell, belong to the Perception experience frame; the verbs look and listen belong to the Perception active frame, along with the corresponding active uses of feel, taste and smell. FrameNet Subframes Subframes are used for representing subevents; frames that represent complex processes have subframes representing their subparts. To take a simple example, the Motion scenario frame has three subframes, Departing, Motion, and Arriving. In this case, the subframes are temporally ordered, but in general, subframes need not be completely ordered with respect to each other. For example, the Commercial transaction frame has two subframes Commerce goods-transfer and Commerce money-transfer, but these are not ordered with respect to each other. In some commercial transactions, you pay in advance, in others, only after receiving the goods or services. FrameNet in azione… http://framenet.icsi.berkeley.edu/index.php Tutta la documentazione si trova in un manuale: http://framenet.icsi.berkeley.edu/index.php ?option=com_wrapper&Itemid=126 FrameNet in altre lingue Salsa Project – FrameNet in German http://www.coli.uni-saarland.de/projects/salsa/ Spanish FrameNet http://gemini.uab.es/SFN/index.html WordNet Sistema di riferimento lessicale disponibile online: http://wordnet.princeton.edu I significati delle parole sono rappresentati da gruppi di sinonimi (synsets). Sono rappresentate anche relazioni quali meronimia, iperonimia, antonimia, etc. Bibliografia aggiornata: http://mira.csci.unt.edu/~wordnet/ Altri lessici multilingui Mimida: http://www.gittens.nl/SemanticNetworks.html Un lessico multilingue basato su WordNet e su vocabolari liberamente disponibili sul web. MultiWordNet: http://multiwordnet.itc.it/english/home.php Un lessico multilingue (italiano, spagnolo, ebraico, rumeno) in cui i synsets sono allineati, laddove possibile, con i synsets del WordNet di Princeton. Sviluppato all’IRST-ITC di Povo (TN). EuroWordNet http://www.illc.uva.nl/EuroWordNet Un progetto analogo per le lingue europee: è possibile scaricarne una demo I vari WordNets sono collegati ad un Interlingual index che è basato sul Wordnet americano e che permette di passare da una parola in una lingua a una parola analoga in un’altra. Questo index consente anche di accedere a un’ontologia condivisa di 63 distinzioni semantiche, che fornisce una base semantica comune per le varie lingue Altre iniziative Progetto EAGLES (Expert Advisory Group for Language Engineering Standards): http://www.ilc.cnr.it/EAGLES96/home.html • development of standards in morphosyntax, syntax and semantics • awareness of the interdependence between lexical specifications and corpus tagsets / syntactic annotations • gli standard sviluppati sono serviti nella creazione di risorse (sia corpora che lessici) creati all’interno dei progetti europei Parole e Simple Altre iniziative Progetto ISLE (International Standards for Language Engineering) – Computational Lexicon Working Group: http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm • una continuazione del progetto EAGLES • sviluppo di uno schema generale per la codifica dell’informazione lessicale multilingue (MILE; Multilingual ISLE Lexical Entry) • impegno a raggiungere consenso su standard di fatto attraverso una procedura bottom-up • impegno a massimizzare l’interazione e le sinergie con chi lavora nell’ambito del semantic web Altre iniziative Progetto PAROLE: http://www.ub.es/gilcub/SIMPLE/simple.html • obiettivo: produrre in Europa un nucleo iniziale di corpora e lessici armonizzati (catalano, danese, olandese, inglese, finlandese, francese, tedesco, greco, italiano, portoghese, spagnolo, svedese) • Informazione codificata: • Morfologia: written forms, including stems and variants; morphosyntactic category; inflected forms; morphological features; derivation; abridged forms Altre iniziative Progetto PAROLE: • Informazione codificata: • Sintassi: subcategorization patterns; grammatical relations of subcategorised complements; control; diathesis and lexical alternations; pronominalization; linear order constraints; constraints on the syntactic context where the lexical entry is inserted; idioms and collocations Altre iniziative Progetto SIMPLE: http://www.ub.es/gilcub/SIMPLE/simple.html • Aggiunta di un livello semantico a PAROLE • “The first attempt to tackle harmonised encoding of semantic types and semantic (subcategorisation) frames on a large scale, i.e. for so many languages and with wide coverage” Altre iniziative Progetto SIMPLE: • Informazione semantica: semantic type; domain information; lexicographic gloss; argument structure for predicative semantic units; event type, to characterise the aspectual properties of verbal predicates; links of the arguments to the syntactic subcategorization frames; ‘qualia’ structure, represented by a very large and granular set of semantic relations and features; regular polysemous alternations (e.g. container for content); hyponymy, synonymy, etc. Due tipi di database tipologici Databases that collect and document primary language data e.g. Agreement database Autotyp Reflexives and intensifiers database Stresstyp... Databases documenting secondary language data e.g. Noun Phrase Universals Database (Edinburgh) The Universals Archive (Konstanz) Das grammatikalische Raritätenkabinett (Konstanz) http://ling.unikonstanz.de/pages/proj/sprachbau.htm Database tipologici http://www.lotschool.nl/Research/ltrc/databases/index.htm contiene un elenco dei database tipologici elaborati all’interno del progetto LTRC (Utrecht) Particolarmente user-friendly: Typological Database of Intensifiers and Reflexives (TDIR): http://noam.philologie.fu-berlin.de/~gast/tdir/index.htm Reduplication database: http://ling.uni-graz.at/redup/ The SMG databases: http://www.smg.surrey.ac.uk/ Database tipologici World Atlas of Language Structure The World Atlas of Language Structures consists of 142 maps with accompanying texts on diverse features (such as vowel inventory size, noun-genitive order, passive constructions, and 'hand'/'arm' polysemy), each of which is the responsibility of a single author (or team of authors). Each maps shows between 120 (35) and 1110 languages, each language being represented by a dot, and different dot colors showing different values of the features. Altogether 2,650 languages are shown on the maps, and more than 58,000 dots give information on features in particular languages Tools per la ricerca tipologica: http://lingweb.eva.mpg.de/fieldtools/tools.htm Strumenti e tecnologie per la creazione di risorse Tools specializzati Fieldwork: Shoebox: http://www.sil.org/computing/shoebox Fieldworks Data Notebook: http://fieldworks.sil.org (open source) Speech analysis: Praat: http://fonsg3.hum.uva.nl/praat (gratuito) SpeechAnalyzer: http://www.sil.org/computing/speechtools/speechanalyzer.htm (versione 2.1 non gratuita; versione 1.5 gratuita) Annotation tools: CLAN: http://childes.psy.cmu.edu (gratuito) Altri strumenti si possono trovare sulla pagina del LARL, nei link (categorie: concordancing tools e altre risorse linguistiche) Strumenti e tecnologie per la creazione di risorse Tools specializzati Tagger morfologici: Morph-it – tagger morfologico dell’italiano ; disponibile una demo in rete sul sito: http://sslmitdev-online.sslmit.unibo.it/linguistics/morph-it.php POS taggers: CLAWS: www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html TREE tagger: www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html Strumenti e tecnologie per la creazione di risorse Tools specializzati Codifica di testi DBT (DataBase Testuale): software di analisi testuale e interrogazione full-text sviluppato da E. Picchi (ILC, CNR, Pisa) http://www.ilc.cnr.it/pisystem/demo/index.html Il LARL possiede un corpus di italiano L2 e il corpus del LIP (Lessico di frequenza dell’italiano parlato) interrogabili attraverso il DBT