Laboratorio di analisi di risorse linguistiche Parte terza Risorse linguistiche: alcune parole chiave Un po’ di storia The term linguistic resource refers to (usually large) sets of language data and descriptions in machine readable form, to be used in building, improving, or evaluating natural language and speech algorithms or systems. Examples of linguistic resources are written and spoken corpora, lexical databases, grammars, and terminologies, although the term may be extended to include basic software tools for the preparation, collection, management, or use of other resources A. Zampolli, J.J. Godfrey, 1997 Un po’ di storia An increasing awareness of the potential economic and social impact of natural language and speech systems has attracted attention, and some support, from national and international funding authorities. Their interest, naturally, is in technology and systems that work, that make economic sense, and that deal with real language uses (whether scientifically interesting or not). A. Zampolli, J.J. Godfrey, 1997 Un po’ di storia La definizione è interessante perché nel volgere di pochi anni alcune cose sono cambiate: • innanzitutto l’interesse strategico per le risorse linguistiche è cresciuto a ritmi eccezionali: ne è prova il confronto fra il numero di papers presentati alla prima Language Resources and Evaluation Conference (circa 200, Granada 1998) con il numero di papers presentati alla quarta LREC (più di 550; Lisbona 2004) Un po’ di storia è poi passati a una visione meno “ancillare” delle risorse linguistiche: l’interesse economico – prevalente nella definizione di Godfrey/Zampolli – non esclude l’interesse scientifico e culturale della creazione di risorse linguistiche come impresa scientifica valida per se (altrimenti non si capirebbe l’enorme mole di risorse per lingue in via di estinzione o minoritarie); è emersa in sostanza una visione più legata alla documentazione corretta e copiosa dei fenomeni linguistici – anche prescindendo dalla immediata utilità economica. • Si Un po’ di storia La linguistica (computazionale ma non solo) ha cominciato a mostrare un reale interesse per le risorse linguistiche come settore strategico solo a partire dagli anni ’90 del secolo scorso: Only about a decade ago, around the ‘80s, it was considered by many colleagues almost a ‘shame’ to have to deal with data, such a trivial matter! Only methods and algorithms were considered by many scientifically valuable. The problem was that these rule-based methods were often valid for the examples at stake, but not effective for real situations. This was particularly true in the written or textual area, while in the spoken area statistical methods, and therefore data, were recognised as valuable, or even necessary, well in advance. N. Calzolari – Lisbona 2004 Un po’ di storia 1998: Antonio Zampolli lancia l’idea di una conferenza internazionale dedicata alle risorse linguistiche (LREC – Language Resources and Evaluation Conference) Settori più rappresentati alla prima edizione della LREC: morphology, tagging, treebanks Settori più rappresentati alla quarta edizione: summarisation, question answering, speech-to-speech translation, cross-lingual information retrieval, information extraction, document classification, automatic indexing of broadcast news, topic detection, semantic web and ontologies… Alcune parole chiave Parola chiave 1: evaluation/validation Parola chiave 2: reusability Parola chiave 3: extendibility Parola chiave 4: portability (or inter-operability) Parola chiave 5: dissemination Bird, S. & Simmons, G. (2003). Seven Dimensions of Portability for Language Documentation and Description. Language 79:3 (pp. 557-582). Alcune parole chiave Parola chiave 1: evaluation Dal sito web dell’ELRA (European Language Resources Association): “Evaluation is important to the Language Engineering industry on many levels. It enables research and development teams to validate research hypotheses and assess progress and system development. It also identifies promising research directions or technology with a view to bringing it to market. Evaluation also enables funding agencies to determine whether their investment has led to significant progress. Finally, a side effect of evaluation campaigns are the production of high quality training/test data, evaluation software, methodologies, metrics and protocols, all of which may be made available in the form of ‘evaluation packages’ and distributed in the same way as traditional language resources. Such evaluation packages would enable all research teams in a field to compare and benchmark their systems”. Alcune parole chiave Parola chiave 1: evaluation Dal sito web dell’ELRA (European Language Resources Association): ELRA/ELDA’s aim in the context of evaluation is to set up a European Evaluation Infrastructure for NLP technologies. This infrastructure is to be largely inspired by the EC funded ELSE project. In so doing, ELRA/ELDA aims to become the European clearing house for evaluation resources, in the same way as it is for language resources. Using its experience in the commissioning, production, validation, packaging and legal distribution of language resources, ELRA/ELDA is well placed to carry out this activity. ELRA/ELDA also aims to provide evaluation services to third parties (consumer organisations, industries with specific needs, funding agencies etc.) wishing to evaluate and benchmark their system/product, by capitalising on its experience and expertise built up from the evaluation projects in which it is involved. Alcune parole chiave Parola chiave 2: reusability La riusabilità delle risorse linguistiche è strettamente dipendente dalla creazione di standard. Se una risorsa linguistica è creata e annotata con procedimenti idiosincratici, nessuno a parte il gruppo che l’ha creata potrà utilizzarla con profitto. Che cosa si intende per risorsa riusabile? By reusable we mean that the language resource must outlast the project where the resource was created and be usable as it is for different purposes by different users in different environments. Grönroos and Miettinen 2004 Alcune parole chiave Parola chiave 2: reusability Il concetto di riusabilità ha anche un aspetto tecnico, legato alla durata delle risorse Today’s linguists can access printed and hand-written documentation that is hundreds of years old. However, much digital language documentation and description becomes inaccessible within a decade of its creation … Funded documentation projects are usually tied to software versions, file formats, and system configurations having a lifespan of three to five years. The issue is acute for endangered languages. In the very generation when the rate of language death is at its peak, we have chosen to use moribund technologies, and to create endangered data. Bird & Simons 2003: 557 Alcune parole chiave Parola chiave 2: reusability (e parola chiave 4: portability) Ancora: per riusabilità si intende anche la capacità di una risorsa di essere il più possibile cross-platform, di essere utilizzata da diverse comunità di studiosi, per scopi diversi Bird & Simons (2003: 558) utilizzano una macro-categoria che riassume tutti gli aspetti della riusabilità, e cioè portability, termine usato normalmente a proposito di software (cfr. PDF, acronimo di Portable Document Format) il cui uso viene esteso ai dati: …Portability is usually viewed as an issue for software, but here our focus is on data … Alcune parole chiave Parola chiave 2: reusability Esiste un working committee della International Standards Organization (ISO), che si occupa della creazione di standard per le risorse linguistiche ISO TC 37 SC4: http://www.tc37sc4.org/ Sviluppo di un Data Category Registry (DCR), e di un Linguistic Annotation Framework (LAF) [cfr. Nancy Ide & Laurent Romary, “A registry of standard data categories for linguistic annotation”, in Proceedings of the fourth international conference on language resources and evaluation, Lisbon 2004, 135-138] “…data categories include both attributes … such as SYNTACTIC CATEGORY and GRAMMATICAL GENDER, as well as a set of associated atomic values taken by such attributes, such as NOUN and FEMININE … In principle, the DCR provides a set of reference concepts, while the annotator provides a Data Category Specification (DCS) that comprises a mapping between his or her scheme-specific instantiations and the concepts in the DCR” (Ide & Romary 2004: 135) Alcune parole chiave Parola chiave 2: reusability ISO TC 37 SC4: http://www.tc37sc4.org/ “Some of the data categories already defined in ISO 12620, for example, include general-purpose management data categories (e.g., SOURCE, RESPONSIBILITY, DATE, etc.) as well as linguistic categories (e.g., PART OF SPEECH), which can provide a base for extension. In addition, it should certainly be possible to utilize results from previous or existing projects such as EAGLES/ISLE to provide a base set of categories for consideration. We intend to proceed cautiously, implementing categories that are widely used and relatively low-level, to ensure acceptance by the community. By building up slowly, the DCR should eventually contain a wide range of data categories, with their complete history, [and] data category description” (Ide & Romary 2004: 136-137) Alcune parole chiave Parola chiave 2: reusability ISO TC 37 SC4: http://www.tc37sc4.org/ Altro obiettivo: “articulation of a detailed technical proposal for an XML format able to represent a feature structure analysis with a precise description of the underlying formal mechanism to ensure the coherence and soundness of the standard in line with major theoretical works in this domain” (Kiyong Lee, Lou Burnard, et al., “Towards an international standard on feature structure representation”, in Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon 2004, 373-376) Alcune parole chiave Parola chiave 2: reusability ISO TC 37 SC4: http://www.tc37sc4.org/ In attesa dei risultati di questo gruppo di lavoro, la riusabilità deve comunque essere perseguita, cercando il più possibile il confronto con altre risorse linguistiche e scegliendo, ove possibile, le cosiddette best practices (ovvero quei procedimenti e quelle scelte metodologiche che caratterizzano le principali risorse linguistiche esistenti, circondate da un ampio consenso, e che aumentano la probabilità di sopravvivenza di una risorsa sul lungo periodo). Alcune parole chiave Parola chiave 3: extendibility • to other tasks and applications • to other platforms • of modules Alcune parole chiave Parola chiave 3: extendibility Un caso esemplare: le guidelines del consorzio TEI Chiunque voglia può aggiungere moduli compatibili per codificare tipi particolari di informazione linguistica ed extralinguistica, con l’unica restrizione che gli schemi di codifica devono possibilmente essere discussi all’interno di una comunità di persone interessate ed essere compatibili con le specifiche di più alto livello. Alcune parole chiave Parola chiave 4: portability Bird & Simons 2003: Sette dimensioni del problema: 4.1 CONTENT 4.2 FORMAT 4.3 DISCOVERY 4.4 ACCESS 4.5 CITATION 4.6 PRESERVATION 4.7 RIGHTS Alcune parole chiave Parola chiave 4: portability 4.1 CONTENT 4.1.1. Coverage: se la copertura di una risorsa linguistica non è ponderata, la nostra capacità di interpretare i fatti linguistici a partire da quella risorsa potrebbe essere compromessa (es.: significati, collocazioni, e costruzioni non attestate) 4.1.2. Terminology: “Language documentation and description of all types depend critically on technical notation and vocabulary, and ambiguous or unknown terms compromise portability” (B. & S. 2003: 563) un problema soprattutto per la ricerca tipologica! Alcune parole chiave Parola chiave 4: portability – Alcune raccomandazioni (e best practices) 4.1.1. Coverage: ambire a un “record that is sufficiently broad in scope, rich in detail, and authentic in portrayal that future generations will be able to study and experience the language, even if no speakers remain” 4.1.2. Terminology: dedicare sforzi specifici ai problemi di comparabilità tra risorse linguistiche analoghe (“map the terminology and abbreviations used in description to a common ontology of linguistic items”) Alcune parole chiave Parola chiave 4: portability 4.2. FORMAT “By format we mean the manner in which the information is represented electronically. The area of format involves four key concepts: the openness of the format, the encoding of characters within textual information, the markup of structure in the information, and the rendering of information in human-readable displays” (B. & S. 2003: 563) È da privilegiare un approccio che non dipenda da soluzioni proprietarie! “It is a basic requirement of language resources that they should be presented to human readers in conventionally formatted displays” (ibidem, 565) aspetto troppo spesso trascurato dai creatori di risorse linguistiche! Alcune parole chiave Parola chiave 4: portability – Alcune soluzioni (e best practices) 4.2. FORMAT “The best practice is one that puts data into a format that is not proprietary” “The best practice is one that fully documents what the character codes in the resource document” “The best practice is one that represents all of the information using a transparent descriptive mark-up” Alcune parole chiave Parola chiave 4: portability – Alcune soluzioni (e best practices) 4.2. FORMAT “The best practice is one that supplements the information resource with all the auxiliary software resources that are needed to render it for display” “Prefer descriptive mark-up over presentational mark-up” “Prefer XML (with an accompanying DTD or schema) over other schemes of descriptive mark-up” “Provide one or more human-readable versions of the material, using presentational markup (e.g. HTML) or other convenient formats” Alcune parole chiave Parola chiave 4: portability (e parola chiave 5: dissemination) 4.3. DISCOVERY “A given resource, even if it is of the highest quality, is of little practical value if the people who could benefit from it do not know that it exists” (B. & S. 2003: 565) In molti casi la conoscenza di una risorsa deriva dal passaparola, e questo la dice lunga su quanto lavoro c’è ancora da fare su questo aspetto! Alcune parole chiave Parola chiave 4: portability (e parola chiave 5: dissemination) – Alcune soluzioni (e best practices) 4.3. DISCOVERY “The best practice is one that makes it easy for anyone to discover that a resource exists” “The best practice is one that makes it easy for anyone to judge the relevance of a resource based on its description” Alcune parole chiave Parola chiave 4: portability (e parola chiave 5: dissemination) – Alcune soluzioni (e best practices) 4.3. DISCOVERY “Any resource presented in HTML on the web should contain metadata with keywords and description for use by conventional search engines” Alcune parole chiave Parola chiave 4: portability (e parola chiave 5: dissemination) 4.4. ACCESS In questo caso abbiamo a che fare con le complessità dell’animo umano! “Commonly, researchers want to be recognized for the labor that went into creating primary language documentation, but do not want to make the materials available to others until they have derived maximum personal benefit” (ibidem, p. 566). Alcune parole chiave Parola chiave 4: portability (e parola chiave 5: dissemination) – Alcune soluzioni (e best practices) 4.4. ACCESS “The best practice is one that makes easy for users to obtain a complete copy of the resource” Oppure: “The best practice is one in which there is a clearly documented procedure by which users may obtain a copy of the resource” Alcune parole chiave Parola chiave 4: portability 4.5. CITATION Il problema di citare in pubblicazioni scientifiche le risorse linguistiche è un aspetto particolare del problema più generale della citazione di documenti elettronici: • persistenza degli URLs • mancanza di indicazioni da parte degli autori delle risorse • alcune soluzioni: indicare la data di accesso per le risorse che cambiano spesso, archiviare sul proprio computer i dati rilevanti (stringhe, entrate lessicali, etc.) in modo da garantirne la reperibilità Alcune parole chiave Parola chiave 4: portability – Alcune soluzioni (e best practices) 4.5. CITATION “The best practice is one that makes it easy for electronic language documentation and description to be cited” “The best practice is one that makes it possible for users to cite particular versions that never change” Alcune parole chiave Parola chiave 4: portability 4.6. PRESERVATION Problemi legati alla longevità e alla durata dei prodotti elettronici e dei dati in formato binario. In parte superati dall’utilizzo di formati non proprietari e dalla manutenzione continua delle risorse (in fondo alcune risorse create negli anni ‘60 sono ancora vitali e utilizzabili) Alcune parole chiave Parola chiave 4: portability – Alcune soluzioni (e best practices) 4.6. PRESERVATION “The best practice is one that stores resources in formats that are likely to remain usable for generations to come” Alcune parole chiave Parola chiave 4: portability 4.7. RIGHTS Problemi di copyright, di tutela dei dati sensibili, delle licenze di uso, etc. Alcune parole chiave Parola chiave 4: portability – Alcune soluzioni (e best practices) 4.7. RIGHTS “The best practice is one that clearly states the terms of use as part of the resource package” Alcune parole chiave Parola chiave 5: dissemination • Un problema politico ed economico… For every natural language, computer-readable basic resources … are increasingly needed… Especially in countries with a strong and modern economy, enormous efforts have already been invested in developing such resources, but often without common purpose and synergy. Property rights tend to be jealously guarded by industrial and academical developers alike. Enormous amounts of monetary support are wasted on projects that perforce must start by reproducing the work of others, since they can’t use the previous results, and whose results in their turn either remain hidden or just evaporate. We appear not to be standing on the shoulders of our predecessors but rather on their toes… (Cornelis H. A. Koster & Stefan Gradmann, “The language belongs to the People!”, in Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, 2004, 353-356) Alcune parole chiave Parola chiave 5: dissemination • Il ruolo delle istituzioni sovranazionali Es. ELRA (European Language Resources Association) “ELRA has been, since its foundation in 1995, a conduit for the distribution of speech, written and terminology databases, enabling key players to have access to Language Resources (LRs) for technology development and technology evaluation. ELRA's initial mission was to establish itself as a self-supported, centralized Not-forprofit organization for the collection, distribution, and validation of speech, text, terminology resources and tools” (Khalid Choukri, “Recent Activities within the European Language Resources Association: issues on sharing Language Resources and Evaluation”, in Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon 2004, 933-936) Alcune parole chiave Parola chiave 5: dissemination • Il ruolo delle istituzioni sovranazionali Es. ELRA (European Language Resources Association) “In order to play its role, ELRA created a structured and publicly available catalogue of Language Resources. A set of description forms was prepared, aiming to help the providers describe what they propose to ELRA for distribution in a more uniform and consistent way and the users have a quick access to the main features” (Choukri, 2004, 935) www.elda.fr Alcune parole chiave Parola chiave 5: dissemination (e parola chiave 1: validation/evaluation) • Il ruolo delle istituzioni sovranazionali Es. ELRA (European Language Resources Association) Validation Manual for Lexica http://www.elra.info/services/valcom.php Validation of a lexicon’s documentation is the act of checking that certain very basic information is present in the documentation. This involves a human reading the documentation and checking it against the criteria. … By lexicon documentation we mean the explanatory files that accompany the lexicon files themselves. These are files such as general and specific documentation, ‘read me’ files, operating instructions etc. Alcune parole chiave Parola chiave 5: dissemination (e parola chiave 1: validation/evaluation) Da: http://www.elra.info/services/valcom.php Firstly, the documentation should be written in English (also for lexical resources for other languages than English), and it should clearly present core administrative information: contact data for the resource (e.g. name, address, e-mail, URL), the number and types of physical media involved (e.g. CDs), the precise contents of each piece of physical medium, and copyright statements … if relevant. Alcune parole chiave Parola chiave 5: dissemination (e parola chiave 1: validation/evaluation) Da: http://www.elra.info/services/valcom.php Secondly, the documentation should describe the formal properties of the lexicon. These are constituted by the basic technical information needed in order to access and use the data: character set(s) used, data format (e.g. mark-up language), system(s) needed to view and/or access the data, and the number, names and organisation of files belonging to the lexicon, plus the procedure for accessing them. Thirdly, the documentation should contain the content information necessary to serve as a specification of the linguistic content. This covers the items lexicon size, lexicon coverage, intended application(s), natural language(s), data structure of an entry, entry types, attributes and their values, POS assignment and other relevant linguistic specifications. Alcune parole chiave Parola chiave 5: dissemination • Il ruolo delle istituzioni sovranazionali Es. Consorzio ENABLER (European National Activities for Basic Language Resources) The ENABLER Consortium conducted the Survey of LRs to get a global picture of the situation on LRs, in order to be able to compare the various conditions that hold across different languages and – on this basis – to suggest more sound recommendations. The Survey provides an overview of the results of National Projects and activities on LRs of different types (written, spoken, multimodal, lexical resources and related tools). http://www.ilsp.gr/enabler/ Alcune parole chiave Parola chiave 5: dissemination • Il ruolo delle istituzioni sovranazionali The Open Language Archives Community (http://www.openarchives.org) OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. Alcune parole chiave Parola chiave 5: dissemination • Il ruolo delle istituzioni sovranazionali The OLAC gateway at the LINGUIST List site (http://linguistlist.org/olac) permits users to search the contents of all archives from a single location. Anyone in the wider linguistics community can participate, not only by using the search facilities, but also by documenting their own resources, or by helping create and evaluate new best practice recommendations. Alcune parole chiave Parola chiave 5: dissemination • The OLAC Metadata standard http://www.language-archives.org/OLAC/metadata.html Un formato XML che permette di inserire tutte le informazioni di tipo “meta-linguistico” riguardanti la propria risorsa linguistica, così da facilitarne la reperibilità. Simile nella concezione alle Library cards utilizzate dalla Library of Congress americana.