Laboratorio di analisi di risorse
linguistiche
Parte terza
Risorse linguistiche:
alcune parole chiave
Un po’ di storia
The term linguistic resource refers to (usually large) sets of
language data and descriptions in machine readable form, to
be used in building, improving, or evaluating natural
language and speech algorithms or systems. Examples of
linguistic resources are written and spoken corpora, lexical
databases, grammars, and terminologies, although the term
may be extended to include basic software tools for the
preparation, collection, management, or use of other
resources
A. Zampolli, J.J. Godfrey, 1997
Un po’ di storia
An increasing awareness of the potential economic and
social impact of natural language and speech systems has
attracted attention, and some support, from national and
international funding authorities. Their interest, naturally, is
in technology and systems that work, that make economic
sense, and that deal with real language uses (whether
scientifically interesting or not).
A. Zampolli, J.J. Godfrey, 1997
Un po’ di storia
La definizione è interessante perché nel volgere di
pochi anni alcune cose sono cambiate:
• innanzitutto l’interesse strategico per le risorse
linguistiche è cresciuto a ritmi eccezionali: ne è prova
il confronto fra il numero di papers presentati alla
prima Language Resources and Evaluation Conference
(circa 200, Granada 1998) con il numero di papers
presentati alla quarta LREC (più di 550; Lisbona 2004)
Un po’ di storia
è poi passati a una visione meno “ancillare” delle
risorse linguistiche: l’interesse economico – prevalente
nella definizione di Godfrey/Zampolli – non esclude
l’interesse scientifico e culturale della creazione di risorse
linguistiche come impresa scientifica valida per se
(altrimenti non si capirebbe l’enorme mole di risorse per
lingue in via di estinzione o minoritarie); è emersa in
sostanza una visione più legata alla documentazione
corretta e copiosa dei fenomeni linguistici – anche
prescindendo dalla immediata utilità economica.
• Si
Un po’ di storia
La linguistica (computazionale ma non solo) ha cominciato a mostrare un
reale interesse per le risorse linguistiche come settore strategico solo a
partire dagli anni ’90 del secolo scorso:
Only about a decade ago, around the ‘80s, it was considered by many
colleagues almost a ‘shame’ to have to deal with data, such a trivial
matter! Only methods and algorithms were considered by many
scientifically valuable. The problem was that these rule-based methods
were often valid for the examples at stake, but not effective for real
situations. This was particularly true in the written or textual area, while in
the spoken area statistical methods, and therefore data, were recognised as
valuable, or even necessary, well in advance.
N. Calzolari – Lisbona 2004
Un po’ di storia
1998: Antonio Zampolli lancia l’idea di una conferenza
internazionale dedicata alle risorse linguistiche (LREC – Language
Resources and Evaluation Conference)
Settori più rappresentati alla prima edizione della LREC:
morphology, tagging, treebanks
Settori più rappresentati alla quarta edizione: summarisation,
question answering, speech-to-speech translation, cross-lingual
information retrieval, information extraction, document
classification, automatic indexing of broadcast news, topic
detection, semantic web and ontologies…
Alcune parole chiave
Parola chiave 1: evaluation/validation
Parola chiave 2: reusability
Parola chiave 3: extendibility
Parola chiave 4: portability (or inter-operability)
Parola chiave 5: dissemination
 Bird, S. & Simmons, G. (2003). Seven Dimensions of Portability
for Language Documentation and Description. Language 79:3 (pp.
557-582).
Alcune parole chiave
Parola chiave 1: evaluation
Dal sito web dell’ELRA (European Language Resources Association):
“Evaluation is important to the Language Engineering industry on many levels.
It enables research and development teams to validate research hypotheses and
assess progress and system development. It also identifies promising research
directions or technology with a view to bringing it to market. Evaluation also
enables funding agencies to determine whether their investment has led to
significant progress. Finally, a side effect of evaluation campaigns are the
production of high quality training/test data, evaluation software,
methodologies, metrics and protocols, all of which may be made available in
the form of ‘evaluation packages’ and distributed in the same way as traditional
language resources. Such evaluation packages would enable all research teams
in a field to compare and benchmark their systems”.
Alcune parole chiave
Parola chiave 1: evaluation
Dal sito web dell’ELRA (European Language Resources Association):
ELRA/ELDA’s aim in the context of evaluation is to set up a European Evaluation
Infrastructure for NLP technologies. This infrastructure is to be largely inspired
by the EC funded ELSE project. In so doing, ELRA/ELDA aims to become the
European clearing house for evaluation resources, in the same way as it is for
language resources. Using its experience in the commissioning, production,
validation, packaging and legal distribution of language resources, ELRA/ELDA is
well placed to carry out this activity.
ELRA/ELDA also aims to provide evaluation services to third parties (consumer
organisations, industries with specific needs, funding agencies etc.) wishing to
evaluate and benchmark their system/product, by capitalising on its experience and
expertise built up from the evaluation projects in which it is involved.
Alcune parole chiave
Parola chiave 2: reusability
La riusabilità delle risorse linguistiche è strettamente dipendente
dalla creazione di standard. Se una risorsa linguistica è creata e
annotata con procedimenti idiosincratici, nessuno a parte il gruppo
che l’ha creata potrà utilizzarla con profitto.
Che cosa si intende per risorsa riusabile?
By reusable we mean that the language resource must outlast the project
where the resource was created and be usable as it is for different purposes
by different users in different environments.
Grönroos and Miettinen 2004
Alcune parole chiave
Parola chiave 2: reusability
Il concetto di riusabilità ha anche un aspetto tecnico, legato alla durata
delle risorse
Today’s linguists can access printed and hand-written documentation that
is hundreds of years old. However, much digital language documentation
and description becomes inaccessible within a decade of its creation …
Funded documentation projects are usually tied to software versions, file
formats, and system configurations having a lifespan of three to five years.
The issue is acute for endangered languages. In the very generation when
the rate of language death is at its peak, we have chosen to use moribund
technologies, and to create endangered data.
Bird & Simons 2003: 557
Alcune parole chiave
Parola chiave 2: reusability (e parola chiave 4: portability)
Ancora: per riusabilità si intende anche la capacità di una risorsa di
essere il più possibile cross-platform, di essere utilizzata da diverse
comunità di studiosi, per scopi diversi
Bird & Simons (2003: 558) utilizzano una macro-categoria che riassume
tutti gli aspetti della riusabilità, e cioè portability, termine usato
normalmente a proposito di software (cfr. PDF, acronimo di Portable
Document Format) il cui uso viene esteso ai dati:
…Portability is usually viewed as an issue for software, but here our focus
is on data …
Alcune parole chiave
Parola chiave 2: reusability
Esiste un working committee della International Standards Organization (ISO), che
si occupa della creazione di standard per le risorse linguistiche
ISO TC 37 SC4: http://www.tc37sc4.org/
Sviluppo di un Data Category Registry (DCR), e di un Linguistic Annotation
Framework (LAF) [cfr. Nancy Ide & Laurent Romary, “A registry of standard data
categories for linguistic annotation”, in Proceedings of the fourth international
conference on language resources and evaluation, Lisbon 2004, 135-138]
“…data categories include both attributes … such as SYNTACTIC CATEGORY
and GRAMMATICAL GENDER, as well as a set of associated atomic values taken
by such attributes, such as NOUN and FEMININE … In principle, the DCR
provides a set of reference concepts, while the annotator provides a Data Category
Specification (DCS) that comprises a mapping between his or her scheme-specific
instantiations and the concepts in the DCR” (Ide & Romary 2004: 135)
Alcune parole chiave
Parola chiave 2: reusability
ISO TC 37 SC4: http://www.tc37sc4.org/
“Some of the data categories already defined in ISO 12620, for example,
include general-purpose management data categories (e.g., SOURCE,
RESPONSIBILITY, DATE, etc.) as well as linguistic categories (e.g.,
PART OF SPEECH), which can provide a base for extension. In addition,
it should certainly be possible to utilize results from previous or existing
projects such as EAGLES/ISLE to provide a base set of categories for
consideration. We intend to proceed cautiously, implementing categories
that are widely used and relatively low-level, to ensure acceptance by the
community. By building up slowly, the DCR should eventually contain a
wide range of data categories, with their complete history, [and] data
category description” (Ide & Romary 2004: 136-137)
Alcune parole chiave
Parola chiave 2: reusability
ISO TC 37 SC4: http://www.tc37sc4.org/
Altro obiettivo: “articulation of a detailed technical proposal for an
XML format able to represent a feature structure analysis with a
precise description of the underlying formal mechanism to ensure the
coherence and soundness of the standard in line with major theoretical
works in this domain” (Kiyong Lee, Lou Burnard, et al., “Towards an
international standard on feature structure representation”, in
Proceedings of the 4th International Conference on Language
Resources and Evaluation, Lisbon 2004, 373-376)
Alcune parole chiave
Parola chiave 2: reusability
ISO TC 37 SC4: http://www.tc37sc4.org/
In attesa dei risultati di questo gruppo di lavoro, la riusabilità deve
comunque essere perseguita, cercando il più possibile il confronto
con altre risorse linguistiche e scegliendo, ove possibile, le
cosiddette best practices (ovvero quei procedimenti e quelle scelte
metodologiche che caratterizzano le principali risorse linguistiche
esistenti, circondate da un ampio consenso, e che aumentano la
probabilità di sopravvivenza di una risorsa sul lungo periodo).
Alcune parole chiave
Parola chiave 3: extendibility
• to other tasks and applications
• to other platforms
• of modules
Alcune parole chiave
Parola chiave 3: extendibility
Un caso esemplare: le guidelines del consorzio TEI
Chiunque voglia può aggiungere moduli compatibili per codificare
tipi particolari di informazione linguistica ed extralinguistica, con
l’unica restrizione che gli schemi di codifica devono possibilmente
essere discussi all’interno di una comunità di persone interessate ed
essere compatibili con le specifiche di più alto livello.
Alcune parole chiave
Parola chiave 4: portability
Bird & Simons 2003:
Sette dimensioni del problema:
4.1 CONTENT
4.2 FORMAT
4.3 DISCOVERY
4.4 ACCESS
4.5 CITATION
4.6 PRESERVATION
4.7 RIGHTS
Alcune parole chiave
Parola chiave 4: portability
4.1 CONTENT
4.1.1.
Coverage: se la copertura di una risorsa linguistica non è ponderata,
la nostra capacità di interpretare i fatti linguistici a partire da quella
risorsa potrebbe essere compromessa (es.: significati, collocazioni, e
costruzioni non attestate)
4.1.2.
Terminology: “Language documentation and description of all types
depend critically on technical notation and vocabulary, and
ambiguous or unknown terms compromise portability” (B. & S.
2003: 563)
un problema soprattutto per la ricerca tipologica!
Alcune parole chiave
Parola chiave 4: portability – Alcune raccomandazioni (e best
practices)
4.1.1. Coverage: ambire a un “record that is sufficiently
broad in scope, rich in detail, and authentic in portrayal
that future generations will be able to study and
experience the language, even if no speakers remain”
4.1.2. Terminology: dedicare sforzi specifici ai problemi di
comparabilità tra risorse linguistiche analoghe (“map the
terminology and abbreviations used in description to a
common ontology of linguistic items”)
Alcune parole chiave
Parola chiave 4: portability
4.2. FORMAT
“By format we mean the manner in which the information is represented
electronically. The area of format involves four key concepts: the
openness of the format, the encoding of characters within textual
information, the markup of structure in the information, and the
rendering of information in human-readable displays” (B. & S. 2003:
563)
È da privilegiare un approccio che non dipenda da soluzioni proprietarie!
“It is a basic requirement of language resources that they should be
presented to human readers in conventionally formatted displays”
(ibidem, 565)  aspetto troppo spesso trascurato dai creatori di risorse
linguistiche!
Alcune parole chiave
Parola chiave 4: portability – Alcune soluzioni (e best
practices)
4.2. FORMAT
“The best practice is one that puts data into a format
that is not proprietary”
“The best practice is one that fully documents what the
character codes in the resource document”
“The best practice is one that represents all of the
information using a transparent descriptive mark-up”
Alcune parole chiave
Parola chiave 4: portability – Alcune soluzioni (e best practices)
4.2. FORMAT
“The best practice is one that supplements the information resource
with all the auxiliary software resources that are needed to render it
for display”
“Prefer descriptive mark-up over presentational mark-up”
“Prefer XML (with an accompanying DTD or schema) over other
schemes of descriptive mark-up”
“Provide one or more human-readable versions of the material,
using presentational markup (e.g. HTML) or other convenient
formats”
Alcune parole chiave
Parola chiave 4: portability (e parola chiave 5:
dissemination)
4.3. DISCOVERY
“A given resource, even if it is of the highest quality, is
of little practical value if the people who could benefit
from it do not know that it exists” (B. & S. 2003: 565)
In molti casi la conoscenza di una risorsa deriva dal
passaparola, e questo la dice lunga su quanto lavoro c’è
ancora da fare su questo aspetto!
Alcune parole chiave
Parola chiave 4: portability (e parola chiave 5:
dissemination) – Alcune soluzioni (e best practices)
4.3. DISCOVERY
“The best practice is one that makes it easy for anyone to
discover that a resource exists”
“The best practice is one that makes it easy for anyone to
judge the relevance of a resource based on its
description”
Alcune parole chiave
Parola chiave 4: portability (e parola chiave 5:
dissemination) – Alcune soluzioni (e best practices)
4.3. DISCOVERY
“Any resource presented in HTML on the web should
contain metadata with keywords and description for
use by conventional search engines”
Alcune parole chiave
Parola chiave 4: portability (e parola chiave 5:
dissemination)
4.4. ACCESS
In questo caso abbiamo a che fare con le complessità
dell’animo umano!
“Commonly, researchers want to be recognized for the
labor that went into creating primary language
documentation, but do not want to make the materials
available to others until they have derived maximum
personal benefit” (ibidem, p. 566).
Alcune parole chiave
Parola chiave 4: portability (e parola chiave 5:
dissemination) – Alcune soluzioni (e best practices)
4.4. ACCESS
“The best practice is one that makes easy for users to obtain
a complete copy of the resource”
Oppure:
“The best practice is one in which there is a clearly
documented procedure by which users may obtain a
copy of the resource”
Alcune parole chiave
Parola chiave 4: portability
4.5. CITATION
Il problema di citare in pubblicazioni scientifiche le risorse
linguistiche è un aspetto particolare del problema più generale
della citazione di documenti elettronici:
•
persistenza degli URLs
•
mancanza di indicazioni da parte degli autori delle risorse
•
alcune soluzioni: indicare la data di accesso per le risorse che
cambiano spesso, archiviare sul proprio computer i dati rilevanti
(stringhe, entrate lessicali, etc.) in modo da garantirne la
reperibilità
Alcune parole chiave
Parola chiave 4: portability – Alcune soluzioni (e best
practices)
4.5. CITATION
“The best practice is one that makes it easy for electronic
language documentation and description to be cited”
“The best practice is one that makes it possible for users
to cite particular versions that never change”
Alcune parole chiave
Parola chiave 4: portability
4.6. PRESERVATION
Problemi legati alla longevità e alla durata dei prodotti
elettronici e dei dati in formato binario. In parte
superati dall’utilizzo di formati non proprietari e dalla
manutenzione continua delle risorse (in fondo alcune
risorse create negli anni ‘60 sono ancora vitali e
utilizzabili)
Alcune parole chiave
Parola chiave 4: portability – Alcune soluzioni (e best
practices)
4.6. PRESERVATION
“The best practice is one that stores resources in
formats that are likely to remain usable for
generations to come”
Alcune parole chiave
Parola chiave 4: portability
4.7. RIGHTS
Problemi di copyright, di tutela dei dati sensibili, delle
licenze di uso, etc.
Alcune parole chiave
Parola chiave 4: portability – Alcune soluzioni (e best
practices)
4.7. RIGHTS
“The best practice is one that clearly states the terms of
use as part of the resource package”
Alcune parole chiave
Parola chiave 5: dissemination
• Un problema politico ed economico…
For every natural language, computer-readable basic resources … are
increasingly needed… Especially in countries with a strong and modern
economy, enormous efforts have already been invested in developing such
resources, but often without common purpose and synergy. Property rights
tend to be jealously guarded by industrial and academical developers alike.
Enormous amounts of monetary support are wasted on projects that perforce
must start by reproducing the work of others, since they can’t use the previous
results, and whose results in their turn either remain hidden or just evaporate.
We appear not to be standing on the shoulders of our predecessors but rather on
their toes… (Cornelis H. A. Koster & Stefan Gradmann, “The language belongs
to the People!”, in Proceedings of the 4th International Conference on Language
Resources and Evaluation, Lisbon, 2004, 353-356)
Alcune parole chiave
Parola chiave 5: dissemination
• Il ruolo delle istituzioni sovranazionali
Es. ELRA (European Language Resources Association)
“ELRA has been, since its foundation in 1995, a conduit for the distribution of
speech, written and terminology databases, enabling key players to have access to
Language Resources (LRs) for technology development and technology evaluation.
ELRA's initial mission was to establish itself as a self-supported, centralized Not-forprofit organization for the collection, distribution, and validation of speech, text,
terminology resources and tools”
(Khalid Choukri, “Recent Activities within the European Language Resources
Association: issues on sharing Language Resources and Evaluation”, in Proceedings
of the 4th International Conference on Language Resources and Evaluation, Lisbon
2004, 933-936)
Alcune parole chiave
Parola chiave 5: dissemination
• Il ruolo delle istituzioni sovranazionali
Es. ELRA (European Language Resources Association)
“In order to play its role, ELRA created a structured and publicly available catalogue
of Language Resources. A set of description forms was prepared, aiming to help the
providers describe what they propose to ELRA for distribution in a more uniform
and consistent way and the users have a quick access to the main features”
(Choukri, 2004, 935)

www.elda.fr
Alcune parole chiave
Parola chiave 5: dissemination (e parola chiave 1: validation/evaluation)
• Il ruolo delle istituzioni sovranazionali
Es. ELRA (European Language Resources Association)
Validation Manual for Lexica  http://www.elra.info/services/valcom.php
Validation of a lexicon’s documentation is the act of checking that
certain very basic information is present in the documentation. This
involves a human reading the documentation and checking it against
the criteria. … By lexicon documentation we mean the explanatory
files that accompany the lexicon files themselves. These are files
such as general and specific documentation, ‘read me’ files, operating
instructions etc.
Alcune parole chiave
Parola chiave 5: dissemination (e parola chiave 1:
validation/evaluation)
Da: http://www.elra.info/services/valcom.php
Firstly, the documentation should be written in English (also
for lexical resources for other languages than English), and it
should clearly present core administrative information: contact
data for the resource (e.g. name, address, e-mail, URL), the
number and types of physical media involved (e.g. CDs), the
precise contents of each piece of physical medium, and
copyright statements … if relevant.
Alcune parole chiave
Parola chiave 5: dissemination (e parola chiave 1: validation/evaluation)
Da: http://www.elra.info/services/valcom.php
Secondly, the documentation should describe the formal properties of the
lexicon. These are constituted by the basic technical information needed in order
to access and use the data: character set(s) used, data format (e.g. mark-up
language), system(s) needed to view and/or access the data, and the number,
names and organisation of files belonging to the lexicon, plus the procedure for
accessing them.
Thirdly, the documentation should contain the content information necessary to
serve as a specification of the linguistic content. This covers the items lexicon
size, lexicon coverage, intended application(s), natural language(s), data
structure of an entry, entry types, attributes and their values, POS assignment
and other relevant linguistic specifications.
Alcune parole chiave
Parola chiave 5: dissemination
• Il ruolo delle istituzioni sovranazionali
Es. Consorzio ENABLER (European National Activities for Basic Language
Resources)
The ENABLER Consortium conducted the Survey of LRs to get a global picture of
the situation on LRs, in order to be able to compare the various conditions that hold
across different languages and – on this basis – to suggest more sound
recommendations. The Survey provides an overview of the results of National
Projects and activities on LRs of different types (written, spoken, multimodal, lexical
resources and related tools).

http://www.ilsp.gr/enabler/
Alcune parole chiave
Parola chiave 5: dissemination
• Il ruolo delle istituzioni sovranazionali
The Open Language Archives Community
(http://www.openarchives.org)
OLAC is an international partnership of institutions and
individuals who are creating a worldwide virtual library of
language resources by: (i) developing consensus on best current
practice for the digital archiving of language resources, and (ii)
developing a network of interoperating repositories and
services for housing and accessing such resources.
Alcune parole chiave
Parola chiave 5: dissemination
• Il ruolo delle istituzioni sovranazionali
The OLAC gateway at the LINGUIST List site
(http://linguistlist.org/olac) permits users to search the
contents of all archives from a single location.
 Anyone in the wider linguistics community can participate,
not only by using the search facilities, but also by documenting
their own resources, or by helping create and evaluate new best
practice recommendations.
Alcune parole chiave
Parola chiave 5: dissemination
• The OLAC Metadata standard
http://www.language-archives.org/OLAC/metadata.html
Un formato XML che permette di inserire tutte le
informazioni di tipo “meta-linguistico” riguardanti la
propria risorsa linguistica, così da facilitarne la
reperibilità. Simile nella concezione alle Library cards
utilizzate dalla Library of Congress americana.
Scarica

portability