Linguistica Computazionale
• Prima parte
Il modello BNC e la
rappresentazione dell’universo
linguistico
.
Un corpus di lingua per quanto grande esso sia, per
essere rappresentativo dell’universo d’uso di una lingua
deve rappresentare i vari domini dell’ universo stesso in
modo statisticamente significativo
Deve avere un design
Selection features
• Texts were chosen for inclusion according to three selection
features:
– domain (subject field),
– time (within certain dates)
– medium (book, periodical, etc.).
The purpose of these selection features was to ensure that the corpus
contained a broad range of different language styles
• the corpus could be regarded as a microcosm of current British English
in its entirety, not just of particular types.
• different types of text could be compared and contrasted with each
other.
– Half of the books in the “Books and Periodicals” class were selected
at random from Whitaker’s Books in Print 1992.
Sample size and method
•
Text samples normally consist of a continuous stretch of discourse
from within the whole.
•
Only one sample was taken from any one text. Samples were taken
randomly from the beginning, middle or end of longer texts.
•
For books, a target sample size of 40,000 words was chosen.
•
For the most part, texts which in their entirety were shorter than
40,000
– words were further reduced by ten per cent for copyright reasons; a few
texts longer than
– the target size were however included in their entirety.
• Some types of written material are composite in
structure:
• , the physical object in written form is composed
of more than one text unit.(newspaper or
magazine)
• in one issue of a newspaper were grouped
according to domain,
– “Business”articles, “
– Leisure” articles, etc.
Domain
• Classification according to subject field seems hardly
appropriate to texts which are fictional or which are
generally perceived to be literary or creative.
• Consequently, these texts are all labelled imaginative
and are not assigned to particular subject areas.
– The evidence from catalogues of books and periodicals suggests
that imaginative texts account for significantly less than 25 per
cent of published output,
• All other texts are treated as informative and are
assigned to one of the eight domains listed below.
informative
Selection procedures employed
• Roughly half the titles were randomly selected from
available candidates identified in Whitaker’s Books in
Print (BIP), 1992,
• . Each text randomly chosen was accepted only if it
fulfilled certaincriteria:
– it had to be published by a British publisher,
– contain sufficient pages of text to make its incorporation
worthwhile
– consist mainly of written text,
– fall within the designated time limits,
– and cost less than a set price.
• the remaining half were selected
systematically to make up the target
percentages in each category
Selection Procedure
•
•
•
Each selection feature was divided into classes
“Medium” into books, periodicals, unpublished etc.;
“Domain” into imaginative, informative, etc.
target percentages were set for each class.
•
Seventy-five per cent of the samples were to be drawn from
informative texts, and the remaining 25 per cent from imaginative
texts.
•
titles were to be taken from a variety of media, in the following
proportions:
– 60 per cent from books,
– 30 per cent from periodicals,
– 10 per cent from miscellaneous sources (published, unpublished, and
written to be spoken).
Bestsellers
•
Because of their wide reception, bestsellers were obvious candidates for
selection.
•
The lists used were those that appeared in the Bookseller at the end of
the years 1987 to 1993 inclusive.
•
Some of the books in the lists were rejected, for a variety of reasons.
Obviously books that had already been selected by the random
method were excluded, as were those by non-UK authors.
•
In addition, a limit of 120,000 words from any one author was imposed,
and books belonging to a domain or category whose quota had already
been reached were not selected.
•
Other bestseller lists were obtained from The Guardian, the British
Council, and from Blackwells Paperback Shop.
•
The titles yielded by this search were mostly in the Imaginative category.
Literary prizes
• The criteria for inclusion were the same as for bestsellers.
The prize winners, together with runners-up and
shortlisted titles, were taken from several sources,
principally Anne Strachan Prizewinning literature: UK
literary award winners, London, 1989.
• For 1990 onwards the sources used were: the last issue
of the Bookseller for each year; The Guardian Index,
1989–, entries under the term “Literature”; and The Times
Index, 1989-, entries under the term “Literature —
Awards”.
• Literary prizes are in the main awarded to works that fall
into the Imaginative category, but there are some
Informative ones also.
Library loans
• The source of statistics in this category was the record of
loans under Public Lending Right, kindly provided by Dr
J. Parker, the Registrar.
• The information comprised lists of the hundred most
issued books and the hundred most issued children’s
books, in both cases for the years 1987 to 1993.
• The lists consist almost exclusively of imaginative
literature, and many titles found there
• also appear in the lists of bestsellers and prize winners.
Additional texts
• As collection proceeded, monitoring
disclosed potential shortfalls in certain
domains. A further selection was therefore
made, based on the “Short Loan”
collections of seven University libraries.
• (Short Loan collections typically contain
books required for academic courses,
which are consequently in heavy demand.)
Periodicals and magazines
• Periodicals, magazines and newspapers account for 30
per cent of the total text in the corpus.
• Of these, about 250 titles were issues of newspapers.
These were selected to cover as wide a spectrum of
interests and language as possible.
• Newspapers were selected to represent as wide a
geographic spread as possible: The Scotsman and the
Belfast Telegraph are both represented, for example.
Other media
•
In addition to samples from books, periodicals, and magazines, the written
part of the corpus contains about seven million words classified as
– “Miscellaneous Published”, “Miscellaneous Unpublished”, or as “Written to be
spoken”.
•
The distinction between “published” and “unpublished” is not an easy one;
publicity leaflets,brochures, fact sheets, and similar items,
•
school and university essays, unpublished creative writing or letters, and
internal company memoranda.
•
•
The “written to be spoken” material includes scripted material, intended to be
read aloud such as television news broadcasts; transcripts of more informal
broadcast materials such as discussions or phone-ins are included in the
spoken part of the corpus.
BNC spoken
• ten million words of orthographically transcribed speech,
covering a wide range of speech variation.
• A large proportion of the spoken part of the corpus —over
four million words—comprises spontaneous conversational
English.
• The importance of conversational dialogue to linguistic study
is unquestionable: it is the dominant component of general
language both in terms of language reception and language
production.
• As with the written part of the corpus, the most important
considerations in constructing the spoken part were
sampling and representativeness.
Sampling BNC spoken:
demographic component
•
A comprehensive list of text types can be drawn up but there is no
accurate way of estimating the relative proportions of each text type
other than by a priori linguistically motivated analysis.
•
An alternative approach, one well known to sociological
researchers, is demographic sampling, (approximately half of the
spoken part of the corpus)
•
. The sampling frame was defined in terms of the language
production of the population of British English speakers in the United
Kingdom.
•
Representativeness was achieved by sampling a spread of
language producers in terms of age, gender, social group, and
region, and recording their language output over a set period of
time.
• Established random location sampling procedures were
used to select individual members of the population by
personal interview from across the country taking into
account age, gender, and social group.
•
Selected individuals used a portable tape recorder to
record their own speech and the speech of people they
conversed with over a period of up to a week.
• In this way a unique record of the language people use
in everyday conversation was constructed.
Sampling procedure
• 124 adults (aged 15+) were recruited from across the
United Kingdom. Recruits were of both sexes and from
all age groups and social classes.
• as far as possible, equal numbers of men and women,
equal numbers from each of the six age groups, and
equal numbers from each of four social classes.
• Additional recordings were gathered for the BNC as part
of the University of Bergen COLT Teenager Language
Project.
BNC spoken
Sample size
• recruiting 1000 people would have given
greater statistical validity
• practical difficulties and cost implications
of recruiting1000 people and transcribing
50–100 million words of speech.
• It is also important to stress that the total
number of participants in all conversations
was well in excess of a thousand.
context-governed part
• that many types of spoken text are produced only rarely in
comparison with the total output of all “speech producers”:
broadcast interviews, lectures, legal proceedings,
• few producers and many receivers.
• A corpus constituted solely on the demographic model
would thus omit important spoken text types.
• Consequently, the demographic component of the corpus
was complemented with a separate text typology intended
to cover the full range of linguistic variation found in spoken
language;
• As in other spoken corpora, the range of text types
was selected according to a priori linguistically
motivated categories.
–
–
–
–
educational,
business,
public/institutional,
leisure.
• Each is divided into the subcategories
• monologue (40 per cent)
• dialogue (60 per cent).
• Each monologue subcategory therefore totals 10 per
cent of the context-governed part of the corpus, and
each dialogue subcategory 15 per cent.
Sampling procedure
• For the most part, a variety of text types were
sampled within three geographic regions.
• However, some text types, such as
parliamentary proceedings, and most broadcast
categories,
• Different sampling strategies were required for
each text type.
Educational and informative:
•
Lectures, talks, educational demonstrations Within each
sampling area a university (or college of further education) and a
school were selected. A range of lectures and talks was recorded,
varying the topic, level, and speaker gender.
•
News commentaries Regional sampling was not applied, but both
national and regional broadcasting companies were sampled. The
topic, level, and gender of commentator was varied.
•
Classroom interaction Schools were regionally sampled and the
level (generally based on student age) and topic were varied.
Business:
•
Company talks and interviews Sampling took into account
company size, areas of activity, and gender of speakers.
•
Trade union talks Talks to union members, branch meetings and
annual conferences were all sampled.
•
Sales demonstrations A range of topics was included.
•
Business meetings Companies were selected according to size,
area of activity, and purpose of meeting.
•
Consultations These included medical, legal, business and
professional consultations.
Public/ or institutional:
• Political speeches Regional sampling of local politics, plus
speeches in both theHouse of Commons and the House of
Lords.
• Sermons Different denominations were sampled.
• Public/government talks Regional sampling of local
inquiries and meetings, plus national issues at different
levels.
• Council meetings Regionally sampled, covering parish,
town, district, and county councils.
• Religious meetings Includes church meetings, group
discussions, and so on.
• Parliamentary proceedings Sampling of main sessions
and committees, House of Commons and House of Lords.
• Legal proceedings Royal Courts of Justice, and local
Magistrates and similar courts were sampled.
Leisure:
•
Speeches Regionally sampled, covering a variety of occasions and
speakers.
•
Sports commentaries Exclusively broadcast, sampling a variety of
sports, commentators, and TV/radio channels.
•
Talks to clubs Regionally sampled, covering a range of topics and
speakers.
•
Broadcast chat shows and phone-ins Only those that include a
significant amount of unscripted speech were selected from both
television and radio.
•
Club meetings Regionally sampled, covering a wide range of clubs.
Sample size
•
•
Composition of the spoken component
A total of 757 texts (6,153,671 words) make up the contextgoverned part of the corpus.
•
. Each monologue text type contains up to 200,000 words of text,
and each dialogue text typeup to 300,000 words.
•
•
The length of text units within each text type vary — for example,
news
commentaries may be only a few minutes long (several hundred
words), lectures are typically up to one hour (10,000 words), and
some business meetings and parliamentary proceedings may last
for several hours (20,000 words+).
•
For the context-governed part of the corpus an upper limit of
10,0000 words per text unit was generally imposed,
• Within each subcategory a range of text types was
defined.
•
This range was not fixed, and the design was flexible
enough to allow the inclusion of additional text types. T
• the sampling methodology was different for each text
type but the overall aim was to achieve a balanced
selection within each, taking into account such features
as region, level, gender of speakers, and topic.
• Other features, such as purpose, were applied on the
basis of post hoc judgements.
C-ORAL-ROM
• Sampling strategy
• the representation of language information
in present days Speken Resources
Comparable corpora with respect to the corpus
design matrix.
Frequency lexicons criteria.
comparable sampling of the universe
comparable variations
TOTAL
LEMMA
S
Open
Fundamental Class
Closed
Class
ITALIAN
15286
2390
2118
187
FRENCH
11801
1981
1778
178
SPANISH
11743
1749
1489
165
PORTUGUESE
11453
1684
1381
224
. Percentage of Open class and Closed class forms in the C-ORAL-ROM corpora
Percentage of Nouns and Verbs in the C-ORAL-ROM corpora
Italian nouns-verbs
20,62%
16,27%
Portuguese nouns-verbs
22,71%
19,84%
18,99%
19,55%
19,10%
17,60%
23,54%
22,41%
18,76% 19,85%
18,67%
13,45%
21,07%
20,24%
18,16%
18,72%
17,58%
16,07%
17,71%
17,34%
13,82%
10,16%
inf_dial
inf_mon
form_dial
form_mon
% nouns
media
telephone
inf_dial
inf_mon
% verbs
%nouns
Spanish nouns-verbs
15,16% 15,22%
12,86%
11,30%
inf_dial
inf_mon
16,69%
media
telephone
%verbs
French nouns-verbs
18,31% 18,10%
15,04%
14,80%
form_dial form_mon
%nouns
form_dial form_mon
%verbs
15,01%
12,80%
19,8%
9,18%
12,7%
media
telephone
18,6%
15,6%
inf_dial
inf_mon
18,3% 19,2% 19,2% 19,9%
18,0% 16,9% 17,8%
12,4%
form_dial form_mon
%nouns
media
telephone
%verbs
Variation of the percentage of Nouns and Verb in the main contexts of the design
– the ratio of nouns and verbs changes
according to variation parameters following
a generic grouping principle:
»
»
»
»
Informal dialogues
Informal monologues
Formal dialogues
Formal monologues
strumenti di conoscenza e di
accessibilità linguistica
• Le risorse linguistiche digitali
• Gli stumenti computazionali
le concordanze
• Kwic (Kwords in contexts)
• Ant-Conc
• Con-cap
• Word-Smith-Tool
informazione da corpus e olismo
informazione da corpus e olismo
collocazioni / legame di
associazione tra parole nel testo
• Composizionalità sintattico semantica
– Gianni ha visto un topolino grigio (verde / bianco / grazioso)
– ……………… in cane marrone
– ………………. la pizza croccante
• costruzioni non volatili
– a notte fonda (? scura / profonda ) c’era la luna piena (?ampia / estesa )
– gli alberghi sono cari in alta (? elevata / avanzata )stagione
– Gianni ha la vista lunga
olismo e collocazioni
tipi di associazioni
• polirematiche
– ferro da stiro, legna da ardere, vaso da notte,
• termini tecnici (corte d’assise, legge delega, sistema
operativo)
• verbi a supporto (fare attenzione, mettere a posto,
prendere freddo,
• nomi propri composti
• idioms
– tagliare la corda, tirare le quoia, gatta morta, acqua cheta
• complementi tipici / collocate
– (infrangere le regole,
proprietà
•
•
•
•
elevata convenzionalità
ridotta composizionalità semantica
rigidità strutturale
selezione reciproca
• i parlanti usano la collocata come blocco
precostituito
il problema delle misure di
associazione
• stabilire il legame di associazione tra due
(o più parole in un testo)
– ricorrenza statisticamente significativa
• Identificazione della frequenza assoluta
dei bigrammi in un corpus
– indice di associazione di ogni bigramma
– determinazione di una soglia di associazione
frequenza di associazioni
grammaticali
• frequenza assoluta del bigramma e
frequenza assoluta
mutual information
• date due parole v1 e v2 si confronta la
probabilità di osservare la coppia con la
probabilità di osservare ogni parola
separatamente
se la frequenza assoluta di una espressione è molto più alta di
quella nella sue associazioni queste non sono significative
i bigrammi più frequentemente associati non corrispondono a collocazioni sensate
mutual information senza hapax
corpus driven analisys
the word largely occurred more frequently with negative words or expressions,
while broadly appeared more frequently with positive ones.
Collocations can be in a syntactic relation (such as verb-object: 'make' and 'decision')
Uno sguardo di insieme sui corpora
• Corpora di riferimento inglesi e risorse
italiane
I primi corpora di riferimento
•
Lessico di frequenza della lingua italiana contemporanea (LIF),
elaborato nel 1971 al cnuce (Centro Nazionale Universitario di Calcolo
elettronico) di Pisa (cfr. Bortolini et alii , 1971).
Primo grande progetto di costruzione di un lessico di frequenza per la lingua
italiana . Il lessico contiene circa 5.000 lemmi ordinati per frequenza e
secondo l'ordine alfabetico, tratti dallo spoglio di testi per un complesso di
500.000 parole., testi scritti tra il 1947 e il 1968, (teatro, romanzi, cinema,
periodici, sussidiari) Non disponibile.
•
Lessico di frequenza dell'italiano parlato (LIP), curato da De Mauro,
Mancini, Vedovelli e Voghera (1993)
costituisce la controparte del lif per l'italiano parlato. Il corpus è costituito da
circa 500.000 parole grafiche, trascrizioni di registrazioni effettuate a Milano,
Firenze, Roma e Napoli, pari a quasi 57 ore di parlato. Le tipologie del parlato
rappresentate sono dialoghi faccia a faccia e non, a presa di parola libera e
non, monologhi faccia a faccia e non.
Ora on line nel BAdip URL: http://languageserver.unigraz.at/badip/badip/20_corpusLip.php
Letterari
•
Tesoro della lingua italiana delle origini (TLIO) è un database
testuale (nato nel 1995 e inaugurato nel 1998) composto da circa
1.780 testi per circa 20 milioni di parole, tratte da scritti in lingua
italiana prima del 1375, in prosa e in poesia. Una prima versione
della banca dati fu implementata in dbt di Eugenio Picchi. Il
database è interrogabile online con registrazione gratuita al sito
dell'Istituto Opera del Vocabolario Italiano (OVI):
URL: http://tlio.ovi.cnr.it/TLIO/
•
Letteratura Italiana Zanichelli Picchi & Stoppelli CD-rom
Contiene il testo integrale di 1000 opere della letteratura italiana.
245 autori dalle origini fino a D’Annunzio e Pirandello più 19
anonimi, 4 antologie poetiche e l'intera serie delle riviste Il Caffè e Il
Conciliatore
• CT "Corpus Taurinense"
• Pogetti (COFIN), condotti da Bice Mortara Garavelli e
Lorenzo Renzi, OVI, IMS, DimaLogic..
•
•
•
•
Corpus di Italiano antico (XIII secolo, Firenze)
259,299 tokens
21,087 types
7,599 lemmas
Interamente lemmatizzato, POS-tagged second
specifiche EAGLES. Annotato per corpus design (generi
letterari) e forme filologiche. http://www.corpora.unito.it/
La seconda generazione e i corpora per la
consultazione e la ricerca on line
• Corpus di Italiano Scritto contemporaneo
(CORIS/CODIS): coordinato da R. Rossini Favretti, dal
1998.
– ..Formato da due corpus distinti. Il COrpus di Riferimento
dell'Italiano Scritto ( Coris ) . Il corpus contiene 100 milioni di
parole aggiornato ogni due anni con nuovo materiale di controllo.
I testi ivi contenuti sono prevalentemente di narrativa prodotta
negli anni Ottanta e Novanta. Il corpus è elaborato con criteri
linguistici
– . COrpus Dinamico dell'Italiano Scritto ( Codis ) che permette
la selezione ed eventuale esclusione di sottocorpora considerati
non rilevanti per specifiche ricerche.
– Il corpus è disponibile su cd-rom e per consultazione online.
URL: http://corpora.dslo.unibo.it/coris_ita.html
Corpus La reppubblica (Bologna Forlì) 2004
• corpus di italiano giornalistico
. pos-tagged
. lemmatized .
– categorized in terms of genre and topic
General labels:
Topic labels:
.
news-report and comment;
church, culture, economics,
education, news, politics, science,
society, sport, weather.
LABLITA-C-ORAL-ROM
• Corpora di LABLITA dal 1973 si occupa della
raccolta e gestione di corpora di parlato italiano
con lo standard di trascrizione chat (cfr. Childes) e
l’allineamento testo suono Winpitch pro
– un corpus di italiano parlato spontaneo adulto che
riguardano situazioni comunicative diafasiche diverse
per un totale di 60 ore;
– un corpus della lingua dei media (cinema, radio e
televisione);
– un corpus di 100 ore di italiano registrato nella fase del
primo apprendimento (in bambini di 18-36 mesi).
– Il corpus allineato C-ORAL-ROM italian nel corpus
comparabile del parlato romanzo
other
•
Corpus e Lessico di Frequenza dell'Italiano Scritto (ColFI) 3.150.075 occorrenze lessicali
tratte da quotidiani, periodici e libri di varia natura bilanciate secondo le letture degli italiani.
. http://www.istc.cnr.it/material/database/colfis/
•
Lessico di frequenza dell'italiano radiofonico (LIR) è un progetto di analisi del lessico e del
corpus del parlato radiofonico nato nel 1998. Il corpus di circa 60 ore, è trascritto
ortograficamente, allineato all'audio mediate software apposito, lemmatizzato e pubblicato
su cd-rom. Accademia della Crusca
•
Corpus di italiano televisivo (CIT) Perugia- attualità, intrattenimento, pubblicità, sport e
telegiornali. Il Cit è annotato secondo gli standard della Text Encoding Initiative (TEI).
URL: http://www.sspina.it/cit/cit.htm
•
Athenaeum Corpus
corpus di italiano scritto accademico, dell'Università di Torino; POS-taggati e classificati
per argomento e tipo testuale (articoli della rivista L'Ateneo e del notiziario Dall'Universita',
documenti ufficiali, e-mail prodotte dai vari dipartimenti e uffici amministrativi) / 306.927
token; 32.221 type; 11.748 lemmas
•
Jus Jurium
(in progress) è un corpus in lingua italiana che intende coprire la totalità dell'universo di
discorso legale oggi corrente in Italia. Non si tratta di un database giuridico essendo le sue
finalità precipuamente linguistiche. Il corpus è etichettato per parti del discorso ed ha un
robusto markup testuale e diplomatico. Ancora non interrogabile.
http://www.corpora.unito.it
http://www.bmanuel.org/projects/
•
•
VALICO "Varietà di Apprendimento della Lingua Italiana: Corpus
Online" e VINCA "Varietà di Italiano di Nativi Corpus Appaiato". La
risorsa è consultabile ed interrogabile on-line. Sotto la supervisione di
Manuel Barbera, Carla Marello ed Elisa Corino,
VALICO è un corpus multilingue di e per apprendenti di italiano come L2.
VINCA è il corpus di testi scritti da italofoni appaiato a VALICO.
formato da testi trascritti annotati per parte del discorso, per tipo testuale,
per lingua madre dell'apprendente.
Sono state raccolte per lo più composizioni libere, traduzioni e composizioni
scritte elicitate a partire da stimoli iconici.
Il bilanciamento mira ad avere in VALICO la stessa quantità di testi ( e
token) per gruppi di studenti con lingua madre o L1 più rappresentata fra
quelle presenti nel corpus e cioè inglese, francese, spagnolo, tedesco. E
anche una stessa (minore) quantità di testi per gruppi di studenti con lingue
madri meno rappresentate come maltese, polacco, giapponese, arabo,
serbo, portoghese, ungherese.
http://www.bmanuel.org/projects/
•
API/AVIP/IPar (laboratorio di linguistica della Scuola Normale di Pisa, il CIRASS e
l’Orientale di Napoli, il Politecnico di Bari e l’Università del Piemonte Orientale)
–
•
materiale fonico spontaneo di lingua italiana, conformi alle specifiche di codifica e
annotazione di Eagles. Il materiale dei corpora (files e software) è disponibile su cd-rom,
distribuiti dal CIRASS e via ftp sempre dal sito del CIRASS (ftp.cirass.unina.it).
URL: http://www.cirass.unina.it/
Corpora Linguistici per l'Italiano Parlato e Scritto (CLIPS) (audio, etichettatura
e documentazione)
– circa 100 ore di parlato, equamente ripartito tra voci maschili e voci femminili,
in parte trascritto ortograficamente e etichettato foneticamente. Le registrazioni
sono state effettuate in 15 località italiane scelte in base a criteri di
rappresentatività linguistica e socioeconomica: Bari, Bergamo, Bologna,
Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma,
Perugia, Roma, Venezia. Per ogni località è stato raccolto
- parlato radiotelevisivo (notiziari, interviste, talk shows);
- parlato dialogico (240 dialoghi raccolti secondo le modalità del map task e del ‘gioco
delle differenze’, dei quali 30 etichettati foneticamente, 90 trascritti
ortograficamente, studenti universitari);
- parlato letto da parlanti non professionisti (20 frasi lette a garantire la copertura
delle frequenze medio-alte del lessico italiano);
d) parlato telefonico (conversazioni tra circa 300 parlatori e un portiere d’albergo
simulato)
e) parlato letto da 20 parlanti professionisti (160 frasi atte a garantire la copertura delle
sequenza fonotattiche dell’italiano e delle frequenze medio-altedel lessico italiano)
registrato in camera anecoica
•
URL: http://www.clips.unina.it/.
il Web come un corpus
• La rete fornisce
– accessibilità ai contenuti
– strumenti di estrazione dell’informazione
linguistica
– corpora in rete
WaCky Web-as-Corpus kool
ynitiative,
•
ITWAC: a 2 billion word corpus constructed from the Web limiting
the crawl to the .it domain and using medium-frequency words from
the Repubblica corpus and basic Italian vocabulary lists as seeds.
•
UKWAC: a 2 billion word corpus constructed from the Web limiting
the crawl to the .uk domain and using medium-frequency words
from the BNC as seeds.
•
DEWAC: a 1.7 billion word corpus constructed from the Web
limiting the crawl to the .de domain and using using mediumfrequency words from the SudDeutsche Zeitung corpus and basic
German vocabulary lists as seed.
NUNC "NewsgroupsUseNet Corpora".
•
NUNC è un corpus multilingue (It. De. Fr. En. Es. Ma. Su. Ee. Pt.)
basato sulla lingua dei newsgroups (oltre 600 milioni di parole per ogni
lingua). La risorsa è libera e interrogabile on-line. Barbera S. Colombo,
E. Corino, C. Marello, Suddiviso in sottocorpora per ricerche mirate
Corpus generico di lingua italiana:
•
–
–
–
–
–
–
–
•
•
•
NUNC Italiano (I parte)
NUNC Italiano (II parte)
Corpus specialistico sull'argomento cucina NUNC Cucina
Corpus specialistico sull'argomento motori NUNC Motori
Corpus specialistico sull'argomento fotografia NUNC Foto
Corpus specialistico sull'argomento fotografia non elaborato NUNC Foto
Corpus specialistico sull'argomento cinema NUNC Cinema
NUNC multilingue
Corpus generico di lingua spagnola NUNC Spagnolo
Corpus specialistico sull'argomento cucina di lingua spagnola NUNC
Cucina Spagnolo
http://www.corpora.unito.it
http://www.bmanuel.org/projects/
tipi di ricerche
•
Concordanze e collocation (ordinate secondo z-score, t-score, mutual
information, occorrenze, log-likelihood);
•
Ricerca per lemma e per forma
•
Colligation;
•
PoS pattern;
•
Definizione della lunghezza del contesto dei risultati della query
•
Ordinamento per contesto sinistro e destro
•
ricerca full text con boosting sulle keyword
•
ricerca per similarità, per vicinanza per espressione regolare
CoRIS
concordanze e criteri di
ordinamento dei risultati
• per forme
– mangiare
– mangiare + con
• contesto destro e sinistro
– ordine alfabetico
CoRIS collocate
• criteri di ordinamento
http://dev.sslmit.unibo.it/corpora/query.php?mode=simple&path=&name=Repubblica
g0au7h51
Corpus repubblica
• Ricerche per forma e lemma
• restrizioni sui Domini pratici
• advanced query CQL
• Liste di frequenza
ricerca per dominio: economia
ricerca per dominio: sport
NUNC lemmi
• non ho liste di frequenze non ho dati
relativi (andare vs andare bene)
– [lemma= 'andare']
– [lemma= 'andare'] [word= 'bene']
– [lemma='tagliare'][]{0,3}[lemma='carne']
[lemma='stare'][pos='PRE'][pos='VER:infi']
• ricerca semplice
• ricerca di associazione
• Ricerca per forma pos e lemma
l
L’italiano tra le altre lingue e il
contesto del web
Lingua
cinese mandarino
inglese
hindi + urdu
spagnolo
russo
bengali
arabo
portoghese
maleo-indonesiano
giapponese
francese
tedesco
…
Italiano
Lingue romanze > 800 milioni
N. di parlanti
1 miliardo
1 miliardo
900 milioni
450 milioni
320 milioni
250 milioni
250 milioni
200 milioni
160 milioni
145 milioni
125 milioni
125 milioni
75 milioni
EU
23 lingue
ufficiali
27 stati
500 milioni di
cittadini
La rete
• Internet è attualmente il più grande deposito di
informazione linguistica esistente
• è insieme ambiente e mezzo privilegiato
dell’uso di una lingua
– lo spazio entro il quale sia gli usi funzionali sia gli
usi creativi del linguaggio sono esercitati con
sempre maggior frequenza.
la rete è grande e è cresciuta
rapidamente
• Google index in 1998 26 million pages
• Google index by 2000 reached one billion mark.
• 29.7 billion pages on the World Wide Web as of February 2007.
• più di venti miliardi di pagine web indicizzate dai motori
di ricerca
• 1 trillion (as in 1,000,000,000,000) unique URLs on the
web at once! in 2008
previsioni nel 2002
• Fifty per cent of the content on the Web is in languages
other than English
• . It is estimated that, by 2003, two-thirds of all Internet
users will be non-English speakers, with the greatest
expansion coming from Asia and Latin America.
•
It is further estimated that, by that same year, at least
one-third of Web users will prefer to conduct their online
activities in a language other than English
• Some forecasters even predict that, by 2007, Chinese will
be the primary language used on the World Wide Web.
Internet Statistics: Distribution of languages on the Internet 2002
Chart of Web content (milions of webpages by language)
• English
1142,5 56,4%
• German
156,2 7,7%
• French
113,1 5,6%
• Japanese
98,3
4,9%
• Languages used to access Google
• Spanish
59,9
3,0%
in January 2002 (vs 2001)
• Chinese
48,2
2,4%
English 57% (64%)
• Italian
41,1
2,0%
German 12% (9%)
Japanish 7% (8%)
• Dutch
38,8
1,9%
Spanish 6% (5%)
• Russian
33,7
1,7%
French 5%
(4%)
• Korean
30,8
1,5%
Chinese 3% (1%)
• Portuguese 29,4
1,5%
Italian 2%
(2%)
• Swedish
15,1
0,7%
Other 8%
(4%
• Polish
14,8
0,7%
• Danish
12,3
0,6%
• Czech
11,5
0,6%
• Turkish
4,9
0,2%
• Hungarian
4,1
0,2%
• Greek
2,0
0,1%
• Other
168,0 8,3%
•
Total Web pages
2024,7
• 2005 English has dropped to 30%
Sviluppo futuro
Il web è multilingue
• A livello degli utenti
• A livello dei contenuti
• L’inglese lingua franca nel mondo globale
non ha cancellato la diversità linguistica
• Ma l’inglese è la lingua dell’universo di
riferimento globale
US / GLOBAL
http://www.google.com/intl/en/press/zeitgeist2008/
GB
• L’italiano non è lingua globale
• l’italiano è presente nell’universo globale della rete
• le parole chiave dell’italiano sono relative
– ad un universo culturale
– ad un insieme di usi funzionali
• per garantire il ruolo globale di una lingua e di una cultura
– è necessario garantire accessibilità ai suoi contenuti di rete
– è necessario rafforzare la possibilità di fruizione e utilizzo
– garantire gli strumenti per il suo apprendimento
• L’italiano ha una vasta presenza di comunità semiitaliofone nel mondo
• L’italiano ha un patrimonio culturale attuale e una eredità
culturale di valore globale
Progetto RIDIRE.it
(Risorsa Dinamica Italiana di Rete)
Progetto FIRB
• repository dei contenuti della rete più
rappresentativi per la cultura italiana
– accesso e disseminazione della cultura italiana in rete
• integrato da strumenti di computazione
dell’informazione linguistica in rete
– sfruttamento completo delle potenzialità dei grandi
corpora per il consolidamento del possesso della lingua
italiana.
– accesso selettivo alla fraseologia italiana
risorsa dinamica di rete vs corpus di riferimento
statico.
• la risorsa dinamica non necessita la determinazione
preteoretica del peso di ciascun dominio in un sampling
• dato un insieme comunque vasto di domini rappresentati, la
risorsa dinamica consentirà l’estrazione di un numero
illimitato di corpora con bilanciamenti o selezioni diverse, a
seconda delle esigenze dell’utente
• A) domini nei quali la lingua si caratterizza per scelte legate
al suo uso funzionale.
All’interno di ciascun “dominio funzionale” compaiono documenti
appartenenti a qualsiasi dominio semantico possible
• Informazione
• economia e affari
• amministrazione e legislazione
• B) domini semantici che identificano i campi dell’eccellenza
italiana nel mondo:
•
•
•
•
•
•
•
•
•
Letteratura
moda
design-architettura
cucina
sport
religione
arti figurative
cinema
musica
RIDIRE contiene i dati relative allo
specifico uso linguistico italiano in tutti I
domini
Costituirà la sorgente d’informazione
necessaria per consolidare le capacità di
effettivo utilizzo dell’italiano da parte degli
apprendenti.
•
L’infrastruttura, promossa, mantenuta e diffusa dalla Società
internazionale di linguistica e filologia italiana (SILFI) è rivolta a
professori di lingua e cultura italiana nel mondo e a tutti i soggetti che
vogliano potenziare le loro capacità nell’uso linguistico italiano per motivi
di
– formazione (studenti)
– consolidamento delle proprie radici identitarie (emigrati di seconda e terza
generazione)
– lavoro e affari (cittadini di paesi in zone di espansione dell’italiano).
•
RIDIRE.it garantirà l’accesso a contenuti che rappresentano l’uso
linguistico italiano sia dal punto di vista funzionale sia dal punto di vista
dell’eccellenza italiana nel mondo.
•
fornirà strumenti di estrazione dell'informazione linguistica che
consentono l’apprezzamento dell’uso linguistico italiano nei domini
rappresentati.
•
• portale con un search engine selettivo per
dominio sui contenuti italiani in rete
Uso e fruizione dell’italiano nel
contesto culturale globale del
web

Scaricare

lezioni prof. Moneglia

Scarica questa sequenza di foto in formato pps

Presentation Slides - ACORN Aston Corpus Network

Sistemi di pensiero - Sentieri della mente

Le tecnologie del linguaggio umano incontrano la lingua di internet

Professione s. Diletta

Umbrellas in action - Homepage di Roberto Bin

Large linguistically-processed Web corpora for multiple languages

Che le logiche commerciali siano diventate, di fatto, le linee guida

Fabio Manganiello

Handout