Digital Italian
An overview of Italian corpora
A linguistic corpus:
a body of texts / transcripts collected for
linguistic purposes,
computerized,
representative for the variety studied,
balanced,
annotated.
Annotation
Linguistic annotation
can be useful or
restrictive
Extra-linguistic
annotation
useful for
sociolinguistic research
Italian corpora
General
Written
Diachronic
Specialized
Spoken
Synchronic
General corporaWritten Italian
Corpus e lessico di
frequenza dell’italiano
scritto (COLFIS)
Corpus di riferimento
dell’italiano scritto /
Corpus dinamico
dell’italiano scritto
(CORIS/CODIS)
COLFIS - structure
COLFIS (over three and a half million words)
Newspapers
Il
Corriere
della
Sera
La
Repubblica
La
Stampa
Economy, news of local interest,
society, crime news, internal /
external affairs, science, show
biz and sports.
Periodicals
Books
Other, arts, science and
technology, cars and
boats, children and
youngsters, home and
hobby, women’s
magazines, photo love
story, general
information, society,
radio and television,
sport, travels and
ecology.
Other, arts,
children, SF,
detective and spy
stories, hobby
and travel,
classics, modern
narrative,
romance, essays,
natural and exact
sciences, human
and social
sciences, theatre
and poetry.
CORIS/CODIS – structure
CORIS / CODIS (one hundred million words)
Press
Fiction
Newspaper,
periodical,
supplement
Novels,
short stories
National,
local/
specialist,
nonspecialist /
connotated,
nonconnotated
Italian,
foreign,
for adults,
for children,
crime,
adventure,
SF, women
literature
Legal and
Administrati
ve Prose
Miscella
-nea
Ephemera
Human sciences,
natural sciences,
physics,
experimental
sciences
Legal,
bureaucratic,
administrative
Books on
religion,
travel,
cookery,
hobbies,
etc.
Letters,
leaflets,
instruction
Books, reviews,
scientific, popular
history,
philosophy, arts,
literary criticism,
law,
economy, biology,
etc.
Books,
reviews
Books,
reviews
Private,
public/
Printed
form,
electronic
form
Academic
Prose
General corporaSpoken Italian
Lessico di frequenza
dell’italiano parlato
(LIP) -> Bancadati
dell’italiano parlato
(BADIP).
Archivio delle varietà
dell’italiano parlato
(AVIP).
LABLITA
Spoken and written Italian:
Corpora e lessici dell’italiano parlato e scritto (CLIPS)
CLIPS (the spoken corpus)
Radio and
television
speech
Entertainment,
informative
transmissions,
cultural and
educational
transmissions,
commercials.
Field
recordings
Readings
Telephone
speech
Map task
dialogues and
spot the
difference game.
Readings by the
speakers
themselves or by
professional
dubbing actors.
Conversations
between a fake
tour-operator
and three
hundred people.
Specialized corpora
Corpus di italiano
televisivo (CIT)
La Repubblica
CIT – structure
CIT
Current
affairs
Studio
broadcast.
On-field
broadcast.
Entertain
ment
(games,
talk-show,
varieties)
Commercials
Text
Text.
Slogans.
Sports news
Commenta
-ries.
Playbyplay
Studio
broadcast
Onfield
broadcast
Text
Newscast
Headlines.
Studio
broadcast.
On-field
broadcast
Corpus di italiano televisivo
La Repubblica – structure
La Repubblica
Year
1985 - 2000
Genre
News
Comment
Topic
Religion
Culture
Economics
Education
News
Politics
Science
Society
Sport
Weather
Unclassified
La Repubblica
Thank you!
Anne-Marie OBRETIN
Mres in European Languages and Cultures
University of Exeter
[email protected]
Scarica

Presentation Slides - ACORN Aston Corpus Network