_________________________________________________________________
‘‘Until the phenomena of any
branch of knowledge have been
submitted to measurement and
number, it cannot assume the
dignity of a science.’’
(Sir Francis Galton, 1822 −
1911)
1
__________________________________________________________________
Contents
Acknoledgements………………………………………………………………. 5
List of Abbreviations……………………………………………………………. 6
Part I
History, Criteria and Research
1 Introduction………………………………………………………….
8
2 History of Corpora………………………………………………..
11
2.1 Early Corpus Linguistics…………………………………………… 11
2.2 The Chomskyian Revolution………………………………………. 13
2.3 Modern Corpus Linguistics………………………………………... 15
3 Corpus-Based Research………………………………………...
18
3.1 Definitions…………………………………………………………… 18
3.2 Some Arguments in Favour of Corpus-Based Research……… 21
3.3 Corpus Outline and Creation……………………………………… 24
3.3.1 Synchronic Corpus Design Criteria…………………………. 24
3.3.2 Diachronic Corpus Design Criteria…………………………. 30
3.4 Kinds of Corpora……………………………………………………. 32
3.5 Tools for Corpus Exploitation……………………………………… 47
3.5.1 Concordancers………………………………………………… 48
3.5.2 Frequency Tables……………………………………………... 50
3.5.3 Taggers………………………………………………………….52
3.5.4 Parsers…………………………………………………………. 53
3.5.5 Ready-Available Tools versus Own Programming……….. 53
3.6 Corpus Networks…………………………………………………… 56
2
Contents
_________________________________________________________________
Part II
Applications
4 Applications of Corpora…………………………………………
60
4.1 The Use of Corpora in Linguistics………………………………... 60
4.1.1 Corpora and Grammar……………………………………….. 60
4.1.2 Corpora and Lexicography/Terminology…………………... 62
4.1.3 Corpora and Morphology…………………………………….. 65
4.1.4 Corpora and Semantics……………………………………... 65
4.1.5 Corpora and Pragmatics…………………………………….. 67
4.1.6 Corpora, Stylistics and Discourse Studies………………... 68
4.1.7 Corpora, Language Teaching and Learning……………… 70
4.1.8 Corpora and Ethnolinguistics……………………………….. 72
4.2 The Use of Corpora in Translation………………………………. 73
4.2.1 Parallel, Multilingual and Comparable Corpora…………... 74
4.2.2 Machine Translation…………………………………………... 76
4.2.3 Translation Memory Systems………………………………... 79
4.2.4 Corpora vs. Termbanks………………………………………. 81
4.2.5 Translation Teaching and Translation Research…………. 82
4.2.6 Thinking Globally - Acting Locally…………………………... 87
4.2.7 Critical Comments…………………………………………….. 88
4.2.8 Conclusions……………………………………………………. 90
5 Case Study………………………………………………………...
91
5.1 The Problem………………………………………………………….91
5.2 Formulation of the Hypothesis……………………………………..92
5.3 Selecting the Corpus………………………………………………. 93
5.4 Choosing the Tools………………………………………………… 96
5.5 Summarising the Restrictions……………………………………... 97
5.6 The Study……………………………………………………………. 98
5.6.1 Synonymy……………………………………………………… 98
5.6.2 Cacophony…………………………………………………… 100
5.7 Conclusions………………………………………………………... 103
Part III
Conclusion and outlook
6 Drawing Conclusions…………………………………………...
105
3
Contents
__________________________________________________________________
Appendices
1 Glossary…………………………………………………………………… 111
2 Major Corpora Available………………………………………………... 116
3 Software Available for Corpus-Based Research…………………….. 128
4 Results of a Collocation Search of Tra and Fra……………………… 135
Bibliography………………………………………………………………….. 154
Index…………………………………………………………………………... 164
4
_________________________________________________________________
Acknowledgements
I am particularly grateful to my parents, who have been an invaluable
support throughout all these years of "foreignness" in Graz.
Above all, I would like to thank my supervisor, Dr. Ursula Stachl-Peier.
Ursula, thank you for being not only a brilliant linguist, but also a
wonderful friend, and for having made this thesis less inadequate than it
still remains.
This thesis reflects the precious work and committment of numerous
corpus researchers throughout the years. It is thanks to these far-sighted
linguists that we are nowadays faced with the expanding universe of
corpus studies. It is to them that I shall dedicate this work of mine.
5
_________________________________________________________________
List of Abbreviations
AI
Artificial intelligence
ARCHER
American Representative Corpus of Historical English
Registers
BNC
British National Corpus
Brown
Brown Corpus of Standard Written American English
CALL
Computer-assisted language learning
CAT
Computer-aided translation
CRATER
Corpus Resources and Terminology Extraction
CSAE
Corpus of Spoken American English
GPEC
Guangzhou Petroleum English Corpus
ICAME
International Computer Archive of Modern English
ICE
International Corpus of English
ICLE
International Corpus of Learner English
Lancaster/IBM Lancaster/IBM Spoken English Corpus
LIP
Lessico di frequenza dell'italiano parlato
LLC
London-Lund Corpus
LOB
Lancaster-Oslo/Bergen Corpus
LSP
Language for specific purposes
MT
Machine translation
OCP
Oxford Concordancing Program
POS
part-of-speech
SEC
IBM-Lancaster Spoken English Corpus
SEU
Survey of English Usage Corpus
6
_________________________________________________________________
PART I
HISTORY, CRITERIA AND
RESEARCH
7
_________________________________________________________________
1
Introduction
Over the last 40 years linguistic research has undergone major
changes. While many have deplored this, arguing that it has led to a lack
of focus and to inconsistency, others (i.e. Svartvik 1990) have pointed out
that it has greatly contributed to academic cross-fertilisation and the
development of new approaches, which will hopefully help us to better
understand the intricacies of human language processing. In the wake of
the new insights associated with the 1950s and Chomsky, new ways of
analysing language were pioneered, while older approaches were virtually
abandoned. Among those dismissed as unscientific or inappropriate was
also the corpus-based approach to language which up to then had been
the main way of gathering language data. Corpus linguistics, as this
discipline is generally called, became neglected, but it by no means
disappeared.
1
Nowadays, the usefulness of corpora is being rediscovered and they
are proving an excellent resource for a wide range of research tasks, not
only because they give evidence of current language usage, but also
because they permit us to compare synchronic and diachronic shifts
within a language and so become the foundation of analysis.
1
The word ‘corpus’ derives from Latin and means ‘body’, a body of texts. Any collection of more
than one text can therefore be defines as a ‘corpus’, which seems a simple enough notion. In
linguistics, however, it is slightly more complicated, as various criteria of inclusion have to be
taken into consideration when a corpus is compiled (see Section 3.1).
8
Introduction
_________________________________________________________________
One of the aims of my thesis is to provide an overview of the many
possible applications that corpora can have in language and translation
studies.
When I set out to write this thesis I also had another goal in mind - a
slightly subtler one. What I wanted to do was to help bring about a change
in well-established views of the role of teachers and learners. Nowadays
students are more and more often asked to organise and manage their
own learning, they are given freedom of choice about which subjects to
study, but most of the time they are not told how to go about it. The
teacher is no longer the sole provider of knowledge, and s/he often falls
victim to sheer economic constraints. This new setting turns out to be a
very big challenge, a challenge to the student and the teacher alike. I
strongly believe that a corpus-based approach to language and
translation teaching is a tool which will allow us to keep up with the times
as it ensures easy and immediate access to empirical language data.
Students are encouraged to deduce rules from naturally occurring
evidence and no longer solely rely on introspection-derived data, be they
those presented by their teachers or themselves.
This thesis includes three main sections. After a short introduction, Part
one, Chapter 2 deals with the history of linguistic research based on
corpus evidence, from the early 1950s to the Chomskyian revolution until
today. Chapter 3 discusses the main elements of corpus-based research,
exploring issues like corpus outline, corpus creation, the different kinds of
corpora, the tools needed to exploit a corpus, finally focusing on the
possibility of establishing a computer network for corpora studies. Part two
Chapter 4 places the emphasis on the numerous applications of corpusbased research, discussing both possible areas of research in linguistics
and translation studies. The case study (Chapter 5) - which completes
Part Two - introduces an example of corpus exploitation which might be of
relevance to linguists as well as to translators.
9
Introduction
_________________________________________________________________
Conclusions are then drawn in Part Three. Section 6.1 suggests
possible topics for further research for translation students, while Section
6.2 focuses on teaching corpora and describes course design issues.
This thesis aims to be above all a basic introduction to corpus-based
research for (translation) students and teachers with no or very limited
knowledge of the area. The focus is on possible applications in a variety
of domains, including grammar, lexicography, morphology, semantics,
pragmatics, stylistics, ethnolinguistics and language teaching. It is hoped
that by showing the potential of corpus-based research, more teachers
and students will be motivated to try out similar studies themselves.
The same intention guides the Case Study, which tries to show very
faithfully the kind of problems that may emerge during a project, and
where solutions might be found.
It will not discuss in any detail the more technical, computational
issues,
such
as
quantitative
data
analysis,
factor
analysis,
multidimensional scaling, cluster analysis or chi-square tests. Suggestions
for further reading on these subjects are included in most of the standard
books on computational linguistics (i.e. McEnery (1992), Souter and
Atwell (1993)).
10
_________________________________________________________________
2
History of Corpora
The main aim of Chapter 2 is to give a brief overview of the history of
2
evidence-based language analysis. I shall first describe early approaches
to corpus linguistics, and then - in Section 2.2 - focus on Noam Chomsky,
th
who is arguably the most influential person in 20 century linguistics. I
shall discuss his objections to corpus linguistics and arguments in favour
of rationalism. In 2.3 I shall concentrate on more recent trends in
linguistics and the revival of corpus linguistics as one of the most valid
tools for language modelling.
2.1 Early Corpus Linguistics
‘Early Corpus Linguistics’ is not a canonised definition of a fixed period
of time. The term was first used by McEnery (1996) to define all work
carried out in the various areas of linguistics before the methodology
advocated by Noam Chomsky became predominant.
‘Corpus linguistics’ has actually been mainstream for a long time,
although it was initially discussed under a variety of names and labels.
Different branches of linguistics, such as field linguistics and the post3
Bloomfieldian structuralist tradition of Harris and Hill, based their studies
on a methodology which nowadays might very well be called ‘corpusbased’.
2
The information included in this Chapter is based on accounts in several textbooks and on
internet sites (Chomsky 1957, 1962, 1965; Cravetto et al. 1997; McEnery and Wilson 1996 as
well as http://caseyd.meer.net/dj/Chomsky/Chomsky.html). Exact page references are only given
when I quote directly from a source.
3
See Appendix 1.
11
History of Corpora
_________________________________________________________________
In the 1950s then, Chomsky brought about a revolution which favoured
an approach which was based on data derived from native speaker
introspection and totally excluded corpus evidence. As we will see later in
this chapter, his doctrine put a spoke in the still fragile wheel of corpus
linguistics.
Before the 1950s several corpus-based studies had been carried out,
not only in the field of language acquisition - maybe the most obvious
application of corpus-based research -, but also in the area of language
pedagogy (Fries and Traver 1940), spelling conventions (Käding 1897),
comparative linguistics (Eaton 1940), syntax (Fries 1952), semantics
(Lorge 1949) and lexis (Gougenheim et al. 1956). Language acquisition
studies first applied the methodology of corpus-based description in the
th
first half of the 20 century. Research focused on child language on the
basis of parental diaries recording dialogues and utterances. Generally,
however, these were simply a collection of transcribed interactions which
rarely complied with even the basic principles of corpus studies, i.e.
representativeness. This material was still being used in the late 1970s by
Chomskyians as a source of normative data, which in fact made this field
of research one of the few areas that successfully continued throughout
the Chomsky-dominated period and so represented a continuum in
corpus linguistics. It also laid the foundations for a new generation of
corpus linguists like Svartvik and Leech, whose early career was in
language
acquisition before they started to apply corpus-based
methodologies to other domains.
Corpus evidence in order to establish spelling conventions was first
used by Käding in 1897, when he put together an impressive 11-millionword corpus of German texts to analyse and correlate the frequency of
letters and letter sequences in the German language (for details see
Käding 1897).
Comparative linguists also included corpora in their studies. One of
them was Eaton, who in 1940 compared the frequency of word meanings
in German, Italian, French and Dutch. The exploitation of the corpus
12
History of Corpora
_________________________________________________________________
allowed him to derive useful information about semantic links across
languages and communication situations. The modernity of such a study
is proved by the fact that only in the second half of the 1990s McEnery
and Oakes were able to create a corpus which was large enough to
produce similar results (McEnery and Wilson 1996:3).
Arguably the two most important early fields of application of corpus
studies are syntax and semantics. In 1952 Fries created a corpus of
transcribed telephone conversations which he then transformed into a
descriptive grammar of English. His pioneering work provided a model for
the one developed by Quirk et al. in 1985, more than 30 years later (see
Fries 1952).
Semantic studies were carried out also for other European languages.
French was analysed in detail by Gougenheim et al. (1956), who
transcribed a corpus of spoken language gathered from 275 informants
with the aim of describing high frequency lexical and grammatical choices
(for further details see McEnery and Wilson 1996:4).
2.2 The Chomskyian Revolution
Between 1957 and 1965 Noam Chomsky published three books which
brought about a radical change in linguistics (see Chomsky 1957, 1962,
4
1965). Chomsky’s criticism of behaviourist approaches initiated a new
kind of research which was founded on rationalism and focused on
introspective judgement rather than external data analysis. Introspection,
Chomskyians argued, was quicker and more reliable (for further
comments see http://listserv.acsu.buffalo.edu/cgi-bin/wa?S1=anthro-l)
In some ways, Chomsky’s success was also the result of a certain
degree of arrogance on the part of early computer linguists who believed
that
corpora
were
collections
of
all
potential
utterances.
They
consequently argued that corpora were the only valid method of
describing language and therefore its “primary explicandum” (Leech
1991:8).
13
History of Corpora
_________________________________________________________________
Chomsky strongly objected to this view of language as a finite medium.
He pointed out that language structures vary in line with personal style
preferences, generic constraints or in response to contextual (situational)
needs, that neologisms are constantly added to enrich the lexis of
languages and meanings also change. Corpora - which are finite, and
according to Chomsky “skewed” – therefore were unsuitable as models
for language:
„Any natural corpus will be skewed. Some sentences won’t occur
because they are obvious, others because they are false, still others
because they are impolite. The corpus, if natural, will be so wildly
skewed that the description [based upon it] would be no more than a
mere list.“ (Chomsky 1962:159)
One of the main tenets of Chomsky was that linguistic theories should
be cognitively plausible and able to simulate and recreate natural
language processing.
Chomsky therefore stressed the need for new analytical tools, which he argued – would have to focus on competence instead of performance
(Chomsky 1962). Competence is introspection. Introspection is the
knowledge of a language that we derive from our own experience.
Chomsky did not actually deny the importance of performance, but he
was convinced that it was competence that both explains and
characterises a language. For his language model Chomsky therefore
discarded performance data, as these were considered to be too weak to
mirror the linguistic behaviour of a language community, and were
influenced by too many language-independent factors, including physical
shape of the speaker, his/her moral principles, etc
One of the major drawbacks I see in Chomsky’s approach is that
language rules risk to become (or remain) the domain of influence of
those in power. Despite Chomsky’s professed reluctance to prescribe and
repeated claims in all Generative Grammar influenced textbooks that all
the versions produced by native speakers are equally acceptable,
4
See Appendix 1.
14
History of Corpora
_________________________________________________________________
decisions on what ‘standard’ language is continue to be the prerogative of
the educated elite.
Another critical issue in Chomsky’s theory, I believe, is his assertion
that only data derived through introspection will not be skewed. To me,
introspection-based data are themselves a kind of evidence and equally
skewed because no native speaker will ever produce the full range of
utterances possible in a language. They may perhaps be a weaker
variation of empiricism, as we do not yet have a methodology available to
retrieve and classify uniformly such personal references.
Another point is that introspection, although it can be recorded, is often
left unspoken. Recordings can easily be analysed, yet thought processes
remain unobservable because they cannot be shared with other people.
Corpus evidence, on the other hand, is publicly available and can
therefore be commented on by all. Even if we try to ignore the fact that
any kind of recording is a corpus, it can still be argued that modelling and
- as a direct consequence - identifying the rules of the language used by a
certain language community must be an endeavour shared by the entire
community and not only an effort engaged in by a linguistic enclave.
Again, this touches on linguistic empowerment. Competence and
performance recognise in their conceptions of language analysis different
linguistic ‘leaders’. While introspection focuses on the individual, trying to
provide him/her with the necessary tools for language analysis, corpus
linguistics makes use of already existing tools to draw conclusions from
naturally occurring data.
The Chomskyian revolution had far-reaching consequences, not simply
in linguistics. His emphasis on cognitive plausibility encouraged
computational linguistics to build systems which would simulate human
intelligence and carry out intelligent tasks.
2.3 Modern Corpus Linguistics
15
History of Corpora
_________________________________________________________________
Despite Chomsky’s success, corpus-based work continued throughout
the 1950s and 1960s, especially in those fields where introspection failed
to achieve satisfactory results. Phonetics and language acquisition were a
case in point. It suddenly became obvious that introspection - especially in
child language acquisition - can only be applied once metalinguistic
awareness has been developed, in other words we can apply competence
to language modelling only when we are aware of being linguistically
competent.
Different corpus linguistics projects were started. Between 1959 and
1961 Randolph Quirk began working on his Survey of English Usage
(SEU) Corpus. Very shortly afterwards Nelson Francis and Henry Kuçera
from Brown University in Providence (Rhode Island, United States) set out
to put together the Brown Corpus, a sample of printed American English,
which is still considered the standard reference for language enquiries. In
1975 Jan Svartvik and his team at Lund University began to transcribe and so to render machine-readable - the spoken part of the SEU corpus.
The advent of the computerised corpus, that is a collection of machinereadable texts, is indeed a major turning point in corpus linguistics. The
availability of institutional and private computing facilities fuelled the
growth of corpora, which from 1965 onwards started to became bigger
and bigger in size and number: the largest corpus available nowadays is
the Bank of English corpus, a monitor corpus created at the University of
Birmingham in collaboration with Collins COBUILD, which includes more
than 200 million words of British English and is constantly been added to.
In recent years, a new trend has started which promises exciting new
opportunities. Corpus researchers like McEnery and Wilson have realised
that artificial data - collected via introspection - can have a place in corpus
linguistics, albeit with the proviso that corpus evidence will “act as a
control, a yardstick” (1996:16). Corpus linguistics - as McEnery and
Wilson (ibid.) put it - should be a synthesis of introspection and
16
History of Corpora
_________________________________________________________________
performance analysis, a mix of artificial and natural observation. Fillmore
sums up this symbiosis very well:
“I don’t think there can be any corpora, however large, that contain
information about all of the areas of English lexicon and grammar that
I want to explore… [but] every corpus I have had the chance to
examine, however small, has taught me facts I couldn’t imagine
finding out any other way. My conclusion is that the two types of
linguists need one another.“ (Fillmore 1992:35)
17
_________________________________________________________________
3
Corpus-Based
Research
This Chapter focuses on corpus-based research and provides a more
detailed description of the methodologies used. Section 3.1 gives a
definition of ‘corpus’ and ‘corpus-based research’; in Section 3.2 I shall
outline some of the strengths of a corpus-based approach which have
been cited in the literature to prove its validity. Before giving a detailed
description of the different kinds of corpora in Section 3.4, I shall delineate
in Section 3.3 the main points to be considered when building a corpus.
The tools needed for corpus exploitation are explored in Section 3.5, while
Section 3.6 outlines criteria for building a computerised infrastructure for
corpus-based work.
3.1 Definitions
In Chapter 1 I already defined a ‘corpus’ as any collection of more than
one text. In modern linguistics, however, this collection of texts must fulfil
certain criteria to be considered a corpus. As stated by McEnery and
Wilson (1996:21), a ‘corpus’ must display four main features:
♦ representativeness
♦ finite size
♦ machine-readable form
♦ be a standard reference.
Representativeness
18
Corpus-Based Research
_________________________________________________________________
Representativeness is a major point. There are basically two ways of
collecting data: either you record every single utterance of a specific
language variety, or you build a sample of the entire population of texts
that you want to analyse. As already pointed out in Chapter 2, a living
language constantly grows and changes, which means that its lexical and
syntactic structures are in theory infinite. The first approach is therefore
impossible to implement.
Generally, therefore, corpus linguists will opt for the second
methodology. However, sampling also has its pitfalls (see Noam
Chomsky’s criticism of corpora being „skewed“ in previous chapter). When
compiling a corpus, we are influenced by many factors (i.e. availability in
electronic form, easy retrieval, ready-made text collections, etc.) that
automatically - and, sometimes, unconsciously – determine the range of
texts from which the corpus will be sampled. Representativeness therefore
can never be totally objective.
While this may be a major drawback, I believe, sampling is still a
legitimate approach, provided we are aware that samples can never
reproduce a language variety completely accurately and faithfully, and
provided we ensure that the collected corpus is balanced. Biber (1993b)
has outlined a number of steps to produce an appropriately balanced
corpus: before starting to build a corpus, clearly state the aim of the study,
specify the linguistic variety to be analysed, and indicate what he
(ibid.:243) calls the „sampling frame“ - the entire population of texts from
which samples are taken. Samples must ‘average out’ and provide a
reasonably accurate picture of the entire language population.
Finite Size
The second feature mentioned is the size of the corpus, which should
be finite. Not all corpora are finite, however: ‘monitor corpora’, such as
John Sinclair’s Bank of English, are open-ended collections. Texts are
constantly added to the corpus in order to update the material already
19
Corpus-Based Research
_________________________________________________________________
collected and so produce reasonably exhaustive samples of language
use.
Determining the size of a corpus is one of the most difficult tasks in
corpus creation. In order to facilitate this task, computational linguists
have elaborated algorithms which are able to approximately quantify
5
variables such as chance and significance. When the total number of
words is reached, collection stops and the corpus is thereafter not
increased in size. Apart from monitor corpora, the only exception to this
principle is represented by the London-Lund Corpus (LLC), which was
enlarged in the mid-1970s by Sidney Greenbaum in order to cover a wider
variety of genres.
Machine-Readable Form
A corpus also has to be available in machine-readable form. As we saw
in chapter two, an essential difference between early and modern corpus
linguistics is the ready availability of microcomputers. Before the advent of
computerised data processing, corpus exploitation was a very long,
expensive and error-prone procedure: just think of Käding’s 11-millionword corpus and the 5,000 Prussian analysts he needed to go through the
corpus. Svartvik was one of the first linguists that applied the principle of
machine-readability to data collection by phonetically transcribing the
spoken texts of Quirk’s SEU. Incidentally, the LLC is one of the few
corpora still available in book format.
Although machine-readability is nowadays to be considered as
absolutely necessary, there are still some exceptions that need to be
mentioned. A complete concordance of the Lancaster-Oslo/Bergen
Corpus (LOB) is available only on microfiche, while some other spoken
corpora such as the Lancaster/IBM Spoken English Corpus offer copies
taped for phonetic analysis.
The advantages of machine-readable corpora can therefore be
summed up under the following 3 main headings:
20
Corpus-Based Research
_________________________________________________________________
♦ thanks to corpus exploitation tools - i.e. concordancers, frequency
listers, parsers - data can be searched and manipulated easily and
time-effectively, and so simplify result analysis;
♦ they can easily be enriched by adding information about grammar and
6
lexis ;
♦ they can be made available to researchers within a couple of minutes
via Internet connections.
Standard Reference
The fourth requirement is that a corpus should also be a standard
reference. Sharing a collection of texts with the rest of the research
community can make out of an appropriately designed corpus a yardstick
for language modelling, which can then also be used for later research
projects. A further advantage is that by using a single source of linguistic
information it is easier to compare different studies, because the opinions
expressed can be judged exclusively on the basis of the claims made by
the scholar who carried out of the analysis.
3.2 Some Arguments in Favour of Corpus-Based Research
The literature cites numerous arguments in favour of corpus-based
research. The perhaps greatest advantage of corpus linguistics over other
approaches that I can see is that corpus linguistics is not restricted either
to theory or just to practice, but that it combines both. It makes available
the methodology that is required to carry out studies into language usage,
but not without also insisting that the empirical data are included in an
overall theoretical description.
In the following, I shall quote a few more arguments that have been
used in favour of corpus studies.
5
6
For further reading on quantitative data analysis see McEnery and Wilson 1996, pp. 66-86.
For further information about corpus annotation see tagged corpora (Chapter 3 Section 4).
21
Corpus-Based Research
_________________________________________________________________
Corpus-based studies vs. introspection-derived data
First and maybe most importantly, a corpus-based approach provides
naturally recorded, linguistically comprehensive examples. We often
explain a phenomenon or a grammar rule by means of introspectively
created examples. Although we are convinced of their validity, we have to
admit that frequently the examples we produce are either clichés or rather
idiosyncratic. Evidently, we need proof from natural language use: in
grammar teaching there is no point in analysing a language variety that
either does not exist in reality, or is considered a sort of sublanguage used
by a closed circle of language users (i.e. prototypical examples used
during a language learning class).
If we use a corpus, then this corpus might of course also contain such
prototypical or idiosyncratic examples, but, because of the greater
representativeness of the corpus, these examples form part of the
knowledge of a wider linguistic community, and therefore must be
accepted as commonly shared rules.
Corpora as Material for Inductive Learning
7
Possibly the biggest advantage of the corpus-based approach is that it
allows inductive learning and is always learner-centered. In the final
analysis it is the learner that decides what s/he wants to focus on, what
s/he wants to learn, how s/he wants to acquire knowledge or skills and at
which pace. The learner can therefore exploit the corpus for his/her own
purposes, which may indeed vary between learners. (Incidentally, corpora
have also been successfully used in teacher training; see Renouf 1997).
Corpora and Reusability
Another important characteristic of the corpus-based approach is the
reusability of linguistic resources. We have already mentioned among the
four main features of a corpus that it is a standard reference. Public
22
Corpus-Based Research
_________________________________________________________________
availability and source reusability are, in my opinion, closely linked.
Together they assure coherence, a much appreciated quality in linguistics.
A coherent approach to language study permits project comparison and
contributes to a global analysis of language use.
Corpora and Interdisciplinarity
Closely linked to the issue of reusability is the issue of interdisciplinarity.
Various linguistic fields can all exploit the same corpus to conduct stylistic,
syntactical and lexical studies. The results can then be used as the basis
of cross-cultural studies.
Corpora and Flexibility
Corpora enable all kinds of studies. They can for instance be provided
with extra tags which are added after every word to describe its status.
This is the so-called ‘annotated corpus’, which McEnery and Wilson
(1996:24) call „a repository of linguistic information“, because it makes
explicit what in the plain text was still implicit (for further details see
Section 3.4).
Corpora and Negative Results
Another major advantage of corpus analyses is that even a negative
result is an analysable result.
Corpora and Specificity
Yet another advantage of a corpus-based approach is its specificity: the
choice of texts to be included in the corpus and the design criteria applied
can reflect a specific attitude to language analysis, which means that we
can modify not only the methodology, but also the goal. Different
languages or language varieties require different analytical standards or
approaches. The corpus can be built in respect of these standards and
7
The various possibilities offered (i.e. CALL) will be described in Part Two, Chapter 4, Section 2.
23
Corpus-Based Research
_________________________________________________________________
therefore become the only valid tool to analyse specific connotations of a
language or language variety (i.e. a sublanguage of a dialect).
Corpora and Language Promotion
One - often unintended - outcome of corpus studies is that analyses of
a given language or language variety help to put this language on the
linguistic map, promoting both research into this variety as well as its use.
3.3 Corpus Outline and Creation
The validity of a study depends primarily on the sampled corpus. This
section aims to delineate some of the basic corpus design criteria involved
in corpus creation. I have mentioned before that corpus-based research is
diverse and extremely flexible, that it allows for a wide range of linguistic
and non-linguistic studies, all of which require the inclusion of special
features and therefore need to be sampled differently. Principally, the
corpus criteria discussed in this Section are meant to describe the design
of corpora that will be exploited for linguistic purposes. Two basic
distinctions are made, ie. synchronic and diachronic corpus design criteria.
3.3.1 Synchronic Corpus Design Criteria
The major text databanks available are synchronic corpora, that is they
describe the state of the language at a certain point in time. The samples
of texts making up the corpus generally comprise different language
varieties, all produced during the same period of time.
Representing (part of) a language is obviously a problematic task. It is
very difficult to determine the full extent of linguistic variations, or even all
the contextual variables that need to be covered in order to deliver a
complete language description. However, attention to certain features will
balance out imprecisions and ensure corpus representativeness. The
main issues of corpus design may then be summarised under 7 headings:
♦ target domain selection;
♦ sampling;
♦ diversity;
24
Corpus-Based Research
_________________________________________________________________
♦ size;
♦ comparability;
♦ distribution;
♦ other issues.
Target domain selection
The first step is to determine the purpose of the study, that is to select
the target domain. Target domain selection is extremely important when
building a corpus. It involves deciding which language variety to focus on,
choosing the register and, possibly, limiting the claims of the research
project. We cannot proceed to sampling without knowing precisely what
we are actually looking for.
Sampling
Once we have decided what we are going to study, we can start to
sample our corpus. Two major approaches can be opted for: proportional
sampling and stratified sampling. To sample a corpus proportionally
means to find a group of people and record all examples of language they
produce and receive - spoken, written or both, depending on the kind of
corpora we are compiling - over a certain period of time. You can then
proportionally divide the language varieties which your subjects have been
exposed to and then build the corpus on the basis of the data collected.
The drawback, as Biber (1998) states, is that proportional samples are
fairly homogeneous, and cannot normally be used for language variation
studies. A proportional corpus that aims to mirror the spoken language
used in everyday situations, for instance, is unlikely to include many
examples of more elevated registers. The samples would therefore display
very similar characteristics, which means that any model of language use
generated on the basis of these corpora would be wrong.
If we search for a sample of predetermined language variants that
describe a given language - or compare it to others -, we need the
stratified approach. A corpus constructed using a stratified approach
25
Corpus-Based Research
_________________________________________________________________
includes and categorises all varieties and registers of the language that
we have decided to analyse. Having catalogued and drawn samples from
all the different categories of text that occur in a given language, we can
then link the texts with the categories or sub-categories. This needs to be
specified in the description included in the general information about the
corpus (i.e. in text headers). A good example is Stig Johannson’s work,
who in 1978 gave an exhaustive account of the categories applied to the
compilation of the LOB corpus.
A further important aspect in sampling is the background of the
language user. Even if the text falls into a specific category and extra
information about the person who produced it may be considered of little
importance, there is sometimes a need for contextual knowledge. In a
corpus trying to describe a specific literary movement, for example, it is
crucial to know if the writer has always belonged to that particular
movement or if the text sample merely represents a period of his/her life
as an artist.
Sampling also involves copyright issues. Copyright is a major
impediment to encoding and storing modern literary and commercial
material, also for those who only want to compile corpora for their own
personal use. Not all sources respect copyright. Documents made
available through the Internet, for example, are often of uncertain status:
some need an explicit authorisation in order to be copied, others contain a
simple copyright disclaimer. Mailing lists - at least in the USA - are
assumed
to
be
implicitly
licensed
for
textual
reproduction
or
retransmission, while the use of anonymised extracts for study purposes
within an institution is considered to be ‘fair dealing’. Without such silent
agreement it would be difficult to retrieve written and spoken material in
machine-readable form. In other cases - such as the major corpora of
English language - limited access to data sources is often allowed to
educational institutions, especially if the source itself is an educational
institution, or belongs to some public authority. If doubts exist, the data
26
Corpus-Based Research
_________________________________________________________________
owner will have to be asked for permission, even if this may often seem a
mere formality or a waste of time.
Diversity
The next design criteria I would like to touch upon is diversity.
Experience tells us that - If we intend to study language use in general we must include as many variants as possible. There is no such thing as
‘general language’, instead there are many language varieties which differ
in the use of lexical, grammatical and discourse features. Furthermore,
each language variety includes different registers, and each register has
its own pattern of use.
To ensure diversity in a corpus, Biber (1998) suggests that two areas
need to be considered: register variation and subject matter. Firstly,
register variation must be represented appropriately. Speakers of a
language make use of different registers, depending on the person they
are talking or writing to. Including only some of these registers would
mean that an incorrect description of language use is produced, which
would invalidate the corpus. The second is subject matter. This is of major
interest for lexicographers, since the frequency of many words depends
on the theme of the interaction. These two issues are closely linked: for all
studies, in fact, you need a sample of a great range of subject matters
and, within each subject matter, of all different registers used.
8
There is a third aspect, not mentioned by Biber, which refers to diversity
amongst language users rather than language use. From a linguistic point
9
of view, dialect and idiolect can also introduce diversity and should
therefore be considered.
Size
8
This of course only applies to corpora which aim to reflect ‘general language’. For more
specialised studies, such as “Academic German”, diversity needs to be redefined.
27
Corpus-Based Research
_________________________________________________________________
The third criterion listed above is size. Size means numbers: the
number of words included in the corpus, and also the number of texts from
the various text categories, the number of samples from each text, and the
number of words in each sample.
10
The issue of size is important and
should be approached very carefully: if an unbalanced number of texts is
included, some text categories can have an undue influence on the results
of the analysis.
Equally important is the choice of samples from each text. A text can
include more than one register or, more generally, different patterns of
language use. If a corpus does not include all features of the specific
pattern(s) analysed, it will misrepresent the linguistic category to be
sampled. Greenbaum (1991) gives an exhaustive account of the issues
involved when deciding on corpus size in his description of the
International Corpus of English (ICE) and outlines its importance for
representativeness. The ICE, for instance, includes a core corpus of
1,000,000 tokens, which should be mainly used for international
comparison. To this core corpus it is possible to add (parts of) a
specialised corpus (i.e. business letters, student essays, etc.), which is
(are) felt to be of value to researchers working in a particular region. A
third corpus which contains texts without specific categorisation can then
also be compiled. All three corpora together form the monitor corpus
which can then be used to analyse a given regional variety.
Comparability and Distribution
Comparability and distribution are two minor issues in corpus design,
that is they do not necessarily apply to all corpora. As far as comparability
is concerned, it is interesting to note that the design of a corpus is
sometimes subject to limitations which may - to some extent - conflict with
The goals set by the compiler. Tradition in corpus linguistics likes the
9
See Halliday and Hasan 1985:41.
The International Corpus of English (ICE), the Brown Corpus of American English and the
LOB all agree that each text in the core corpus should contain about 2,000 words.
10
28
Corpus-Based Research
_________________________________________________________________
design of a new corpus to follow the pattern set by corpora already
compiled, so that the two can be easily compared, and this might not
always match the corpus creator’s needs.
An example of the importance of comparability is the Corpus of Spoken
American English (CSAE), which was designed along the lines of the LLC.
This had two clear advantages: first, the corpus creator was able to
combine all that had been learnt from earlier experience with technological
innovations and new theoretical developments; secondly, the creation of
two comparable corpora allowed cross-cultural studies and also
cooperation in drawing up methodological and analytical frameworks.
Distribution involves the form in which the final product is published and
questions of which institution will distribute them. The most frequently
used channel of dissemination is the CD-ROM, principally because it is
small, light, easy to send, and has considerable research potential
because it delivers the corpus in machine-readable form. Some corpora,
such as the CSAE and the LLC, are still available as printed books. Other
channels include microfiche, tapes and the world wide web. In some
cases, an institution assumes responsibility for the distribution, as, for
example, the Norwegian Computing Centre for the Humanities at Bergen,
which is responsible for the distribution of both the International Computer
Archive of Modern English (ICAME) and the ICE.
Other issues
Other issues in corpus design include compilation and annotation.
Compilation involves some ‘strategic’ decisions. One is the question of
sampling method. The classic approach, for instance, implies scanning
text samples and then editing them. A further problem could then be the
processing of non-language material (i.e. mathematical formulae, symbolic
expressions, figures, diagrams, pseudocodes, etc.). A way of solving
these problems is to decide to confine text samples to the running
text and therefore omit such materials. Omissions can then be balanced
by protocols (i.e. *EQ* for mathematical expressions, *FI* for figures, etc.).
29
Corpus-Based Research
_________________________________________________________________
Finally, proof-reading and cross-checking belong to corpus compilation as
well.
The second issue is annotation. It is up to the corpus builder to decide
whether or not s/he wants to encode the corpus. Some corpora (i.e. ICE
and LOB) have been compiled in two versions, one with and one without
annotations. With respect to discourse analysis, for example, John
Sinclair’s advice is indeed very appropriate:
The safest policy is to keep the text as it is, unprocessed and clear of
any other codes. (Sinclair 1991:21)
Nevertheless, corpus annotation is opted for when analysis tools (i.e.
parsers) are to be used, such as in lexicographic or grammatical studies.
Given constraints on time, money and the availability of texts, it is often
necessary to make compromises. Every corpus has limitations, but a welldesigned one can still be very useful for investigating a variety of linguistic
issues.
3.3.1 Diachronic Corpus Design Criteria
Designing a diachronic (also known as ‘historic’) corpus - a collection of
texts that accounts for language development across a specific period of
time - can be even more complicated than creating a synchronic corpus.
In addition to the basic issues of corpus design outlined in the previous
section, a diachronic corpus compiler is faced with the problem of
representativeness. Since the corpus aims to cover a precise range of
linguistic variations and registers across a specified period of time, it might
be possible to opt for exhaustive sampling. That means that the ‘final
product’ will include all linguistic variants and all registers of that specific
period. However, actual practice is often far more complicated than theory.
The design of a representative diachronic corpus which will be used to
study a specific literary style raises serious questions about sampling
methods. The most complex approach is what Biber defines as the “multipurpose diachronic corpus” (1998:251), which is designed to represent a
30
Corpus-Based Research
_________________________________________________________________
wide range of registers across historical periods, such as the Helsinki
Historical English Corpus and the American Representative Corpus of
Historical English Registers (ARCHER).
In addition to the standard variables (time and region), a major point in
diachronic corpus design is the range of registers which are to be included
in the corpus. This is not an easy task: there are, in fact, various factors
that play a role, one of these being the number of texts available. It can be
very difficult to find sufficient texts to exhaustively cover a certain register.
A case in point is spoken interaction. The ARCHER Corpus, for instance,
includes several speech-based registers, but the majority of transcribed
texts of spoken discourse are derived from drama and fiction, in which
spoken dialogues reflect the author’s intuitions and representations.
A further issue concerning registers is their variability across time. This
is even more difficult to identify and, consequently, to be analysed. Both
the Helsinki and the ARCHER Corpus avoid this problem by treating
register variation as a continuum, that is as one single register, leaving it
to the analyst to describe the dramatic ways in which a register can evolve
over time.
Dialects and idolects should also be given attention. The corpus
sampler must decide how to catalogue them, delivering text type divisions
that take the sociolinguistic aspect into consideration as well. In her paper
about the Cambridge Corpus of Early Modern English (1600-1800), Wright
(1993) champions the importance of idiolect, calling for a clear
discrimination of the relation between the “state of the language” and
“individual usage”, (ibid:29) highlighting the necessity of genre division
across a particular period of time. She adds that it is necessary to set up
“stringent functional/situational criteria” (ibid:27), because notions such as
register and genre can vary conspicuously across time.
The next step in diachronic corpus design, then, is text selection. As
Biber (1998:253) states, the best criterion is to include a random selection
of the texts available for a specific register in each period. In order to be
able to do this, a complete listing of all texts available from the period of
31
Corpus-Based Research
_________________________________________________________________
time analysed is absolutely necessary. With literary registers this can be
easily achieved thanks to exhaustive bibliographies from which a random
sample can be selected, while for other registers the ideal sampling
method can be much more difficult to identify. Different approaches can
therefore be used, as in the case of the ARCHER Corpus (see Appendix).
A final issue concerns corpus automated tagging. Due to the fact that
spelling and other orthographic conventions might vary considerably over
time, interactive checking and editing of automated annotation is essential
when building a diachronic corpus. This approach can be very timeconsuming, but it is necessary to guarantee a correct linguistic analysis
(i.e. parsing).
3.4 Kinds of Corpora
In principle, corpora can deliver information about many different
aspects of human interaction. To make them optimally suitable for the
many research fields, however, studies with a different focus require
different parameters and therefore an appropriate approach to sampling.
A basic methodological choice is to decide what kind of corpus suits the
analysis best. There are, in fact, many corpus ‘templates’ which can either
be used as they are, or remodelled to meet a specific priority. Every
‘template’ acts as a gateway to specific information about specific
11
features: spoken corpora , for example, provide information about
pronunciation standards, while parallel corpora focus on the translation of
the same collection of texts in one or more languages.
11
Spoken corpora include recorded oral material which is then transcribed. Spoken corpora enable
researchers to study interaction from a phonetician’s perspective; they have also proved useful in
discourse analysis, sociolinguistics and even psychology. However, there are various problems
related to a spoken corpus. The major challenge of spoken corpora is its representation in written
form (transcription). Spoken language has no explicit punctuation. It is therefore up to the corpus
compiler to decide whether to attempt to transcribe the corpus in the form of orthographic
sentences or whether to use intonation units (prosodic annotation), which tend to capture features
such as stress, intonation, pauses, ‘body language’ (i.e. eye contact) and other non-verbal material
(i.e. coughs, laughs, etc.). In order to avoid interpretation errors (i.e. inserting wrong punctuation
marks), transcriptions are normally made by using the scripts which were used by the speakers
(informants) (i.e. the Lancaster/IBM English Spoken Corpus, which is made up of radio
32
Corpus-Based Research
_________________________________________________________________
In this section I shall outline the basic characteristics of the corpora
used more frequently in linguistic research.
Raw and annotated corpora
Possibly the major distinction between corpora is whether they consist
of raw or annotated texts. The main difference between these two kinds of
corpora is that annotated corpora are provided with additional linguistic
information (annotation). This information can be prosodic (focusing on
12
intonation units), semantic, syntactic, generic, contextual, and so on .
The most common form of annotated corpora is the grammatically
tagged one. In a grammatically tagged corpus, every word has been
assigned a word class label (part-of-speech tag). The following example is
taken from the untagged and tagged versions of the LOB Corpus:
Untagged Sample
A move to stop Mr Gaitskell from nominating any more labour life
peers is to be made at a meeting of labour MPs tomorrow.
Tagged Sample
^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN
nominating_VBG any_DTI more_AP labour_NN life_NN peers_NNS
is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN of_IN
labour_NN \0MPs_NPTS tomorrow_NR ._.
(Source: Biber 1998:258)
The tags used in the LOB Corpus are the original Brown Corpus tags:
AP
post-determiner
AT
article
broadcasts), or punctuation is added by the informants themselves. An example of orthographic
transcriptions of speech is the Lancaster/IBM Corpus.
12
In fact, even filenames can provide information. The LOB Corpus, for instance, is divided into
different sections, with filenames indicating the section and whether or not that particular section
has been tagged. (i.e. a computer file named loba.tag tells us that we are dealing with the section
A of the tagged version of the LOB Corpus).
33
Corpus-Based Research
_________________________________________________________________
BE
infinitive form of the verb „to be“ (be)
BEZ
third person singular of the verb „to be“ (is)
DTI
single/plural determiner or quantifier
IN
preposition
NN
singular noun
NNS
plural noun
NP
proper noun
NPT
noun of style or title, singular
NPTS
noun of style or title, plural
NR
adverbial noun
TO
infinitive marker (to)
VB
verb
VBG
present participle/gerund
VBN
past participle
.
end of the sentence
Another type of annotation is parsing.Parsed corpora offer a syntactic
analysis of a corpus, identifying subjects, verbs, objects, as well as more
complex syntactic information. Sometimes, this kind of corpora is also
represented as tree diagrams, which are therefore known as „treebanks“.
A typical example of a tree diagram of the sentence Andrea turned on the
lights looks as follows:
34
Corpus-Based Research
_________________________________________________________________
S
NP
VP
PP
NP
PN
Andrea
V
turned
P
on
AT
the
N
lights
where ‘S’ stands for sentence, ‘NP’ for noun phrase, ‘VP’ for verb phrase,
‘PP’ for prepositional phrase, ‘N’ for noun, ‘V’ for verb, ‘P’ for preposition,
‘PN’ for proper noun, and ‘AT’ for article. This kind of graphic annotation is
extremely space-consuming. Parsed corpora therefore tend to use
different annotation labels, where the constituents are indicated by
opening and closing square brackets. The example above, then, would
read:
[S [NP Andrea_PN1 NP] [VP turned_VVD [PP on_II [NP the_AT1
lights_NN1 NP] PP] S]
where the tag set used corresponds to that used to annotate the British
National Corpus (BNC). In order not to completely lose the visual
properties of the tree diagram, the bracket-based method is sometimes
displayed with indentations:
35
Corpus-Based Research
_________________________________________________________________
[S
[NP Andrea NP]
[VP turned
[PP on
[NP the lights NP]
PP]
VP]
S]
This is the representational layout chosen for the Penn Treebank project.
As I shall describe in Chapter 4, both tagged and parsed corpora can be
extremely useful tools for research if exploited with appropriate computer
programs (i.e. concordancers).
Generic and contextual information is frequently encoded in the
document header, which can include, for example, the title of the
document, the name, age and sex of the language producer, the date of
publication, the language variety, the subject domain, and so on. Such a
header can be very useful when refining the search for text types or
particular variables within a range of texts. In corpora such as the
Longman-Lancaster Corpus and the Helsinki Corpus, this type of
information is given in COCOA format. COCOA was an early computer
program that extracted indexes of words in context from machine-readable
texts. The system was then also applied to other concordancing
programs, such as the Oxford Concordance Program (OCP). A “COCOA
reference” consists of a balanced set of angled brackets (< >) containing
two values (“entities”): a code signifying a particular variable name (i.e. ‘A’
normally stands for ‘Author’), and a string providing the information
needed. The following example shows a COCOA document header from
the Helsinki Corpus, where ‘X’ indicates that the information was either not
available, or not relevant to the text:
36
Corpus-Based Research
_________________________________________________________________
<B CEPRIV1>
Short descriptive code
<Q E1 XX CORP EBEAUM>
Text identifier
<N LET TO HUSBAND>
Name of text
<A BEAUMONT ELIZABETH>
Author’s name
<C E1>
Sub-period
<O 1500-1570>
Date of original
<M X>
Date of manuscript
<K X>
Contemporaneity of original and manuscrip
<D ENGLISH>
Dialect
<V PROSE >
Verse or prose
<T LET PRIV>
Text type
<G X>
relationship to foreign original
<F X>
Language of foreign original
<W WRITTEN>
Relationship to spoken language
<X FEMALE>
Sex of author
<Y X>
Age of author
<H HIGH>
Author’s social status
<U X>
Audience description
< E INT UP>
Participant relationship
<J INTERACTIVE>
Interactive/non-interactive
<I INFORMAL>
Formal/informal
<Z X>
Prototypical text category
<S SAMPLE X>
Sample
(Source: McEnery et al. 1996:31)
Spoken corpora are frequently available in a prosodically annotated
version. The following example is from the prosodically transcribed LLC:
1
1
1
1
1
1
1
8
8
8
8
8
8
8
14
14
14
14
14
14
14
1470
1480
1490
1500
1510
1520
1530
1
1
1
1
1
1
1
1
1
1
1
1
1
1
A
A
B
A
B
A
B
11
20
11
11
11
11
11
1 8 14 1540 1 1 B 11
1 8 14 1550 1 1 A 11
^what a_bout a cigar\ette# .
/
*((4sylls))*
/
*I ^w\on’t have one th/anks#* - - /
^aren’t you •going to sit d/own# /
^[/ \ m] # /
^have my _coffee in p=eace# - - /
^quite a nice •room to !s\it in ((/actually))#
/
*^\isn’t* it#
/
*^y/ \ es#* - - /
(Source: McEnery and Wilson 1996:55)
37
Corpus-Based Research
_________________________________________________________________
The Codes used by the compilers of the LLC are:
#
end of tone group
^
onset
/
rising nuclear tone
\
falling nuclear tone
/\
rise-fall nuclear tone
_
level nuclear tone
[]
enclose partial words and phonetic symbols
‘
normal stress
!
booster: higher pitch than preceding prominent syllable
=
booster: continuance
(( ))
unclear
**
simultaneous speech
-
pause of one stress unit
Prosodic transcription is a very difficult task for which highly skilled
phoneticians are required and, unlike part-of-speech (POS) annotation, it
cannot be delegated to the computer.
A further problem of prosodically annotated corpora is consistency, or
rather the lack of it. The identification of intonation patterns is a matter of
perception, and it is difficult to ensure that the same parameters are
maintained throughout the whole corpus.
An additional problem is that, given the huge size of most corpora,
generally more than one phonetician needs to be involved. However,
there is a very real danger that different annotators will apply different
standards. As far as the Lancaster/IBM corpus is concerned, the solution
to this problem was to have a small part of the corpus (approximately 9%)
annotated by both the transcribers involved. These “overlap-passages”
then served as a reference for the transcription parameters chosen by the
two phoneticians and, therefore, as a yardstick for comparison.
38
Corpus-Based Research
_________________________________________________________________
Another problem in compiling a spoken corpus is that of raw text
recoverability. Since prosodic annotation is carried out syllable by syllable
(and not word by word!), symbols have to be inserted within the word. As
the example from the LLC shows, the annotated text looks fragmented
and the original can be recovered only by deleting every annotation mark
separately.
Monolingual and multilingual corpora
Corpora can also be classified according to whether they consist of
texts in one or more than one language (or language variety): a
monolingual corpus is a database of texts produced exclusively in one
language (or language variety), while a multilingual corpus deals with texts
in several different languages.
The arguably most useful type of multilingual corpus from a translator’s
perspective is the parallel corpus which includes original texts and their
translations. In order to be able to mutually cross-check translation units,
translations need to be aligned. Computational linguistics has successfully
tried to develop automated alignment tools which identify so-called
“anchor points” within the sentence, that is to say the computer searches
the text for lexical or grammatical units that are mutual translations.
Alignment and parallel corpora have proved a very useful tool in language
analysis, second language tuition and obviously in translation teaching. A
further advantage of aligned corpora is that they can be used as an
inductive tool in cross-cultural analyses and the development of machine
13
translation (MT) and computer-aided translation (CAT) systems.
Unfortunately, however, because of the difficulties involved in obtaining a
sufficient number of texts which are translations of each other, almost all
parallel aligned corpora currently available contain only highly specialised
texts. The most famous parallel corpora - the Canadian Hansard (a
parallel corpus in French and English of the proceedings of the Canadian
13
For more detailed discussion of MT and CAT see Chapter 4.
39
Corpus-Based Research
_________________________________________________________________
Parliament) and the corpus of IBM technical manuals (English and
French) -, in fact, cover a restricted range of domains and text types.
In the last five years many parallel corpora projects have been started,
including:
♦ INTERSECT (International Sample of English Contrastive Texts)
The INTERSECT Project at Brighton University began in the Spring of
1994. The aim is to construct and analyse a parallel bilingual corpus of
French and English written texts, adding other languages later if
resources permit.
♦ LINGUA
A project involving the construction of multilingual corpora for English,
French, Greek and some others, for use in language pedagogy.
♦ MULTEXT
A project started in 1994 by Ide and Véronis which aims to develop
parallel corpus resources for a subset of European languages.
♦ MULTEXT-EAST
A research project that focuses on parallel and comparable corpora in
Eastern European languages.
♦ TRIPTIC (Trilingual Parallel Text Information Corpus)
TRIPTIC is a trilingual corpus developed for the analysis of
prepositions in English, French and Dutch.
♦ CRATER (Corpus Resources and Terminology Extraction)
Research on the CRATER project aims to achieve automatic bilingual
lexicon construction: it therefore concerns automatic alignment of
parallel texts, both at the sentence and word level. Below is an
example of French-English aligned sentences from the CRATER
corpus:
40
Corpus-Based Research
_________________________________________________________________
sub d = 22 ----------&
the location
register
should
as
a
minimum
contain the following information about a mobile
station :
-----&
l ‘ enregistreur de localisation doit contenir
au moins les renseignements suivants sur une
station mobile :
sub d = 386 ----------&
handover is the action of switching a call in
progress from one cell to another ( or radio
channels in the same cell ) .
-----&
le transfert intercellulaire consiste à commuter
une communication en cours d ‘ une cellule à une
autre cellule ( ou d’ une voie radioélectrique à
l ‘ intérieur de la même cellule ) .
sub d = 380 ----------&
the location register , other than the home
location register used by an msc to retrieve
information for , for instance , handling of
calls to or from a roaming mobile station ,
currently located in its area .
-----&
enregistreur de localisation , autre que l ‘
enregistreur de localisation nominal , utilisé
par un ccm pour la recherce d ‘ informations en
vue , par exemple , de l ‘ établissement de
communication en provenance ou à destination d ‘
une
station
mobile
en
déplacement
,
temporairement située dans sa zone .
(Source McEnery and Wilson 1996:59)
♦ The English-Norwegian Parallel Corpus
This parallel corpus is planned as an open text bank and will be
expanded when resources to do so are available. It is intended as a
general research tool available beyond the present project for applied
and theoretical linguistic research. There will be two main parts:
•
A core corpus consisting of original texts and their translations
(English to Norwegian and Norwegian to English). Initially, the
focus was on novels and fairly general non-fiction books. In order
to include material by a maximally large number of translators, the
texts of the core corpus are limited to text extracts (chunks of
10,000 words or more). Provided that there is sufficient funding, the
amount and variety of text will be increased to include more
specialised material, including legal texts.
41
Corpus-Based Research
_________________________________________________________________
•
A supplementary corpus containing texts which are translations,
but not of the matched source texts. The main object of this
supplementary
corpus
is
to
analyse
possible
features
of
"translationese" (that is, features typical of translated texts) and, in
general, of increasing the amount and variety of the material.
♦ ETAP
The project conducted by the University of Uppsala aims to create and
annotate a parallel corpus for the recognition of translation
equivalents. This computerised multilingual corpus is based on
Swedish source texts with translations into Dutch, English, Finnish,
French, German, Italian and Spanish.
♦ FECCS (Finnish-English Contrastive Corpus Studies)
A project in Contrastive Linguistics at the University of Jyväskylä,
Finland which uses a bilingual Finnish-English corpus.
♦ The Proteus Project
The Proteus Project is a machine translation project of the Computer
Science department of New York University and the Autonomous
University of Madrid. They use parallel corpora in English and
Spanish.
♦ Text-based contrastive studies in English
A project at Lund University in Sweden which aims to develop a
parallel corpus of texts in Swedish and English which can be used for
cross-linguistic studies.
♦ Translearn
A European project aimed at the development of a translation support
tool. The languages covered are English, French, Greek and
Portuguese.
♦ The Translation Corpus of English and German
The Technical University of Chemnitz-Zwickau is currently compiling a
translation corpus of English and German texts. The corpus at present
includes EU-material, academic textbooks, modern fiction and tourist
brochures (approx. 500,000 words in total). The researchers are
42
Corpus-Based Research
_________________________________________________________________
looking at aspects such as culture-specific problems in translation and
translationese.
♦ Corpora project Språkteknologi, University of Uppsala
The project aims to develop two multilingual text corpora and to
integrate them with lexical resources. The primary objective is to
create a reference corpus for research in Machine Translation.
♦ The Scania Corpus
The Scania Corpus is a collection of truck manuals from Scania.
Swedish is the source language and the texts have been translated
into seven languages: English, French, German, Spanish, Dutch,
Italian and Finnish. The Swedish component adds up to 300,000
words and is the largest part of the corpus. The smallest component,
Finnish, consists of approximately 200,000 words. The goal is to build
a corpus of 2,000,000 words. This corpus is unlikely to ever become
available, since the material is ‘commercial in confidence’.
♦ The Swedish Immigrant Newspaper Corpus
The Swedish Immigrant Newspaper Corpus (swe. Invandratidningen)
is available at Uppsala University in nine different languages: Swedish,
Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian
and Spanish. Work on this corpus has only just begun so there is no
information about the number of words contained.
♦ The Swedish Government Corpus
This collection, consisting of Swedish political texts, is still at the
planning stage. It will contain declarations by the Swedish government
(regeringsförklaringen).
♦ The Scandinavian Project of Contrastive Corpus Studies
There is an ongoing Scandinavian project involving four partners in
Norway, Finland, Denmark and Sweden. Swedish is represented by
the Department of English at Lund University with their ‘Text-based
Contrastive Studies in English'’ (Aijmer et al. 1996). All four corpora
will have the same structure. Each corpus consists of two parts: one
parallel
corpus
comprising
original
texts
together
with
their
43
Corpus-Based Research
_________________________________________________________________
translations, and one comparable corpus consisting of original texts in
both languages. The aim is to use the corpora in contrastive studies
between the Scandinavian languages and English. The parallel corpus
between Swedish and English will eventually consist of 1,600,000
words, and comprise a large range of different text samples (each
sample 10,000-15,000 words). The corpus will become available as
soon as all the copyright restrictions are resolved. The Finnish corpus
consists of approximately 2 million words; however, the parallel texts
have not yet been aligned. So far there has been no work on POStagging.
Another type of multilingual corpora are the comparable corpora, which
are in fact collections of ‘similar’ monolingual corpora, which apply the
same sampling criteria and cover the same subjects for every language
(variety) considered. The main aim of this type of corpora is to compare
languages - or varieties thereof - produced in similar communication
situations, without the distorsions which might appear in translated texts.
One example of multilingual corpora, which is also described by McEnery
and Wilson (1996), is the Aarhus corpus of Danish, French and English
contract law, which consists of three monolingual contract law corpora,
sampled according to the same criteria but do not include translations.
Another interesting application of comparable corpora is in the fields of
dialectology and language variation studies. A good example are the LOB
(British English) and the Kolhapur Corpus (Indian English) comparable
corpora, which use the same genres and sample sizes as the Brown
corpus.
General and specific corpora
A further kind of classification of corpora is based on the distinction
between general corpora and specific corpora. General corpora - also
known as “reference corpora” - are very large databases compiled to be a
representative selection from the language as a whole or of a clearly
44
Corpus-Based Research
_________________________________________________________________
defined part of it. A case in point is the monitor corpus, a collection of texts
drawn from different subject fields or registers. The best example of a
monitor corpus is the British National Corpus (BNC), developed at
Birmingham University by John Sinclair’s team in collaboration with Collins
COBUILD. This collection of texts is an open-ended entity: texts are
constantly being added to it, so that it gets bigger and bigger all the time.
Currently it comprises over 200 million words of British English, drawn
from different registers, yet focusing more on written (90 million) rather
than spoken (10 million) texts. New texts are added on a regular basis,
while ‘old’ texts are sometimes either stored on extra CD-ROMs or even
deleted: this process enables the compiler to provide a general overview
of current language use and ‘monitor’ its development across time.
Monitor corpora are primarily of importance in lexicographic work,
because they allow lexicographers to search a stream of very recent texts
for the occurrence of new words or for changes in meaning of old ones.
They also represent a valid field for research, because they include a
broad range of registers and text types, which means that language can
be modelled more accurately.
General corpora can be used for research in various fields. Specialised
(or LSP) corpora, by contrast, are created for a special purpose; many are
in fact used for work on spoken language, others are sublanguage
corpora, learner corpora and developmental corpora. LSP corpora
(corpora of language for specific purposes) can be exploited to provide
many different kinds of domain-specific material for language learning.
Sublanguage corpora consist of texts that are chosen from a particular
variety of a language, i.e. from a particular dialect or subject area. Early
examples are the Guangzhou Petroleum English Corpus (GPEC) and the
Computer Science Corpus of the Hong Kong University of Science and
Technology
(HKUST).
Besides
learning
and
teaching
purposes,
sublanguage corpora can be also used in language engineering: machine
translation cannot be realistically trialled on general language, but it
45
Corpus-Based Research
_________________________________________________________________
becomes feasible when the task is restricted to a particular domain, or
sublanguage.
Learner corpora are databases that aim to improve our understanding
of language learning from an unusual point of view. Instead of describing
language as it should be, this kind of corpora focuses on the analysis of
the commonest mistakes made by non-native speakers in order to
develop methodologies to avoid them. Although limited to only one aspect
(free writing) of one type of sublanguage (advanced foreign Learners of
English), maybe the best example of this type of corpora is the
International Corpus of Learner English (ICLE). The ICLE is a comparable
corpus whose design, compilation and processing are described in detail
in Granger (1993).
The last kind of corpora sampled for special purposes is represented by
developmental corpora. This kind of database aims to represent the
language used by native speakers whose linguistic competence has not
yet reached maturity, that is to say they try to depict a type of raw
language which is developing extremely fast and subject to numerous
influences. Because language teaching is mostly concentrated on children
during the periods of primary and secondary education, developmental
corpora has lately become mainstream. In order to create reference works
that really suit the needs of young learners, it is necessary to design a
corpus that corresponds to the target language behaviour of the learners.
CHILDES, the child language database designed in Pittsburgh by a team
of researchers at the Department of Psychology of Carnegie Mellon
University, is an example of such a corpus. Although this kind of corpus is
most useful in language acquisition research, it also has a very practical
application to the development of language-teaching and testing
materials.
This section aimed to outline the major kinds of corpora used in language
research.
14
Obviously, it cannot be a full account. There are numerous
types and subtypes not listed here that can be designed in connection
46
Corpus-Based Research
_________________________________________________________________
with a specific kind of study. This is, I feel, the most exciting characteristic
of corpora: they are flexible, that is they can be adapted to optimally suit
specific needs. In order to exploit corpora even more effectively, however,
researchers use tailor-made computer tools which enable them to obtain
useful information about the specific characteristics of natural language.
These will be described in the next section.
3.5 Tools for Corpus Exploitation
Before outlining the different analytical tools which are used in corpus
linguistics, I think it is necessary to briefly describe some of the more
15
frequently operations used in corpus work and to explain what tools can
actually do for the linguist, rather than what kind of data structures they
manipulate.
16
♦ Searching: takes a text, raw or annotated, as well as a target item, and
points to segments in the text where that specific item is found.
♦ Concordancing: takes a text, tagged or untagged, as well as a target
item, and produces a concordance, that is a list of words and phrases
in context.
♦ Parsing: takes as input a (segment of) text as well as a grammar, and
delivers syntactic information about all items in different forms (i.e.
parse trees).
♦ Counting: takes a text, raw or annotated, as well as a target item, and
returns the number of text segments that match that specific item.
♦ Tabling: takes a text, raw or annotated, as well as a target
specification, and produces a table (i.e. a frequency table, a table of
collocations, etc.).
14
For examples of corpora in languages other than English see Appendix 2.
For further details about informal specifications of operations, the required input, and the
resulting output see Lager (1995).
16
The list is not exhaustive. See Lager (1995) for more functions. The reasons why I have decided
to restrict my discourse to these operations is that they deliver sufficient quantitative data for all
the types of corpus analysis that are of relevance to the translator. These operations can of course
be combined: automatic part-of-speech tagging involves automatic disambiguation,
concordancing may imply searching, etc. (see Lager 1995).
15
47
Corpus-Based Research
_________________________________________________________________
♦ Collocating: given a description as well as a target item, produces a
list of collocations, that is a list of words that co-occur more often than
expected by chance.
♦ Automatic part-of-speech tagging: takes a text as well as a lexicon
(and sometimes some kind of rules or highly probable links) and
delivers information about the text at the level of part-of-speech.
♦ Lemmatising: takes a text as well as a lexicon (and sometimes some
kind of rules), and produces a description of the text which specifies the
lemma from which different inflected forms have been derived.
♦ Manual/automatic disambiguation: given a number of alternative
descriptions as well as either the user’s act of choosing (interaction), or
rules for automatically selecting between them, returns a description.
The four most useful tools for corpus-based research in translationoriented studies (i.e. concordancers, frequency tables, taggers and
17
parsers) are described in the following section.
3.5.1 Concordancers
Concordancers enable you to discover patterns that exist in natural
language by rearranging text in such a way that these patterns become
clearly visible. Concordancing programs allow you to look for single lexical
items or lexical groups. The principal objective of collocation searches is
to identify the lexical items a given word or lexical group can collocate
with. The example given below was produced by Conc 1.80b3 (a
Macintosh application) from a plain ASCII text version of the first chapter
of Lewis Carroll’s “Alice’s Adventures in Wonderland”. Note that the line
numbers are automatically calculated by the application.
17
Again, the list of tools discussed is not exhaustive. New programs are constantly being written
to meet specific purposes (i.e. language-specific analysers of derivational and inflectional
morphology). See Biber (1998) for further comments on the issue of already available
concordancing packages vs. own programming.
48
Corpus-Based Research
_________________________________________________________________
(Source: http://lonestar.texas.net/~brazos/alice/aliceinw.htm)
A printout like this, with the keyword in a straight column down the
middle of the page with as much of the context as will fit running in one
line to right and left, is known as KWIC (keyword in context) concordance.
Many concordancers will also let you print out contexts consisting of a
complete sentence, or a fixed number of words, or a whole paragraph, or
allow you to trace any occurrence back to the original text.
49
Corpus-Based Research
_________________________________________________________________
Tribble and Jones (1990) outline three main types of concordancing
software:
♦ streaming concordancers: they “read” a text line-by-line and produce
concordanced text either to screen, printer or disk as they chunk
through the documents you are analysing. This kind of software is very
accessible: you can use the macro option of any word processor or
even develop one yourself if you have some programming knowledge,
the most used programming language being currently Perl. There is,
however, a major drawback. Although not limited to a particular size of
text file, the concordance might take a long time to work through a long
document (50,000+ words). An example of a streaming concordancer
is Conc 1.80b3 for the Macintosh.
♦ text-indexers: they create an index of your text in one (sometimes
lengthy) operation and then permit a large variety of text retrieval
activities, including concordancing. Although ideal for large-scale
research, text-indexers are still relatively little used, mainly because
they might prove daunting to those with little computing experience or
with limited time or motivation for learning how to use them. Maybe the
best example of text-indexing software is WordCruncher.
♦ in-memory concordancers: this software loads a complete file - or set
of files - into the memory of the computer in one operation. The text
can then be consulted in a variety of ways, the results obtained being
presented to the user more or less instantaneously. The most common
in-memory concordancer is the Longman Mini-Concordancer.
3.5.2 Frequency Tables
Before analysing concordances, however, it is worth remembering that
there are also other extremely useful sources of information about texts.
One of these is a list of word frequencies (also known as frequency
tables), which can be obtained either by means of an extra application
(i.e. Mike Scott’s Frequency Lister), or simply by activating one of the
many features provided by concordancing softwares. The example given
50
Corpus-Based Research
_________________________________________________________________
below is taken from a full frequency table of Chapter 1 of Lewis Carroll’s
“Alice in Wonderland” produced with the index function of Conc 1.80b3:
(Source: http://lonestar.texas.net/~brazos/alice/aliceinw.htm)
The number in brackets shows the frequency of the tokens listed in the
first column, while the numeric string indicates the line numbers where
that token was detected.
By creating a frequency table like the one illustrated above before
running a concordance across a text it is possible to preselect the most
analytically relevant items. If wordlists are created before you proceed to
analysis, a great deal of guesswork can be avoided. Furthermore, a
frequency table may even reveal stylistic characteristics of a text that
would otherwise have gone unnoticed.
51
Corpus-Based Research
_________________________________________________________________
3.5.3 Taggers
A tagger is a computer program that assigns grammatical information
to words. For instance, a tagger might tell us that the word program in the
previous sentence is a noun in the nominative singular, or that the word
program is a present tense verb in the sentence “They program well”.
Most taggers use the following modules:
♦ they isolate words and punctuation marks
♦ a lexical analyser inspects each word and adds tags that indicate the
grammatical properties of the words (e.g. part of speech and
inflectional properties). If a word can serve several grammatical
functions, several tags are added as alternatives, as in the following
example:
He_P
chairs_Npl_Vpres
the_DET
conference_N
♦
(P for pronouns)
(Npl for plural noun; Vpres for present tense
verb)
(DET for determiner)
(N for uninflected or singular noun)
when a word could represent two or more grammatical categories, the
context needs to be consulted to disambiguate the word. The final
stage in tagging, then, is disambiguation: a disambiguator tries to
select the correct alternative by removing contextually illegitimate tags.
As a result of successful disambiguation, the above sample would be
analysed as follows:
He_P
chairs_Vpres
the_DET
conference_N
This last operation is by far the most difficult subproblem in tagging. In
spite of nearly 40 years of research, no perfect solutions are in view,
although considerable progress has been made.
Taggers are extremely useful in linguistics. In the previous section I
have already outlined the advantages of a tagged corpus. What has not
52
Corpus-Based Research
_________________________________________________________________
been mentioned yet is that fact that tagging (and parsing) software has
also found application in another major field of applied linguistics:
machine translation.
18
3.5.4 Parsers
The simple act of encoding phrases or sentences in a target language
is actually not a difficult task for a computer. Since the process is largely
mechanical, the machine even has an advantage - in terms of pure speed
- over a human being. Additionally, the database of words to which it has
virtually instant access is considerably larger than the active vocabulary
that most people carry in their heads.
Where the real challenge resides is in analysing the phrase correctly
into its constituent elements before the translation process starts. This is
known as "parsing" the phrase. In computer technology, a parser is a
program that receives input in the form of annotated text, interactive
online commands, or some other user-defined interface and breaks them
up into parts (i.e. singular or plural nouns, verbs, adjectives etc.) together
with their attributes or options. It then draws a map of the phrase or
sentence either in a linear or in a schematic form. A parser may also
check to see that all necessary input has been provided, otherwise
signalling an error in the syntactic construction of the sentence.
Despite the very advanced technology applied to the process of
parsing, however, automatic syntactic analyses may be simply not
sufficient. The poor performance of most machine translation programs, in
fact, strongly suggests that such analyses should be supervised by
humans: interaction with the machine allows the human analyst to make
difficult linguistic judgements, while the computer takes care of recordkeeping. Again, performance and competence are symbiotic.
3.5.5 Ready-Available Tools versus Own Programming
18
See Chapter 4 Section 2.2.
53
Corpus-Based Research
_________________________________________________________________
An issue that continues to cause major disagreement among
researchers is the question of whether or not corpus users should be able
to create their own analytical tools. While many corpus users feel that the
commercially available software does not provide for the kind of analysis
that they need, not many are familiar with the programming languages
used in corpus linguistics. Conversely, there are many linguistic software
developers who have little knowledge of linguistics and therefore no
insight into the real needs of corpus users. The following exchange is
quite typical of the kinds of arguments put forward by those supporting the
use of ready-made software:
Date: Wed, 29 Jul 1998 12:50:41 +0200
To: [email protected]
From: Henning Reetz <[email protected]>
Subject: Re: Corpora: Corpus Linguistics User Needs
Sender: [email protected]
1)
Writing a program is one thing. Testing and proving its correctness is
another thing. Even for simple statistical problems I prefer to use
standard statistical packages because I expect their algorithms to be
better tested than my own code (but I compare always their results
with examples from text books; if both disagree, I compute the
problem on the example data by hand and found more often bugs in
the textbooks than in the programs). Being an experienced
programmer having written many thousands lines of code, I prefer to
use standard software.
2)
I don't have to be a car mechanic to drive a car. Why do I have to be
a
programmer to use a corpora? --- But I have to know as a driver what
petrol my car takes, how good the breaks are, etc. As a user of a
program, I cannot simple trust the program but have to be aware of
its bugs or problems. I think it is a good policy to test a function by
hand on a small data set and do cross-checks and plausibility tests
on large data sets.
3)
Why re-invent the wheel?
Henning Reetz
Allgemeine Sprachwissenschaft
54
Corpus-Based Research
_________________________________________________________________
Universität Konstanz
(Source: [email protected],
th
in reponse to a question by Mason and Berglund on 27 July)
The point made by Reetz is well-grounded. Interdisciplinarity supported
by close cooperation is often enough to solve the problem. Being able to
write your own software, however, has several advantages. As Biber
(1998) points out, creating your own programs allows you to conduct
analyses that would otherwise not be possible, either because no readilyavailable tool explores the pattern of use you are aiming to analyse, or
because it does not apply the scale of analysis you have chosen for your
study. A further advantage is that someone familiar with programming
languages would be able to modify them and so increase their speed and
accuracy. (Biber ibid.)
Another argument in favour of programming your own tools is that you
can tailor the analysis process to fit your research needs. For studies that
are based on - or simply require - human assessments, you can develop
an interactive interface, where the user takes over from the computer
whenever s/he feels s/he should. See the following comment:
To: [email protected]
Date: Wed, 29 Jul 1998 11:14:17 -0400
Subject: Re: Corpora: Corpus Linguistics User Needs
X-Juno-Line-Breaks: 0-4,9-10,14-15,17-31
X-Juno-Att: 0
MIME-Version: 1.0
From: [email protected] (C Hogan)
Sender: [email protected]
Henning Reetz writes:
>I don't have to be a car mechanic to drive a car. Why do I have to
>be a programmer to use a corpus?
The argument here turns on the meaning of the word "use": It is not
necessary to be a car mechanic if all you want out of your car is to
drive it to work, turn right and left, stop and accelerate, etc. On the
other hand, if you would like to put in a new engine, or tune-up your
car, then yes, you do need to be an auto mechanic.
55
Corpus-Based Research
_________________________________________________________________
Similarly for corpus linguistics: if all you want to do is get word counts
from your corpus, then you can probably rely on existing software. If,
however, you want to do really custom stuff, then you should
probably learn to program.
(Source: [email protected])
In my own opinion, it is not vital for a young corpus linguist to know a
particular computer language particularly well. However, a track record of
the ability and, above all, the willingness to acquire programming skills is
almost indispensable. I believe that a corpus linguist has first to be a pure
linguist, and only later - when the basic knowledge of linguistic processes
have been assimilated - become a software developer. Modern linguists –
translators included - are very versatile, mainly because they have to cope
with the information madness of the society we live in. They are very far
indeed from the celebrated stereotype of the ‘armchair linguist’ created in
the early 1990s by Fillmore, that is they are aware of the need for
interdisciplinary learning. If they choose to focus on linguistic issues,
rather than implementing a more integrated approach, then, there must be
a valid reason, and that is – at least at the beginning of their career - time.
I simply believe that such priorities must be respected.
3.6 Corpus Networks
This section introduces an aspect which in recent years has been at the
forefront of theoretical discourse and of practice-oriented debates alike.
Throughout the years corpora have become bigger and bigger in size,
requiring not only more capable hardware, but also the collaboration of
different people, who would all be assigned specific tasks and would have
to carry out them simultaneously (i.e. exploitation, updating, support, etc.).
It is because of the need for flexible access to stored data that networking
has become necessary.
56
Corpus-Based Research
_________________________________________________________________
Here I shall focus on basic technical details and outline the reasons
why I believe that a computer network is essential for corpus exploitation,
both for language learning and for language teaching purposes.
The major advantage of networking is that the content of hard disks can
be shared. Only one machine needs to hold the data, while the other
computers simply access them through the network. Another advantage is
that the data can only be changed by the project management, so that the
corpus remains secure from unwanted modifications.
However, a networked system has also some crucial drawbacks.
Hughes (1997), for instance, points out that shared files cannot be
distributed among several computers: it would be in fact quite difficult to
know where particular parts of the data are stored. It is normally only one
computer which is used to distribute the shared files, that is to say that all
users have to access the data on its hard disk. As a consequence, this
machine may become a bottleneck for the whole system, being further of
no use as a standard machine. The solution to the problem represented
by a large number of users is to dedicate an extra computer to the task of
running the network. This kind of machine is called ‘server’, while all
computers dedicated to accessing the data are known as ‘clients’. In the
long run, this solution turns out to be an easy way of creating a shared
resource and fits therefore ideally in an institutional framework, because it
supplies data rapidly and necessitates periodical servicing only on the
mainframe machine. Moreover, support is easier because it is more
structured, and growth simply involves an increase of resources or client
machines. Even the huge amount of resources on the Internet can be
exploited more effectively: a case in point of server-client network is the
Bank of English, which can be accessed remotely through the net by
means of a specifically designed retrieval program called SARA.
A computer network offers some evident improvements to language
learning. First of all, the simultaneous availability of different machines is
an essential component of a learning-by-doing method. Scholars can
practice language use directly and independently, not having to wait either
57
Corpus-Based Research
_________________________________________________________________
for the teacher to answer their questions or for some corpus tool to give
them evidence of language use. This kind of approach boosts the
scholar’s curiosity in practical applications, and - provided s/he knows how
to exploit a corpus appropriately - not only delivers an exhaustive answer,
but also encourages him/her to develop his/her own attitude to research in
general. However, it should be mentioned that it is the task of the scholar
to take advantage of the structure s/he is being offered: merely knowing
about the possibility of self-training (as in the case of computer-aided
language learning) is clearly not enough. Furthermore, the initiative has to
be given enough space within the institution (i.e. a room specifically
designed for this purpose) and be supported by additional courses that
focus on the teaching of corpus exploitation, that is ‘teaching to selfteach’.
Corpora resources run on a computer network offer some ‘strategic’
advantages for language teaching as well. Again, there is a huge variety
of practical perspectives that vary according to the methodology adopted
by the teacher. One major point in favour of a computing infrastructure is
that it enables the teacher to teach language courses without bringing
his/her research activities to a halt. One possible feedback for the teacher,
in fact, might consist in the supervision of individual research projects that
the students have to carry out using specific resources made available by
a source whose characteristics are well-known to both teacher and
scholars. It might be very interesting, for example, to analyse in detail the
different use of the words cheers in a corpus of American, British and
Hibernian English, or even confine the study to a determined corpus (i.e.
the BNC) and highlight the difference between two similar words, such as
recently and lately, well considering their pragmatic connotations.
58
Corpus-Based Research
_________________________________________________________________
Finally, installing a computer network also means being able to profit
from particular software licences and agreements (i.e. the campus licence
19
of Translator’s Workbench ), and therefore save big amounts of money.
PART II
19
„Translator’s Workbench“ is a trademark of Trados Gmbh, Stuttgart, Germany.
59
_________________________________________________________________
APPLICATIONS
60
_________________________________________________________________
4
Applications of Corpora
In this chapter the focus is on applications. In Section 4.1 I shall begin
with a short overview of the uses of corpora in some major domains of
linguistics, and then discuss some of the advantages of implementing a
corpus-based approach in the fields of translation and interpreting
(Section 4.2).
4.1 The Use of Corpora in Linguistics
The main advantage of a corpus-based approach, as I have suggested,
is its capacity to deliver evidence of language usage. This section
describes in detail how corpus-oriented studies can contribute to the
various branches of linguistics.
4.1.1 Corpora and Grammar
One of the earliest applications for corpus-oriented approaches has
been grammar. Using language corpora in grammatical analyses has a
double advantage, as it supports both the deductive and the inductive
construction of theories.
Especially parsed corpora – i.e. corpora with annotated grammatical
information - tell us a great deal about which syntactic structures are
associated with which linguistic contexts. In other words, empirical data
gives the grammarian the possibility to deduct the rules and to develop
theories underlying language use.
Corpora are also increasingly used to support inductive grammar
theories. Aarts (1991), for instance, describes how primarily rationalist
formal grammars developed at Nijmegen University are tested on natural
61
Applications of Corpora
_________________________________________________________________
language using computer corpora. In other words, empiricism (corpus
evidence) and rationalism (introspection) are successfully combined to
draw up a comprehensive grammar. Possibly one of the most famous
advocates of a combined approach is Michael Halliday whose theory of
20
systemic grammar
is perhaps the best example of an efficacious
symbiosis between corpora and grammar.
Examples of Practical Applications
In grammar studies, corpora can be used to raise consciousness for
specific grammatical phenomena in comparative analyses of target
language and source language structures. For instance, learners of
English trying to comprehend the different usage of some and any will be
more motivated if they can examine the contexts in which the two
determiners occur themselves and compare them with the German
equivalents irgendeiner, jeder, etc. (see Wolff, 1996:78)
Another exciting application is the Internet Grammar of English, which
is available at http://www.ucl.ac.uk/internet-grammar/intro/intro.html and
was developed by the Survey of English Usage group headed by R. Quirk
(1996-1998).
Another interesting internet site which draws on corpora for grammar
teaching is http://www.ccl.umist.ac.uk/projects/salsa/ . The University of
Manchester developed SALSA, a program for students of English,
French, German, and Spanish who have an interest not only in learning
the language, but also in learning facts about the language, and who want
to acquire proficiency in the use of the linguistic metalanguage. In
addition, the multilingual nature of the package offers the opportunity to
compare linguistic phenomena in freely selected language pairs. SALSA
20
M.A.K. Halliday’s theory of systemic grammar is based upon the notion of language as a set of
choices for each instance from which the speaker must select one. In each situation various
choices are more or less likely to be selected by the speaker: Halliday uses this idea of a
probabilistically ordered choice to interpret many aspects of linguistic variation and change in
terms of the differing probabilities of linguistic systems. It is, for example, one of Halliday’s
suggestions that the notion of a register, such as that of conversational speech, is really equivalent
to a set of these kinds of variations in the probabilities of the grammar. (McEnery and Wilson
1996:95-96)
62
Applications of Corpora
_________________________________________________________________
emulates the learning situation in which the students can freely
experiment with different ways of analysing language without their
attempts being recorded or marked. Instead, the program compares the
input with the expected answer in order to provide feedback and help.
Although the primary function of the software is to provide practice in
syntactic analysis, SALSA offers short hypertext tutorials as well. These
tutorials are not intended as yet another book on syntactic theory, but aim
to explain the objectives and methods of the practical lesson(s) which
follow in SALSA or give a synopsis of the topic in question before the
student attempts the relevant exercises. Whenever feasible, the software
user can call up a set of examples which corresponds to the nature of the
tutorial, e.g. when the noun is being described the students can draw on a
list of nouns in the language of their choice. These examples are taken
from the same database as the sentences for the exercises so that
learners familiarise themselves gradually with the new material.
4.1.2 Corpora and Lexicography/Terminology
Lexicography is the only branch of linguistics that was making massive
use of empirical evidence long before corpus linguistics became
mainstream. Nevertheless, the advantages of a machine-readable,
representative corpus are obvious. First of all, corpus-based research is
very efficient: the only thing a lexicographer needs to do is to switch on
the computer and call up a word or phrase. In a few seconds, s/he will
obtain a huge amount of information about that particular entry and is able
to draw his/her own conclusions within a very short period of time. This
means that collections of linguistic information about lexical items (i.e.
vocabularies) can be updated more quickly and, depending on the kind of
corpus examined, can reach a very high degree of accuracy and
completeness.
Moreover, corpus exploitation tools such as concordancers, which can,
for example, sort hits by the first word on the right or on the left of the
keyword, and frequency listers allow the lexicographer to catalogue
63
Applications of Corpora
_________________________________________________________________
entries on the basis of the context in which they are embedded and to
establish a scale of importance within all possible meanings. (McEnery
and Wilson 1996:91)
Another advantage of corpora is the fact that corpora can be enriched
with extra linguistic and non-linguistic information. As already mentioned
when introducing the COCOA references, text headers can deliver data
about different variables, including also age and gender of the author, his
or her social status, date the text was produced, its genre and register
variety, etc. With this extra input, the lexicographer has the possibility to
assign plain text to prototypical text categories on the basis of linguistic
factors and social contexts on the basis of non-linguistic parameters, and
so produce a well-grounded analysis of language use.
Representativeness, however, remains a complex issue. Corpora
obviously cannot solve all the lexicographer’s problems, as Della
Summers points out:
“Frequency is a powerful tool in the lexicographer’s arsenal of
resources, allowing her to make informed linguistic decisions about
how to frame the entry and analyse the lexical patterns associated
with words in a more objective and consistent way. However, in
dictionay-making editorial judgement is of paramount importance,
because blindly following the corpus, no matter how carefully it may
be constructed to represent the target language type accurately, can
lead to oddities. We expect our motto: ‘Corpus-based, but not
corpus-bound’ to hold good for many years to come.” (Della
Summers 1996:266)
Examples of Practical Applications
One easily replicable example of lexicographic research using corpora
is described by Tim Johns (http://sun1.bham.ac.uk/johnstf/neo.htm). His
corpus comprised the 1994 electronic editions of the Guardian and the
Observer newspapers to search for neologisms in the world of computing,
and in particular terms relating to the Internet. What he found was that
some of the new terms were actually old words that were given new
meanings, whereas others were completely new coinages. Most of these
neologisms come from the USA, and were joky and informal.
64
Applications of Corpora
_________________________________________________________________
In addition to the obvious usefulness of corpora in lexicographic
research, there has been a trend in recent years to draw on corpora in
quite specific terminological studies. One of these is of particular interest
to translators and is therefore being described in more detail. Susanne
Lenz conducted a corpus-based lexical study in the field of terminology
and showed how a corpus-based approach can help detect mutual lexical
influences between German and English. As evidence she used two large
corpora, the Bank of English and the more recent IDS-Korpora. For
reasons of topicality, environmental "terms" were chosen: items which fall
somewhere between terminology and general language, e.g., green dot,
green bin, habitat. In her study, the focus was on terms selected from a
German and an English glossary and their respective translations. These
glossaries had been compiled as part of an earlier project which
investigated communicative structures and interfaces in the environmental
field of various industrial companies, and were chosen mainly for
contextual reasons, to ensure cohesion with regard to type of text and
circulation. She first selected a limited number of terms from the
glossaries and then cross-checked them with the German IDS-Korpora
and the Bank of English. The selected items were revisited throughout the
corpora in order to look for common contextual patterns and, if possible,
also for chronological evidence that might be attributable to mutual lexical
influence between German and English in this field. Her findings show
that the use of large corpora in this type of research promises interesting
results, although further bilingual corpus alignment is required if maximally
21
reliable statements are to be produced.
The last point I shall make concerns future trends. Some recent
projects have been trying to use word sense frequencies as the basis of
categorisation, and to produce sense-ordered frequency dictionaries (see
21
For tools specifically developed to extract terminologies and lexicographic data from corpora
see QUIRK (http://www.mcs.surrey.ac.uk/Research/CS/AI/SystemQ/index2.html), a project
conducted by Sharp Laboratories of Europe Ltd, Cambridge University Press, and the University
of Cambridge Computer Laboratory.
65
Applications of Corpora
_________________________________________________________________
for instance research project conducted by Christian-Albrechts-Universität
in Kiel and Bowling Green State University in Ohio; the aim is to create a
sense-ordered frequency dictionary for mediaeval German epic). Such
semantics-based dictionaries would undoubtedly be of great interest to
translators, as they provide the kind of information that is not included in
traditional dictionaries.
4.1.3 Corpora and Morphology
One
interesting
application
of
corpora has been its use in
morphological analyses in deduction-based language learning. Learners
themselves systematically collect data on word-formation rules by means
of the wildcard option. This allows them to identify semantic similarities
and differences between, for example, English words ending in *ic vs.
*ical, such as classic/classical, historic/historical, economic/economical, or
the usual contexts of English and French adverbs (*ly and *ment
respectively) and adjectives. (Legenhausen 1996:69)
4.1.4 Corpora and Semantics
There are two major reasons for using a corpus-based approach in
semantics: the first is that a corpus helps identify criteria for assigning
meanings to linguistic items, the second reason is that corpora permit a
prototypical approach to the linguistic categorisation of lexical units.
Concordances show lexical items embedded in contexts which normally
specify their meanings in that particular phrase or in a given period. This
use of contexts may also reveal deviations from established uses: a
monitor corpus, for instance, can show changes in the meaning and the
use of words as well as the expansion of their semantic fields.
Corpora have also played a major role in eroding the belief in the
possibility of hard and fast categorisation. Conventional studies attempted
to formulate unambiguous descriptions of word meanings and to ignore
that alternative options existed. Corpora have proved this methodology to
66
Applications of Corpora
_________________________________________________________________
be wrong. McEnery and Wilson (1996) sum up this point clearly and
concisely:
“In looking empirically at natural language in corpora it becomes
clear that this ‘fuzzy’ model accounts better for the data: there are
often no clear-cut category boundaries but rather gradients of
membership which are connected with frequency of inclusion rather
than simple inclusion or exclusion. Corpora are invaluable in
determining the existence and scale of such gradients.” (McEnery
and Wilson ibid:97)
The notion of ‘frequency of inclusion’ suggests a prototypical approach
to language analysis which is based upon definite, clearly established
items serving as descriptors of a specific function within the text. Linguistic
units gravitate towards these; their meanings vary according to numerous
factors (i.e. time, place, speaker, situation, etc.). The corpus is so far the
only linguistic tool which enables the definition of this ”gradience of
membership”.
Examples of Practical Applications
One current application is the DELIS project (Descriptive Lexical
Specifications) which is conducted jointly by several universities and
publishers
22
and aims to produce a dictionary describing the major
semantic classes of English, French, Italian, Danish and Dutch, and also
the interaction between syntax and semantics.
Another more ambitious project is the FRACAS project (Framework for
23
Computational Semantics) .
22
The Center for Sprogteknologi at Kobenhavns Universitet; Research Unit for Computational
Linguistics, Helsingin Yliopisto; Linguacubun Ltd., London; Istituto di Linguistica
Computazionale del CNR, Università di Pisa; Sonovision ITEP Technologies, Paris; Vrije
Universiteit Amsterdam; Van Dale publishers, Utrecht; as consultants: Université de ClermontFerrand, Den Danske Ordbog, Oxford University Press.
23
This project is conducted jointly by the Centre for Cognitive Science (CCS) and Human
Communication Research Centre (HCRC), University of Edinburgh; the National Centre for
Mathematics and Computer Science (CWI), Foundation Mathematical Centre (SMC) in
Amsterdam; the Institut für Computerlinguistik, Universität Saarbrücken; and the Stanford
67
Applications of Corpora
_________________________________________________________________
4.1.5 Corpora and Pragmatics
Up to now very few corpus-based studies have been carried out in
pragmatics. One of the reasons is that the usual design criteria for corpora
do not really recommend them to pragmatic analysis. I mention earlier that
most corpora try to be representative and so include rather short samples,
which are removed from their original social and textual context. A
pragmatic analysis, however, needs an unabridged version of the text in
order to be able to extrapolate all meanings and reactions in particular
situations. Another problem is that pragmatic elements cannot be
extracted from corpora by means of a simple concordance. The solution to
this problem suggested by McEnery and Wilson is that pragmatic
elements could be represented by some kind of matrix which associates
words with meanings and correlates them with effects or reactions. What
they advocate, then, is the creation of a linguistic database that is able to
bridge the gap between what is actually said and what is really meant.
Examples of Practical Applications
One example of a pragmatic-related study is described by McEnery and
Wilson (1996): Stenström (1987) carried out a study in which she looked
at ‘carry-on signals’ such as right, right o, and all right. and was able to
classify the use of these signals according to a typology of their various
functions. She found, for instance, that right - the most frequent locution had many functions and was very often used either as a response or to
both evaluate a previous response and terminate the exchange. All right,
instead, acted as boundary between two stages in the discourse, while
locutions such as it’s alright and that’s right were used as responses to
apologies. On the basis of quantitative data, Stenström was then able to
infer that the use of carry-on signals in conversational English was
Research Institute (SRI), International Cambridge Computer Science Research Centre in
Cambridge, U.K.
68
Applications of Corpora
_________________________________________________________________
strongly linked to the channel used, i.e. telephone English (McEnery and
Wilson 1996).
A more ambitious project is being conducted by the ISSCO at the
University of Geneva and the Institut Dalle Molle pour les Etudes
Sémantiques et Cognitives. They use corpora of dialogues to identify
regularities in how the beliefs and intentions of the interactants are
reflected in language.
4.1.6 Corpora, Stylistics and Discourse Studies
While earlier applications of corpora were largely restricted to the study
of low-level grammatical and lexical phenomena, several recent projects
have gone beyond the word or sentence level and tried to identify generic
and textual patterns, taking into consideration pragmatic as well as
discoursal features.
Biber (1998) summarises the pros and cons of corpus-based
methodologies. While it is true that conventional concordancers are not
really able to identify discourse-level features, their usefulness can be
improved through interactivity, with the concordancer producing a fast and
reliable list of all the discourse characteristics the researcher wants to
identify, which are then checked by the researcher who decides whether
the items identified comply with the given specification.
Examples of Practical Applications
One possible application described by Biber uses concordancers to
track surface grammatical features over the course of the whole text and
to produce what Biber (1998:108) refers to as a „discourse map“, i.e. the
monitoring of the development of discourse patterns through texts. These
discourse maps can then be used as the basis for textual comparisons
aimed at finding typical patterns of text design in various genres and
registers.
A further example of the use of corpora in stylistics is given by Tim
Johns, who compiled a corpus of articles and letters from the scientific
69
Applications of Corpora
_________________________________________________________________
research journal Nature of the year 1989. Johns examined in detail five
reporting verbs (indicate, show, suggest, find, demonstrate), which had
previously been identified on the basis of the formal criterion that they
were those most frequently followed by that-clause complements. Even
though
other complements
were not included, related nominals
(indication(s), suggestion(s), finding(s), demonstration) were taken into
account when they were followed by a that-clause. The main features of
the syntactic environment of each verb were then identified with the
24
assistance of the program MicroConcord (OUP).
Another popular area has been stylistic variation analysis. A good
example is the research project carried out by Thomas and Short (1996),
which used a corpus-based approach to examine patterns of speech and
thought presentation in contemporary prose fiction and newspaper
reports. They compiled quite a small corpus (40 extracts of approximately
2,000 words each, giving a total corpus of 88,631 words), which was split
into four roughly equal parts. There are 10 extracts from each of the
following areas:
–
‘high’ literature (21,911 words);
–
popular fiction (23,301 words);
–
broadsheet newspapers (22,814 words);
–
tabloid newspapers (20,605 words).
Contrary to other major approaches to corpus compiling, the corpus had
not been built for general research purposes, but for very specific tasks,
which were (Thomas and Short ibid.):
♦ to enable investigation of the similarities and differences in the
presence and patterning of speech and thought categories in literary
and non-literary texts;
♦ to investigate a linguistic phenomenon which is textual/discoursal; and
♦ to explore the possibilities for automatic parsing of texts for speech
and thought categories.
24
For further information about the nature and the results of the analysis go to
http://sun1bham.ac.uk/johnstf/Five_vbs.htm
70
Applications of Corpora
_________________________________________________________________
Thomas and Short were able to test hypotheses in an empirical way as
well as to determine and quantify categories across text-types. They also
found that the use of corpora forced them to label every single part of a
text and not to ignore examples inconvenient to the theory. Further, the
presence of ambiguous codings (4,5% of the total) and frequent overlap
between categories underlined once again the importance of pragmatic
factors in assigning items to linguistic categories, supporting indirectly the
notion of a continuum in speech and though presentation and so helping
to dismiss hard and fast categorisation.
Another example is the COLT project (Bergen Corpus of London
Teenage Language), which is the first large English Corpus focusing on
the speech of teenagers. In a pilot version, information on the use of
linguistic items by different groups (age, gender, socio-economic class,
location, etc.) can be obtained.
In the future, as more researchers will familiarise themselves with these
techniques and learn how to use them, we will hopefully be able to learn
more about patterns of discourse that hold across texts and registers.
4.1.7 Corpora, Language Teaching and Learning
Language teaching is another typical field of application of corpus
linguistics. The use of real life examples has many advantages: it exposes
students to real communicative situations at early stages of their language
comprehension process, provides an empirical basis for progression in
language learning and helps address individual students’ needs.
Mindt (1992, 1996) convincingly showed that many grammar textbooks
pay little attention to real life. Most grammar textbooks introduce future
time references and modal verbs quite late. This may present
considerable difficulties for foreign language learners who might not
understand native speaker utterances. He therefore argues that corpus
studies should be used to inform the production of teaching resources, so
that more common choices of usage are given more attention than those
which are less common.
71
Applications of Corpora
_________________________________________________________________
Examples of Practical Applications
One example of how corpora can be used to address the individual
needs of students is the exploitation of LSP corpora. Major LSP corpora
available at the moment include the Guangzhou Petroleum English
Corpus and the Hong Kong University of Science and Technology
25
(HKUST) Learner Corpus .
Another possibility to fit language learning to students’ individual
requirements are the CALL programs (computer-assisted language
learning).
CALL’s
primary
objective
is to create a user-friendly
computational environment that allows the student to exploit resources by
focusing on his or her specific needs.
A major application of corpora to CALL has been implemented by Tim
Johns, who designed user-centred didactic software for language
acquisition. Cloze and Contexts
26
enable the teacher to automatically
generate learning materials from any corpus s/he might have sampled.
These two tools not only represent a convenient way of introducing
concordance-based methods in language teaching, but they also help
students appreciate the advantages of concordancing, which might then
be introduced at a later stage. Further, student performance is logged in
text files, which can allow the teacher to monitor learning strategies. As
Johns points out, CALL methods allow every student to become a
“Sherlock Holmes” (Johns 1997:101), a ‘language detective’ who learns to
recognise and interpret clues from a context.
However, CALL can also be applied to more specific areas of language
teaching
and
learning,
as
McEnery,
Baker
and
Wilson
(1995)
demonstrated. The researchers compared the performance of students in
part-of-speech analysis tasks, using corpus-based computer vs. traditional
human teaching methods. Their focus was on accuracy of participant
response
25
over
time.
Of
seventeen
first
year
English
language
See appendix for further details on these two corpora.
72
Applications of Corpora
_________________________________________________________________
undergraduates who participated in the seven-week experiment, nine
were taught grammar via the traditional classroom-based “human teacher”
method, while the rest used CyberTutor, a corpus-based computer-aided
linguistic learning program. What McEnery et al. found out was that the
computer-aided group out-performed their human-taught counterparts in
terms of accuracy and number of words analysed.
4.1.8 Corpora and Ethnolinguistics
Ethnolinguistic studies have only recently begun to make use of
machine-readable corpora and few examples of practical applications are
as yet available.
Examples of Practical Applications
One notable exception is a study by Leech and Fallon (1992) in which
the authors investigated differences between British and American English
which might be attributed to differences in life-styles and cultural attitudes.
They first prepared frequency lists and then compared these. What they
found was rather suggestive: travel words, for instance, were more
frequent in American English, perhaps because of the larger size of the
United States. Further, words belonging to the domain of criminality and
the military proved to be more frequent in the American corpus, which
might support the theory of the American ‘gun culture’.
Another example of a possible application of corpora in ethnolinguistics
27
is the Austrian-based one-year project Racism at the Top , led by Ruth
Wodak and Teun A. van Dijk in cooperation with seven EU Contries. The
overall aim was to investigate the role of top politicians in the reproduction
of
racism
societies.
and
anti-racism
More specifically,
in
Europe's
Wodak
and
increasingly
van
multicultural
Dijk
26
For further information about The Cloze and Contexts programs see Johns 1997 as well as Tim
John’s CALL page at http://sun1.bham.ac.uk/johnstf/timcall.htm
27
Further information about the nature and the results of the project can be obtained from Teun A.
van Dijk’s home page at http://www.let.uva.nl/~teun/
73
Applications of Corpora
_________________________________________________________________
investigated how leading politicians write and speak about immigration,
minorities, and other 'ethnic' issues in Great Britain, France, Germany,
Spain, Italy, Austria and the Netherlands. The results of this project
yielded crucial insight into the ways leading politicians influence public
discourse and attitudes about ethnic issues.
Despite the rather restricted scope of the work so far, it seems a very
promising line which should be more closely integrated with work in
cultural studies.
4.2 The Use of Corpora in Translation
Translation practice has recently undergone major changes: quality
standards have improved in an attempt to meet increasingly stiff
competition, new text types and genres have been created, and language
technologies
have
been
enjoying
a
major
boost.
One
of
the
consequences of this development is that translation can no longer be
taught merely in a hands-on, “learning by doing” fashion (if this was ever
possible), where translation exercises are used to improve language
proficiency. What is required is sound theoretical knowledge that will allow
students to identify the distinctive features of texts and to develop
translation strategies for the various situational contexts and translation
briefs. As professional translators they will also need to be familiar with all
the tools that can help them to fulfil this task. For this, they require
interdisciplinary knowledge. In a world where the translator can no longer
cope with the information overload resulting from contemporary
technologies, s/he will necessarily have to look in his/her bag of tricks for
new tools able to combine accuracy and rapidity.
While we may not agree with Gavioli’s suggestion (1996) that
translation should be considered as a non-standard LSP situation, she is
right in saying that translators (and translation students) are language
experts rather than specialists in a certain discipline who nevertheless
require a high degree of technical competence. The acquisition of such
specialised knowledge is not normally a major focus in the translation
74
Applications of Corpora
_________________________________________________________________
course curriculum. Indeed, because it is not possible to foresee what
translators will be asked to translate during the course of their
professional careers, it may even be an ill-conceived move to offer
translation courses in too narrowly circumscribed specialist areas. A more
promising approach, I feel, is to focus on paradigms that are maximally
generalisable.
28
It is this area where I feel corpora will prove most useful.
4.2.1 Parallel, Multilingual and Comparable Corpora
In translation, corpora are being used for a wide range of analyses.
Amongst the most obvious applications is the use of comparable and
parallel corpora as reference material for terminological analyses.
Friedbichler et al. (1997), for instance, demonstrate how aligned parallel
corpora enable translators to pinpoint the terms, collocations or
colligations standardly used by target-language experts in a given
specialist area. Whether the translator then uses these standard
collocations or decides to adopt a foreignising translation (see Venuti
1995) is a different matter. What is important is that translators know
which terms/phrases reflect regular usage and that they therefore have a
real choice.
Parallel corpora can also be used to provide information on languagepair specific translation behaviour, highlighting equivalence relationships
between lexical items or structures in source and target languages (Kenny
1998; Kenny, forthcoming). On a more general level, parallel corpora can
also be used to analyse the ‘side effects’ of the translation process, as for
instance unconventional linguistic phenomena such as errors, interference
29
between similar languages, etc. Comparable corpora have been used for
example by Baker (1993:239-247) in a large-scale study aimed at
identifying ‘translation universals’. Drawing on Toury’s notion of laws of
28
See http://www.sslmit.unibo.it/cult.htm
Baker defines a comparable corpus as a collection of texts originally written in a language, say
English, alongside a collection of texts translated (from one or more languages) into English
(Kenny 1998).
29
75
Applications of Corpora
_________________________________________________________________
translation, she and her associates have focused on identifying those
features that occur exclusively (or with a suspiciously low or high
frequency) in translated texts, which - provided they are not the result of
interference and have subsequently been confirmed by the analysis of
comparable corpora in other languages - can be considered as candidates
for translation universals (see also Laviosa-Braithwaite 1998 and Kohn
1996:46-48). Based on some preliminary studies, Baker’s hypothesis is
that translated texts tend to be more explicit, unambiguous, and
grammatically conventional than their source texts or original texts in the
target language. She further argues that translations also tend to avoid
source text redundancy and to exaggerate stereotypical characteristics of
the target language (see Baker 1993:243-245). Kenny (1998) found that
translations tend to display high type-token ratio, low lexical density and
low sentence length vis-à-vis original texts in the same language, which
seems to support the hypothesis that simplification may be a translation
universal (see also Laviosa-Braithwaite, 1998).
While the notion of language universal is not new (see Toury 1991;
Baker 1993:243-247), earlier research had to rely on manual analysis,
which proved very time-consuming. Corpus linguistics, and modern tools,
such as Scott’s Wordsmith Tools, allow for rapid processing of linguistic
patterns in vast quantities of texts, and produce comprehensive statistical
data.
Despite this obvious success of corpora in translation research, much
work still needs to be done. So far only few researchers have dealt with
the problem of which type of corpus (parallel, comparable, multilingual,
monolingual) should be best used in which kind of study to achieve
optimal results.
Another issue that is rarely addressed is that of how theoretical and
applied aspects could best be merged. To what extent theoretical
investigations are hoped to feed into translation pedagogy will again
impact on the type of corpora used. That theory and applied translation
studies should be closely related was convincingly argued by James
76
Applications of Corpora
_________________________________________________________________
Holmes. In his seminal work of 1972 entitled "The Name and Nature of
Translation Studies", which essentially laid the foundations of the
discipline, Holmes stated that translation practice should not be divorced
from theory and that the discipline should be receptive to developments in
other fields of study. It was his firm conviction that the theoretical
component, whose aim is essentially to describe the phenomena of
translating and translation and to establish general principles by means of
which these phenomena can be explained and predicted, should be
closely linked to applied areas of translation studies such as translator
training, foreign-language learning and translation criticism.
Multilingual parallel corpora provide an effective means of integrating a
theoretical component into the pedagogical material required for a
translator training course, helping to analyse translation universals and to
identify the tolerance factors that constitute "adequate" and "acceptable"
translations.
4.2.2 Machine Translation
MT systems have adopted a variety of different approaches and also
evolved considerably since their first beginnings in the 1950s, when
translation was considered to be essentially on a par with code-breaking:
the ‘first-generation direct systems’ tried to implement dictionary-based
direct replacement on the word level.
During the 1960s, then, the basic techniques of word transfer were
revised, and gave way to the ‘indirect method’: the transfer approach,
involved structural analysis of the input text, a bilingual mapping at an
abstract level, and synthesis of the target text; while the interlingua
approach avoided the bilingual transfer stage and instead used a more
abstract universal representation. (Somers 1998)
The pyramid diagram depicted below and probably first used by
Vauquois in 1968 shows the essentials of MT systems: the deeper the
analysis, the less transfer is needed, the ideal case being the interlingua
approach, where there is no transfer at all.
77
Applications of Corpora
_________________________________________________________________
Interlingua
Transfer
Analysis
Generation
Direct translation
Source text
Target text
Source: Somers 1998:145
Although much improved compared to the early systems, even 2ndgeneration MT programs were unable to produce fully acceptable texts.
MT systems either required comprehensive pre- or post-editing of texts to
ensure they fulfilled the end-user’s needs. This was generally considered
too time-consuming and so prevented the wider distribution of such
systems amongst translators and companies.
In recent years, a ‘third generation’ of MT systems has evolved which
try to incorporate real-word knowledge. The new paradigm that is being
developed in MT research is called ‘artificial intelligence’ (AI). AI
researchers essentially agree that, in order to be able to ‘teach’
knowledge to a MT system, huge amounts of naturally occurring language
are needed: the source they need is obviously text corpora. Two major
examples of such ‘third generation’ MT systems which make extensive
use of corpora are example-based MT and statistics-based MT.
In example-based systems, translation is produced by comparing the
input with a corpus of typical translated examples, extracting the closest
matches and using them as a model for the target text. This is done in two
78
Applications of Corpora
_________________________________________________________________
stages: ‘matching’ the input with examples, and ‘recombining’ the targetlanguage fragments extracted. This approach is considered to be more
30
like the way humans translate and its result is said to be more ‘stylish’ ,
since it is not solely based on the structural analysis of the input text.
(Somers 1998:148)
Statistics-based
systems
are
essentially
a
non-linguistics-based
technique. They attempt to translate purely on the basis of probabilities
calculated by considering millions of words of parallel (or comparable)
text, thus trying to determine lexical equivalents and target-language word
order. (ibid.)
Examples of Corpus-Integrating MT Systems:
SPARKLE
SPARKLE is one of several MT systems currently being developed at
various European centres that are designed to be context-sensitive,
making use of phrase-level syntactic analysis to solve the problem of
lexical clusters and disambiguation. Computational concordancing is used
to systematically examine words and phrases which occur in the proximity
of a given term. This approach has proved particularly successful in the
area of special-domain languages where the number of standardised
collocations appears to be more restricted. With more context-sensitive
MT software the output will require substantially less time for post-editing.
PARGRAM
The major goals of the PARGRAM project are the analysis and
encoding of important and most generally occurring syntactic structures in
German, and the development of parallel analyses for cross-linguistic
phenomena (e.g. binding, modification). The parallel nature of the
analyses is ensured through the concurrent development of German,
English, and French Lexical Functional Grammars (cf. the LFG websites
in Essex and Stanford). The researchers also strive for maximally broad
30
Literal translation is obviously an exception.
79
Applications of Corpora
_________________________________________________________________
coverage, coupled with efficient processing. A spin-off of their work is that
they are accumulating extensive experience in the encoding of large
grammars.
4.2.3 Translation Memory Systems
The development of AI-based MT systems was one of the responses
put forward by computational linguistics in an attempt to improve the
disappointing performance of MT. Another was to develop systems that
would support, rather than replace the translator. The basic idea was to
create computer tools which were able to reuse previously translated
passages. Earlier translations are stored in a database – the so-called
translation memory – where sentences of the source text are aligned with
corresponding sentences of the target text.
Translation memory systems can be particularly useful if the sourcelanguage text is an updated version of a document (for instance a
computer manual). When starting to translate the new text using the
translation editor, the system automatically segments the source text and
looks up each segment in the translation memory database. If a segment
has occurred previously, the stored version is offered as a possible
equivalent. The output can then be accepted, amended or even rejected,
that is, the translator remains responsible for drawing analogies and for
structuring the target text during the translation process. (Freigang 1998)
80
Applications of Corpora
_________________________________________________________________
Screenshot of Translator’s Workbench for Windows by Trados
Translation memories can therefore be considered as a special kind of
parallel corpus, which, in addition to their practical applications, also
provides interesting data for cross-linguistic studies and the study of
language use in translation in general.
31
While translators using Translation Memories were originally restricted
to exploiting the texts they (or their colleagues) had translated, several
new projects are currently underway which try to improve the workbench
programs so that they can also draw on comparable corpora (for details
see Chapter 6).
31
A very interesting project about translation support tools entitled ”Linguistic Engineering for
Generation and Translation of Documentation” is being conducted at the Department of Computer
and Information Science of Linköping University. For further details see Ahrenberg et al. 1996.
81
Applications of Corpora
_________________________________________________________________
4.2.4 Corpora vs. Termbanks
Until recently building corpora was the privilege of a handful of specialists
in the field of language engineering. But since the advent of the new
media, most notably of CD-ROMs and the World Wide Web, the number
of electronic texts available has been increasing exponentially each year.
In the field of the medical sciences, for example, more and more
prestigious journals are publishing annual full-text collections of their hardcopy issues on CD-ROM and the proceedings of many specialist
conferences are available on electronic resources. Similar trends can be
observed in many other disciplines. As the benefits of concordancing rely
heavily on the quality and adequacy of the corpus, this means that it is
now becoming profitable for every professional translator working in a
specific domain to compile his own custom-designed domain-specific
corpus.
But why do we need domain-specific corpora when we have term
banks, some terminologists may wonder. First of all, corpus research is a
highly efficient tool for compiling more authoritative data banks.
Furthermore,
bilingual
data
banks
can
be
incorporated
in
the
corresponding domain-specific corpus in which they would act as a pivot
between two unaligned source and target language corpora. In addition,
and this is the crucial point for professional translators, a well-designed
representative corpus is far richer and much more adaptable to the
various language queries a translator is confronted with. Experience has
shown that once the initial learning phase has been overcome finding the
proper terms - especially the more common ones which are likely to be
available from data banks - is an issue of decreasing importance, while
embedding the key terms in the appropriate idiom and hitting the
adequate domain-specific register, phraseology and style remain timeconsuming tasks in final-draft revision even for translators with extensive
experience. It is precisely in this latter context that professional translators
having a representative specialised corpus at hand will save a
82
Applications of Corpora
_________________________________________________________________
considerable amount of referencing time and, at the same time, enhance
the quality of their translations.
4.2.5 Translation Teaching and Translation Research
Although an old adage has it that practice makes perfect, in translation
programmes this approach will often cause frustration, as learners are
told they need to improve their performance yet are not offered any
advice on how this could be achieved.
If we are to keep pace with new trends in translation and the
translation market, I believe more creative approaches are needed,
especially approaches which promote self-access study. Using corpora to
me seems one of the best ways of enhancing autonomous learning skills.
Silvia Bernardini (1997:3) also supports this view when she suggests
that activities involving self-access use of a large corpus for learning
rather than reference purposes may help students develop the skills and
strategies that are necessary complements to the translation task.
Bernardini summarises her starting point as follows:
I want to let learners find out for themselves the solution to a problem
they are (or are made) aware of, or the answer to a curiosity or
doubt. Besides, however, I also want them to develop procedures
and strategies which allow them to take maximum advantage of the
resources they have - in this case a large corpus - in order to
accomplish the task successfully and economically. Finally, I want
them to feel free to look around, to notice unexpected - or indeed
expected - phenomena, to deviate from their path in order to follow a
new one, or go back to the old one if the new one reaches a dead
end. Clearly, the aim here is not the acquisition of descriptively
adequate knowledge that, or competence, although this is a valuable,
and indeed likely, outcome of large-corpus concordancing. Instead,
what is at stake is the development of a number of skills that can be
grouped under the heading of knowledge how to, or capacity. In
other words, we focus on processes rather than products, on
methods rather than outcomes, on resourcefulness, awareness and
reflectiveness rather than learnedness (Bernardini, 1997:3; my
emphases).
Examples of Practical Applications
83
Applications of Corpora
_________________________________________________________________
Lynne Bowker at Dublin City University similarly promotes the use of
corpora and corpus tools in the translation class. In an experiment she
conducted with a group of final-year students she convincingly proved that
the quality of the translations produced by the students substantially
improved (both with regard to comprehension errors, specifically errors
resulting from a lack of comprehension of the subject field, and production
errors, including wrong choice of term, un-idiomatic constructions,
grammatical errors, and incorrect register) when a target-language
comparable corpus was used.
A similar experiment was carried out by Federica Scarpa at the
Università per Interpreti e Traduttori of Trieste with a group of final-year
Italian students. The study was carried out on the section of the corpus
consisting of original English texts and their translations into Italian and,
conversely, of original Italian texts and their translations into English.
Concordance queries were undertaken at different levels of "delicacy": at
the word-level the focus was on alerting students to basic translation
problems such as "false friends" (e.g. prima facie equivalents such as in
fact and infatti, eventually and eventualmente), and, at the paragraph
level, the students investigated the different strategies used to signal the
same pragmatic feature in the two languages (e.g. the greater
grammaticalisation of modality in English compared to Italian, where
modal functions of auxiliaries have been taken over to some extent by
other items). Scarpa stresses that this type of activity discourages a wordto-word approach to translation and enhances the critical awareness of
the students, often disturbing received ideas such as the fact that
published translations must be accurate.
A very specific application is described by Robert Spence. He carried
out
an
experiment
in corpus-based translation teaching at the
Fachrichtung 8.6 (Angewandte Sprachwissenschaft sowie Ubersetzen
und Dolmetschen) of the Universitat des Saarlandes. Spence analysed
two text corpora: the first a corpus of 100 student translations of a short
news report on world population growth and the second a corpus of 37
84
Applications of Corpora
_________________________________________________________________
student translations of a tourist guide to the Chamber of the House of
Commons. Most of the translations were done by German native speaker
students. The texts included in the first corpus were assessed for errors,
which were then classified according to their likely origin (in relation to the
metafunctions, strata and ranks of the systemic functional model of text as
instantiation of “meaning potential”) and in terms of their likely effect (on
the “usability” of the translation). In analysing the second corpus, the
focus was on the relation between register, genre and ideology, and on
the role of microregisterial variation as a tool for identifying genre-specific
text structures. The experiment had three main aims:
♦ to investigate the phenomenon of Learner English, and in particular
the phenomenon of L1 (and possibly also L3) interference, in a highly
constrained text-creation environment (i.e., in relation to translation
rather than free composition);
♦ to explore didactic applications of corpora of student L2 errors in the
context of an undergraduate course in translation;
♦ to ascertain the feasibility of using such corpora in interaction with a
multilingual systemic functional computational generative grammar
and parser as part of a future computer-aided approach to the difficult
task of "learning to translate".
The primary role of corpora in cross-linguistic research has also been
advocated by Stig Johansson, who recently examined the agreement
between bilingual dictionaries entries and the correspondences observed
in the corpus material. By confronting the Norwegian modal particle nok to
its English counterpart probably (1998:13), he successfully shows that
corpus-based analysis gives a far richer picture of correspondences
across language than dictionaries do. A comparison like the one carried
out by Johansson gives new insight into translation and provides a new
perspective on the languages compared.
In the same paper, Johansson (ibid.:16) also shows very clearly how a
linguistic context can affect the meaning of words. The general
conclusions he drew from the analysis of the English noun mind, for
85
Applications of Corpora
_________________________________________________________________
instance, are that English and Norwegian tend to refer to mental
processes in different ways and that correspondences are highly sensitive
to context. His study demonstrates that there is no single preferred
Norwegian counterpart of the English noun mind, and in approximately
half of the cases Norwegian opts for a form without a corresponding noun.
(Johansson ibid.:18)
A detailed example of one possible application of corpora for
translational purposes is presented by Margaret Rogers (1997). In her
study about synonymy and equivalence in German and English speciallanguage texts, Rogers considers the linguistic behaviour of two sets of
potential synonyms in English and German from the domain of genetic
engineering, based on a corpus of texts aimed at a scientific but not
necessarily expert readership. The analysis resulted in a number of
constraints which are of relevance to translators as text creators. Her
study also showed that translators should not merely rely on dictionaries,
which often present synonyms as decontextualised lexemes, but aim
primarily to spot - by means of corpus exploitation - possible relations of
overlap and exclusion which are neither logically predictable nor amenable
to standardisation procedures.
Guyda Armstrong (1996) also contributed to the implementation of the
corpus-based approach into translation (studies). In order to force
students to investigate the development of Machiavelli’s political thought,
Armstrong, a teacher of Italian at the University of Edinburgh, let second
year students in translation run Machiavelli’s
32
Il Principe through the
TACT program (See appendix 3). The exercise focused on the key
concepts of virtù (‘prowess’), fortuna (‘fate’) and the associated concept of
prudenzia (‘caution’). The students investigated these words using various
(corpus-based)
methodologies,
first
analysing
the
single
words
individually, for then moving on to more sophisticated collocational
searches which included all three items. Finally, the students were asked
86
Applications of Corpora
_________________________________________________________________
to compare the distribution graphs of all three words and draw some
general conclusions about the overall structure of Il Principe. In the end,
students were able to recognise Machiavelli’s lexical choices and the
meaning he assigned to them. Armstrong (ibid.), however, points out that
the corpus-based approach does not compensate for inadequate
preparation, but offers a possibility of looking at the text from a new
perspective, maybe discovering unexpected leads which can be followed
up elsewhere (i.e. by means of an etymological search, the development
of Machiavelli’s political concepts, an analysis of synonyms and
associated words, etc.) This kind of work not only supports the translator’s
lexical choice in the target language, but can also be of help for students
of political science and history, therefore promoting the interdisciplinary
use of corpora.
In a very recent study about the translation of the German modal
particle doch into French, Feyer (1998) set off from the general
assumption that all nuances concealed behind German modal particles
can be basically expressed in other languages as well. In order to shift the
focus on the problems encountered when translating such linguistic items,
Feyer (1998:118-124) decided to compile a corpus of written literary texts
including a great amount of spoken language, since this kind of artistic
production was felt to deliver the optimal test-bed for a contrastive
analysis. The corpus included major works of Austrian and German
authors (Bernhard, Böll, Dürrenmatt, Horváth, Schneider and Konsalik) as
well as their translation into French. Feyer believed that the very
expressive writing skills of these ‘word-jugglers’ would be challenging
enough for a translator to interpret. In her detailed analysis of linguistic
and cultural patterns of the modal particle doch, she convincingly shows
that there is almost no lexical correspondence between German and
French, but also that the meaning can generally be got across in various
ways, depending on the kind of sentence one is dealing with. (Feyer
32
th
Machiavelli, one of the most prominent and linguistically complex politicians of the 16
Century, used to assign new meanings to old words, which resulted in his works being sometimes
87
Applications of Corpora
_________________________________________________________________
1998:130-259) On a more general note, her study demonstrates that the
translator remains the one in charge of deciding which TL solution to opt
for, making clear that lexical and semantic variation has not to be
confused with inaccuracy. Indeed, variation is sometimes even necessary
in order to render the translation culture-specific. (Feyer 1998:279). Feyer,
then, supports the interpretative and creative side of translation, giving an
account of how it can be possible to assess both the role and the
behaviour of translators, analyse word structures, and develop clever
translation strategies by means of a corpus-based approach to
translational issues.
What all these studies have in common is a strong belief in the
necessity to design translation training courses that focus on processes
rather than products, mainly because it is impossible to teach translation
trainees all the words or acquaint them with the entire range of texts that
they will be confronted with in their professional lives. What they therefore
need are strategies that will help them cope with new terminologies and
with unfamiliar genres and their conventions.
Corpora are seen as tools that allow trained learners to:
♦ solve problems on their own, using the available resources, which
should also boost their creativity and resourcefulness, since they will
need to learn where to look for solutions to a given problem that may
arise in the course of the translation task
♦ develop greater awareness for culture, situation, genre and textdependent language use
♦ improve their ability to cope in new situations
♦ develop the technical skills necessary for efficient corpus use, such as
computing and logical skills
4.2.6 Thinking Globally – Acting Locally
misunderstood by later generations.
88
Applications of Corpora
_________________________________________________________________
Corpus-oriented studies are going global. Before the advent of the
corpus-based approach, the major fields of linguistic study (e.g. grammar
and lexicography) were normally strictly separated. Corpora allow
scholars to tackle different tasks simultaneously, and thus to unite and
integrate different fields of research and approaches. This has improved
descriptive cross-linguistic research and, therefore, more comprehensive
and coherent language descriptions. (See also Johansson 1998:21)
A greater ‘globality’ - i.e. closer integration of different disciplines - is
being promised by the multimedia technologies. Multimedia technologies
permit the integration of both spoken and written language - two research
fields between which there has traditionally been little cooperation - as
well as non-verbal data. The fact that all the different types of data would
be stored, analysed and described on a single platform would immensely
improve the representation, manipulation and retrieval of corpus data (see
also McEnery and Wilson 1996:173). A truly multimedia corpus would for
instance allow users to switch between a section of transcribed text and a
segment of a video recording showing the interaction, which could then be
annotated on many different levels (e.g. transcription, grammatical
analysis of the text, on-line notes describing the social background of the
speaker, analysis of the sequence in terms of its discourse structure, an
ethnographic description of the context, a detailed analysis of non-verbal
elements, etc.).
Over the exitement of the vast potential of improved interdisciplinary
research we must not forget, however, that specific research projects still
require tailor-made solutions. Over the last two decades many (kinds of)
corpora have been made available to the international research
community. As researchers, our role is to identify our needs and to exploit
linguistic resources accordingly, and not merely to assume that sampling
criteria and parameters that were outlined for other projects will be
applicable to our own studies. In other words, corpus exploitation very
much depends on a 'think globally - act locally' philosophy.
89
Applications of Corpora
_________________________________________________________________
4.2.7 Critical Comments
While recent years have seen a considerable increase in the number
of corpus-based investigations in translation, not all translation theorists
are convinced that corpora can really provide all the solutions and so
have sounded a note of caution (Melmkjaer, forthcoming, quoted in
Kenny 1998:53).
One point that has been made is that corpora exploitation is mostly
statistics-oriented, that is, its advantages can be fully understood only by
translators au fait with computational linguistics. While I agree that
knowledge of statistics is necessary if the use of corpus-based research
results is to be maximised, I cannot concur with the critics’ claim that one
needs to be a computational linguist to decode and interpret statistical
information. If more corpus-based work was to be used during translation
training, what would of course need to be done is to include a more
comprehensive introduction to computational issues in the translation
curriculum. Provided the focus was on enhancing awareness of the
translation process and end users’ needs, this would also help students
develop a perception of translational skills not merely as a means to a
(highly practise-oriented) end, but as something that should be analysed
and discussed from a more theoretical perspective. (See also Kohn
1996:48)
A further drawback has been mentioned by Kenny (1998:53). Referring
mainly to the use of comparable corpora in literary translation research,
she found, because new genres are often introduced from one literature
to another, that there was nothing comparable in the “host literature”.
The same problem may arise in non-literary genres in less-widely used
and taught languages. A case in point is Irish Gaelic, where many (nonliterary) genres are modelled on English so that there are no ‘native’ texts
with which to compare translations. (Kenny ibid.)
One further point of criticism often mentioned is that corpus linguistics
has traditionally applied a strict bottom-up approach. Data were collected
and statistically evaluated before any theories about generalisable usage
90
Applications of Corpora
_________________________________________________________________
patterns were proposed. Most translation theorists, however, have
adopted a top-down direction. Theories were drawn up, and only later
tested against real-language evidence.
While it has been convincingly shown that the two approaches are not
mutually exclusive and may well complement each other (see Aston
1997:2; Chafe 1992; Leech 1991; Svartvik 1992; Kohn 1996:48), it also
seems that many translation theorists are reluctant to engage in corpusbased research, possibly because this would imply that their theories
would have to be restricted to a fairly narrow domain, while traditionally
translation theories (i.e. Reiss and Newmark) seem to have been allinclusive and promulgated for all translational events.
4.2.8 Conclusions
I hope that the examples described in this chapter have shown that
corpora and the programs available to exploit them are immensely useful
tools for translators. Once translators - and translation students understand the many different types of analyses they can carry out with
and on corpora, the ease and efficiency with which such investigations
can be conducted should provide sufficient impetus to make them
interested in issues that go beyond purely practical applications.
Comparisons of larger bodies of texts and their translations should also
encourage them - as Venuti puts it - to be ‘suspicious’, and to query the
transparency of translations. They should make them eager to find out
what is really concealed behind the word, what the author of the text really
wanted to express, and what strategies the translators employed in their
efforts to render this meaning in the target language.
Apart from inspiring such more theoretical interests, work with corpora and the elaboration of translation strategies this permits - obviously also
allows translators to keep up with current developments in language
production, and therefore assures both high quality and productivity.
Indeed, I feel that corpus-based work is the only way that will ensure this.
91
_________________________________________________________________
5
Case Study
This chapter tries to demonstrate how corpora might be used by
translators and translation students to solve a specific linguistic problem.
The principal aim of this chapter is to show what difficulties they might
encounter when trying to select a suitable corpus and how initial
hypotheses may have to be revised following some pilot analyses. It also
shows some of the limitations of corpus-based work, and sounds a note of
caution regarding the validity of its results. The focus is on procedural
issues; other potential applications are described in the previous chapters.
5.1 The problem
When I tried to decide which kind of case study would provide the most
suitable framework to show the kinds of problems that may arise during
the investigation of linguistic patterns, I at first was of course very tempted
to replicate one of the studies that have been carried out within translation
studies. However, given the limited scope of an undergraduate thesis, and
the likely ignorance of corpus tools of most of the readers of this thesis, I
decided that a more limited case study that focussed on a clearly defined
linguistic problem would be better able to demonstrate the pros and cons
of corpus applications.
The linguistic problem that I then chose to analyse was the difference
between the use of the prepositions tra and fra in Italian. As a native
speaker I have often been asked by fellow-students which they should
use in which contexts. Generally, I was able to tell them which I preferred,
yet when asked why, my explanations rarely went beyond “the other one
does not sound right”.
93
Case Study
_________________________________________________________________
In this chapter, then, I shall first present my own hypothesis about the
use of tra and fra in the Italian language and describe the reasons why I
want to describe their usage patterns in spoken Italian (Section 5.2).
Section 5.3 will describe the corpus which was used to test my
hypothesis, while Section 5.4 will deal with the tools used in the analysis.
In the light of the results of a trial run, I shall then reformulate the claims
as to the validity of the study (Section 5.5). The actual study will be
presented in Section 5.6, and in Section 5.7 I shall offer possible
interpretations of the findings and some concluding remarks.
5.2 Formulation of the Hypothesis
In the Italian language, tra and fra are considered to be synonymous
prepositions which basically indicate:
33
♦ a relation between (or among) two or more people or things, as in fra
le due possibilità (between two possibilities) or tra fratelli (among
brothers)
♦ a position (in the middle of, amid, amidst), as in tra la folla (in the
middle of the crowd)
♦ a movement (through), as in il sentiero s’insinuava fra i monti (the path
wound through the mountains)
♦ a time reference (in, within), as in tra due giorni (in two days’ time)
Most people would maintain that the two prepositions are fully
synonymous and totally interchangeable. A quick collocation search of a
34
corpus of 289,426 tokens, however, produced 452 hits for tra and only
96 hits for fra, which suggests that there is a degree of preference for the
former preposition.
My native speaker hunch feeling has always been that the use of fra is
motivated by phonological constraints: Fra, I believed, was used to avoid
cacophonous repetitions, especially, I conjectured, in speech that was
33
The information included has been gathered from numerous grammar books (e.g. Krenn 1996,
Renzi et al. 1995, Dardano and Trifone 1985, Salvi and Vanelli 1992, Levi and Dosi 1982) as
well as from a dictionary of frequency of contemporary Italian language (Bartolini et al. 1971).
94
Case Study
_________________________________________________________________
trying to sound more accomplished.
35
If my hypothesis was correct, the
case study would produce regular cotextual patterns that would show
which preposition was preferably used in which environment.
5.3 Selecting the Corpus
To test the hypothesis formulated in the previous section I needed to
find a corpus which was able to provide data suitable for a qualitative
analysis. Since corpus compilation is one of the most difficult tasks, I
thought it was vital to ask more experienced people which kind of corpus
they felt would be the most appropriate for my purposes. I therefore
posted a message to the ICAME mailing list ([email protected]). My
query was answered by two subscribers who suggested two different
ways of compiling a suitable corpus. One was Ralf Steinberger, who
works as a researcher at the Joint Research Center of the European
Commission. He suggested the following:
Dear Andrea,
I can think of two sources for Italian corpora:
1) The ECI corpus, obtainable at ELRA
(http://www.icp.inpg.fr/ELRA/cata/tabtext.html).
2) You can download Italian texts from the European Union web
sites, as many texts exist in all official EU languages. This is a bit
tiresome, but if you only need 200.000 words, you can do this in less
than half an hour. One possible site is: http://eu
The latter source is quite EU-biased, of course, so it is certainly not
literature.
For prose, you may find something at the Oxford Text Archive, but I
do not know their internet address. Maybe it is http://www.ox.ac.uk/...
Good luck,
Ralf
(Source: private e-mail correspondence)
34
35
See Section 5.2 for a detailed description of the corpus mentioned.
Some support for my hypothesis is found in the following quote:
”Queste ragioni di eufonia diedero qualche pensiero al Manzoni che, adeguandosi
anche in questo particolare all'uso fiorentino del tempo, sostituì i fra della prima
edizione dei Promessi Sposi con tra: nel capitolo IX, dove aveva scritto "fra tre o
quattro confidenti", per evitare il brutto tra tre, "se l'è cavata correggendo: 'tra
quattro o cinque confidenti'. Sennonchè le cifre non sempre son così elastiche come
erano per sua fortuna qui!” (D'Ovidio 1933:102, quoted in Serianni 1989:299)
95
Case Study
_________________________________________________________________
The second was Elisabeth Burr, a lecturer at the Romance Languages
Department of the Gerhard Mercator University at Duisburg, who
mentioned the possibility of on-line research via tactweb:
Dear Andrea,
I have created two corpora of Italian newspaper language. Part of
one of them (ca. 750.000 words) is available via the Oxford Text
Archive for teaching and research. You could, however, also use my
tactweb page and do your study online. Have a look at:
http://www.uni-duisburg.de/FB3/ROMANISTIK/PERSONAL/Burr/burr.htm
You'll find a link from there. The part of the corpus which is online
contains about 75.000 words. In the near future I am planning to put
more material on-line for a seminar I am teaching. So if you can wait
a bit longer, you might be able to get enough material together. The
part which is on-line already and what I am going to put there is not
POS-tagged, however. I have done some POS-tagging but it still has
to be corrected.
All the best for your research
Elisabeth Burr
(Source: private: e-mail correspondence)
These sources, however, did not really meet my needs: the use of the
ELRA corpus - as well as most of the material from the Oxford Archive - is
subject to a subscription fee, while an on-line corpus can be neither
downloaded nor exploited by means of collocation software other than the
built-in search engines, which again did not provide the kind of information
I was interested in.
A further source of data was suggested to me by Guy Aston, Associate
Professor of English Linguistics at the Scuola Superiore di Lingue
Moderne per Interpreti e Traduttori of Forlì, who mailed me concordances
of tra and fra from the LIP Corpus as well as a wordlist. While his material
was very useful, unfortunately, I was unable to gain access to the entire
LIP corpus, so I had to opt for yet another source of electronic texts, the
Associazione Liber Liber homepage (http://gsi.it/LiberLiber/index.htm).
This choice was mainly due to the fact that this copyright-free source of
data allowed me to compile my own corpus by selecting only those texts
that I considered appropriate for my study. A further reason was that the
96
Case Study
_________________________________________________________________
Liber Liber Association collects both transcribed spoken texts and literary
masterpieces by major Italian - and, exceptionally, non-Italian – writers,
which promised to be extremely interesting.
After much further pondering of which type of texts I should choose, I
decided to sample the transcribed spoken subcorpus, since I assumed it
would be closer to the language used by Italians in unplanned
interactions, albeit in a formal setting. It also seemed unconstrained with
regard to the lexicon and syntax used and could therefore be assumed to
contain different stylistic registers.
The texts I first selected for my corpus comprised all the transcribed
records of the Commissione Parlamentare Antimafia (Parliamentary
Commission against Mafia Crimes) which were made available to Liber
th
Liber on 15 May 1995. Since this corpus comprised over 1.6 million
36
words, it proved unmanageable for my concordancer. I therefore had to
select a smaller sub-corpus. The problem that arose at this stage was
how representativeness could be ensured in this small sub-selection.
In order to overcome this obstacle, I resorted to a little trick. I
introduced an extra variable: only the hearings chaired by Tiziana Parenti
were included (in total 28 hearings, and 422,590 tokens). This of course
makes the study less representative; however, I felt that even this ‘limited’
representativeness was sufficient for the purpose of this study.
Another basic problem I was faced with was that of POS-tagging.
Although a tagged corpus would have offered me the chance to look at
my corpus from a statistical point of view as well, I decided not to have it
tagged. The main reason for my decision was that the kind of analysis I
36
This larger corpus of 1,676,863 tokens was first posted to the ICAME mailing list, and then put
on the net for public availability (http://www.bhak-bludenz.ac.at/mdgrosse). The corpus I decided
to exploit for my purposes, then, could well be defined as a ‘trimmed’ version of this general
corpus.
97
Case Study
_________________________________________________________________
wanted to carry out did not require syntactical information or lexical
categorisation. The meanings of both prepositions can easily be
98
Case Study
_________________________________________________________________
extracted from any Italian monolingual dictionary or grammar book. I was,
as I stated above, primarily interested in differences in usage patterns
between tra and fra in the spoken language.
5.4 Choosing the Tools
Once the corpus was compiled, I proceeded to the selection of the
tools for its exploitation.
At present, the two most comprehensive concordancing programs
running on a Macintosh environment are Conc 1.80b3 and SysConc 2.5.
Both text browsers load the entire text into memory for processing and
can therefore handle only relatively small corpora, which, as explained
above, was the main reason why I resized my corpus. Conc is a statisticsoriented research concordancer developed in 1996 by John Thomson at
the Summer Institute of Linguistics of Dallas. It is very fast, and produces
both KWIC concordances and indices (see Appendix for further details).
SysConc has been developed by Christian Matthiessen and Canzhong
Wu, respectively Associate Professor and Research Assistant at the
Natural Language Laboratory of the Speech, Hearing and Language
Research Centre of the Department of Linguistics at Macquarie
University, Sydney. Although SysConc cannot browse text files bigger
than 2 megabytes, that is to say about 300,000 tokens in MS Word
format, its information output is much better structured (i.e. through bar
graphs, frequency maps and hierarchies) than Conc’s, which means that
regular patterns of language use may be spotted more easily. It also
allows you to perform collocational searches (search of two items in a preestablished collocational range, with or without wildcards) and a feature
search (search of a number of items, with or without wildcards), with the
possibility to highlight irregular verb forms.
Although some of Conc’s features were also interesting, such as its
split screen display of text and concordance, the potentialities of SysConc
as well as its friendly interface convinced me that it was more suitable,
and I chose it for my analysis.
99
Case Study
_________________________________________________________________
5.5 Summarising the Restrictions
Before I start the actual analysis I shall summarise the main issues
discussed above:
♦ This analysis focusses on differences between tra and fra in spoken
Italian. It does not attempt to produce statistical data, or data that will
hold true for all modes (spoken and written) and all text types and
genres.
♦ As far as the size and the representativeness of the corpus are
concerned, it has to be admitted that this study cannot be considered
a deep analysis of these two Italian prepositions. Nonetheless, its
results can still provide quite significant information: a frequency of
452 hits in a corpus of 300,000 tokens is sufficiently high to provide a
good basis for hypothesis testing. Moreover, the fact that the guiding
principle in text selection was maximum consistency (achieved by
including only the hearings chaired by Tiziana Parenti) should also
ensure maximum corpus validity.
37
♦ Another problem, which I have not yet mentioned because it is not
directly related to sampling criteria is the question of whether or not a
collection of transcribed hearings can be considered as true
representations of spoken language. The corpus I used seems to be
heavily normalised: common features of spoken language such as
pauses, interruptions and false starts have been edited out. Despite
these shortcomings, it still appears to be an accurate enough reflection
of spoken language in a formal setting.
37
Of course, homogeneity is a double-edged weapon: while data become more creditable, the
findings can bo longer generalised unconditionally as they might result in a misleading
description of language use.
100
Case Study
_________________________________________________________________
5.6 The Study
Essentially, the case study aims to prove two sub-hypotheses:
♦ that tra and fra are synonyms
♦ that avoidance of cacophony is the primary factor determining their use
in spoken Italian.
In this section, I shall first attempt to address the question of
synonymity, and then discuss some points that support my second
hypothesis.
5.6.1 Synonymity
In order to prove that the two propositions are synonymous, I
searched the corpus for any similarities concerning the cotexts in which
fra and tra occur.
As far as syntax is concerned, even a rather superficial analysis of the
frequency tables produces interesting results: the cotexts of tra and fra
are very similar.
101
Case Study
_________________________________________________________________
Frequency table of the Italian preposition tra
Frequency table of the Italian preposition fra
The frequency tables show the number of tokens of all words that
collocate with each of the two prepositions, and also the right-hand and
left-hand collocates. As it can be easily gathered from the pictures
reproduced above, the same grammatical classes preced and succeed
tra and fra: the first collocate on the left is mostly a noun, while the first
collocate on the right is normally an article or a pronoun.
Similarities in the semantic structure, on the other hand, cannot be
extrapolated from a simple frequency list. Even if frequency tables
contain various hints, a more detailed analysis of collocations is required.
After a first analysis of left-hand and right-hand collocates, the hypothesis
that the two prepositions are fully synonymous seems to be supported:
102
Case Study
_________________________________________________________________
quite a number of nouns, including rapporto/i, distinzione, collegamento,
coordinamento, are followed as often by tra as by fra.
...che il tema del rapporto tra criminalità organizzata ed effetti...
...approccio con il grande tema del rapporto tra economia, finanza
e...
...una ricognizione sul tema del rapporto fra mafia ed enti locali…
...ma è una questione di rapporti fra Governo, Parlamento e…
...almeno qui, avessimo chiara la distinzione tra Governo e Stato...
...tutti assieme, senza distinzione tra maggioranza e opposizione...
...la giusta distinzione fra i pubblici ministeri è evidente che esisterà...
...riguardano: la distinzione fra intermediari finanziari ed i soggetti...
...come "ufficiale di collegamento" tra i paesi dell'Unione europea
e...
...che bisogna creare un collegamento tra istituzioni governative e...
...daremo avvio ad un collegamento fra tutti i paesi amici per...
...era emerso alcun collegamento fra queste persone e la
criminalità...
...occuparsi del coordinamento tra l'azione dello Stato e quella svolta...
...che ha compiti di coordinamento tra gli enti governativi e quelli
non...
...possibilità che il coordinamento fra le forze di polizia possa
essere...
(Source: Mafia Corpus, 1998)
The only problem that remains is: If the two prepositions true synonyms
and interchangeable in all contexts, why then are there 452 occurrences
of tra and only 96 occurrences of fra in the corpus? If the difference
between them is neither semantic nor syntactic, what motivates their
choice?
My second hypothesis is that the use of fra and tra is guided by
phonological constraints. This hypothesis will be tested in the next
section.
5.6.2 Cacophony
103
Case Study
_________________________________________________________________
Before entering into a detailed discussion of cacophony, I shall give a
very brief introduction to some basic phonological concepts:
The first consonant in fra is a labio-dental fricative; the initial consonant
in tra is an alveolar plosive. The concatenation of identical sounds is
generally considered to be cacophonous in Italian, while the alternative
use
38
of
fricatives
and
plosives
is
seen
as
more
38
euphonous.
See also Serianni 1989:298-299
104
Case Study
_________________________________________________________________
This hypothesis is borne out by the following examples in my corpus:
...per proporre intese fra tutti i paesi per arrivare ad una armonizza...
... dopodomani, la prossima settimana, fra due settimane e fra tre mesi...
...Questi casi, fra l'altro, sono apparsi su tutti i giornali...
... si articola lungo più direttrici tutte fra loro strettamente connesse...
...l'effettivo isolamento del detenuto. Fra questi si annoverano quelli...
... mafiose, prime fra tutte le attività economiche e finanziarie....
...di infiltrazioni, di relazioni fra settori economici, istituzionali,...
... riguarda i rapporti intrattenuti fra i detenuti ed il mondo esterno...
... nuovo rapporto che ha cercato di instaurare fra cittadino e Stato...
...senz’altro si rileva uno scarto fra l’entità del fenomeno e la quantità...
(Source: Mafia Corpus, 1998)
There were, however, other examples in my corpus which did not
support this hypothesis:
… lo svolgimento delle elezioni in Germania, tra i quattro o cinque paesi…
…sono stati assunti), primo tra tutti la revisione della legge che consente…
… credo che lo scarto tra entrate ed uscite annue sia elevatissimo…
…cultura della legalità soprattutto tra i giovani, in particolare nella scuola…
…Ricordo la drammatica notte tra il 19 e il 20 luglio 1992, quando i ministri…
… al coordinamento tra attività "ordinarie" e "antimafia" nelle…
…Un disegno di legge si è infranto tra le proteste delle organizzazioni…
…sicurezza che non rientra tra quelle riservate ai detenuti sottoposti…
…il trait d'union tra il detenuto e il tribunale di sorveglianza…
…di cui da tempo parliamo, tra struttura e personale addetti alle indagini…
…la risposta: il contatto tra magistrati e pentiti, per le ragioni indicate…
…momento di attrito tra il potere giudiziario e quello amministrativo…
(Source: Mafia Corpus, 1998)
In total, the distribution of fra and tra across ‘euphonic’ and
‘cacophonous’ cotexts was as follows:
Preposition Total Occurrences
Euphonic
Cacophonous
FRA
96
27
5
TRA
452
35
41
105
Case Study
_________________________________________________________________
The rest of the occurrences can be considered neutral, that is no dental
plosive or fricative consonant occurred in the immediate cotext.
With regard to fra my initial hypothesis seems to be confirmed: out of
96 occurrences only 5 instances can be considered as cacophonous.
A further point in favour of the hypothesis is the fact that all set phrases
and idioms present in the corpus actually avoid cacophony (i.e. ‘tra
virgolette’ instead of ‘fra virgolette’).
The collocation results obtained for tra, by contrast, do not really
confirm my hypothesis.
There may be several reasons why repetition of tra occurred:
♦ Pauses between tra and succeeding cotext: Continuity of
discourse: because the corpus omits pauses, interruptions and false
starts, it is impossible to assess to what extent this might have had an
effect on the results. It seems reasonable to assume, however, that
language processing strictures play an important role, i.e. that
utterance planning up to and including the preposition was completed
before the remainder of the sentence was planned. As tra appears to
be the default choice, this obviously leads to repetition of sounds if the
lexical item that is later chosen as the one that can most appropriately
construe the intended meaning contains dental plosives. To what
extent this may be true would, however, need to be verified with an
appropriate corpus and through additional experiments, which is
beyond the scope of this thesis.
♦ Emphasis of a statement: the t(r) sound may be deliberately repeated
to focus attention on this part of the sentence. No similar effect can be
achieved through repeating the fricatives ‘f/v’, since these consonants
cannot be pronounced as loudly as dental plosives can.
♦ Easier pronunciation: repeating dental plosives is easier because
many Italian words and word clusters feature dental plosives (e.g.
ministro, struttura, tra l’altro, etc.). My corpus contained a total of 60
occurrences of dental plosives repetition, some of which are reported
here below:
106
Case Study
_________________________________________________________________
...per esempio, tra magistrati di vari gradi, tra magistrati che si...
...Vanno considerati, tra gli altri, i limiti di resistenza umana;...
...la separazione, di cui da tempo parliamo, tra struttura e
personale...
…soffermarmi sui rapporti tra la distrettuale, le procure ed i
tribunali...
...gruppo di lavoro interministeriale (tra ministro dell'interno e
ministro...
...credo che lo scarto tra entrate ed uscite annue sia elevatissimo...
...del trattamento; tra l'altro, il magistrato di sorveglianza decide...
...provvedimento del giudice, tra l'altro motivato, per poterlo limitare...
...regime dell'articolo 41-bis. Tra l'altro, di questo mi dà conferma
l'ultima...
...vigente e di consentire, tra l'altro, il ricorso a strumenti di
indagine...
...rivisitata, tenendo conto tra l'altro delle oggettive difficoltà...
(Source: Mafia Corpus, 1998)
5.7 Conclusions
On the basis of the findings presented, it is fair to conclude that the
Italian prepositions tra and fra are synonyms of each other. The
cacophony hypothesis, on the other hand, could not be fully verified. It
appears to be supported by the occurrences of fra in the corpus; tra,
however, requires further investigations.
Even though the results may not be what I had hoped to achieve, I felt
that by describing very faithfully how I went through the various steps,
from initial hypothesis formulation to corpus selection and final
interpretation of the results and what problems I encountered during the
process, I could perhaps demonstrate more realistically the advantages,
as well as the pitfalls of corpus-based analysis.
107
_________________________________________________________________
PART III
CONCLUSION AND OUTLOOK
108
_________________________________________________________________
6
Drawing Conclusions
In the previous chapters, I have dealt in detail with the most crucial
arguments in favour of - and against - the application of corpora in
language and translation studies. In doing so, I have tried to discuss the
issues from a variety of different perspectives, focussing first on more
general aspects before providing specific examples.
This last chapter looks to the future. It summarises the main implications
of a corpus-based approach and makes suggestions for new fields of
application, both in linguistics and translation research.
It tries above all to get across the one message that to me seems to be
the most important one of all, which is: times are changing, and so are
corpora - and hopefully our approach to teaching translation.
The Discipline of the Future
What we hear and read is so often mediated language that it is
probably fair to say that exposure to translated material is now a regular
feature of most people's daily existence. Given that this trend is likely to
continue in the new millenium, I believe that it is high time that translators
and translation scholars as well as linguists and lay people started to
rethink and reconsider their views of what translation entails and how
translation studies should be conducted.
Linguists in particular need to recognise that translation is a central
mode of communication in modern societies. So far, their attitude towards
109
Drawing Conclusions
_________________________________________________________________
translation has been at best ambivalent and at worst dismissive, shortsighted, and highly prescriptive. If they considered translation at all they
generally focussed on how linguistics could be employed to ‘put matters
right’, rather than on translation as a phenomenon in its own right, which
does not necessarily have to conform to the linguist's preconceived ideas
of what counts as correct or incorrect use of language.
Seeing translation as a skill which can be improved through enhanced
sensitivity to linguistic patterns is of course a legitimate view. However, it
is also a rather limited and unsophisticated perspective, given the much
more productive role theoretical linguists could play in translation studies.
The growing interest within translation studies in exploiting corpus
linguistics for a variety of translation-related analyses, including the
examination of translation-specific features of language use (e. g.
‘translation universals’) should provide sufficient motivation for linguists to
enter into more fruitful partnerships with translation scholars that are
aimed at developing descriptive methodologies for translation studies.
Translators and translation teachers also will need to revise their views
and methods. One of the major aims of this thesis has been to show the
benefits of the implementation of corpus-based techniques in translation
research and teaching. These resources will only be fully exploited,
however, if there is a basic willingness to change the status quo, and if
there exists a consistent institutional policy that encourages such change.
Obviously, effecting innovative strategies will be difficult and those in
charge of course design will have to be ready to take risks, as it may not
yet be possible to enshrine the use of novel technologies in translation
curricula.
110
Drawing Conclusions
_________________________________________________________________
There are plenty of examples of institutions that have been prepared to
confront the challenge posed by the new technologies and which have
developed pioneering projects. At the Centro Nazionale di Ricerca (CNR)
of Pisa, for instance, Peters and Picchi (1997:267-271) have integrated a
lexical database and a text management system into a prototype
workstation. The system includes many different components which can
be exploited by the translator and the lexicographer, by the language
learner, or by any user interested in using to the full the possibility of being
able to dynamically access, browse, and extract the different kinds of
linguistic information contained in dictionary and text databases. (Peters
and Picchi 1997:271). Given the potential of universities in terms of
available human and technical resources, it is difficult to understand why
they should not engage in similar projects.
This is particularly true of the School for Translators and Interpreters at
Graz University. Even though our School has shown that it is aware of the
great importance of new technologies by obtaining a campus licence for a
major Translation Memory tool, and although students have ready access
to a variety of concordancing programs and statistical software, only few
translation classes make use of corpus-based tools and electronically
available sources. As a consequence, the number of students attending
tutorials aimed at familiarising them with TM and other tools is very limited.
I believe that no professional translator today can afford not to use a
computerised environment: computer literacy is a must for anyone
entering the translation market. I also believe that institutions training
translators have an obligation to show the students which computer-based
resources are available and how they might help them improve the quality
of their output, both during their course and - perhaps even more
111
Drawing Conclusions
_________________________________________________________________
importantly - also later, when they are given their first professional
assignments.
There are a host of different ways of how this could be achieved. Using
corpora and concordancing programs in the translation class would be
one possible approach. This way, students would be introduced to data
cataloguing, including parallel text management, semantic and lexical
disambiguation,
stylistic
analyses,
etc..
Another
would
be
the
implementation of TM and MT routines in the translation class, which,
apart from its obvious practical benefits, would have the additional
advantage of allowing the department to compile a huge parallel (or even
multilingual) corpus made up of original texts and students’ translations.
A further area where it is easy to see possible applications of corpuslinguistics is that of language acquisition. Learners’ corpora could be
compiled in the more language-acquisition oriented classes, which would
represent very interesting material for a variety of applied linguistics
research projects. The results of such analyses could then be used as the
basis for computer-based self-instruction exercises. Quite apart from the
didactic potential of such projects, they would, I believe, also improve the
reputation of the university as an innovative research institution which
keeps abreast of new developments in order to meet the increasingly
more exacting standards of the professional world.
One final argument in favour of a corpus-based approach is, I believe,
the great motivational potential of their use in the translation class.
Students who discover language through corpora are constantly
challenged as they are obliged to analyse texts and reflect on the linguistic
and textual evidence they find, to make decisions and explain their
choices, and to query and justify their own textual production. This ability
to reflect and to challenge received views is among the most important
112
Drawing Conclusions
_________________________________________________________________
objectives
of
third-level
education.
Corpora
and
corpus-based
methodologies, I believe, can greatly contribute towards attaining this
goal.
113
_________________________________________________________________
APPENDICES
114
_________________________________________________________________
1
Glossary
♦ Alignment
The practice of defining explicit links between texts in a
parallel corpus.
♦ Annotation
The practice of adding explicit additional information to
machine-readable texts, as well as the physical representation of such
information.
♦ ASCII (American Standard Code for Information Interchange)
A
numerical coding system for computerised text. When people refer to a
computer document being ‘in ASCII’, they usually mean that it consists
only of the characters that fall within the near-universally adopted lower
range of ASCII codes, 1-127, which cover unaccented Latin characters,
roman numerals, and a basic range of punctuation. Such files, which
may also be referred to as ‘text only’, present far fewer problems than
formatted word-processor files when it comes to manipulating data with
different types of software and on different computing platforms.
♦ Behaviourism Psychological doctrine developed at the end of the 19
th
century which focused exclusively on observable behaviour. The most
valuable achievement realised by this discipline was to exclude
introspection from scientific study. John Watson - probably the first real
behaviourist - typified the approach and dismissed introspection as
untestable: he was convinced that the study of language had to be
based on subjectiveness, namely the only valid scientific approach to
limit study to specific stimuli and consequent observable periferal
muscular and glandular responses. Together with Watson, who
115
Glossary
__________________________________________________________________
actually developed a complete behavioural theory, to be mentioned are
also other behaviourists such as Hull, Tolman and Skinner.
♦ COBUILD COBUILD is an acronym for COllins Birmingham University
International Language Database. This is a joint project between
industry (HarperCollins Publishers) and the University of Birmingham,
which began in 1980. A large corpus of contemporary English was
gathered from spoken and written sources, and each word in turn was
studied for its lexical, grammatical, semantic, stylistic and pragmatic
features. The information was entered into a database from which were
edited the Cobuild dictionaries and other publications.
♦ COCOA Reference A balanced set of angled brackets (<>) containing
two things: a code standing for a particular type of information, and a
string or set of strings, which are the instantiations of that information.
♦ Colligation
Collocation patterns based on syntactic groups rather
than individual words.
♦ Compile Collect and put together (i.e. texts for a corpus).
♦ Concordancer
A program which identifies a pattern (usually a word)
within a text, and prints out instances of its occurrence along with a
specified amount of context.
♦ Corpus
A collection of natural-occurring language text, usually in
machine-readable form and compiled to be representative of a
particular kind of language.
♦ Co-text The co-text of a selected word or phrase consists of the other
words on either side of it. This is a more precise term than context or
verbal context, but it is not much used.
♦ KWAL (Key Word and Line)
A form of concordance which can allow
several lines of context either side of the key word.
♦ KWIC (Key Word In Context) The most common type of concordance
output, in which the search item, or key word is presented with a single
line of context. When several lines of output are presented the key word
is aligned vertically giving the impression of a column.
116
Glossary
_________________________________________________________________
♦ Lemma
The headword form that one would look for if looking up a
word in a dictionary, i.e. the word-form of eats belongs to the lemma
EAT.
♦ Lemmatisation The process or result of dividing a text into lemmas.
♦ Machine-readable
A term to describe textual resources which have
been stored on computer. It refers specifically to text which has been
encoded as characters, rather than images (such as a fax).
♦ Match When your search string is found in the corpus, it is referred to
as a match or hit.
♦ Mailing List
A mailing list is an e-mail-based bulletin board. E-mails
are sent to a particular site for inclusion in an electronic mailshot. When
the administrator of the mailing list feels that a new mailshot is ready,
the collected messages are posted to people who have specifically
subscribed to the mailing list.
♦ Natural Language
Term used for human language, as opposed to
artificial languages used for, for example, computer programming and
formal logic (i.e. PROLOG).
♦ Parsing
A form of grammatical analysis which represents all of the
grammatical relationship (syntactic structures) within a sentence.
♦ Running Words
This term is used in measuring the length of a text.
Each successive word-form is counted once, whether or not that
particular form has occurred before. For example, the sentence
„Andrea is a very cool guy.“ contains 6 running words.
♦ SGML (Standard Generalised Mark-up Language)
Mark-up system
used for electronic texts.
♦ Sublanguage
A constrained variety of a language. Although a
sublanguage may be naturally occurring, its key feature is that it lacks
the productivity generally associated with language.
♦ String Combination of letters/characters.
♦ Structural Linguistics At the beginning of the 20th century, attention
shifted to the fact that not only language change, but language
117
Glossary
_________________________________________________________________
structure as well, is systematic and governed by regular rules and
principles. The attention of the world's linguists turned more and more
to the study of grammar, intended as the organisation of the sound
system of a language and the internal structure of its words and
sentences. By the 1920s, the program of 'structural linguistics', inspired
in large part by the ideas of the Swiss linguist Ferdinand de Saussure,
was developing sophisticated methods of grammatical analysis.
Structural linguistics focused on the synchronic analysis of language
and contributed greatly to the evolution of phonology. Major structural
schools were the Prague School (Trubeckoj), the Copenaghen School
(Hjelmslev) and the American structuralism (Bloomfield).
♦ Tag
A code attached to words in a text representing some feature or
set of features relating to those words.
♦ Tagger
A program which assigns labels to words or other units in a
machine-readable text. Currently the most common type of tagger is
one which assigns part of speech labels, typically using a probabilistic
algorithm, based on frequencies observed in previously tagged, or
annotated, text corpora.
♦ TEI (Text Encoding Initiative)
An international project to define
standards for the format of machine readable texts.
♦ Text Continuous spoken or written language.
♦ Treebank
A corpus which has been annotated with phrase structure
information.
♦ Universals of translation
Linguistic features typically occuring in
translated rather than original texts. They are thought to be
independent of the influence of the specific language pairs involved in
the process of translation. (Baker 1993:243)
♦ Word-Form
This term is used for any unique string of characters,
bounded by spaces. Hence eat, eating, ate, eaten are all different wordforms of the same lemma (eat).
118
_________________________________________________________________
2
Major Corpora Available
This appendix is by no means an exhausting listing. It merely aims to
provide an insight into the major corpora available at the moment of
writing, as well as a contact address for further information on every
specific corpus. The list is divided into three main categories (written,
spoken, written and spoken) and arranged alphabetically. The main
features of every entry are highlighted, so that parsed, tagged, historical
or any other kind of specialised corpora can be easily identified.
Written
♦ The Aarhus Corpus of Contract Law
Features:
multilingual
corpora
made
of
three
1,000,000-word
subcorpora of Danish, English and French respectively. Texts are taken
from the area of contract law. This is not a parallel corpus.
Contact: The Aarhus School of Business, Fuglesangs Allé 4, DK-8210
Aarhus V, Denmark.
♦ The ACI/DCI Corpus (Association of Computational Linguistics/Data
Collection Initiative)
Features: monolingual corpus of 63 million words of written American
English (40 million words from the Wall Street Journal, 23 million words
from scientific abstracts)
Contact: Department of Linguistics, University of Pennsylvania,
Philadelphia, PA 19104 USA.
119
Major Corpora Available
__________________________________________________________________
♦ The American Printing House for the Blind Corpus (APHB)
Features: monolingual treebanked corpus of fiction text produced for
IBM USA at Lancaster University.
Contact: not available for research purposes.
♦ The Augustan Prose Sample
Features: historical corpus of about 80,000 words of British English
reading material from between c.1675 and 1705.
Contact: Oxford Text Archive, Oxford University Computing Service, 13
Banbury Rd., Oxford, OX2 6NN (e-mail: [email protected]).
♦ The Australian Corpus of English (ACE)
Features: 1-million-word monolingual corpus of Australian English,
compiled to be comparable with the Brown Corpus.
Contact: School of English, Linguistics & Media, Macquarie University
North Ryde NSW 2109, Australia.
♦ The BAF Corpus
Features: French-English bitext of about 400,000 words per language.
It gathers four subcategories:
- Four institutional texts (including a representative excerpt of the so
called Hansard corpus) for a total size close to 300000 words per
language;
- Five scientific articles of about 50000 words per language each.
- A technical documentation with 39328 English-words for 46828
French ones.
- The novel of Jules Verne: “De la terre à la lune”. (40,161 English
words vs. 53,181 French words). This corpus is very interesting
because the translations are sometimes divergent (75% of 1 to 1
patterns). In fact, it is even not clear whether the English version is
really a translation of the French one or if it has been translated from
120
Major Corpora Available
_________________________________________________________________
an abridged version. The English version has a lot of missing
segments.
Contact:
RALI,
Département
d'Informatique
et
recherche
opérationnelle, Université de Montréal, C.P. 6128, succursale Centreville, Montréal (Québec), Canada, H3C 3J7. Team leader is Pierre
Isabelle (e-mail: [email protected]). The BAF corpus has got
its own webpage at http://www-rali.iro.umontreal.ca/arc-a2/BAF/
♦ The Brown Corpus
Features: monolingual corpus of about 1 million words of written
American English dating from 1961 including many different registers.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Canadian Hansard Corpus
Features: a corpus of proceedings from the Canadian parliament. The
corpus is a parallel French-English corpus of about 750,000 words of
each language. The English version of the corpus has been part-ofspeech tagged and parsed at Lancaster University.
Contact: Department of Linguistics, University of Pennsylvania,
Philadelphia, PA 19104, USA (raw text corpus only!). The parsed and
tagged version is not available for distribution.
♦ The Crater Corpus (ITU Corpus)
Features: a trilingual parallel corpus of French, English and Spanish
from the telecommunications domain. It is available in part-of-speech
tagged, lemmatised and aligned form.
Contact: Department of Linguistics and Modern English Language,
Lancaster University, Lancaster LA1 4YT, UK.
♦ The CURIA
121
Major Corpora Available
_________________________________________________________________
Features: an ongoing text collection project sponsored by the Royal
Irish Academy to make available machine-readable texts in the several
languages used in Ireland during its history - Irish (both old and
modern), Hiberno-Latin and Hiberno-English.
Contact: Royal Irish Academy, Dawson Street, Dublin, Ireland (e-mail:
[email protected]). An e-mail discussion list provides periodic
updates on the work.
♦ The Freiburg Corpus
Features: monolingual corpus of about 1 million words of written British
English from material published in 1991. The corpus aims to parallel as
closely as possible the contents of the LOB, in order to enable the
study of language change in the 30 years separating the two corpora.
Contact: Institut für Englische Sprache und Literatur, Albert-Ludwigs
Universität, D-7800 Freiburg, Germany.
♦ The Guangzhou Petroleum English Corpus
Features: a sublanguage corpus of 411,612 words of written English
from the petrochemicals domain.
Contact: Guangzhou Training College of the Chinese Petroleum
University, Guangzhou, China.
♦ The Helsinki Diachronic Corpus
Features: historical corpus of about 1,5 million words from 850 to 1710.
The corpus is divided in 3 periods and 11 subperiods and covers many
registers.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Helsinki Corpus of Early American English
122
Major Corpora Available
_________________________________________________________________
Features: historical corpus of about 500,000 words of late 17th and
early 18th century of North American English.
Contact: Department of English, University of Helsinki, Porthania 311,
00100 Helsinki, Finland.
♦ The Helsinki Corpus of Older Scots
Features: historical corpus of 830,000 words from 15 registers dated
from 1450 to 1700.
Contact: Department of English, University of Helsinki, Portania 311,
00100 Helsinki, Finland.
♦ The Hong Kong University of Science and Technology (HKUST)
Learner Corpus
Features: learner corpus of about 6 million words (with on-going
collection) of written undergraduate assignments and „A“ level Use of
English scripts from the Hong Kong Examination Authority.
Contact: Language Center, Hong Kong University of Science and
Technology, Clear Water Bay, Hong Kong.
♦ The Innsbruck Computer Archive of Middle English Texts
Features: historical corpus of about 2 million words of Middle English
prose from 1100 to 1500. Texts are arranged alphabetically.
Contact: [email protected].
♦ The International Corpus of Learner English (ICLE)
Features: learner corpus of about 1 million words of written English
texts from nine different language backgrounds: Chinese, Czech,
Dutch, Finnish, French, German, Japanese, Spanish, and Swedish.
Contact: University of Louvain, B-1348 Louvain-La-Neuve, Belgium.
♦ The Kolhapur Corpus
123
Major Corpora Available
_________________________________________________________________
Features: monolingual corpus of 1 million words of written Indian
English from 1978. The corpus uses the same genres and proportions
as the Brown Corpus and the LOB Corpus.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Lampeter Corpus of Early Modern English Tracts
Features: historical corpus of about 500,000 words of pamphlet
literature dating between 1640 and 1740. This corpus contains whole
texts rather than smaller samples from texts.
Contact: TU Chemnitz-Zwickau, D-09107 Chemnitz, Germany
♦ The Lancaster-Leeds Treebank
Features: a subsample of about 45,000 words taken from the LOB
corpus. The corpus is tagged for part-of-speech and fully parsed.
Contact: Department of Linguistics and Modern English Language,
Lancaster University, Lancaster LA1 4YT, UK.
♦ The Lancaster-Oslo/Bergen Corpus (LOB)
Features: monolingual corpus of about 1 million words of written British
English, all published in 1961. Many different registers are included.
The genre categories are parallel to those of the Brown corpus. The
entire corpus has been part-of-speech tagged, and various subsamples
have also been parsed (see: Lancaster Parsed Corpus; LancasterLeeds Treebank).
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Lancaster Parsed Corpus
124
Major Corpora Available
_________________________________________________________________
Features: 133,000 words from the LOB Corpus that have been
syntatically analysed.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, harald Hårfagres gate
31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Longman-Lancaster Corpus
Features: monolingual corpus of about 30 million words of written
British and American English covering a broad range of subject fields
from the early 1900s to the 1980s.
Contact: Longman Dictionaries, Longman House, Burnt Mill, Harlow,
Essex, CM20 2JE UK.
♦ The Melbourne-Surrey Corpus
Features: monolingual corpus of 100,000 words from Australian
newspapers.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Newdigate Newsletter Corpus
Features: historical corpus of 750,000 words of manuscript newsletters
from 1674 to 1692.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Scottish Dramatical Texts Corpus
125
Major Corpora Available
_________________________________________________________________
Features: monolingual corpus of about 101,000 words of drama in
traditional and Glaswegian Scots.
Contact: School of English, The Queen’s University of Belfast, Belfast,
BT7 1NN, UK.
♦ The SUSANNE Corpus
Features: a part-of-speech tagged, parsed and lemmatized subset of
the Brown corpus (about 128,000 words) and the LOB corpus.
Contact: Oxford Text Archive, Oxford University Computing Service, 13
Banbury Rd., Oxford OX2 6NN, UK (e-mail: [email protected]).
♦ Thesaurus Linguae Graecae (TLG)
Features: a machine-readable collection of most of ancient Greek
literature.
Contact: TLC Project, University of California at Irvine, Irvine, CA
92717-5550, USA.
♦ The Tosca Corpus
Features: a monolingual corpus of about 1,500,000 words of written
English from dates between 1976 and 1986. The corpus is part-ofspeech tagged and parsed.
Contact: Department of English, University of Nijmegen, Erasmusplein
1, NL-6525 HT Nijmegen, The Netherlands.
♦ The Zurich Corpus of English Newspapers (ZEN)
Features: historical corpus of London newspapers from the mid 1660s
to the beginning of the twentieth century.
Contact: University of Zurich, Plattenstraße 47, CH-8032, Zurich,
Switzerland.
126
Major Corpora Available
_________________________________________________________________
Spoken
♦ The Corpus of Spoken American English (CSAE)
Features: this monolingual corpus
(still under construction) aims to
reach the size of 200,000 words of spoken American English.
Contact: Department of Linguistics, University of California at Santa
Barbara, Santa Barbara, CA 93106, USA.
♦ The Helsinki Corpus of English Dialects
Features: a dialect corpus of about 245,000 words of spoken English
from several regions of England. Speakers are elderly and rural in
conversation with fieldworkers.
Contact: Department of English, University of Helsinki, Porthania 311,
00100 Helsinki, Finland.
♦ The IBM-Lancaster Spoken English Corpus (SEC)
Features: monolingual corpus of 52,000 prosodically annotated and
part-of-speech tagged words of spoken British English, mostly form
BBC recordings. The Machine-Readable Spoken English Corpus
(MARSEC) is a version of the SEC which exists in the form of a
relational database and also includes some additional information, such
as phonetic transcription.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The London-Lund Corpus
Features: monolingual corpus of about 1/2 million prosodically
annotated words of spoken British English collected in the 1960s and
early 1970s. The corpus includes mainly conversational genres, with
some additional categories such as legal proceedings and commentary
added later.
127
Major Corpora Available
_________________________________________________________________
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ The Northern Ireland Transcribed Corpus of Speech (NITC)
Features: dialect corpus of about 400,000 words of spoken material
from 42 locations and over three age groups (children, middle-aged and
elderly). The data represents conversations with fieldworkers.
Contact: Oxford Text Archive, Oxford University Computing Service, 13
Banbury rd., Oxford OX2 6NN, UK (e-mail: [email protected]).
♦ The Polytechnic of Wales Corpus (POW)
Features: monolingual corpus of 61,000 words of children’s spoken
language. The corpus has been parsed using the Hallidayan SystemicFunctional Grammar.
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
Written and Spoken
♦ A
Representative
Corpus
of
Historical
English
Registers
(ARCHER)
Features: historical corpus of about 2 million words of British and
American English covering the time from 1650 to 1990. Both written
and speech-based registers are available.
Contact: Douglas Biber, Department of English, Northern Arizona
University,
Flagstaff,
AZ
86011-6032,
USA
(e-mail:
[email protected]).
♦ The Bank of English
128
Major Corpora Available
_________________________________________________________________
Features: monitor corpus of more than 200 million words of British
English (mostly written) built by Collins COBUILD at Birmingham
University, constantly growing. The data have been part-of-speech
tagged and parsed.
Contact: The Bank of English, Westmere, 50 Edgbaston Park Road,
Birmingham B15 2RX, UK.
♦ The Birmingham Corpus
Features:
a
monolingual
corpus
of
about
20,000,000
words
(approximately 90% written and 10% spoken). The corpus consists
mainly of British English, although some other varieties are also
represented.
Contact: The Bank of English, Westmere, 50 Edgbaston Park Road,
Birmingham B15 2RX, UK.
♦ The British National Corpus (BNC)
Features: monolingual corpus of about 100 million words of British
English (90 million written, 10 million spoken) covering many different
registers. The entire corpus is part-of-speech tagged, while only a onemillion-word subset is parsed.
Contact: British National Corpus, Oxford University Computing Service,
13
Banbury
Rd.,
Oxford
OX2
6NN,
UK
(e-mail:
[email protected]).
♦ The CHILDES Project
Features: collection of children’s spoken and written language and
language pathologies. The samples are mainly American and British
English, but other languages are also represented.
Contact: CHILDES Project, Department of Psychology, Carnegie
Mellon
University,
Pittburg,
PA
15213,
USA
(e-mail:
[email protected]).
♦ The International Corpus of English (ICE)
129
Major Corpora Available
_________________________________________________________________
Features: a collection of 1-million-word corpora - one written and one
spoken - of different varieties of English. Samples are collected in each
country or region in which English is a first or major language (i.e. East
Africa, Australia, New Zealand, as well as the UK and USA). Collection
is still in progress.
Contact: Survey of English Usage, University College London, Gower
Street, London WC1E 6BT UK.
♦ The Nijmegen Corpus
Features: monolingual corpus of about 130,000 parsed words of written
and spoken British English (120,000 written, 10,000 spoken). The
spoken part is made of transcripted sports commentary.
Contact: TOSCA Group, Department of Language and Speech,
University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The
Netherlands (e-mail: [email protected]).
♦ The Penn Treebank
Features: a monolingual, part-of-speech and parsed corpus consisiting
primarily of articles from the Wall Street Journal but also including some
samples of spoken language.
Contact: Penn Treebank, Department of Computer and Information
Science, University of Pennsylvania, Philadelphia, PA 19104, USA.
♦ The Survey of English Usage (SEU)
Features: monolingual corpus of about 1 million words of British English
collected from 1953 to 1987, divided evenly into spoken and written.
The spoken texts make up the London-Lund Corpus.
Contact: Survey of English Usage, University College London, Gower
Street, London WC1E 6 BT UK.
130
_________________________________________________________________
3
Software Available for
Corpus-Based research
This appendix intends to deliver some basic information about the major
tools for text analysis. As a matter of fact, the majority of the entries
concerns concordancing software: this is due to the fact that
concordancers are practically the sine qua non of corpus exploitation and
a very useful tool for the non-linguist as well. Needless to say that this
software is nothing but a simple collection of computer programs. In other
words, they will not do the miracle: the ‘output’ still needs to be analysed,
filed and compared with other quantitative data in order to produce
‘results’.
Tools For IBM-Compatible Personal Computers
♦ Corpusbench
Features: this tool enables word counts, concordancing, simple
grammatical and morphological analyses (i.e. past tense „ed“). It can
handle large corpora, but it needs to construct a text database.
Contact:
Textware
Direct,
Hörscholmsgrade.
20
2
DK-2200
Københaven N, Copenhagen, Denmark.
♦ International Corpus of English Utility Program (ICEUP)
Features: for use only with the International Corpus of English.
131
Software Available for Corpus-Based Research
_________________________________________________________________
Contact: Survey of English Usage, University College London, Gower
Street, London WC1E 6BT, UK.
♦ LEXA
Features: LEXA is a sophisticated corpus analysis system. It produces
lexical databases and concordances. The program is able to handle
texts marked with COCOA references. It goes beyond the basic
frequency and concordance features of most corpus analysis programs
and also enables simple tagging and lemmatization routines to be run
Contact: International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagres
gate 31, N-5007 Bergen, Norway (e-mail: [email protected]).
♦ Longman Mini Concordancer
Features: for use with ASCII texts less than 50,000 words. It provides
frequency lists, KWICs, and some text statistics. It is also possible to
call up concordances of specific collocations using the program.
Contact: Longman Group UK, Longman House, Burnt Mill, Harlow,
Essex CM20 2JE, UK.
♦ MicroConcord
Features: fast classroom concordancer. It is basically a searcher that
produces a concordance. MicroConcord offers also other features, such
as word counts, simple syntactic analyses and some morphological
analyses (i.e. past tense „ed“). It can be used with a variety of
languages and alphabets. Site licences are available.
Contact: in the US: Athelstan, PO Box 8025, La Jolla, CA 92038-8025;
in Europe: Oxford University Press, Walton Street, Oxford, OX2 6DP,
UK. A downloadable demo is available from the Oxford University Press
website.
♦ Micro-OCP
132
Software Available for Corpus-Based Research
_________________________________________________________________
Features: slow research concordancer. Micro-OCP is a full-featured
concordancer, with many options for tailoring the search and the
concordance, including user-definable alphabets and references. It can
produce indexes, statistics, frequency lists and save the output to file. It
can be used with a variety of languages. Apparently, there are no size
limits.
Contact: Electronic Publishing, Oxford University Press, 200 Madison
Ave., New York, NY 10016, USA.
♦ MonoConc
Features:
quite
fast
and
user-friendly
concordancing
program
developed by Michael Barlow at Rice University (USA).
Contact: Athelstan, PO Box 8025, La Jolla CA 92038-8025 USA.
♦ Nijmegen Linguistic DataBase Software (LDB)
Features: allows browsing, concordancing, and syntactic pattern
searches specifically with the Nijmegen Corpus. It can also be used
with other parsed corpora that have been adapted for use with the LDB.
Contact: TOSCA Group, Department of Language and Speech,
University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The
Netherlands (e-mail: [email protected]).
♦ SARA
Features: a sophisticated concordancer designed specifically to handle
texts which use TEI/SGML markup. SARA is necessary to browse the
text collection of the Bank of English.
Contact: Electronic Publishing, Oxford University Press, 200 Madison
Ave., New York, NY 10016, USA.
♦ TACT
Features: freeware package. The functionality of TACT is quite similar
to that of Wordcruncher. The program’s basic outputs are KWAL and
133
Software Available for Corpus-Based Research
_________________________________________________________________
KWIC concordances and frequency lists. It also enables the user to
produce graphs of the distribution of words through a text or corpus.
Further features are a basic collocation list generator and the ability to
group words for searching according to user-defined categories (i.e.
semantic fields). TACT requires the user to convert the raw text into a
TACT database using a program called MAKBAS, a quite difficult task
to be carried out by a non-expert.
Contact: Centre for Computing in the Humanities, Room 14297A,
Robarts Library, University of Toronto, Toronto, Ontario, M5S 1A5,
Canada (e-mail: [email protected]). Also available by anonymous
FTP
from
the
latter
(ftp://epas.utoronto.ca)
and
from
ICAME
(ftp://nora.hd.uib.no)
♦ TransSearch
Features: bilingual concordancing tool designed to query exclusively
the Canadian Hansard texts, currently a database of seven years of
Canadian parliamentary debates (the Hansards), from 1986 to 1993. A
nice option of TransSearch is the possibility to submit searches using
both a simple or a bilingual interface.
Contact: The former Computer-Aided Translation Research Team of
the Centre for Information Technology Innovation (CITI) constitutes now
the core of the RALI laboratory of the University of Montreal, Canada.
For further information contact Université de Montréal, CP 6128-A,
Montréal, Québec, H3C 3J7, Canada, or go to the RALI website
(currently http://www-rali.iro.umontreal.ca).
♦ Wordcruncher
Features: user-friendly package able to produce frequency listings,
KWAL and KWIC, concordances and concordances of user-selected
collocations. It can also produce word distribution statistics. Like TACT,
Wordcruncher requires texts to be in a specially indexed format. The
LOB, Brown, London-Lund, Kolhapur and Helsinki Diachronic corpora
134
Software Available for Corpus-Based Research
_________________________________________________________________
are available on CD-ROM from ICAME in a ready-indexed form for use
with Wordcruncher.
Contact: Johnston & Company, PO Box 446, American Fork, UT
84003, USA.
♦ Wordsmith Tools
Features: suite produced by Mike Scott at the University of Liverpool
which includes a concordancer, a text aligner, a frequency lister as well
as a variety of other tools. The only program suite based on a Windows
environment. Currently the best set of tools available. For purchase
conditions see the Oxford University Press catalogue.
Contact: Further information can be obtained at Mike Scott’s Wordsmith
site
web-published
by
the
Oxford
University
Press
(currently
http://www.liv.ac.uk/~ms2928/homepage.html).
Tools For Apple Macintosh Computers
♦ Conc 1.8
Features: research concordancer. It works with small texts only but it is
very fast. An attractive feature of Conc 1.8 is its split screen display of
text and concordance: users can click in the concordance window to
see the full context, and vice-versa. Conc 1.8 has a variety of options
for including or excluding words, sorting, exporting concordance to a file
and producing statistics.
Contact: International Academic Bookstore, Summer Institute of
Linguistics, 7500 West Camp Wisdom Road, Dallas TX 75236. This
software can also be downloaded from the site of the Summer Institute
of Linguistics.
♦ Concorder
Features: a fairly simple KWIC concordancer.
135
Software Available for Corpus-Based Research
_________________________________________________________________
Contact: Les Publications CRM, Université de Montréal, CP 6128-A,
Montréal, Québec, H3C 3J7, Canada.
♦ FreeText Browser
Features: fast research concordancer based on an HyperCard stack. It
has no limitation on file size, but also no print/extract capability.
However, settings can be modified. It is a very nice tool for ad hoc
browsing: it delivers three windows, showing words with frequency,
concordance and text.
Contact: FreeText Browser, PO Box 598, Kensington, MD 20895, USA
(e-mail: [email protected]). It can also be downloaded from
the Umich Mac HyperCard Archive.
♦ SysConc 2.5
Features: tool for extracting linguistic patterns from a large corpus of
texts. It searches for specific lexical items, collocational patterns, or a
group of items of any semantic type set by the user. SysConc displays
the search results in a list, so that a larger context for a certain item can
be obtained once required by the user. It also shows the statistical
results for the words around the searched items, demonstrating them in
a bar graph format and their collocations in a hierarchical pattern.
Contact: School of English, Linguistics & Media, Macquarie University
North Ryde NSW 2109, Australia. It can also be downloaded free of
charge from the Macqurie Systemic Modelling Group home page
(currently http://minerva.ling.mq.edu.au).
Part-of-Speech Taggers
♦ CLAWS
Features: theConstituent Likelihood Automatic Wordtagging System is
a part-of-speech tagger for English which makes use of a probabilistic
model trained on large amounts of manually corrected analysed text.
136
Software Available for Corpus-Based Research
_________________________________________________________________
Contact: Department of Linguistics and Modern English Language,
Lancaster University, Lancaster LA1 4YT, UK.
♦ Xerox Tagger
Features: a part-of-speech tagger, developed at the Xerox Parc
laboratories, whose basic tagging program is language-independent
and is being used at the Universidad Autónoma de Madrid to tag the
Spanish part of the CRATER corpus.
Contact:
available
by
anonymous
FTP
from
ftp://ftp.parc.xerox.com/pub/tagger.
137
_________________________________________________________________
4
Results of a Collocation
search of Tra and Fra
In this last appendix I reproduced the collocations of both Italian
prepositions tra and fra for reference. The number at the beginning of
each string identifies its order of appearance in the corpus.
138
139
1 tivo, di ciascun gruppo. Ritengo quindi che tra una settimana- dieci giorni la Commissione
2 za che procede a ripartire i relativi oneri tra i due rami del Parlamento.
3 emo subito dopo - dell'usura e del rapporto tra banche, finanziarie ed intermediatori finan
4 distrettuali e, in particolare, il rapporto tra le procure distrettuali antimafia e la DNA.
5 erso un obiettivo preciso. Quella odierna è tra l'altro la prima seduta "vera" della Commis
6 anti per evitare di perdere tempo prezioso; tra l'altro, nell'elenco delle audizioni si dov
7 la lotta alla mafia, oltre che dei rapporti tra mafia e politica, qualora ve ne fossero. Si
8 re, acquisire gli atti relativi ai rapporti tra mafia e massoneria e, in generale, tutti gl
9 da parte della Commissione.
Condivido, tra l'altro, una sua dichiarazione che ho letto
10 razione, che dobbiamo rendere obbligatoria, tra le forze di polizia e dell'esercito. La col
11 polizia e dell'esercito. La collaborazione tra carabinieri, polizia e Guardia di finanza,
12 nza allargato ai rappresentanti dei gruppi. Tra le richieste che dovremmo porre al ministro
13 voluto sottolineare questo piccolo problema tra i tanti.
ANTONIO BARGONE. Avevo posto un
14 doci ad ascoltare quanto ci vengono a dire. Tra l'altro, dobbiamo anche tenere presente qua
15 la criminalità economica, cioè del rapporto tra crimine organizzato ed economia. Da questa
16 e immediatamente - che il tema del rapporto tra criminalità organizzata ed effetti sull'eco
17 endo apparire come il tipico saputello, che tra l'altro non sono e ripeto - non intendo ess
18 o approccio con il grande tema del rapporto tra economia, finanza e criminalità organizzata
19 e cosa fare fino alle prossime scadenze.
Tra l'altro, questi difetti di organizzazione d
20 In caso contrario, ci troveremmo a fissare tra dieci giorni una riunione in cui si definis
21 cupazione di non creare una sovrapposizione tra l'ufficio di presidenza e la Commissione pl
22 o quella di non creare una contrapposizione tra l'ufficio di presidenza e la Commissione pl
23 ani ed americani che operavano, in simbiosi tra loro ed in collegamento con la mafia colomb
24 o la tendenza verso una stretta interazione tra realtà criminali diverse, ha favorito il co
25 minali diverse, ha favorito il collegamento tra differenti settori dello scambio illegale e
26 delinquenziali siffatte, che interagiscono tra loro proponendosi come un sistema complesso
27 ell'azione antimafia, un quadro di raccordo tra il momento della valutazione strategica del
28 ale esistente prevede un raccordo immediato tra Consiglio generale e strutture di contrasto
29 traverso Pagina 40 una costante interazione tra il momento dell'acquisizione conoscitiva e
30 un programma d'intervento il quale prevede, tra l'altro, l'adozione, di concerto con il min
TRA
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
reati economici ed il traffico di armi.
Tra
posizione trainante di rilievo, provvedendo tra
rare un pió elevato livello di cooperazione tra
luppo di iniziative di collaborazione anche tra
anica collaborazione di carattere operativo tra
uare una pió efficace divisione dei compiti tra
nsiderando che la Commissione antimafia ha, tra
ia possibile su questo versante un raccordo tra
O, ROS), unificando un'azione oggi dispersa tra
lo svolgimento delle elezioni in Germania, tra
esi interessati al fenomeno - la Germania è tra
pibile! Non soltanto: in molte grandi città tra
in cui è certamente presente una collusione tra
, stia tranquillo. Ho fatto una distinzione tra
ai rapporti, pericolosamente in estensione, tra
no contrario a quello che ci si prefiggeva. Tra
n questa sede e che vorrei riprendere. Sono tra
nistro accennava, di maggiore coordinamento tra
piccola norma che ero riuscito ad ottenere. Tra
molto serio di porsi di fronte al rapporto tra
.
Concludo con un'ultima questione. Tra
onsiglio comunale di Trani - e leggiamo che tra
sione straordinaria possiamo constatare che tra
è giustissima, ma va evitata la cogestione tra
prannominato "Gigi l'americano", il quale è tra
rre individuare e recidere i legami mafiosi tra
n all'interno della struttura né tanto meno tra
CO ed averlo fatto funzionare con successo (tra
ontinuità nella gestione della direzione.
Tra
finora, cioè di occuparsi del coordinamento tra
volontariato.
Credo che la collaborazione tra
che unità) proprio per il legame fortissimo tra
140
le organizzazioni impegnate a vario titolo
l'altro all'istituzione di speciali agenzie
gli organismi di polizia impegnati nella pr
paesi extracomunitari, specie laddove quest
gli organismi investigativi attivi nell'are
polizia e carabinieri. Come tutti sappiamo,
i suoi compiti, quello di verificare che tu
le esperienze di alta professionalità dei v
i vari corpi, specializzando l'intervento g
i quattro o cinque paesi interessati al fen
questi - per verificare in quale modo si po
cui Napoli, nei centri dove è presente un h
amministrazioni locali e forze mafiose. In
la sua volontà e l'azione complessiva del G
economia e criminalità. Un problema di coor
l'altro, io stesso già due anni fa avevo la
gli ammiratori dei carabinieri, sia ben chi
le forze dell'ordine siano accelerate e che
l'altro, al di là di tutte le domande e di
mafia e politica.
Piuttosto, signor mini
i personaggi a rischio nella lotta contro l
le motivazioni da cui ha tratto origine un
questi è contenuto il permesso per tale dis
Parlamento e Governo. Si tratta di un atto
gli arrestati. Le accuse sono di associazio
la criminalità e la struttura che non sempr
i diretti interessati. Credo sia utile ed o
parentesi, lo SCO ha gestito per due anni u
gli spostamenti che però non vengono quasi
l'azione dello Stato e quella svolta nella
questi due mondi, che finora non si sono pa
i suoi componenti. Oggi abbiamo la grande o
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
ato comunque costituito un gruppo di lavoro
da commissariato a commissariato: infatti,
studio e di approfondimento su questi temi
considero l'Unione europea una via di mezzo
diterraneo come "ufficiale di collegamento"
io, perché l'argomento riguarda il rapporto
rventi della Banca d'Italia, i collegamenti
ha comportato la chiusura di più di 2 mila
roppo, però, non sono stati assunti), primo
41- bis. Per quanto riguarda i rapporti
risposto che bisogna creare un collegamento
ia in Sicilia ed il fatto che i funzionari,
pre necessario trovare un giusto equilibrio
migrazione, che ha compiti di coordinamento
arantiscono, io credo, il giusto equilibrio
di protezione per un centinaio di persone,
perseguito attraverso molteplici strumenti,
a quella fascia di clienti che è al confine
dell'economia del paese, tende a stabilirsi
rito che solo le banche possono raccogliere
ooperative finanziarie di raccogliere fondi
izzazione favorendo la più ampia diffusione
rio contributo tecnico. E' stata condivisa,
ediari sono tenuti a conservare. I rapporti
ati gravi produttori di ricchezza illecita,
e su basi non codificate, si va realizzando
aese impegnati nell'azione antiriciclaggio,
avete sempre assunto circa il collegamento
tassimo un pochino il limite che intercorre
NO
VIOLANTE. Inoltre, credo che lo scarto
are un solo istituto o un solo paese.
oro del 1991, per ragioni contabili interne
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
141
le forze di polizia che presenterà entro la
due persone che svolgono le stesse funzioni
esperti del ministero, della Bocconi (già d
estero e territorio nazionale (non è estera
i paesi dell'Unione europea e quelli non fa
Governo e settore del credito, le funzioni
Tesoro, Banca d'Italia e settore del credit
società finanziarie e fiduciarie; siamo sem
tutti la revisione della legge che consente
economia e criminalità, si tratta di un tem
istituzioni governative e istituzioni non g
cui il segretario comunale, rimangono al lo
la consistente presenza di forze dell'ordin
gli enti governativi e quelli non governati
la sicurezza e la possibilità per l'interes
cui molti politici o ex politici. Si tratta
cui il controllo degli assetti proprietari,
la bancabilità e la non bancabilità.
Ind
economia criminale ed economia legale.
M
il pubblico fondi con l'impegno di restituz
i loro soci. Per corrispondere all'esigenza
il pubblico degli elenchi degli intermediar
l'altro, la scelta di svincolare la figura
intermediari e organi inquirenti potranno d
i quali quindi anche i fatti di usura.
R
le autorità dei paesi ad economia matura; c
i quali la Banca d'Italia e l'Ufficio itali
la proliferazione degli sportelli bancari e
il soggetto bancabile e quello non bancabil
entrate ed uscite annue sia elevatissimo, n
i paradisi fiscali noti ho contato 14 paesi
banca centrale e singole banche, li esclude
142
95 della provincia di Agrigento, per esempio, tra gli arrestati per vicende connesse all'usur
96 one alla necessità di definire nuove regole tra banche ed utenti per quanto riguarda la cer
97 ripetersi di vicende di questo genere, che tra l'altro riguardano transazioni di decine di
98 ca d'Italia ha fatto un'analisi comparativa tra la situazione economicosociale di alcune a
99 etta separazione dell'attività di vigilanza tra le autorità dei paesi, in particolare di qu
100 o, al loro interno, manchevolezze o abbiano tra i loro dipendenti elementi infedeli che con
101
L'onorevole Del Prete ha citato alcuni tra i casi più clamorosi: mi riferisco alle due
102 chiedono) che rendiamo pubbliche. Io stesso tra una settimana, a Foggia, svolgerò un interv
103 uoi lavori.
Esiste quindi una differenza tra l'emendamento ed il testo attuale. Non ho b
104 automaticamente, già porta ad una divisione tra noi.
PRESIDENTE. I proponenti insistono p
105 Nel corso degli incontri che si sono svolti tra lei, signor presidente, ed il capogruppo de
106 il tema degli insediamenti mafiosi nel nord tra quelli propri dei gruppi di lavoro della Co
107
in alcun modo i lavori della Commissione. Tra l'altro, è in vigore quello provvisorio.
108 cinque Commissari eletti dalla Commissione tra i suoi membri. Tra questi la Commissione el
109 eletti dalla Commissione tra i suoi membri. Tra questi la Commissione elegge il Presidente.
110 (penso, in particolare, alla Calabria, dove tra breve inizieranno processi molto importanti
111 nto riguarda il filone relativo al rapporto tra mafia e politica, occorre fare riferimento
112 ricche, è diventato "importante" - lo dico tra virgolette - perché è finalizzato anche all
113 rta un'internazionalizzazione del discorso: tra poco tempo si terrà, come è noto, la confer
114 trasparenza e per un diverso modo di porsi tra cittadini e istituzioni, è tema prioritario
115 e di una cultura della legalità soprattutto tra i giovani, in particolare nella scuola, per i
116
mi riferisco alla questione dei rapporti tra mafia e sistema eversivo. Mentre ho affront
117 o vale per altre questioni, come i rapporti tra Cosa nostra e la banda della Magliana, che
118 ntegrazione (su frodi comunitarie, rapporti tra mafia e massoneria, e così via) riprendono
119 tratta di intrecci, nemmeno tanto occulti, tra politica, economia e mafia. Non diamo a que
120 piamo bene che la mafia ha sempre sguazzato tra grembiuli e cappucci. Ma si tratta anche di
121 di trasparenza.
PRESIDENTE. Comunico che tra venti minuti avranno inizio votazioni alla
122 alisi delle seguenti tematiche: connessioni tra mafia e politica negli organi dello Stato e
123 e politica della mafia, di capire cioè come tra mafia e politica si fosse stabilito un rapp
124 zione di quella che è stata la coabitazione tra il potere politico e la mafia. In questo mo
125 zione dai Presidenti delle Camere di intesa tra di loro. Quindi, tali adempimenti non dipen
126 rdo che vengano costituiti. Dirò subito che tra i compiti della Pagina 392 Commissione bica
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
ono registrati. Ricordo la drammatica notte
data l'esigenza di conoscenza del rapporto
enza su quelle altrui. Occorrono un accordo
iguardano aspetti interessanti dei rapporti
i dei rapporti tra mafia e politica, quelli
è rappresentato dal rapporto che intercorre
lla legge, anche se si tratta di una legge,
votazione.
GIUSEPPE ARLACCHI. Poiché sono
che in questo caso potrebbe venir fuori che
i gruppi di lavoro si trovino in disaccordo
Ho messo come priorità quella del rapporto
ne possano essere raggiunti". Tale modifica
curiosa scissione che esiste sempre di più
no comunque trovare una stretta connessione
Procure distrettuali; al coordinamento
a" nelle procure distrettuali e al raccordo
.P., effettivo coordinamento delle indagini
rapporti e delle strutture di collegamento
ati e gli stretti legami e interconnessioni
ruzione dei molteplici aspetti dei rapporti
aspetti dei rapporti tra mafia e politica e
serio lavoro conoscitivo sulle connessioni
o sulle connessioni tra mafia e politica, e
articolare, riescono a consolidare i legami
consente di individuare le interconnessioni
ciascuna tematica e delle diverse tematiche
e seguenti tematiche:
1) Connessioni
o sia opportuno ribadire in questa sede.
ste una grande collaborazione, per esempio,
trovare là occasioni di lavoro importanti;
a domanda sulla mafia, sulla ricongiunzione
.
Svolgerò ora alcune premesse generali.
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
tra
Tra
143
il 19 e il 20 luglio 1992, quando i ministr
l'apposita commissione presso il Ministero
i gruppi parlamentari e buona volontà da pa
mafia e politica, quelli tra mafia ed econo
mafia ed economia e della mafia nel nord. Q
noi e chi governa questo paese, nel senso c
le tante, che mai è stata applicata. L'urge
i firmatari dell'ordine del giorno Bargone,
qualche giorno i gruppi di lavoro si trovin
loro sulle formulazioni suggerite.
PRESID
mafia, politica ed economia, e questa è un'
l'altro toglie quell'apprezzamento sul lavo
il dibattito politico sulla mafia fuori da
di loro, così da realizzare la conclamata e
attività "ordinarie" e "antimafia" nelle pr
queste ultime e le procure circondariali;
diversi PM) o che potrebbero favorire un ac
gli altri servizi centrali e periferici (S.
gli stessi, si trasforma in una mera formal
mafia e politica e tra mafia ed economia, c
mafia ed economia, che si possa pervenire,
mafia e politica, e tra mafia ed economia,
mafia ed economia, non trascurando di verif
ambiente governativo, militare, apparati di
i diversi settori nell'ambito di ciascuna t
di loro, in una circolarità che eviti un la
mafia e politica negli organi dello Stato e
l'altro, questo viaggio in Russia mi ha riv
Washington e Mosca. Credo che debba essere
l'altro, c'è tutta la riconversione dell'in
la mafia italiana e quella dei paesi dell'e
i suoi obiettivi primari e fondamentali, il
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
atezza assumono i temi relativi ai rapporti tra
oni per un recupero del rapporto fiduciario tra
no: il riequilibrio dell'equità competitiva tra
perfezionati i meccanismi di coordinamento tra
istri. Era altamente affettuoso. Lei sa che tra
e e del gruppo di lavoro interministeriale (tra
sia per ragioni di continuità di competenza tra
so", si aggiungono infatti quelle legate... tra
upposti per un sempre più diffuso "scambio" tra
e al fine di rendere più agevoli i rapporti tra
no esperienze pregresse e la collaborazione tra
della sua audizione, che esiste un rapporto tra
sibile. Ecco perché la Commissione ha posto tra
Governo? Un disegno di legge si è infranto tra
n viso a cattivo gioco, si tengono riunioni tra
o futuro.
Sarebbe auspicabile, di intesa tra
nare e quindi emettere sentenze più rapide. Tra
o peggiorativo. C'è un problema di coerenza tra
dei problemi che dobbiamo affrontare. Oggi tra
nti forse si potrebbe trovare un equilibrio tra
non aveva affatto queste caratteristiche e, tra
ente sottoposti a tale regime ha raggiunto, tra
sottoposti al regime dell'articolo 41-bis, tra
ezione di massima sicurezza che non rientra tra
mi risulta gode di notevole prestigio anche tra
sto cospicuo ampliamento, che ha comportato tra
entrale della lotta alla mafia, il rapporto tra
legami, spesso ambigui e sempre insidiosi, tra
i e sempre insidiosi, tra mafia e politica, tra
se lei ritenga che il problema del rapporto tra
478 economica, realizzando quella saldatura tra
rtamente quanto meno un accenno al rapporto tra
144
i vari organismi di polizia ed al loro coor
cittadino ed istituzioni e per l'acquisizio
gli operatori, grandi e piccoli, anche tram
le diverse autorità, amministrative e di po
il presidente e i tifosi non ci può essere
ministro dell'interno e ministro di grazia
uffici giudiziari sia per esigenze di funzi
l'altro sento di cifre che, sia pure senza
le mafie tradizionali e quelle straniere (l
le autorità giudiziarie (specie in tema di
i vari paesi, anche per quanto riguarda le
istituzioni, sistema bancario e mondo econo
i suoi compiti quello di indagare sul ricic
le proteste delle organizzazioni sindacali
i questori, prima dello sciopero generale,
la Presidenza del Consiglio e i ministri de
coloro che aspettano di essere processati i
indirizzi e proposte del Presidente del Con
funzionari amministrativi e magistrati vi è
le esigenze di bilancio e le esigenze di un
l'altro, prevedeva anche uno specifico cont
la fine del 1992 e il primo semestre del 19
la fine del 1992 e il primo semestre del 19
quelle riservate ai detenuti sottoposti al
i rappresentanti dell'opposizione. Si tratt
l'altro il frazionamento di una direzione (
mafia e politica. Eppure di tale questione
mafia e politica, tra criminalità organizza
criminalità organizzata e consenso elettora
mafia e politica sia stato in tutto o in pa
criminalità organizzata e criminalità degli
mafia e amministrazione, politica e istituz
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
pio, intenda istituire un corpo specifico);
one di tutti noi sulla differenza esistente
rtuno questo suo accenno), con riferimento,
sorvolo). La celebrazione dei processi, che
e debbo dirle con franchezza circa il nesso
di Palermo, esponente di forza Italia (che
il mondo e vi è quindi quasi un'incoerenza
iglio, nelle sue comunicazioni ha collocato
una situazione di obiettiva incompatibilità
uttorie, depotenzia la "periferia" - sempre
di ciò che accadeva, dei rapporti illeciti
fossero o quali potessero essere i rapporti
ll'opposizione verificare la corrispondenza
erativi speciali dell'Arma dei carabinieri;
, salta agli occhi il problema del rapporto
spetto fondamentale, si tratta del rapporto
rio perché mostrava possibilità di intrecci
567 decreti, tanto che nel periodo compreso
rio previsto dall'articolo 41-bis sono 436;
144 facenti parte di altre cosche mafiose,
ratta di persone inserite in famiglie unite
lla quale non si può uscire. Naturalmente,
rovvedimenti in vigore nel periodo compreso
si propone, quello cioè di tagliare i fili
la possibilità di emanare decreti e perché,
ole: il cosiddetto radio-carcere funziona e
prontare un congresso che si terrà a Napoli
vi è la necessità di effettuare interventi,
prattutto con riferimento alla 'ndrangheta,
atterizzate da una grande presenza mafiosa.
sociali, che rappresentano il trait d'union
relazione sulla diversità dei comportamenti
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
145
l'altro, secondo i calcoli, i pentiti sono
il mantenimento normativo del suddetto arti
l'altro, alla questione della celebrazione
primo e secondo grado è una celebrazione so
il problema mafia e il turismo: al riguardo
l'altro conosco da anni ed ho sempre avuto
un tipo di interpretazione che si può dare
i vari settori dell'illecito le case da gio
il magistrato che gestisce il pentito e il
virgolette, perché è un termine che non mi
politica, imprenditoria, criminalità. Lei h
mafia e massoneria. Abbiamo avuto un interl
parole e fatti, indurre e stimolare il Gove
i 400 e i 500 uomini lavorano nel servizio
diverse forze di polizia. Siamo convinti ch
il Governo ed il mondo della cultura, degli
politica e criminalità che essi conoscevano
la fine del 1992 e l'inizio del 1993 il tot
questi, ve ne sono alcuni per i quali i dec
cui la stidda (in questo caso si tratta di
di loro da matrimoni o da alleanze che spes
i personaggi nei cui confronti è stato appl
la fine del 1992 e l'inizio del 1993 si è g
un certo tipo di detenuti e coloro che sono
i decreti emanati dal direttore generale, 5
i detenuti vi sono amici, coimputati, corre
il 21 e il 23 novembre, al quale affluirann
l'altro, sul parco macchine e ricordo, per
i cui componenti vi sono migliaia di detenu
l'altro, in Calabria vi sono 157 cosche, pe
il detenuto e il tribunale di sorveglianza.
questo e quell'istituto, ho cercato di farv
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
ati come un supporto, come un trait d'union
istrati di sorveglianza, obbligatoriamente.
ne penitenziaria ci sono tanti problemi, ma
i diritti inviolabili Pagina 522 dell'uomo,
, deve sempre sussistere un contemperamento
oblemi che avvertiamo in merito al rapporto
e equilibrio, sotto il profilo percentuale,
in generale nella materia del trattamento;
è pió grave, genera solidarietà nel carcere
potrebbe addirittura creare una solidarietà
a di persone dal carcere significa portarle
'è bisogno di un provvedimento del giudice,
sottoposti al regime dell'articolo 41-bis.
e al tribunale debbono marciare altre cose,
sua banda che ha commesso parecchi omicidi,
ontavano a 26 mila unità circa, il rapporto
e, come ho detto, siamo in sede di rinvio;
o un altro principio, del quale sottolineo,
luto, non può assicurare che non conversino
lla coerenza, direi della consequenzialità,
ialità, tra intenzioni ed atti di Governo e
che si ponga anche un problema di coerenza
problema di coerenza tra parole e parole e
oni per un recupero del rapporto fiduciario
rte della banca è più economica che morale.
. Si pone, quindi, il problema del rapporto
enomeno.
Non vede lei una contraddizione
a Mario Pirani che non è l'ultimo arrivato
icando! su corruzione e mafia, sui rapporti
richiesta a magistrati (del popolo, lo dico
edere la normativa vigente e di consentire,
al cliente. E' infatti sempre la difformità
tra
Tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
tra
tra
tra
tra
tra
146
il detenuto, la famiglia, il posto di lavor
noi e i magistrati c'è una specie di amore
quelli che hanno priorità bisogna annoverar
cui quello della libertà personale, la cui
potestà punitiva, tutela della sicurezza pu
il legislativo e la magistratura, è chiaro
le due ipotesi?
ANTONELLA GIULIANA MAGNAV
l'altro, il magistrato di sorveglianza deci
tutti i detenuti: con De Lorenzo è solidale
i mafiosi colpiti da questa norma, che perc
la popolazione e addossarne il controllo ai
l'altro motivato, per poterlo limitare. Lo
l'altro, di questo mi dà conferma l'ultima
cui il carcere. Si sono dimenticati del car
cui quello di un agente di pubblica sicurez
reclusi definitivi e quelli in attesa di gi
breve il problema dovrebbe essere risolto
virgolette, la gravità. Tale principio, inf
loro. PRESIDENTE. O che la posta segua vi
intenzioni ed atti di Governo e tra progett
progetti e loro realizzazione. E credo che
parole e parole e tra intenzioni e intenzio
intenzioni e intenzioni, soprattutto quando
cittadino ed istituzioni e per l'acquisizio
cliente bancabile e cliente non bancabile l
aree depresse del Mezzogiorno e capacità di
quanto ella ha proclamato, ossia di voler c
i giornalisti - lei attaccò con forza i giu
affari e mafia, siano comunisti, cioè che a
parentesi) che l'amministrano in altro modo
l'altro, il ricorso a strumenti di indagine
l'entità del patrimonio delle persone, i mo
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
onale antimafia, che si collocano a cavallo
non approdò ad esiti di rilievo. Peraltro,
ato alla migliore distribuzione dei compiti
ne, nell'ambito delle associazioni mafiose,
ma completamente rivisitata, tenendo conto
necessità di attuare una netta separazione
cano nel più generale contesto dei rapporti
ali privilegiati per infittire le relazioni
genericamente allo stesso tema dei rapporti
con riguardo al problema del coordinamento
oni per il recupero del rapporto fiduciario
anche i presupposti per un diffuso scambio
che nella dimensione internazionale vedono,
del codice di procedura penale, i rapporti
ece che fosse la cupola mafiosa a scegliere
le diverse province i "personaggi" (lo dico
no sull'argomento o a mancato coordinamento
protettori.
Con il distinguo che si opera
a...
La pericolosa commistione esistente
tituto in cui vi siano una vera separazione
ia pure in prima battuta -, una distinzione
ontrollo. Noi non vogliamo creare conflitti
la pronuncia è difficile, vi è un conflitto
con disagio questa conflittualità immanente
nza però realizzare alcunché.
PRESIDENTE.
la questione della conflittualità oggettiva
PRESIDENTE. Come viene diviso l'istituto
stato mandato, anche perché non è detto che
losi custodi della nuova identità, cosa che
enta di mettere a proprio agio il soggetto.
tirla e non di discriminare nella sicurezza
aspettative dei collaboratori di giustizia.
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
tra
tra
Tra
tra
Tra
147
la fine del 1991 ed i primi mesi del 1992,
le stesse forze di polizia, i cui responsab
i vari corpi ed allo sviluppo di un'azione
attività delinquenziali primarie (fonte del
l'altro delle oggettive difficoltà connesse
chi investiga sui fatti dichiarati dal pent
mafia e politica. Di tale relazione voglio
i suoi appartenenti e coloro i quali, in qu
mafia e pubblica amministrazione: per evita
le forze di polizia, che già è stato affron
cittadino ed istituzioni, sia creando una n
le organizzazioni criminali, che nella dime
l'altro, un mezzo più sicuro e proficuo di
le autorità giudiziarie. Sotto questo profi
i candidati inclusi nelle liste dei vari pa
virgolette) da appoggiare in ogni singola c
la Presidenza della Commissione e palazzo C
i carcerati con l'applicazione dell'articol
imputati e condannati, sia pure in primo gr
imputato e condannato sia pure in prima
uomini e donne e speciali accorgimenti per
i poteri dello Stato; le nostre pronunce so
poteri dello Stato, perché potere esecutivo
il potere esecutivo e quello giudiziario, s
l'altro, costano anche poco.
MARCELLO G
alcune norme e la disciplina carceraria, mi
collaboratori e detenuti soggetti a regime
coloro che sono sottoposti al regime di cui
l'altro comporta anche delle spese, perché
gli aspetti paradossali della vicenda di qu
persona e persona.
GIUSEPPE SC
l'altro amministriamo soldi dello Stato e q
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
dino. Ognuno dice quello che pensa. Io sono
petenza. Per quanto riguarda la distinzione
i fondi, vista la crescita esponenziale che
o di un testimone né di un suo familiare.
"). Vi sono stati diversi collaboratori che
o inoltre conoscere il rapporto, se esiste,
iale.
Per quanto riguarda il rapporto
il direttore del servizio, sull'interazione
eve svolgere, ossia quella di trait d'union
agneremmo del tempo -, però è opportuno che
te sconsigliabile, ed addirittura negativo.
ste fare, intendete operare una distinzione
Valentini abbia dei dati più puntuali, però
di diritto amministrativo...
PRESIDENTE.
no della magistratura reggina, per esempio,
per esempio, tra magistrati di vari gradi,
de e si sviluppa il dibattito o il rapporto
a giustizia siano in qualche modo collegati
almeno qui, avessimo chiara la distinzione
inzione tra Governo e Stato. Chi vi parla è
sul terreno della mafia, ma è anche da anni
difeso lo Stato e lo difende, distinguendo
sto Governo farà meglio degli altri, saremo
a pagina a caso in cui si parla di disagio,
dichiarazione e di tutti gli interstizi che
presentato particolari margini di rischio,
pentiti emerga la permanenza di un rapporto
tito non ha reso dichiarazioni sui rapporti
tà di ascoltare alcuni pentiti sui rapporti
l'arresto, per effetto della detenzione. Ma
e interessare sotto il profilo del rapporto
guerra di religione, come contrapposizione
tra
tra
tra
Tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
Tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
148
coloro che sono stanchi di vedere uno Stato
i pentiti e coloro che sono infiltrati dell
il 1992 e il 1994 emerge dalle cifre, esso
l'altro, devo dire che, tranne due casi, ch
l'altro hanno mandato in tilt la questura d
personale impiegato e persone tutelate: inf
i collaboratori ed il personale del Servizi
le procedure applicative del programma di p
voi, che siete impegnati direttamente sul c
voi e la Commissione antimafia si instauri
l'altro, non sempre tutti coloro che sono p
il collaboratore, il pentito ed il testimon
il 1^ novembre 1993 e il 1^ novembre 1994 s
poco lo sapremo.
GIACOMO GARRA. Fate un r
magistrati di vari gradi, tra magistrati ch
magistrati che si occupano di pentiti ed al
chi pone le domande stesse e chi risponde.
loro e, nel caso in cui tale collegamento e
Governo e Stato. Chi vi parla è tra coloro
coloro che, da anni, si sono schierati cont
coloro che cercano di difendere lo Stato ne
Stato e governi. Se questo Governo farà meg
Pagina 632 quelli che gliene daranno atto e
le tante che erano state redatte in quel pe
dichiarazione e dichiarazione possono prese
fenomeni criminali e tessuto istituzionale.
mafia e politica che sia attuale e se vi si
mafia e politica perché purtroppo alcuni po
mafia e politica, ovviamente senza che ques
questi due poli ci sono mille gamme interme
mafia e politica, la domanda se ci siano re
posizioni filosofiche diverse - a volte add
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
- a volte addirittura a livello di litigio
gistrati è stata quella di tranquillizzare,
E qui bisogna ancora una volta distinguere
guere tra collaborante e collaborante, cioè
si potrebbe immaginare. Vanno considerati,
di difesa. Cosa accadrebbe, ad esempio, se
tra valutazione circa il rapporto esistente
indagini sul problema cruciale del rapporto
è la separazione, di cui da tempo parliamo,
oblemi per quanto concerne il coordinamento
e ad instaurare un coordinamento nel lavoro
In ordine alla valutazione del rapporto
questa ipotesi consolidatisi e sviluppatisi
e sembrava una cosa incredibile; novecento,
lato in questa sede di una giusta equazione
ossa esserlo anche la risposta: il contatto
resenza dello Stato "in periferia" (lo dico
siciliani per poi soffermarmi sui rapporti
ocura distrettuale di Palermo sia un unicum
sta DDA, costretti a quotidiani spostamenti
i passaggi, resta il problema dei rapporti
parrocchiale che, di fronte all'alternativa
ta verifica sul campo dell'attuale rapporto
atitanti e ad alcune questioni carcerarie),
i Leonardo Messina, l'esistenza di contatti
ina ed ai suoi o no? Abbiamo una spaccatura
r quanto riguarda la questione dei contatti
A noi non risultano contatti di questo tipo
i una spaccatura o distinzione di strategie
e organicità l'esistenza di una spaccatura
si traduce in un disagio, in una difficoltà
I, Procuratore della Repubblica di Palermo.
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
149
tifosi da stadio - che non, ripeto, con la
virgolette, affermando che non c'è nulla di
collaborante e collaborante, cioè tra un so
un soggetto che ha vissuto un'esperienza cr
gli altri, i limiti di resistenza umana; in
i mille ricordi dell'esperienza criminale d
Cosa nostra e le associazioni di tipo masso
due tipi di organizzazioni eversive: Cosa n
struttura e personale addetti alle indagini
i pubblici ministeri interessati? Quali vie
i pubblici ministeri?
Inoltre, quando vi
Cosa nostra e massoneria ed a che punto sia
componenti di Cosa nostra e momenti, profil
l'altro è il numero complessivo dei collabo
segretezza e sicurezza.
Dottor Lo Forte,
magistrati e pentiti, per le ragioni indica
virgolette perché la lotta alla mafia non d
la distrettuale, le procure ed i tribunali
le altre distrettuali: non c'è altra distre
Palermo, Agrigento e Sciacca, attraverso st
procure distrettuali e procure circondarial
l'esortazione evangelica ad ispirarsi all'o
Cosa nostra e la società civile a confermar
i quali anche quello relativo alla protezio
collaboratori di giustizia, nel momento in
gli uomini d'onore reclusi nelle carceri di
collaboratori di giustizia ed agenti dei se
i collaboratori di giustizia, le cui dichia
uomini d'onore detenuti e quelli liberi, ev
le due componenti dell'organizzazione.
MI
i pentiti, si può essere portati a chieders
l'altro non è neanche rientrata. E' vero ch
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
di Grasso e di Vigna vengono sostituite - e tra
elencati, che PM di primo grado non erano (tra
ssere, così com'è oggi, per molti detenuti, tra
r Caselli, ovvero la relazione strettissima tra
per uscire da questa grave situazione che, tra
mafiosa in alcune amministrazioni comunali, tra
otto una vecchia consuetudine di collusione tra
ba esserci un rapporto chiaro e trasparente tra
o lavorare tutti assieme, senza distinzione tra
itico, dell'asprezza del dibattito politico tra
anto uno sforzo volontaristico, ritengo che tra
mmo dovuto fare per procedere in tal senso. Tra
ventilato da qualcuno, di creare spaccature tra
e possa estrinsecarsi una certa litigiosità tra
ella Commissione antimafia: una discussione tra
empre molto vicina alla passione politica e tra
stata rimarcata una differenza territoriale tra
ortata ad esempio Pagina 697 una differenza tra
. Non voglio farmi illusioni, perché so che tra
Sono profondamente convinto che lo scontro tra
alle opinioni di ciascuno (che non sempre, tra
e sospetti. Sono certissima di trovarmi qui tra
isogno piuttosto che almeno in questa sede, tra
o abbia stimolato un momento di riflessione tra
ARLACCHI. Sul caso Ayala c'è un contrasto tra
in Sicilia debba essere fatto; siamo stati tra
one non è personale, ma attiene al rapporto tra
o e che li dobbiamo affrontare uno per uno. Tra
ç agli altri un problema di incompatibilità tra
ne per rinnovare le intese assunte nel 1993 tra
ni per effetto di una convenzione stipulata tra
e queste associazioni esistano ancora oggi. Tra
150
l'altro circolano nomi, mi pare non smentit
questi c'era anche il dottor Ilarda), c'è u
i più significativi, una scatola vuota, un
segnali istituzionali, atti pubblici, misur
l'altro, ha provocato un clima di tensione
le quali quelle di Corleone e San Giuseppe
amministrazione ed ambienti mafiosi, restit
il presidente e i membri della Commissione,
maggioranza e opposizione. Questa dovrebbe
le forze qui rappresentate, ma è di tipo is
una settimana ci ritroveremo di fronte agli
l'altro, abbiamo denunciato fin dall'inizio
nord e sud, se è vero, com'è vero, che alla
le componenti politiche della Commissione,
me e lui si concluse proprio sulla necessit
sede e sede muta la risonanza delle azioni
nord e sud, non certo per orgoglio o razzi
nord e sud, in un momento in cui, invece, s
noi molti non sono amici, né posso pretende
il presidente ed il commissario Ayala non s
l'altro, occorre tenere presenti, così come
persone più che oneste, lontane da ogni sos
i 51 componenti della Commissione, ci si ri
di noi. Avremmo tutti - io per prima, non m
dichiarazioni fatte da Ayala e dichiarazion
coloro che hanno ritenuto opportuno di rinv
il presidente della Commissione ed i compon
questi problemi ne individuo tre sollevati
la sua posizione e il suo incarico.
PRESI
la Commissione antimafia della precedente l
il Ministero di grazia e giustizia e, appun
l'altro, in quella zona si sono registrati
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
'è dubbio. ANTONIO BARGONE. L'esperienza,
voro mafia ed economia, abbiamo individuato
- ma anche misure di carattere preventivo,
esercizio in via prevalente; la confluenza
eranti in presumibile regime di abusivismo.
gli organi che hanno l'obbligo di riferire,
, è consentita dagli operatori autorizzati.
el merito delle ispezioni), ma oggi non c'è
enzioni con i singoli istituti, soprattutto
tuti, soprattutto tra UIC e Banca d'Italia,
tra UIC e Banca d'Italia, tra UIC e ISVAP,
dato (e che è stato il primo ostacolo posto
rci. PRESIDENTE. Lo
so, ma la convenzione
gruppo di lavoro che si occupa dei rapporti
asparenza: tenuto conto del tempo trascorso
za del rapporto causale è questo il punto potere al commissario antiracket (il quale,
o del lavoro che sta svolgendo sul rapporto
a di fideiussione. Qualora si appurasse che
r comprendere i motivi di queste reiezioni.
afiosa.
I problemi cominciarono a sorgere
elle spese.
Ciò detto, si deve sapere se
ano possibilità di collusioni o di incontri
normativa primaria, della differenziazione
r sempre. In queste situazioni, il rapporto
erverranno, e dopo aver valutato i rapporti
in una situazione di sudditanza - sia detto
lizia giudiziaria determina una commistione
deve essere individualizzato in relazione,
vo - se necessario dagli organi competenti,
ubblica che non parleranno più dei rapporti
compenso esistente in seno alla commissione
tra
tra
tra
tra
Tra
tra
Tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
Tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
tra
151
l'altro, ci insegna che la partecipazione d
gli obiettivi di primo piano quello relativ
cui l'utilizzo degli intermediari quale str
gli intermediari altresì dei soggetti che s
le problematiche emerse, la Guardia di fina
gli altri, al Tesoro, in ordine all'osserva
l'altro, a questo proposito le farò una dom
noi chi non sia andato in banca per fare un
UIC e Banca d'Italia, tra UIC e ISVAP, tra
UIC e ISVAP, tra UIC e Consob. Ciò in modo
UIC e Consob. Ciò in modo che, ognuno per i
le ruote) e la disponibilità di tutti quest
l'UIC e la Consob viene fatta, in base alla
mafia ed economia ha avvertito la necessità
la data di presentazione della domanda e il
l'attentato subito e gli atteggiamenti di c
l'altro, opera presso la Presidenza del Con
criminalità ed economia, potrebbe formulare
coloro i quali hanno proposto domanda di ri
le diverse zone del paese, le regioni del s
la fine del 1992 e l'inizio del 1993, per l
i presupposti dell'applicazione del program
i soggetti. Questo è un punto centrale, per
struttura investigativa e di protezione, pr
il cambiamento delle generalità e l'offerta
la commissione e l'autorità giudiziaria.
virgolette - rispetto al capo della polizia
i due aspetti della protezione e dell'inves
l'altro, allo stato di pericolo; tale indiv
i quali ovviamente doveva collocarsi, dopo
politica ed istituzioni (vedete Buscetta, f
i componenti cosiddetti laici (cinque, oltr
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
rchÇ sempre di più assistiamo a commistioni tra
o diverso, c'erano situazioni assai diverse tra
mbra si tratti di una collaborazione dovuta tra
li interventi che svolgeremo di qui a poco, tra
ono valutazioni divergenti - e ve ne sono - tra
ttor Vigna ed altri che oggi interverranno (tra
temente alimentarsi del confronto razionale tra
o, vi è un'ampia circolazione di idee anche tra
upano di altri affari. I temi più generali, tra
no su due temi fondamentali: la separazione tra
onatorio, quanto più vi sia una distinzione tra
meno immediati e concreti; è la distinzione tra
ti assolutamente impropri e non auspicabili tra
mente quelli relativi ad eventuali rapporti tra
e quali sono le regole di un serio rapporto tra
ura migliorare e razionalizzare il rapporto tra
to virtualmente un principio di distinzione tra
olutamente non necessario di conflittualità tra
empo, inoltre, potrebbe durare il conflitto tra
quanto concerne il problema delle divisione tra
dalla Commissione antimafia, quale un forum tra
cipi costituzionali che regolano i rapporti tra
obiettare che vi è una profonda differenza tra
profonda differenza tra i due settori, cioè tra
a Siclari, ha finito con il creare problemi tra
come avviene nei rapporti di collaborazione tra
i di collaborazione tra autorità diverse, e tra
esta materia e di razionalizzare i rapporti tra
- che peraltro non è mai mancato in passato tra
tratta di un settore che si trova al limite tra
essivo. Se è vero che tale procura ha avuto tra
anche quello di effettuare un coordinamento tra
152
potere politico e criminalità organizzata.
loro che occorreva riportare ad una certa r
organi istituzionali e non mi pare ci sia a
i magistrati che fanno parte della commissi
il dottor Loris D'Ambrosio, il dottor Vigna
questi la procura di Palermo), ciò signific
opinioni diverse, debbo dire che nell'ambit
la Direzione distrettuale antimafia ed i co
cui questi fondamentali riguardanti la lott
la fase delle investigazioni e quella della
organo dell'investigazione e organo della p
le sfere istituzionali di competenza della
organi della giurisdizione e organi dell'am
l'organizzazione criminale e componenti del
il collaborante e le istituzioni dello Stat
questi due aspetti. Ritengo tuttavia che, a
i poteri dello Stato, che va salvaguardato
autorità giudiziaria ed organo amministrati
l'autorità giudiziaria proponente, che rite
custodia ed investigazione, problema che, p
le procure distrettuali e molte procure ord
l'amministrazione e la giurisdizione. Natur
i due settori, cioè tra la valutazione in o
la valutazione in ordine alla possibilità d
il pubblico ministero e la commissione. Que
autorità diverse, e tra amministrazione e g
amministrazione e giurisdizione. Mai, però,
gli uffici della procura e l'organismo cent
la commissione e la procura della Repubblic
la legislazione primaria e quella di second
i suoi poteri anche quello di effettuare un
le procure distrettuali, è chiaro che una s
447 tiamo occupando, sia necessario distinguere
448 ssibilità di ricorrere ad alternative e che
449 siamo sicuramente su un terreno di confine
450 ratta di due attività profondamente diverse
451 mio parere dannosissimo momento di attrito
452 Lo Forte, cioè quello relativo ai rapporti
tra
tra
tra
tra
tra
tra
153
gli Pagina 781 aspetti formali e quelli sos
queste vi sia anche la detenzione extracarc
amministrazione e giurisdizione. Siamo su u
loro: mentre quella giudiziaria ovviamente
il potere giudiziario e quello amministrati
mafia e politica e mafia ed istituzioni, il
154
1 dal Presidente per procedere all'elezione, fra i suoi componenti, di due Vicepresidenti e
2 nati dai presidenti delle Camere, di intesa fra loro.
2. Le spese per il funzionamento d
3 ssere assolti in tempi del tutto residuali, fra una votazione e l'altra o fra l'una o l'alt
4 to residuali, fra una votazione e l'altra o fra l'una o l'altra seduta di Commissione. Imma
5 li questioni, per poi cominciare a lavorare fra due o tre settimane. Si tratterebbe, a mio
6 he intanto la burocrazia diminuisca i tempi fra la confisca e l'assegnazione; mi rendo cont
7 nche una ricognizione sul tema del rapporto fra mafia ed enti locali. In merito a quest'asp
8 unque serio, ma ä una questione di rapporti fra Governo, Parlamento e sistema bancario e no
9 tati Uniti: daremo avvio ad un collegamento fra tutti i paesi amici per rafforzare la lotta
10 cune gestioni commissariali (non ricordo se fra esse vi era anche quella del comune di Terl
11 re, le connessioni giuridiche ed economiche fra i soggetti prenditori del credito. Passo a
12 ve o statutarie che ne regolano l'attività; fra queste, evidentemente, rientra sicuramente
13 anizzata ä rappresentata dalla cooperazione fra le autorità preposte ai controlli. Con la s
14 ttoscrizione dei Memoranda of understanding fra le "Vigilanze" dei paesi comunitari si dà c
15 resentato un significativo foro di incontro fra le diverse delegazioni nazionali dell'inter
16
oggi continui ad esistere.
Mi scuso se fra poco dovrò allontanarmi, ma avrò il piacere
17 che si faceva sul serio soprattutto perché fra gli studiosi - fra i quali cito il Rey - si
18 serio soprattutto perché fra gli studiosi - fra i quali cito il Rey - si cominciava a prosp
19 ovrà essere anche la questione dei rapporti fra le varie forze di polizia, anche con riferi
20 alutare la possibilità che il coordinamento fra le forze di polizia possa essere potenziato
21 i flussi di spesa pubblica, con l'intreccio fra le imprese mafiose e gli eventuali appoggi
22 livello internazionale, per proporre intese fra tutti i paesi per arrivare ad una armonizza
23 n dolore. A me non ha fatto piacere quando, fra i primi, ho scritto delle collusioni dell'a
24 domani, dopodomani, la prossima settimana, fra due settimane e fra tre mesi. Il senso dell
25 la prossima settimana, fra due settimane e fra tre mesi. Il senso dell'ordine del giorno,
26 'entra l'autorità giudiziaria. Questi casi, fra l'altro, sono apparsi su tutti i giornali;
27 ella Commissione alcuni elementi, contenuti fra l'altro nel programma, ai quali si intende
28 che si articola lungo più direttrici tutte fra loro strettamente connesse ed alla cui scel
29 curare l'effettivo isolamento del detenuto. Fra questi si annoverano quelli dell'Asinara e
30 essere dalle organizzazioni mafiose, prime fra tutte le attività economiche e finanziarie.
FRA
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
à economica, di infiltrazioni, di relazioni
il tribunale di Reggio Calabria, dal quale
il periodo attuale.
La distinzione
generale non era emerso alcun collegamento
pplicazione, tenendo conto di vari elementi
sa efficacia ma che si sottrae al conflitto
dinaria potrebbe superare. Alla discussione
ticolo 41-bis), ossia una netta separazione
egli ultimi avvenimenti e della commistione
roduce altri reati e strane alleanze, anche
i problemi cui accenna l'onorevole Simeone,
o in zone molto diverse.
Il collegamento
sidente, che la differenza di comportamento
un'ispezione, fate un'indagine comparativa
in alcuni interventi ä quello del rapporto
to che lo stesso ä imputato di molti reati,
o istituti nuovi, come Opera a Milano, dove
ri, educatori e chi più ne ha più ne metta:
ispetto e che per una vita ä stato detenuto
ia in discussione!
LUIGI ROSSI. Scusi, ma
odia). In molti di essi, in particolar modo
poi individuare un sistema di collegamento
ma anzi attiva quell'indiscussa solidarietà
Per quanto riguarda i rapporti intrattenuti
ecché ne pensi il direttore, ä pacifico che
perché magari un giorno devono essere qui e
n provincia di Lecce, dove ora si ammazzano
illazioni cui abbiamo assistito alla Camera
ri come mai continui il contenzioso in atto
iose, se non addirittura in odore di mafia.
segnalazione che dovrà avere a presupposto,
nuovo rapporto che ha cercato di instaurare
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
Fra
fra
fra
155
settori economici, istituzionali, imprendit
poco saranno scarcerate centinaia di person
i vari periodi ä utile, anche se costringe
queste persone e la criminalità organizzata
i quali anche la giurisprudenza costituzion
persone di ottima volontà da una parte e da
41bis sì, 41-bis no, la risposta é una norm
i detenuti in base ai reati commessi, possa
detenuti normali e politici, lei non pensa
la delinquenza comune e
politica. La sua
l'altro per gli avvenimenti emersi recentem
organizzazioni criminali diverse, comunque,
banche del nord e del sud ä abissale: lei c
gli istituti bancari del nord e del sud: è
dichiarazioni e fatti. Credo sia compito in
cui un certo numero di omicidi, che inducon
l'altro abbiamo creato cento posti nell'osp
questi vi ä un conflitto terribile, come av
i detenuti.
PRESIDENTE. Tutti i magistrat
l'articolo 41-bis e l'articolo 13 della Cos
i giovani, vi ä un'adesione al malumore pop
la magistratura e il carcere ai fini delle
gli stessi uomini d'onore"; infatti gli uom
i detenuti ed il mondo esterno hanno parlat
di loro vi siano contatti; ma all'esterno a
tre a San Gimignano per un altro reato. Sug
loro: ä una tragedia che il fenomeno sia ar
maggioranza e Governo, che hanno portato di
l'esecutivo e la magistratura, che suscita
l'altro, ho raccolto vivaci critiche sulle
gli indici di anomalia delle operazioni ste
cittadino e Stato) e, dall'altro, all'inver
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
e, prevedendo eventualmente un collegamento
ato per quanto riguarda i problemi del sud,
ntito l'efficace funzionamento del sistema.
uello informativo e operativo. Sono questi,
tica nazionale ed internazionale, che tenga
presenti nell'animo siciliano: il dualismo
re la questione delle sue fonti di guadagno
o male, per poi trovare la medicina adatta.
questo tipo di servizio: vi ä una rotazione
iudici faziosi e, in qualche caso, di faide
in cui si affronta il capitolo dei rapporti
beni.
Permanenza di un rapporto attuale
à difficile una linea di demarcazione netta
che costituivano il momento di collegamento
tuale della verifica delle tesi di accusa la giusta autonomia e la giusta distinzione
ichiarazione chiara, ma ciò non ä avvenuto.
zioni, e non avendolo fatto credo di essere
di recuperare una compatibilità funzionale
bile, ma almeno che neutralizzi le distanze
n ordinario, banale e fisiologico confronto
n ordinario, banale e fisiologico confronto
OPELLITI. Potremmo convocare la Commissione
ci auguriamo, quindi, che a breve riavremo
uesta sede abbiamo soltanto discusso, ed io
gressa normativa riguardano: la distinzione
vità svolta, alla dimensione ed al rapporto
ne di servizi di pagamento, la demarcazione
stessa disciplina, per stabilire i confini
uno scarto enorme, o comunque consistente,
lo ripeto, senz'altro si rileva uno scarto
do esattamente l'espressione da lei usata -
fra
fra
Fra
fra
fra
fra
fra
Fra
fra
fra
fra
fra
fra
fra
fra
fra
Fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
fra
156
i due momenti ai limitati fini di garantire
i quali in primo luogo vi ä quello della ma
quei punti vanno certamente inseriti la sca
gli altri, gli obiettivi della prossima con
l'altro conto delle risultanze del referend
società e Stato, il ripiegamento sulla fami
le quali la principale ä il commercio delle
le medicine vi ä anche l'isolamento, indubb
tutto il personale? C'ä una determinata cat
giudici), ha addirittura assunto un'altra i
mafia e politica, tutto da verificare. I gi
mafia e politica. Ferdinando Imposimato ha
militare e colletto bianco: non esiste fors
l'organizzazione militare e la società, cio
le informazioni che provengono dall'interno
i pubblici ministeri ä evidente che esister
l'altro aggiungo che l'intervista con le di
coloro che possono essere autorizzati a for
i suoi componenti (perché si tratta di comp
i gruppi politici per un lavoro in comune a
opzioni diverse e nemmeno mi ä sembrato un
opzioni diverse e nemmeno un ordinario, ban
le 13,30 e le 14,30, tenendo conto degli im
noi l'onorevole Ayala e che questa Commissi
i primi ho fatto rilevare che impostare l'a
intermediari finanziari ed i soggetti non o
indebitamento e patrimonio sono iscritti in
operatività nei confronti del pubblico e no
l'una e l'altra normativa.
Comunque, le
l'entità del fenomeno del riciclaggio e qua
l'entità del fenomeno e la quantità di atti
il testo unico della legge bancaria e la le
157
95 Gazzetta Ufficiale, nel quale sono dettate, fra l'altro, e sempre in attuazione dell'artico
96 rganizzazioni criminali di tipo mafioso - e fra queste naturalmente tutta la materia riguar
__________________________________________________________________
Bibliography
Aarts, J. (1991) “Intuition-Based and Observation-Based Grammars”, in:
Aijmer and Altenberg 1991:44-62
Aarts, J./de Haan, P./Oostdijk, N. (eds.) (1993) English Language
Corpora: Design, Analysis and Exploitation, Amsterdam: Rodopi
(Language and Computers: Studies in Practical Linguistics)
Aarts, J./Meijs W. (eds.) (1984) Corpus Linguistics, Amsterdam:Rodopi
Aarts, J./Meijs W. (eds.) (1986) Corpus Linguistics II, Amsterdam:Rodopi
Aarts, J./Meijs W. (eds.) (1990) Theory and Practice In Corpus Linguistics,
Amsterdam: Rodopi
Ahrenberg, L./Merkel, M. (1996) “On Translation Corpora and Translation
Support Tools: A Project Report”, in Aijmer et al. 1996
Aijmer, K. and Altenberg, B. (eds.) (1991) English Corpus Linguistics:
Studies in Honour of Jan Svartvik, London: Longman
Aijmer, K./Altenberg, B./Johansson, M. (eds.) (1996) Languages in
Contrast: Papers From a Symposium on Text-Based Cross-Linguistic
Studies, Lund: Lund University Press
Armstrong, G. (1996) “Computer-Assisted Literary Analysis Using the
TACT Text-Retrieval Program”, in: Computers & Texts 11(8)
Aston, G. (1997) “Small and Large Corpora in Language Learning”, paper
presented at the PALC Conference, University of Lodz, Poland
Baker, M. (1993) “Corpus Linguistics and Translation Studies. Implications
and Applications”, in: Baker et al. 1993:233-250
Baker, M. (1995) “Corpora in Translation Studies: An Overview and some
Suggestions for Future Research”, in: Target 7(2):223-243
Baker, M. (ed.) (1998) Routledge Encyclopedia of Translation Studies,
London: Routledge (Translating and Interpreting - Encyclopedia)
158
Bibliography
__________________________________________________________________
Baker, M./Francis, G./Tognini-Bonelli, E. (eds.) (1993) Text and
Technology: In Honour of John Sinclair, Amsterdam: John Benjamins
Barnbrook, G. (1996) Language and Computers, Edinburgh: Edinburgh
University Press
Bergenholtz, H./ Schaeder, B. (eds.) (1979) Empirische Textwissenschaft:
Aufbau und Auswertung von Text-Corpora, Königstein: Scripter
Verlag
Bernardini, S. (1997) “A ‘Trainee’ Translator’s Perspective on Corpora”,
paper presented at the international conference Corpus Use and
Learning to Translate, Centro Residenziale Universitario, Bertinoro
39
Biber, D. (1988) Variation Across Speech and Writing, Cambridge:
Cambridge University Press
Biber, D. (1993) “Co-occurrence patterns among collocations: a tool for
corpus-based lexical knowledge acquisition”, in: Computational
Linguistics 19(3):549-556
Biber, D. (1993b) “Representativeness in corpus design”, in: Literary and
Linguistic Computing 8:243-257
Biber, D./Conrad, S./Reppen, R. (1998) Corpus Linguistics: Investigating
Language Structure and Use, Cambridge: Cambridge University
Press (Cambridge Approaches to Linguistics)
Biber, D./Conrad, S./Reppen, R. (1998b) “Corpus-based Approaches in
Applied Linguistics”, in: Applied LInguistics 15:169-189
Bortolini, U./Tagliavini, C./Zampolli, A. (1971) Lessico di frequenza della
lingua italiana contemporanea, Milano: IBM Italia
Brill, E. (1993) A corpus-based approach to language learning, PhD
Thesis, University of Pennsylvania: Department of Computing
Calzolari, N./Bindi, R. (1990) “Acquisition of lexical information from a
large textual Italian corpus”, in: Proceedings of the Thirteenth
International Conference on Computational Linguistics, Helsinki
39
See footnote 35
159
Bibliography
__________________________________________________________________
Chafe, W. (1992) “The Importance of Corpus Linguistics to Understanding
the Nature of Language”, in Svartvik 1992a:79-97
Chomsky, N. (1957) Syntactic Structures, The Hague: Mouton
Chomsky, N. (1962) Paper given at the University of Texas 1958, 3rd
Texas Conference on Problems on Linguistic Analysis in English,
Austin, University of Texas
Chomsky, N. (1965) Aspects of the Theory of Syntax, Cambridge, MA:
MIT Press
Cravetto, E./De Petri, L./Bosso, S./Ferrantino, R./Mazzucchetti,
E./Pellicoro, E./Prato, G./Recupero, A./Rosso, C./Servetto, P./Vono,
E. (1997) Dizionario Enciclopedico Multimediale, Torino: GarzantiUTET
Crowy, S. (1993) “Spoken corpus design”, in: Literary and Linguistic
Computing 8(4):259-296
Dardano, M./Trifone, P. (1983) Grammatica italiana con nozioni di
linguistica, Bologna:Zanichelli
Dardano, M./Trifone, P. (1985) La lingua italiana, Bologna:Zanichelli
De Mauro, T./Mancini, F./Vedovelli, M./ Voghera, M. (1993) Lessico di
frequenza dell'italiano parlato, Milano: Etaslibri.
D’Ovidio, F. (1933) Le correzioni ai “Promessi Sposi” e la questione della
lingua, Napoli: Guida
Eaton, H. (1940) Semantic Frequency List for English, French, German
and Spanish, Chicago: Chicago University Press
Feyrer,
C.
(1998)
Modalität
übersetzungsorientierten
im
Kontrast:
Ein
Modalpartikelforschung
Beitrag
anhand
zur
des
Deutschen und des Französischen, Frankfurt am Mein: Peter Lang
(Europäische Hochschulschriften, Reihe XXI, Linguistik)
Fillmore, C. (1992) “ “Corpus linguistics” or “Computer-aided armchair
linguistics” ”, in: Svartvik 1992a:35-60
Freigang, K.-H. (1998) “Machine-Aided Translation”, in: Baker 1998:134136
160
Bibliography
__________________________________________________________________
Friedbichler, I./Friedbichler, M. (1997) “Korpusgestütztes Übersetzen
jenseits der Wortgrenzen”, in: Lebende Sprachen 2/97:49-53
Fiedbichler, I./Friedbichler, M. (1997) “The Potential of Domain-Specific
Target-Language Corpora For The Translator’s Workbench”, paper
presented at the international conference Corpus Use and Learning
to Translate, Centro Residenziale Universitario, Bertinoro
40
Fries, C. (1952) The Structure of English: An Introduction to the
Construction of Sentences, New Yourk: Harcourt-Brace
Fries, C./Traver, A. (1940) English Work Lists. A Study of their Adaptability
and Instruction, Washington DC: American Council of Education
Fries, U./Tottie, G./Schneider, P. (eds.) (1994) Creating and Using English
Language Corpora, Amsterdam: Rodopi
Gavioli (1996) “Corpora And The Concordancer In Learning ESP: An
Experiment In A Course Of Interpreters And Translators”, paper
presented at the 18
th
Congress of the Associazione Italiana di
Anglistica, Genova
Gavioli, L./Zanettin, F. (1997a) “Corpus Use and Learning to Translate”,
paper presented at the international conference Corpus Use and
Learning to Translate, Centro Residenziale Universitario, Bertinoro
41
Gavioli, L./Zanettin, F. (1997b) “Comparable Corpora And Translation: A
Pedagogic Perspective”, paper presented at the international
conference Corpus Use and Learning to Translate, Centro
Residenziale Universitario, Bertinoro
42
Gougenheim, G./Michéa, R./Rivenc, P./Sauvegot, A. (1956) L’Elaboration
du français élémentaire, Paris: Didier
Granger, S. (1993) “International Corpus of Learner English”, in: Aarts et
al. 1993:57-71
40
Copyright notice: The moral rights of the author(s) to be identified as author(s) of this work are
asserted in accordance with ss. 77 and 78 of the Copyright, Designs and Patents Act 1988. This
work may be reproduced without the consent of the author, in part or in whole in any manner and
in any medium subjected only to the two following conditions:
– no charge shall be made for the copy containing the work or the excerpt
– a copy of this notice shall preceed the work or the excerpt
41
See footnote 35
161
Bibliography
__________________________________________________________________
Greenbaum, S. (1991) “The Development of the International Corpus of
English”, in: Aijmer and Altenberg 1991:83-91
Halliday, M. A. K./Hasan, R. (1976) Cohesion in English, London:
Longman
Halliday, M. A. K./Hasan, R. (1985) An Introduction to Functional
Grammar, London: Edwark Arnold
Hughes, G. (1997) “Developing a Computing Infrastructure for Corpusbased Teaching”, in: Wichmann et al. 1997:292-307
Jimenez, M. M. (1995) Sprache, Computer und Übersetzen, Diplomarbeit,
Übersetzer- und Dolmetscherinstitut, Graz
Johansson, S. (ed.) (1982) Computer Corpora in English Language
Research, Bergen: Norwegian Computing Centre for the Umanities
Johansson, S./Stenström, A.-B. (eds.) (1991) English Computer Corpora:
Selected Papers and Research Guide, Berlin: Mouton de Gruyter
Johansson, S./Oksefjell, S. (1998) Corpora and Cross-linguistic Research,
Amsterdam: Rodopi
Johansson, S. (1998) “On the Role of Corpora in Cross-linguistic
Research”, in Johannson and Oksefjell 1998
Johns, Tim (1997) Contexts: the Background, Development and Trialling of
a Concordance-based CALL Program, in Wichmann et: al. 1997:100115
Käding, J. (1897) Häufigkeitswörterbuch der deutschen Sprache, Steglitz:
privately published
Kenny, D. (1998) “Corpora in Translation Studies”, in: Baker 1998:50-53
Kenny, D. (forthcoming) Developing A Corpus-Based Methodology For
Investigating Universal Features Of Translation, PhD thesis
Klaudy, K./Lambert, J./Sohár, A. (eds.) (1996) Translation Studies in
Hungary, Budapest: Scholastica
Klaudy, K/Kohn, J. (eds.) (1997) Transferre Necesse Est, Budapest:
Scholastica
42
See footnote 35
162
Bibliography
__________________________________________________________________
Kohn, J. (1996) “What Can (Corpus) Linguistics Do for Translation?”, in:
Klaudy et al. 1996:39-52
Krenn, H. (1996) Italienische Grammatik, Ismaning: Max Hueber Verlag
Kytö, M./Ihalainen, O./Rissanen, M. (eds.) (1988) Corpus Linguistics Hard
and Soft, Amsterdam: Rodopi
Laffling, J. (1991) Towards High-Precision Machine Translation: Based on
Contrastive Textology, New York: Foris Publications (Distributed
Language Translation:7)
Lager, T. (1995) A Logical Approach to Computational Corpus
Linguistics., PhD Thesis, University of Göteborg: Department of
Linguistics
Laviosa-Braithwaite, S. (1997) “Investigating Simplification in an English
Comparable Corpus of Newspaper Articles”, in: Klaudy and Kohn
1997:531-540
Laviosa-Braithwaite, S. (1998) “Universals of Translation”, in: Baker
1998:288-291
Leech, G. (1991) “The State of The Art In Corpus Linguistics”, in: Aijmer
and Altenberg 1991:8-29
Leech, G./Candlin, C. (eds.) (1986) Computers in English Language
Teaching, London: Longman
Leech, G./Fallon, R. (1992) “Computer Corpora - What Do They Tell Us
About Culture”, in: ICAME Journal 16:29-50
Legenhausen, L. (ed.) (1996) Computers in the Foreign Language
Classroom, proceedings of the workshop no. 2 of the annual meeting
of the European Centre for Modern Languages, Graz: unpublished
Leitner, G. (ed.) (1992) New Dimensions in English Language Corpora,
Berlin: Mouton de Gruyter
Levi, E./Dosi, A. (1982) I dubbi della grammatica, Milano:Longanesi & C.
Lorge, I. (1949) Semantic Content of the 570 Commonest English Words,
New York: Addison Wesley
Louw, B. (1997) “The Role of Corpora in Critical Literary Appreciation”, in:
Wichmann et al. 1997:240-251
163
Bibliography
__________________________________________________________________
Maia, B. (1997a) “Making Corpora: A Learning Process”, paper presented
at the international conference Corpus Use and Learning to
Translate, Centro Residenziale Universitario, Bertinoro
43
Maia, B. (1997b) “Sentence Structure and Thematization in Comparable
and Parallel Texts”, in Klaudy and Kohn 1997:541-547
McEnery, T. (1992) Computational Linguistics, Wilmslow: Sigma Press
McEnery, T./Wilson, A. (1993) “The Role Of Corpora In ComputerAssisted Language Learning”, in: Computer Assisted Language
Learning 6(3):233-248
Mc Enery, T./Wilson, A. (eds.) (1996) Corpus Linguistics, Edinburgh:
Edinburgh University Press (Edinburgh Textbooks in Empirical
Linguistics)
McEnery, T./Baker, P./Wilson A. (1995) “A Statistical Analysis Of Corpus
Based Computer Vs. Traditional Human Teaching Methods Of Part
Of Speech Analysis”, in: Computer Assisted Language Learning
8(2/3):259-274
Meijs, W. (ed.) (1987) Corpus Linguistics and Beyond, Amsterdam:
Rodopi
Merkel, M. (1993) “When And Why Should Translations Be Reused?”,
paper presented at the XIII VAAKKI symposium, Vaasa
Merkel, M. (1996) “Consistency And Variation in Technical Translations –
A Study of Translators’ Attitudes”, in: Proceedings from Unity in
Diversity, Translation Studies Conference, Dublin
Mindt, D. (1992) Zeitbezug im Englischen: eine didaktische Grammatik
des englischen Futurs, Tübingen: Gunter Narr
Mindt, D. (1996) “English Corpus Linguistics and the Foreign Language
Teaching Syllabus”, in: Thomas and Short 1996:232-247
2
Newmark, P. ( 1994) La traduzione: problemi e metodi, MIlano: Garzanti
(Strumenti di studio)
43
See footnote 35
164
Bibliography
__________________________________________________________________
Peters, C./Picchi, E. (1997) “Reference Corpora and Lexicons for
Translators and Translation Studies”, in: Trosborg 1997:247-274
Porozinskaya, G. (1997) “Aspects of Literary and MT Editing in Teaching
Translation”, in: Klaudy and Kohn 1997:553-557
Quirk, R./Greenbaum, S./Leech, G./Svartvik, J. (1985) A Comprehensive
Grammar of the English Language, London: Longman
Reinke, U. (1997) “Computergestützte Kommunikation im Übersetzungsunterricht?”, in Lebende Sprachen 4/97:145-153
2
Reiß, K. ( 1983) Texttyp und Übersetzungsmethode : der operative Test,
Heidelberg : Groos
Renouf, A. (1987) “Corpus Development”, in: Sinclair 1987:1-40
Renouf, A. (1997) “Teaching Corpus Linguistics to Teachers of English”,
in: Wichmann et al. 1997:255-266
Renzi, L./Salvi, G./Cardinaletti, A. (1995) Grande grammatica italiana di
consultazione, Bologna: il Mulino
Rico Pérez, C./Martín De Santa Olalla Sánchez, A. (1997) “New Trends in
Machine Translation”, in Meta 4/97:605-621
Rissanen, M. (1989) “Three Problems Connected With The Use Of
Diachronic Corpora”, in: ICAME Journal 13:16-19
Rogers, M. (1997) “Synonymy and Equivalence in Special-language
Texts: A Case Study in German and English Texts on Genetic
Engineering”, in: Trosborg 1997:217-245
Salvi, G./Vanelli, L. (1992) Grammatica essenziale di riferimento della
lingua italiana, Firenze: Istituto Geografico De Agostini, Le Monnier
Serianni, L. (1989) Grammatica Italiana: suoni, forme, costrutti, Torino:
UTET
Short, M./Semino, E./Culpeper, J. (1996) “Using a Corpus For Stylistics
Research: Speech And Thought Presentation”, in: Thomas and Short
1996:110-131
Sinclair, J. (ed.) (1987) Looking Up, London: Collins
2
Sinclair, J. (ed.) ( 1992) Corpus, Concordance, Collocation, Oxford:
Oxford University Press (Describing English Language)
165
Bibliography
__________________________________________________________________
Somers, H. L. (1998) “Machine Translation: Applications”, in: Baker
1998:136-139
Somers, H. L. (1998) “Machine Translation: History”, in: Baker 1998:140143
Somers, H. L. (1998) “Machine Translation: Methodology”, in: Baker
1998:143-149
Souter, C. and Atwell, E. (eds.) (1993) Corpus Based Computational
Linguistics, Amsterdam: Rodopi
Stenström, A.-B. (1987) “Carry-on Signals in English Conversation”, in:
Meijs 1987:87-119
Stubbs, M. (1996) Text and Corpus Analysis, Computer-assisted Studies
of Language and Culture, Oxford: Blackwell (Language in Society)
Summers, D. (1996), “Computer Lexicography – The Importance of
Representativeness in Relation to Frequency”, in: Thomas and Short
1996:260-266
Svartvik, J. (ed.) (1990) The London-Lund Corpus of Spoken English:
Description and Research, Lund: Lund University Press
Svartvik, J. (ed.) (1992a) Directions in Corpus Linguistics, Berlin: Mouton
de Gruyter
Svartvik, J. (1992b) “Corpus Linguistics Comes of Age”, in: Svartvik
1992a:7-13
Thomas, J./Short, M. (eds.) (1996) Using Corpora for Language
Research, Studies in the Honour of Geoffrey Leech, New York:
Longman
Tribble, C. (1997) “Improvising Corpora in ELT: Quick-And-Dirty Ways Of
Developing Corpora For Language Teaching”, paper presented at
the first international conference Practical Applications in Language
Corpora, University of Lodz, Poland
Tribble, C./Jones, G. (1990) Concordances in the Classroom: A Resource
Book for Teachers, London: Longman
Trosborg, A. (ed.) (1997) Text Typology and Translation, Amsterdam:
John Benjamins Publishing Co. (Benjamins translation library:26)
166
Bibliography
__________________________________________________________________
Toury, G. (1991) “What are Descriptive Studies into Translation Likely to
Yield apart from Isolated Descriptions”, in: van Leuven-Zwart and
Naaijkens, 1991:172-192
Van Leuven-Zwart, K./Naaijkens, T. (eds.) (1991) Translation Studies: The
State of the Art: Proceedings from the First James S. Holmes
Symposium on Translation Studies, Amsterdam: Rodopi
Varantola, K. (1997) “Translators, Dictionaries and Text Corpora”, paper
presented at the international conference Corpus Use and Learning
to Translate, Centro Residenziale Universitario, Bertinoro
44
Venuti, L. (ed.) (1992) Rethinking Translation, New York: Routledge
Venuti, L. (1992) “Introduction”, in: Venuti 1992:1-15
Venuti, L. (1995) The Translator's Invisibility: A History of Translation, New
York: Routledge
Wichmann, A./Fligelstone, S./McEnery, A./Knowles, G. (1997) Teaching
and Language Corpora, London: Longman (Applied Linguistics and
Language Study)
Wolff,
Dieter
(1996)
MULTICONCORD:
A
Multilingual
Parallel
Concordancer, in: Legenhausen 1996:74-79
Wright, S. (1993) “In Search of History: English Language In the
Eighteenth Century”, in: Aarts et al. 1993:25-39
44
See footnote 35
167
__________________________________________________________________
Index
Alignment, 39
Annotation, 30, 33, 35, 38
Artificial intelligence (AI), 77
CALL, 71
Chomsky, Noam, 8, 13-14, 19
COCOA reference, 36, 63
Collins COBUILD, 16, 45
Collocating, 48
Comparability, 25, 28
Compilation, 29
Comparative linguistics, 12-13
Competence, 14, 15
Computer-aided translation
(CAT), 39
Concordancers, 48-50, 63, 68, 96
Streaming, 50
Text-indexers, 50
In-memory, 50
Concordancing, 47
Copyright, 26, 44
CORPORA
American Representative
Corpus of Historical English
Registers (ARCHER), 31
Bank of English, 16, 19, 57, 64
British National Corpus (BNC),
35, 45
Brown Corpus of American
English, 28, 33, 44
Canadian Hansard, 39
Child Language Database
(CHILDES7, 46
Computer Science Corpus of
the Hong Kong University
(HKUST), 45, 71
Corpora Project Språteknologi,
43
Corpus of Spoken American
English (CSAE), 29
CRATER, 40
English-Norwegian Parallel
Corpus, 41
ETAP, 42
FECCS, 42
Guangzhou Petroleum English
Corpus (GPEC), 45, 71
Helsinki Historical English
Corpus, 31, 36
IDS-Korpora, 64
International Archive of
Modern English (ICAME), 29
International Corpus of English
(ICE), 28
INTERSECT, 40
Kolhapur Corpus of Indian
English, 44
Lancaster/IBM Spoken English
Corpus, 20
Lancaster-Oslo/Bergen Corpus
(LOB), 20, 26, 33
LINGUA, 40
London-Lund Corpus (LLC),
20, 29, 37, 38, 39
Longman-Lancaster Corpus,
36
MULTEXT, 40
MULTEXT-EAST, 40
Penn Treebank, 36
Proteus Project, 42
Scandinavian Project of
Contrastive Corpus Studies,
43
Scania Corpus, 43
Survey of English Usage
(SEU), 16, 20, 61
168
Index
__________________________________________________________________
Swedish Government Corpus,
43
Swedish Immigrant Newspaper
Corpus, 43
Text-Based Contrastive
Studies in English, 42
Translation Corpus of English
and German, 42
Translearn, 42
TRIPTIC, 40
CORPUS
Annotated, 23, 33
Comparable, 74-75, 80
Core, 28, 41
Developmental, 46
Diachronic, 30
General, 44-45
Learner, 46
Monitor, 16, 19, 20, 28, 45, 65
Monolingual, 39
Morphology, 65
Multilingual, 39, 74
Multimedia, 88
Parallel, 39-40, 74
Parsed, 34, 36, 60
Prosodic, 37
Raw, 33
Reference, 44
Specialised, 28, 45
Spoken, 32, 38
Sublanguage, 45
Supplementary, 41
Synchronic, 24
Tagged, 33, 36, 52
Untagged, 33
Corpus-based research, 18
Corpus-based studies, 21
Corpus creation, 24
Corpus outline, 24
Counting, 47
Cross-cultural studies, 23, 29
Design criteria, 23, 24,30
Dialect, 27, 31
Disambiguation, 48, 52
Discourse studies, 68
Distribution, 25, 28
Diversity, 24, 27
Document header, 36
Electronic texts, 81
Ethnolinguistics, 72
Exploitation tools, 21
Flexibility, 23
Frequency tables, 50-51, 98-99
Grammar, 60
Idiolect, 27, 31
Inductive learning, 22
Interdisciplinarity, 23
Introspection, 15, 16, 21, 61
Keyword in Context (KWIC), 49
Language acquisition, 15
LANGUAGE
Learning, 57, 70
Pedagogy, 12
Promotion, 24
Teaching, 58, 70-71, 82
Variety, 23, 24
Lemmatising, 48
Lexicography, 62
Machine-readable form, 16, 18,
20, 26, 62
Machine translation (MT), 39, 43,
53, 76-78, 108
Example-based, 77
Statistics-based, 78
Networking, 56
Parsers, 32, 53, 84
Parsing, 47
Part-of-speech (POS) tag, 33, 38
POS tagging, 48, 95
Performance, 14, 15, 16
Permission, 27
Pragmatics, 67
Registers, 31-32
Register variation, 27
Representativeness, 18-19, 24,
30, 63, 97
Reusability, 22
Sampling, 19, 24-25, 28, 32
Proportional, 25
Stratified, 26
Searching, 47
Semantics, 13, 65
Size, 20, 25, 28, 97
Finite, 18-19
169
Index
__________________________________________________________________
Specificity, 23
Spelling conventions, 12
Standard reference, 16, 18, 21
Subject matter, 27
Stylistics, 68
Syntax, 13
Tabling, 47
Taggers, 52-53
Target domain selection, 25
Termbanks, 81
Terminology, 62, 64
Translation, 73-90
Translation memory systems
(TM), 79, 108
Translation research, 82
Universals of translation, 75
170
Scarica
Mens Sana in Corpore Sano

Scarica questa sequenza di foto in formato pps

Large linguistically-processed Web corpora for multiple languages

Mens Sana in Corpore Sano

Scarica questa sequenza di foto in formato pps

Large linguistically-processed Web corpora for multiple languages

Presentation Slides - ACORN Aston Corpus Network

Sistemi di pensiero - Sentieri della mente

Le tecnologie del linguaggio umano incontrano la lingua di internet

Umbrellas in action - Homepage di Roberto Bin

Uso dei corpora nella preparazione dei dizionari Concordanze