_________________________________________________________________ ‘‘Until the phenomena of any branch of knowledge have been submitted to measurement and number, it cannot assume the dignity of a science.’’ (Sir Francis Galton, 1822 − 1911) 1 __________________________________________________________________ Contents Acknoledgements………………………………………………………………. 5 List of Abbreviations……………………………………………………………. 6 Part I History, Criteria and Research 1 Introduction…………………………………………………………. 8 2 History of Corpora……………………………………………….. 11 2.1 Early Corpus Linguistics…………………………………………… 11 2.2 The Chomskyian Revolution………………………………………. 13 2.3 Modern Corpus Linguistics………………………………………... 15 3 Corpus-Based Research………………………………………... 18 3.1 Definitions…………………………………………………………… 18 3.2 Some Arguments in Favour of Corpus-Based Research……… 21 3.3 Corpus Outline and Creation……………………………………… 24 3.3.1 Synchronic Corpus Design Criteria…………………………. 24 3.3.2 Diachronic Corpus Design Criteria…………………………. 30 3.4 Kinds of Corpora……………………………………………………. 32 3.5 Tools for Corpus Exploitation……………………………………… 47 3.5.1 Concordancers………………………………………………… 48 3.5.2 Frequency Tables……………………………………………... 50 3.5.3 Taggers………………………………………………………….52 3.5.4 Parsers…………………………………………………………. 53 3.5.5 Ready-Available Tools versus Own Programming……….. 53 3.6 Corpus Networks…………………………………………………… 56 2 Contents _________________________________________________________________ Part II Applications 4 Applications of Corpora………………………………………… 60 4.1 The Use of Corpora in Linguistics………………………………... 60 4.1.1 Corpora and Grammar……………………………………….. 60 4.1.2 Corpora and Lexicography/Terminology…………………... 62 4.1.3 Corpora and Morphology…………………………………….. 65 4.1.4 Corpora and Semantics……………………………………... 65 4.1.5 Corpora and Pragmatics…………………………………….. 67 4.1.6 Corpora, Stylistics and Discourse Studies………………... 68 4.1.7 Corpora, Language Teaching and Learning……………… 70 4.1.8 Corpora and Ethnolinguistics……………………………….. 72 4.2 The Use of Corpora in Translation………………………………. 73 4.2.1 Parallel, Multilingual and Comparable Corpora…………... 74 4.2.2 Machine Translation…………………………………………... 76 4.2.3 Translation Memory Systems………………………………... 79 4.2.4 Corpora vs. Termbanks………………………………………. 81 4.2.5 Translation Teaching and Translation Research…………. 82 4.2.6 Thinking Globally - Acting Locally…………………………... 87 4.2.7 Critical Comments…………………………………………….. 88 4.2.8 Conclusions……………………………………………………. 90 5 Case Study………………………………………………………... 91 5.1 The Problem………………………………………………………….91 5.2 Formulation of the Hypothesis……………………………………..92 5.3 Selecting the Corpus………………………………………………. 93 5.4 Choosing the Tools………………………………………………… 96 5.5 Summarising the Restrictions……………………………………... 97 5.6 The Study……………………………………………………………. 98 5.6.1 Synonymy……………………………………………………… 98 5.6.2 Cacophony…………………………………………………… 100 5.7 Conclusions………………………………………………………... 103 Part III Conclusion and outlook 6 Drawing Conclusions…………………………………………... 105 3 Contents __________________________________________________________________ Appendices 1 Glossary…………………………………………………………………… 111 2 Major Corpora Available………………………………………………... 116 3 Software Available for Corpus-Based Research…………………….. 128 4 Results of a Collocation Search of Tra and Fra……………………… 135 Bibliography………………………………………………………………….. 154 Index…………………………………………………………………………... 164 4 _________________________________________________________________ Acknowledgements I am particularly grateful to my parents, who have been an invaluable support throughout all these years of "foreignness" in Graz. Above all, I would like to thank my supervisor, Dr. Ursula Stachl-Peier. Ursula, thank you for being not only a brilliant linguist, but also a wonderful friend, and for having made this thesis less inadequate than it still remains. This thesis reflects the precious work and committment of numerous corpus researchers throughout the years. It is thanks to these far-sighted linguists that we are nowadays faced with the expanding universe of corpus studies. It is to them that I shall dedicate this work of mine. 5 _________________________________________________________________ List of Abbreviations AI Artificial intelligence ARCHER American Representative Corpus of Historical English Registers BNC British National Corpus Brown Brown Corpus of Standard Written American English CALL Computer-assisted language learning CAT Computer-aided translation CRATER Corpus Resources and Terminology Extraction CSAE Corpus of Spoken American English GPEC Guangzhou Petroleum English Corpus ICAME International Computer Archive of Modern English ICE International Corpus of English ICLE International Corpus of Learner English Lancaster/IBM Lancaster/IBM Spoken English Corpus LIP Lessico di frequenza dell'italiano parlato LLC London-Lund Corpus LOB Lancaster-Oslo/Bergen Corpus LSP Language for specific purposes MT Machine translation OCP Oxford Concordancing Program POS part-of-speech SEC IBM-Lancaster Spoken English Corpus SEU Survey of English Usage Corpus 6 _________________________________________________________________ PART I HISTORY, CRITERIA AND RESEARCH 7 _________________________________________________________________ 1 Introduction Over the last 40 years linguistic research has undergone major changes. While many have deplored this, arguing that it has led to a lack of focus and to inconsistency, others (i.e. Svartvik 1990) have pointed out that it has greatly contributed to academic cross-fertilisation and the development of new approaches, which will hopefully help us to better understand the intricacies of human language processing. In the wake of the new insights associated with the 1950s and Chomsky, new ways of analysing language were pioneered, while older approaches were virtually abandoned. Among those dismissed as unscientific or inappropriate was also the corpus-based approach to language which up to then had been the main way of gathering language data. Corpus linguistics, as this discipline is generally called, became neglected, but it by no means disappeared. 1 Nowadays, the usefulness of corpora is being rediscovered and they are proving an excellent resource for a wide range of research tasks, not only because they give evidence of current language usage, but also because they permit us to compare synchronic and diachronic shifts within a language and so become the foundation of analysis. 1 The word ‘corpus’ derives from Latin and means ‘body’, a body of texts. Any collection of more than one text can therefore be defines as a ‘corpus’, which seems a simple enough notion. In linguistics, however, it is slightly more complicated, as various criteria of inclusion have to be taken into consideration when a corpus is compiled (see Section 3.1). 8 Introduction _________________________________________________________________ One of the aims of my thesis is to provide an overview of the many possible applications that corpora can have in language and translation studies. When I set out to write this thesis I also had another goal in mind - a slightly subtler one. What I wanted to do was to help bring about a change in well-established views of the role of teachers and learners. Nowadays students are more and more often asked to organise and manage their own learning, they are given freedom of choice about which subjects to study, but most of the time they are not told how to go about it. The teacher is no longer the sole provider of knowledge, and s/he often falls victim to sheer economic constraints. This new setting turns out to be a very big challenge, a challenge to the student and the teacher alike. I strongly believe that a corpus-based approach to language and translation teaching is a tool which will allow us to keep up with the times as it ensures easy and immediate access to empirical language data. Students are encouraged to deduce rules from naturally occurring evidence and no longer solely rely on introspection-derived data, be they those presented by their teachers or themselves. This thesis includes three main sections. After a short introduction, Part one, Chapter 2 deals with the history of linguistic research based on corpus evidence, from the early 1950s to the Chomskyian revolution until today. Chapter 3 discusses the main elements of corpus-based research, exploring issues like corpus outline, corpus creation, the different kinds of corpora, the tools needed to exploit a corpus, finally focusing on the possibility of establishing a computer network for corpora studies. Part two Chapter 4 places the emphasis on the numerous applications of corpusbased research, discussing both possible areas of research in linguistics and translation studies. The case study (Chapter 5) - which completes Part Two - introduces an example of corpus exploitation which might be of relevance to linguists as well as to translators. 9 Introduction _________________________________________________________________ Conclusions are then drawn in Part Three. Section 6.1 suggests possible topics for further research for translation students, while Section 6.2 focuses on teaching corpora and describes course design issues. This thesis aims to be above all a basic introduction to corpus-based research for (translation) students and teachers with no or very limited knowledge of the area. The focus is on possible applications in a variety of domains, including grammar, lexicography, morphology, semantics, pragmatics, stylistics, ethnolinguistics and language teaching. It is hoped that by showing the potential of corpus-based research, more teachers and students will be motivated to try out similar studies themselves. The same intention guides the Case Study, which tries to show very faithfully the kind of problems that may emerge during a project, and where solutions might be found. It will not discuss in any detail the more technical, computational issues, such as quantitative data analysis, factor analysis, multidimensional scaling, cluster analysis or chi-square tests. Suggestions for further reading on these subjects are included in most of the standard books on computational linguistics (i.e. McEnery (1992), Souter and Atwell (1993)). 10 _________________________________________________________________ 2 History of Corpora The main aim of Chapter 2 is to give a brief overview of the history of 2 evidence-based language analysis. I shall first describe early approaches to corpus linguistics, and then - in Section 2.2 - focus on Noam Chomsky, th who is arguably the most influential person in 20 century linguistics. I shall discuss his objections to corpus linguistics and arguments in favour of rationalism. In 2.3 I shall concentrate on more recent trends in linguistics and the revival of corpus linguistics as one of the most valid tools for language modelling. 2.1 Early Corpus Linguistics ‘Early Corpus Linguistics’ is not a canonised definition of a fixed period of time. The term was first used by McEnery (1996) to define all work carried out in the various areas of linguistics before the methodology advocated by Noam Chomsky became predominant. ‘Corpus linguistics’ has actually been mainstream for a long time, although it was initially discussed under a variety of names and labels. Different branches of linguistics, such as field linguistics and the post3 Bloomfieldian structuralist tradition of Harris and Hill, based their studies on a methodology which nowadays might very well be called ‘corpusbased’. 2 The information included in this Chapter is based on accounts in several textbooks and on internet sites (Chomsky 1957, 1962, 1965; Cravetto et al. 1997; McEnery and Wilson 1996 as well as http://caseyd.meer.net/dj/Chomsky/Chomsky.html). Exact page references are only given when I quote directly from a source. 3 See Appendix 1. 11 History of Corpora _________________________________________________________________ In the 1950s then, Chomsky brought about a revolution which favoured an approach which was based on data derived from native speaker introspection and totally excluded corpus evidence. As we will see later in this chapter, his doctrine put a spoke in the still fragile wheel of corpus linguistics. Before the 1950s several corpus-based studies had been carried out, not only in the field of language acquisition - maybe the most obvious application of corpus-based research -, but also in the area of language pedagogy (Fries and Traver 1940), spelling conventions (Käding 1897), comparative linguistics (Eaton 1940), syntax (Fries 1952), semantics (Lorge 1949) and lexis (Gougenheim et al. 1956). Language acquisition studies first applied the methodology of corpus-based description in the th first half of the 20 century. Research focused on child language on the basis of parental diaries recording dialogues and utterances. Generally, however, these were simply a collection of transcribed interactions which rarely complied with even the basic principles of corpus studies, i.e. representativeness. This material was still being used in the late 1970s by Chomskyians as a source of normative data, which in fact made this field of research one of the few areas that successfully continued throughout the Chomsky-dominated period and so represented a continuum in corpus linguistics. It also laid the foundations for a new generation of corpus linguists like Svartvik and Leech, whose early career was in language acquisition before they started to apply corpus-based methodologies to other domains. Corpus evidence in order to establish spelling conventions was first used by Käding in 1897, when he put together an impressive 11-millionword corpus of German texts to analyse and correlate the frequency of letters and letter sequences in the German language (for details see Käding 1897). Comparative linguists also included corpora in their studies. One of them was Eaton, who in 1940 compared the frequency of word meanings in German, Italian, French and Dutch. The exploitation of the corpus 12 History of Corpora _________________________________________________________________ allowed him to derive useful information about semantic links across languages and communication situations. The modernity of such a study is proved by the fact that only in the second half of the 1990s McEnery and Oakes were able to create a corpus which was large enough to produce similar results (McEnery and Wilson 1996:3). Arguably the two most important early fields of application of corpus studies are syntax and semantics. In 1952 Fries created a corpus of transcribed telephone conversations which he then transformed into a descriptive grammar of English. His pioneering work provided a model for the one developed by Quirk et al. in 1985, more than 30 years later (see Fries 1952). Semantic studies were carried out also for other European languages. French was analysed in detail by Gougenheim et al. (1956), who transcribed a corpus of spoken language gathered from 275 informants with the aim of describing high frequency lexical and grammatical choices (for further details see McEnery and Wilson 1996:4). 2.2 The Chomskyian Revolution Between 1957 and 1965 Noam Chomsky published three books which brought about a radical change in linguistics (see Chomsky 1957, 1962, 4 1965). Chomsky’s criticism of behaviourist approaches initiated a new kind of research which was founded on rationalism and focused on introspective judgement rather than external data analysis. Introspection, Chomskyians argued, was quicker and more reliable (for further comments see http://listserv.acsu.buffalo.edu/cgi-bin/wa?S1=anthro-l) In some ways, Chomsky’s success was also the result of a certain degree of arrogance on the part of early computer linguists who believed that corpora were collections of all potential utterances. They consequently argued that corpora were the only valid method of describing language and therefore its “primary explicandum” (Leech 1991:8). 13 History of Corpora _________________________________________________________________ Chomsky strongly objected to this view of language as a finite medium. He pointed out that language structures vary in line with personal style preferences, generic constraints or in response to contextual (situational) needs, that neologisms are constantly added to enrich the lexis of languages and meanings also change. Corpora - which are finite, and according to Chomsky “skewed” – therefore were unsuitable as models for language: „Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list.“ (Chomsky 1962:159) One of the main tenets of Chomsky was that linguistic theories should be cognitively plausible and able to simulate and recreate natural language processing. Chomsky therefore stressed the need for new analytical tools, which he argued – would have to focus on competence instead of performance (Chomsky 1962). Competence is introspection. Introspection is the knowledge of a language that we derive from our own experience. Chomsky did not actually deny the importance of performance, but he was convinced that it was competence that both explains and characterises a language. For his language model Chomsky therefore discarded performance data, as these were considered to be too weak to mirror the linguistic behaviour of a language community, and were influenced by too many language-independent factors, including physical shape of the speaker, his/her moral principles, etc One of the major drawbacks I see in Chomsky’s approach is that language rules risk to become (or remain) the domain of influence of those in power. Despite Chomsky’s professed reluctance to prescribe and repeated claims in all Generative Grammar influenced textbooks that all the versions produced by native speakers are equally acceptable, 4 See Appendix 1. 14 History of Corpora _________________________________________________________________ decisions on what ‘standard’ language is continue to be the prerogative of the educated elite. Another critical issue in Chomsky’s theory, I believe, is his assertion that only data derived through introspection will not be skewed. To me, introspection-based data are themselves a kind of evidence and equally skewed because no native speaker will ever produce the full range of utterances possible in a language. They may perhaps be a weaker variation of empiricism, as we do not yet have a methodology available to retrieve and classify uniformly such personal references. Another point is that introspection, although it can be recorded, is often left unspoken. Recordings can easily be analysed, yet thought processes remain unobservable because they cannot be shared with other people. Corpus evidence, on the other hand, is publicly available and can therefore be commented on by all. Even if we try to ignore the fact that any kind of recording is a corpus, it can still be argued that modelling and - as a direct consequence - identifying the rules of the language used by a certain language community must be an endeavour shared by the entire community and not only an effort engaged in by a linguistic enclave. Again, this touches on linguistic empowerment. Competence and performance recognise in their conceptions of language analysis different linguistic ‘leaders’. While introspection focuses on the individual, trying to provide him/her with the necessary tools for language analysis, corpus linguistics makes use of already existing tools to draw conclusions from naturally occurring data. The Chomskyian revolution had far-reaching consequences, not simply in linguistics. His emphasis on cognitive plausibility encouraged computational linguistics to build systems which would simulate human intelligence and carry out intelligent tasks. 2.3 Modern Corpus Linguistics 15 History of Corpora _________________________________________________________________ Despite Chomsky’s success, corpus-based work continued throughout the 1950s and 1960s, especially in those fields where introspection failed to achieve satisfactory results. Phonetics and language acquisition were a case in point. It suddenly became obvious that introspection - especially in child language acquisition - can only be applied once metalinguistic awareness has been developed, in other words we can apply competence to language modelling only when we are aware of being linguistically competent. Different corpus linguistics projects were started. Between 1959 and 1961 Randolph Quirk began working on his Survey of English Usage (SEU) Corpus. Very shortly afterwards Nelson Francis and Henry Kuçera from Brown University in Providence (Rhode Island, United States) set out to put together the Brown Corpus, a sample of printed American English, which is still considered the standard reference for language enquiries. In 1975 Jan Svartvik and his team at Lund University began to transcribe and so to render machine-readable - the spoken part of the SEU corpus. The advent of the computerised corpus, that is a collection of machinereadable texts, is indeed a major turning point in corpus linguistics. The availability of institutional and private computing facilities fuelled the growth of corpora, which from 1965 onwards started to became bigger and bigger in size and number: the largest corpus available nowadays is the Bank of English corpus, a monitor corpus created at the University of Birmingham in collaboration with Collins COBUILD, which includes more than 200 million words of British English and is constantly been added to. In recent years, a new trend has started which promises exciting new opportunities. Corpus researchers like McEnery and Wilson have realised that artificial data - collected via introspection - can have a place in corpus linguistics, albeit with the proviso that corpus evidence will “act as a control, a yardstick” (1996:16). Corpus linguistics - as McEnery and Wilson (ibid.) put it - should be a synthesis of introspection and 16 History of Corpora _________________________________________________________________ performance analysis, a mix of artificial and natural observation. Fillmore sums up this symbiosis very well: “I don’t think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore… [but] every corpus I have had the chance to examine, however small, has taught me facts I couldn’t imagine finding out any other way. My conclusion is that the two types of linguists need one another.“ (Fillmore 1992:35) 17 _________________________________________________________________ 3 Corpus-Based Research This Chapter focuses on corpus-based research and provides a more detailed description of the methodologies used. Section 3.1 gives a definition of ‘corpus’ and ‘corpus-based research’; in Section 3.2 I shall outline some of the strengths of a corpus-based approach which have been cited in the literature to prove its validity. Before giving a detailed description of the different kinds of corpora in Section 3.4, I shall delineate in Section 3.3 the main points to be considered when building a corpus. The tools needed for corpus exploitation are explored in Section 3.5, while Section 3.6 outlines criteria for building a computerised infrastructure for corpus-based work. 3.1 Definitions In Chapter 1 I already defined a ‘corpus’ as any collection of more than one text. In modern linguistics, however, this collection of texts must fulfil certain criteria to be considered a corpus. As stated by McEnery and Wilson (1996:21), a ‘corpus’ must display four main features: ♦ representativeness ♦ finite size ♦ machine-readable form ♦ be a standard reference. Representativeness 18 Corpus-Based Research _________________________________________________________________ Representativeness is a major point. There are basically two ways of collecting data: either you record every single utterance of a specific language variety, or you build a sample of the entire population of texts that you want to analyse. As already pointed out in Chapter 2, a living language constantly grows and changes, which means that its lexical and syntactic structures are in theory infinite. The first approach is therefore impossible to implement. Generally, therefore, corpus linguists will opt for the second methodology. However, sampling also has its pitfalls (see Noam Chomsky’s criticism of corpora being „skewed“ in previous chapter). When compiling a corpus, we are influenced by many factors (i.e. availability in electronic form, easy retrieval, ready-made text collections, etc.) that automatically - and, sometimes, unconsciously – determine the range of texts from which the corpus will be sampled. Representativeness therefore can never be totally objective. While this may be a major drawback, I believe, sampling is still a legitimate approach, provided we are aware that samples can never reproduce a language variety completely accurately and faithfully, and provided we ensure that the collected corpus is balanced. Biber (1993b) has outlined a number of steps to produce an appropriately balanced corpus: before starting to build a corpus, clearly state the aim of the study, specify the linguistic variety to be analysed, and indicate what he (ibid.:243) calls the „sampling frame“ - the entire population of texts from which samples are taken. Samples must ‘average out’ and provide a reasonably accurate picture of the entire language population. Finite Size The second feature mentioned is the size of the corpus, which should be finite. Not all corpora are finite, however: ‘monitor corpora’, such as John Sinclair’s Bank of English, are open-ended collections. Texts are constantly added to the corpus in order to update the material already 19 Corpus-Based Research _________________________________________________________________ collected and so produce reasonably exhaustive samples of language use. Determining the size of a corpus is one of the most difficult tasks in corpus creation. In order to facilitate this task, computational linguists have elaborated algorithms which are able to approximately quantify 5 variables such as chance and significance. When the total number of words is reached, collection stops and the corpus is thereafter not increased in size. Apart from monitor corpora, the only exception to this principle is represented by the London-Lund Corpus (LLC), which was enlarged in the mid-1970s by Sidney Greenbaum in order to cover a wider variety of genres. Machine-Readable Form A corpus also has to be available in machine-readable form. As we saw in chapter two, an essential difference between early and modern corpus linguistics is the ready availability of microcomputers. Before the advent of computerised data processing, corpus exploitation was a very long, expensive and error-prone procedure: just think of Käding’s 11-millionword corpus and the 5,000 Prussian analysts he needed to go through the corpus. Svartvik was one of the first linguists that applied the principle of machine-readability to data collection by phonetically transcribing the spoken texts of Quirk’s SEU. Incidentally, the LLC is one of the few corpora still available in book format. Although machine-readability is nowadays to be considered as absolutely necessary, there are still some exceptions that need to be mentioned. A complete concordance of the Lancaster-Oslo/Bergen Corpus (LOB) is available only on microfiche, while some other spoken corpora such as the Lancaster/IBM Spoken English Corpus offer copies taped for phonetic analysis. The advantages of machine-readable corpora can therefore be summed up under the following 3 main headings: 20 Corpus-Based Research _________________________________________________________________ ♦ thanks to corpus exploitation tools - i.e. concordancers, frequency listers, parsers - data can be searched and manipulated easily and time-effectively, and so simplify result analysis; ♦ they can easily be enriched by adding information about grammar and 6 lexis ; ♦ they can be made available to researchers within a couple of minutes via Internet connections. Standard Reference The fourth requirement is that a corpus should also be a standard reference. Sharing a collection of texts with the rest of the research community can make out of an appropriately designed corpus a yardstick for language modelling, which can then also be used for later research projects. A further advantage is that by using a single source of linguistic information it is easier to compare different studies, because the opinions expressed can be judged exclusively on the basis of the claims made by the scholar who carried out of the analysis. 3.2 Some Arguments in Favour of Corpus-Based Research The literature cites numerous arguments in favour of corpus-based research. The perhaps greatest advantage of corpus linguistics over other approaches that I can see is that corpus linguistics is not restricted either to theory or just to practice, but that it combines both. It makes available the methodology that is required to carry out studies into language usage, but not without also insisting that the empirical data are included in an overall theoretical description. In the following, I shall quote a few more arguments that have been used in favour of corpus studies. 5 6 For further reading on quantitative data analysis see McEnery and Wilson 1996, pp. 66-86. For further information about corpus annotation see tagged corpora (Chapter 3 Section 4). 21 Corpus-Based Research _________________________________________________________________ Corpus-based studies vs. introspection-derived data First and maybe most importantly, a corpus-based approach provides naturally recorded, linguistically comprehensive examples. We often explain a phenomenon or a grammar rule by means of introspectively created examples. Although we are convinced of their validity, we have to admit that frequently the examples we produce are either clichés or rather idiosyncratic. Evidently, we need proof from natural language use: in grammar teaching there is no point in analysing a language variety that either does not exist in reality, or is considered a sort of sublanguage used by a closed circle of language users (i.e. prototypical examples used during a language learning class). If we use a corpus, then this corpus might of course also contain such prototypical or idiosyncratic examples, but, because of the greater representativeness of the corpus, these examples form part of the knowledge of a wider linguistic community, and therefore must be accepted as commonly shared rules. Corpora as Material for Inductive Learning 7 Possibly the biggest advantage of the corpus-based approach is that it allows inductive learning and is always learner-centered. In the final analysis it is the learner that decides what s/he wants to focus on, what s/he wants to learn, how s/he wants to acquire knowledge or skills and at which pace. The learner can therefore exploit the corpus for his/her own purposes, which may indeed vary between learners. (Incidentally, corpora have also been successfully used in teacher training; see Renouf 1997). Corpora and Reusability Another important characteristic of the corpus-based approach is the reusability of linguistic resources. We have already mentioned among the four main features of a corpus that it is a standard reference. Public 22 Corpus-Based Research _________________________________________________________________ availability and source reusability are, in my opinion, closely linked. Together they assure coherence, a much appreciated quality in linguistics. A coherent approach to language study permits project comparison and contributes to a global analysis of language use. Corpora and Interdisciplinarity Closely linked to the issue of reusability is the issue of interdisciplinarity. Various linguistic fields can all exploit the same corpus to conduct stylistic, syntactical and lexical studies. The results can then be used as the basis of cross-cultural studies. Corpora and Flexibility Corpora enable all kinds of studies. They can for instance be provided with extra tags which are added after every word to describe its status. This is the so-called ‘annotated corpus’, which McEnery and Wilson (1996:24) call „a repository of linguistic information“, because it makes explicit what in the plain text was still implicit (for further details see Section 3.4). Corpora and Negative Results Another major advantage of corpus analyses is that even a negative result is an analysable result. Corpora and Specificity Yet another advantage of a corpus-based approach is its specificity: the choice of texts to be included in the corpus and the design criteria applied can reflect a specific attitude to language analysis, which means that we can modify not only the methodology, but also the goal. Different languages or language varieties require different analytical standards or approaches. The corpus can be built in respect of these standards and 7 The various possibilities offered (i.e. CALL) will be described in Part Two, Chapter 4, Section 2. 23 Corpus-Based Research _________________________________________________________________ therefore become the only valid tool to analyse specific connotations of a language or language variety (i.e. a sublanguage of a dialect). Corpora and Language Promotion One - often unintended - outcome of corpus studies is that analyses of a given language or language variety help to put this language on the linguistic map, promoting both research into this variety as well as its use. 3.3 Corpus Outline and Creation The validity of a study depends primarily on the sampled corpus. This section aims to delineate some of the basic corpus design criteria involved in corpus creation. I have mentioned before that corpus-based research is diverse and extremely flexible, that it allows for a wide range of linguistic and non-linguistic studies, all of which require the inclusion of special features and therefore need to be sampled differently. Principally, the corpus criteria discussed in this Section are meant to describe the design of corpora that will be exploited for linguistic purposes. Two basic distinctions are made, ie. synchronic and diachronic corpus design criteria. 3.3.1 Synchronic Corpus Design Criteria The major text databanks available are synchronic corpora, that is they describe the state of the language at a certain point in time. The samples of texts making up the corpus generally comprise different language varieties, all produced during the same period of time. Representing (part of) a language is obviously a problematic task. It is very difficult to determine the full extent of linguistic variations, or even all the contextual variables that need to be covered in order to deliver a complete language description. However, attention to certain features will balance out imprecisions and ensure corpus representativeness. The main issues of corpus design may then be summarised under 7 headings: ♦ target domain selection; ♦ sampling; ♦ diversity; 24 Corpus-Based Research _________________________________________________________________ ♦ size; ♦ comparability; ♦ distribution; ♦ other issues. Target domain selection The first step is to determine the purpose of the study, that is to select the target domain. Target domain selection is extremely important when building a corpus. It involves deciding which language variety to focus on, choosing the register and, possibly, limiting the claims of the research project. We cannot proceed to sampling without knowing precisely what we are actually looking for. Sampling Once we have decided what we are going to study, we can start to sample our corpus. Two major approaches can be opted for: proportional sampling and stratified sampling. To sample a corpus proportionally means to find a group of people and record all examples of language they produce and receive - spoken, written or both, depending on the kind of corpora we are compiling - over a certain period of time. You can then proportionally divide the language varieties which your subjects have been exposed to and then build the corpus on the basis of the data collected. The drawback, as Biber (1998) states, is that proportional samples are fairly homogeneous, and cannot normally be used for language variation studies. A proportional corpus that aims to mirror the spoken language used in everyday situations, for instance, is unlikely to include many examples of more elevated registers. The samples would therefore display very similar characteristics, which means that any model of language use generated on the basis of these corpora would be wrong. If we search for a sample of predetermined language variants that describe a given language - or compare it to others -, we need the stratified approach. A corpus constructed using a stratified approach 25 Corpus-Based Research _________________________________________________________________ includes and categorises all varieties and registers of the language that we have decided to analyse. Having catalogued and drawn samples from all the different categories of text that occur in a given language, we can then link the texts with the categories or sub-categories. This needs to be specified in the description included in the general information about the corpus (i.e. in text headers). A good example is Stig Johannson’s work, who in 1978 gave an exhaustive account of the categories applied to the compilation of the LOB corpus. A further important aspect in sampling is the background of the language user. Even if the text falls into a specific category and extra information about the person who produced it may be considered of little importance, there is sometimes a need for contextual knowledge. In a corpus trying to describe a specific literary movement, for example, it is crucial to know if the writer has always belonged to that particular movement or if the text sample merely represents a period of his/her life as an artist. Sampling also involves copyright issues. Copyright is a major impediment to encoding and storing modern literary and commercial material, also for those who only want to compile corpora for their own personal use. Not all sources respect copyright. Documents made available through the Internet, for example, are often of uncertain status: some need an explicit authorisation in order to be copied, others contain a simple copyright disclaimer. Mailing lists - at least in the USA - are assumed to be implicitly licensed for textual reproduction or retransmission, while the use of anonymised extracts for study purposes within an institution is considered to be ‘fair dealing’. Without such silent agreement it would be difficult to retrieve written and spoken material in machine-readable form. In other cases - such as the major corpora of English language - limited access to data sources is often allowed to educational institutions, especially if the source itself is an educational institution, or belongs to some public authority. If doubts exist, the data 26 Corpus-Based Research _________________________________________________________________ owner will have to be asked for permission, even if this may often seem a mere formality or a waste of time. Diversity The next design criteria I would like to touch upon is diversity. Experience tells us that - If we intend to study language use in general we must include as many variants as possible. There is no such thing as ‘general language’, instead there are many language varieties which differ in the use of lexical, grammatical and discourse features. Furthermore, each language variety includes different registers, and each register has its own pattern of use. To ensure diversity in a corpus, Biber (1998) suggests that two areas need to be considered: register variation and subject matter. Firstly, register variation must be represented appropriately. Speakers of a language make use of different registers, depending on the person they are talking or writing to. Including only some of these registers would mean that an incorrect description of language use is produced, which would invalidate the corpus. The second is subject matter. This is of major interest for lexicographers, since the frequency of many words depends on the theme of the interaction. These two issues are closely linked: for all studies, in fact, you need a sample of a great range of subject matters and, within each subject matter, of all different registers used. 8 There is a third aspect, not mentioned by Biber, which refers to diversity amongst language users rather than language use. From a linguistic point 9 of view, dialect and idiolect can also introduce diversity and should therefore be considered. Size 8 This of course only applies to corpora which aim to reflect ‘general language’. For more specialised studies, such as “Academic German”, diversity needs to be redefined. 27 Corpus-Based Research _________________________________________________________________ The third criterion listed above is size. Size means numbers: the number of words included in the corpus, and also the number of texts from the various text categories, the number of samples from each text, and the number of words in each sample. 10 The issue of size is important and should be approached very carefully: if an unbalanced number of texts is included, some text categories can have an undue influence on the results of the analysis. Equally important is the choice of samples from each text. A text can include more than one register or, more generally, different patterns of language use. If a corpus does not include all features of the specific pattern(s) analysed, it will misrepresent the linguistic category to be sampled. Greenbaum (1991) gives an exhaustive account of the issues involved when deciding on corpus size in his description of the International Corpus of English (ICE) and outlines its importance for representativeness. The ICE, for instance, includes a core corpus of 1,000,000 tokens, which should be mainly used for international comparison. To this core corpus it is possible to add (parts of) a specialised corpus (i.e. business letters, student essays, etc.), which is (are) felt to be of value to researchers working in a particular region. A third corpus which contains texts without specific categorisation can then also be compiled. All three corpora together form the monitor corpus which can then be used to analyse a given regional variety. Comparability and Distribution Comparability and distribution are two minor issues in corpus design, that is they do not necessarily apply to all corpora. As far as comparability is concerned, it is interesting to note that the design of a corpus is sometimes subject to limitations which may - to some extent - conflict with The goals set by the compiler. Tradition in corpus linguistics likes the 9 See Halliday and Hasan 1985:41. The International Corpus of English (ICE), the Brown Corpus of American English and the LOB all agree that each text in the core corpus should contain about 2,000 words. 10 28 Corpus-Based Research _________________________________________________________________ design of a new corpus to follow the pattern set by corpora already compiled, so that the two can be easily compared, and this might not always match the corpus creator’s needs. An example of the importance of comparability is the Corpus of Spoken American English (CSAE), which was designed along the lines of the LLC. This had two clear advantages: first, the corpus creator was able to combine all that had been learnt from earlier experience with technological innovations and new theoretical developments; secondly, the creation of two comparable corpora allowed cross-cultural studies and also cooperation in drawing up methodological and analytical frameworks. Distribution involves the form in which the final product is published and questions of which institution will distribute them. The most frequently used channel of dissemination is the CD-ROM, principally because it is small, light, easy to send, and has considerable research potential because it delivers the corpus in machine-readable form. Some corpora, such as the CSAE and the LLC, are still available as printed books. Other channels include microfiche, tapes and the world wide web. In some cases, an institution assumes responsibility for the distribution, as, for example, the Norwegian Computing Centre for the Humanities at Bergen, which is responsible for the distribution of both the International Computer Archive of Modern English (ICAME) and the ICE. Other issues Other issues in corpus design include compilation and annotation. Compilation involves some ‘strategic’ decisions. One is the question of sampling method. The classic approach, for instance, implies scanning text samples and then editing them. A further problem could then be the processing of non-language material (i.e. mathematical formulae, symbolic expressions, figures, diagrams, pseudocodes, etc.). A way of solving these problems is to decide to confine text samples to the running text and therefore omit such materials. Omissions can then be balanced by protocols (i.e. *EQ* for mathematical expressions, *FI* for figures, etc.). 29 Corpus-Based Research _________________________________________________________________ Finally, proof-reading and cross-checking belong to corpus compilation as well. The second issue is annotation. It is up to the corpus builder to decide whether or not s/he wants to encode the corpus. Some corpora (i.e. ICE and LOB) have been compiled in two versions, one with and one without annotations. With respect to discourse analysis, for example, John Sinclair’s advice is indeed very appropriate: The safest policy is to keep the text as it is, unprocessed and clear of any other codes. (Sinclair 1991:21) Nevertheless, corpus annotation is opted for when analysis tools (i.e. parsers) are to be used, such as in lexicographic or grammatical studies. Given constraints on time, money and the availability of texts, it is often necessary to make compromises. Every corpus has limitations, but a welldesigned one can still be very useful for investigating a variety of linguistic issues. 3.3.1 Diachronic Corpus Design Criteria Designing a diachronic (also known as ‘historic’) corpus - a collection of texts that accounts for language development across a specific period of time - can be even more complicated than creating a synchronic corpus. In addition to the basic issues of corpus design outlined in the previous section, a diachronic corpus compiler is faced with the problem of representativeness. Since the corpus aims to cover a precise range of linguistic variations and registers across a specified period of time, it might be possible to opt for exhaustive sampling. That means that the ‘final product’ will include all linguistic variants and all registers of that specific period. However, actual practice is often far more complicated than theory. The design of a representative diachronic corpus which will be used to study a specific literary style raises serious questions about sampling methods. The most complex approach is what Biber defines as the “multipurpose diachronic corpus” (1998:251), which is designed to represent a 30 Corpus-Based Research _________________________________________________________________ wide range of registers across historical periods, such as the Helsinki Historical English Corpus and the American Representative Corpus of Historical English Registers (ARCHER). In addition to the standard variables (time and region), a major point in diachronic corpus design is the range of registers which are to be included in the corpus. This is not an easy task: there are, in fact, various factors that play a role, one of these being the number of texts available. It can be very difficult to find sufficient texts to exhaustively cover a certain register. A case in point is spoken interaction. The ARCHER Corpus, for instance, includes several speech-based registers, but the majority of transcribed texts of spoken discourse are derived from drama and fiction, in which spoken dialogues reflect the author’s intuitions and representations. A further issue concerning registers is their variability across time. This is even more difficult to identify and, consequently, to be analysed. Both the Helsinki and the ARCHER Corpus avoid this problem by treating register variation as a continuum, that is as one single register, leaving it to the analyst to describe the dramatic ways in which a register can evolve over time. Dialects and idolects should also be given attention. The corpus sampler must decide how to catalogue them, delivering text type divisions that take the sociolinguistic aspect into consideration as well. In her paper about the Cambridge Corpus of Early Modern English (1600-1800), Wright (1993) champions the importance of idiolect, calling for a clear discrimination of the relation between the “state of the language” and “individual usage”, (ibid:29) highlighting the necessity of genre division across a particular period of time. She adds that it is necessary to set up “stringent functional/situational criteria” (ibid:27), because notions such as register and genre can vary conspicuously across time. The next step in diachronic corpus design, then, is text selection. As Biber (1998:253) states, the best criterion is to include a random selection of the texts available for a specific register in each period. In order to be able to do this, a complete listing of all texts available from the period of 31 Corpus-Based Research _________________________________________________________________ time analysed is absolutely necessary. With literary registers this can be easily achieved thanks to exhaustive bibliographies from which a random sample can be selected, while for other registers the ideal sampling method can be much more difficult to identify. Different approaches can therefore be used, as in the case of the ARCHER Corpus (see Appendix). A final issue concerns corpus automated tagging. Due to the fact that spelling and other orthographic conventions might vary considerably over time, interactive checking and editing of automated annotation is essential when building a diachronic corpus. This approach can be very timeconsuming, but it is necessary to guarantee a correct linguistic analysis (i.e. parsing). 3.4 Kinds of Corpora In principle, corpora can deliver information about many different aspects of human interaction. To make them optimally suitable for the many research fields, however, studies with a different focus require different parameters and therefore an appropriate approach to sampling. A basic methodological choice is to decide what kind of corpus suits the analysis best. There are, in fact, many corpus ‘templates’ which can either be used as they are, or remodelled to meet a specific priority. Every ‘template’ acts as a gateway to specific information about specific 11 features: spoken corpora , for example, provide information about pronunciation standards, while parallel corpora focus on the translation of the same collection of texts in one or more languages. 11 Spoken corpora include recorded oral material which is then transcribed. Spoken corpora enable researchers to study interaction from a phonetician’s perspective; they have also proved useful in discourse analysis, sociolinguistics and even psychology. However, there are various problems related to a spoken corpus. The major challenge of spoken corpora is its representation in written form (transcription). Spoken language has no explicit punctuation. It is therefore up to the corpus compiler to decide whether to attempt to transcribe the corpus in the form of orthographic sentences or whether to use intonation units (prosodic annotation), which tend to capture features such as stress, intonation, pauses, ‘body language’ (i.e. eye contact) and other non-verbal material (i.e. coughs, laughs, etc.). In order to avoid interpretation errors (i.e. inserting wrong punctuation marks), transcriptions are normally made by using the scripts which were used by the speakers (informants) (i.e. the Lancaster/IBM English Spoken Corpus, which is made up of radio 32 Corpus-Based Research _________________________________________________________________ In this section I shall outline the basic characteristics of the corpora used more frequently in linguistic research. Raw and annotated corpora Possibly the major distinction between corpora is whether they consist of raw or annotated texts. The main difference between these two kinds of corpora is that annotated corpora are provided with additional linguistic information (annotation). This information can be prosodic (focusing on 12 intonation units), semantic, syntactic, generic, contextual, and so on . The most common form of annotated corpora is the grammatically tagged one. In a grammatically tagged corpus, every word has been assigned a word class label (part-of-speech tag). The following example is taken from the untagged and tagged versions of the LOB Corpus: Untagged Sample A move to stop Mr Gaitskell from nominating any more labour life peers is to be made at a meeting of labour MPs tomorrow. Tagged Sample ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN nominating_VBG any_DTI more_AP labour_NN life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN of_IN labour_NN \0MPs_NPTS tomorrow_NR ._. (Source: Biber 1998:258) The tags used in the LOB Corpus are the original Brown Corpus tags: AP post-determiner AT article broadcasts), or punctuation is added by the informants themselves. An example of orthographic transcriptions of speech is the Lancaster/IBM Corpus. 12 In fact, even filenames can provide information. The LOB Corpus, for instance, is divided into different sections, with filenames indicating the section and whether or not that particular section has been tagged. (i.e. a computer file named loba.tag tells us that we are dealing with the section A of the tagged version of the LOB Corpus). 33 Corpus-Based Research _________________________________________________________________ BE infinitive form of the verb „to be“ (be) BEZ third person singular of the verb „to be“ (is) DTI single/plural determiner or quantifier IN preposition NN singular noun NNS plural noun NP proper noun NPT noun of style or title, singular NPTS noun of style or title, plural NR adverbial noun TO infinitive marker (to) VB verb VBG present participle/gerund VBN past participle . end of the sentence Another type of annotation is parsing.Parsed corpora offer a syntactic analysis of a corpus, identifying subjects, verbs, objects, as well as more complex syntactic information. Sometimes, this kind of corpora is also represented as tree diagrams, which are therefore known as „treebanks“. A typical example of a tree diagram of the sentence Andrea turned on the lights looks as follows: 34 Corpus-Based Research _________________________________________________________________ S NP VP PP NP PN Andrea V turned P on AT the N lights where ‘S’ stands for sentence, ‘NP’ for noun phrase, ‘VP’ for verb phrase, ‘PP’ for prepositional phrase, ‘N’ for noun, ‘V’ for verb, ‘P’ for preposition, ‘PN’ for proper noun, and ‘AT’ for article. This kind of graphic annotation is extremely space-consuming. Parsed corpora therefore tend to use different annotation labels, where the constituents are indicated by opening and closing square brackets. The example above, then, would read: [S [NP Andrea_PN1 NP] [VP turned_VVD [PP on_II [NP the_AT1 lights_NN1 NP] PP] S] where the tag set used corresponds to that used to annotate the British National Corpus (BNC). In order not to completely lose the visual properties of the tree diagram, the bracket-based method is sometimes displayed with indentations: 35 Corpus-Based Research _________________________________________________________________ [S [NP Andrea NP] [VP turned [PP on [NP the lights NP] PP] VP] S] This is the representational layout chosen for the Penn Treebank project. As I shall describe in Chapter 4, both tagged and parsed corpora can be extremely useful tools for research if exploited with appropriate computer programs (i.e. concordancers). Generic and contextual information is frequently encoded in the document header, which can include, for example, the title of the document, the name, age and sex of the language producer, the date of publication, the language variety, the subject domain, and so on. Such a header can be very useful when refining the search for text types or particular variables within a range of texts. In corpora such as the Longman-Lancaster Corpus and the Helsinki Corpus, this type of information is given in COCOA format. COCOA was an early computer program that extracted indexes of words in context from machine-readable texts. The system was then also applied to other concordancing programs, such as the Oxford Concordance Program (OCP). A “COCOA reference” consists of a balanced set of angled brackets (< >) containing two values (“entities”): a code signifying a particular variable name (i.e. ‘A’ normally stands for ‘Author’), and a string providing the information needed. The following example shows a COCOA document header from the Helsinki Corpus, where ‘X’ indicates that the information was either not available, or not relevant to the text: 36 Corpus-Based Research _________________________________________________________________ <B CEPRIV1> Short descriptive code <Q E1 XX CORP EBEAUM> Text identifier <N LET TO HUSBAND> Name of text <A BEAUMONT ELIZABETH> Author’s name <C E1> Sub-period <O 1500-1570> Date of original <M X> Date of manuscript <K X> Contemporaneity of original and manuscrip <D ENGLISH> Dialect <V PROSE > Verse or prose <T LET PRIV> Text type <G X> relationship to foreign original <F X> Language of foreign original <W WRITTEN> Relationship to spoken language <X FEMALE> Sex of author <Y X> Age of author <H HIGH> Author’s social status <U X> Audience description < E INT UP> Participant relationship <J INTERACTIVE> Interactive/non-interactive <I INFORMAL> Formal/informal <Z X> Prototypical text category <S SAMPLE X> Sample (Source: McEnery et al. 1996:31) Spoken corpora are frequently available in a prosodically annotated version. The following example is from the prosodically transcribed LLC: 1 1 1 1 1 1 1 8 8 8 8 8 8 8 14 14 14 14 14 14 14 1470 1480 1490 1500 1510 1520 1530 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A A B A B A B 11 20 11 11 11 11 11 1 8 14 1540 1 1 B 11 1 8 14 1550 1 1 A 11 ^what a_bout a cigar\ette# . / *((4sylls))* / *I ^w\on’t have one th/anks#* - - / ^aren’t you •going to sit d/own# / ^[/ \ m] # / ^have my _coffee in p=eace# - - / ^quite a nice •room to !s\it in ((/actually))# / *^\isn’t* it# / *^y/ \ es#* - - / (Source: McEnery and Wilson 1996:55) 37 Corpus-Based Research _________________________________________________________________ The Codes used by the compilers of the LLC are: # end of tone group ^ onset / rising nuclear tone \ falling nuclear tone /\ rise-fall nuclear tone _ level nuclear tone [] enclose partial words and phonetic symbols ‘ normal stress ! booster: higher pitch than preceding prominent syllable = booster: continuance (( )) unclear ** simultaneous speech - pause of one stress unit Prosodic transcription is a very difficult task for which highly skilled phoneticians are required and, unlike part-of-speech (POS) annotation, it cannot be delegated to the computer. A further problem of prosodically annotated corpora is consistency, or rather the lack of it. The identification of intonation patterns is a matter of perception, and it is difficult to ensure that the same parameters are maintained throughout the whole corpus. An additional problem is that, given the huge size of most corpora, generally more than one phonetician needs to be involved. However, there is a very real danger that different annotators will apply different standards. As far as the Lancaster/IBM corpus is concerned, the solution to this problem was to have a small part of the corpus (approximately 9%) annotated by both the transcribers involved. These “overlap-passages” then served as a reference for the transcription parameters chosen by the two phoneticians and, therefore, as a yardstick for comparison. 38 Corpus-Based Research _________________________________________________________________ Another problem in compiling a spoken corpus is that of raw text recoverability. Since prosodic annotation is carried out syllable by syllable (and not word by word!), symbols have to be inserted within the word. As the example from the LLC shows, the annotated text looks fragmented and the original can be recovered only by deleting every annotation mark separately. Monolingual and multilingual corpora Corpora can also be classified according to whether they consist of texts in one or more than one language (or language variety): a monolingual corpus is a database of texts produced exclusively in one language (or language variety), while a multilingual corpus deals with texts in several different languages. The arguably most useful type of multilingual corpus from a translator’s perspective is the parallel corpus which includes original texts and their translations. In order to be able to mutually cross-check translation units, translations need to be aligned. Computational linguistics has successfully tried to develop automated alignment tools which identify so-called “anchor points” within the sentence, that is to say the computer searches the text for lexical or grammatical units that are mutual translations. Alignment and parallel corpora have proved a very useful tool in language analysis, second language tuition and obviously in translation teaching. A further advantage of aligned corpora is that they can be used as an inductive tool in cross-cultural analyses and the development of machine 13 translation (MT) and computer-aided translation (CAT) systems. Unfortunately, however, because of the difficulties involved in obtaining a sufficient number of texts which are translations of each other, almost all parallel aligned corpora currently available contain only highly specialised texts. The most famous parallel corpora - the Canadian Hansard (a parallel corpus in French and English of the proceedings of the Canadian 13 For more detailed discussion of MT and CAT see Chapter 4. 39 Corpus-Based Research _________________________________________________________________ Parliament) and the corpus of IBM technical manuals (English and French) -, in fact, cover a restricted range of domains and text types. In the last five years many parallel corpora projects have been started, including: ♦ INTERSECT (International Sample of English Contrastive Texts) The INTERSECT Project at Brighton University began in the Spring of 1994. The aim is to construct and analyse a parallel bilingual corpus of French and English written texts, adding other languages later if resources permit. ♦ LINGUA A project involving the construction of multilingual corpora for English, French, Greek and some others, for use in language pedagogy. ♦ MULTEXT A project started in 1994 by Ide and Véronis which aims to develop parallel corpus resources for a subset of European languages. ♦ MULTEXT-EAST A research project that focuses on parallel and comparable corpora in Eastern European languages. ♦ TRIPTIC (Trilingual Parallel Text Information Corpus) TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. ♦ CRATER (Corpus Resources and Terminology Extraction) Research on the CRATER project aims to achieve automatic bilingual lexicon construction: it therefore concerns automatic alignment of parallel texts, both at the sentence and word level. Below is an example of French-English aligned sentences from the CRATER corpus: 40 Corpus-Based Research _________________________________________________________________ sub d = 22 ----------& the location register should as a minimum contain the following information about a mobile station : -----& l ‘ enregistreur de localisation doit contenir au moins les renseignements suivants sur une station mobile : sub d = 386 ----------& handover is the action of switching a call in progress from one cell to another ( or radio channels in the same cell ) . -----& le transfert intercellulaire consiste à commuter une communication en cours d ‘ une cellule à une autre cellule ( ou d’ une voie radioélectrique à l ‘ intérieur de la même cellule ) . sub d = 380 ----------& the location register , other than the home location register used by an msc to retrieve information for , for instance , handling of calls to or from a roaming mobile station , currently located in its area . -----& enregistreur de localisation , autre que l ‘ enregistreur de localisation nominal , utilisé par un ccm pour la recherce d ‘ informations en vue , par exemple , de l ‘ établissement de communication en provenance ou à destination d ‘ une station mobile en déplacement , temporairement située dans sa zone . (Source McEnery and Wilson 1996:59) ♦ The English-Norwegian Parallel Corpus This parallel corpus is planned as an open text bank and will be expanded when resources to do so are available. It is intended as a general research tool available beyond the present project for applied and theoretical linguistic research. There will be two main parts: • A core corpus consisting of original texts and their translations (English to Norwegian and Norwegian to English). Initially, the focus was on novels and fairly general non-fiction books. In order to include material by a maximally large number of translators, the texts of the core corpus are limited to text extracts (chunks of 10,000 words or more). Provided that there is sufficient funding, the amount and variety of text will be increased to include more specialised material, including legal texts. 41 Corpus-Based Research _________________________________________________________________ • A supplementary corpus containing texts which are translations, but not of the matched source texts. The main object of this supplementary corpus is to analyse possible features of "translationese" (that is, features typical of translated texts) and, in general, of increasing the amount and variety of the material. ♦ ETAP The project conducted by the University of Uppsala aims to create and annotate a parallel corpus for the recognition of translation equivalents. This computerised multilingual corpus is based on Swedish source texts with translations into Dutch, English, Finnish, French, German, Italian and Spanish. ♦ FECCS (Finnish-English Contrastive Corpus Studies) A project in Contrastive Linguistics at the University of Jyväskylä, Finland which uses a bilingual Finnish-English corpus. ♦ The Proteus Project The Proteus Project is a machine translation project of the Computer Science department of New York University and the Autonomous University of Madrid. They use parallel corpora in English and Spanish. ♦ Text-based contrastive studies in English A project at Lund University in Sweden which aims to develop a parallel corpus of texts in Swedish and English which can be used for cross-linguistic studies. ♦ Translearn A European project aimed at the development of a translation support tool. The languages covered are English, French, Greek and Portuguese. ♦ The Translation Corpus of English and German The Technical University of Chemnitz-Zwickau is currently compiling a translation corpus of English and German texts. The corpus at present includes EU-material, academic textbooks, modern fiction and tourist brochures (approx. 500,000 words in total). The researchers are 42 Corpus-Based Research _________________________________________________________________ looking at aspects such as culture-specific problems in translation and translationese. ♦ Corpora project Språkteknologi, University of Uppsala The project aims to develop two multilingual text corpora and to integrate them with lexical resources. The primary objective is to create a reference corpus for research in Machine Translation. ♦ The Scania Corpus The Scania Corpus is a collection of truck manuals from Scania. Swedish is the source language and the texts have been translated into seven languages: English, French, German, Spanish, Dutch, Italian and Finnish. The Swedish component adds up to 300,000 words and is the largest part of the corpus. The smallest component, Finnish, consists of approximately 200,000 words. The goal is to build a corpus of 2,000,000 words. This corpus is unlikely to ever become available, since the material is ‘commercial in confidence’. ♦ The Swedish Immigrant Newspaper Corpus The Swedish Immigrant Newspaper Corpus (swe. Invandratidningen) is available at Uppsala University in nine different languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian and Spanish. Work on this corpus has only just begun so there is no information about the number of words contained. ♦ The Swedish Government Corpus This collection, consisting of Swedish political texts, is still at the planning stage. It will contain declarations by the Swedish government (regeringsförklaringen). ♦ The Scandinavian Project of Contrastive Corpus Studies There is an ongoing Scandinavian project involving four partners in Norway, Finland, Denmark and Sweden. Swedish is represented by the Department of English at Lund University with their ‘Text-based Contrastive Studies in English'’ (Aijmer et al. 1996). All four corpora will have the same structure. Each corpus consists of two parts: one parallel corpus comprising original texts together with their 43 Corpus-Based Research _________________________________________________________________ translations, and one comparable corpus consisting of original texts in both languages. The aim is to use the corpora in contrastive studies between the Scandinavian languages and English. The parallel corpus between Swedish and English will eventually consist of 1,600,000 words, and comprise a large range of different text samples (each sample 10,000-15,000 words). The corpus will become available as soon as all the copyright restrictions are resolved. The Finnish corpus consists of approximately 2 million words; however, the parallel texts have not yet been aligned. So far there has been no work on POStagging. Another type of multilingual corpora are the comparable corpora, which are in fact collections of ‘similar’ monolingual corpora, which apply the same sampling criteria and cover the same subjects for every language (variety) considered. The main aim of this type of corpora is to compare languages - or varieties thereof - produced in similar communication situations, without the distorsions which might appear in translated texts. One example of multilingual corpora, which is also described by McEnery and Wilson (1996), is the Aarhus corpus of Danish, French and English contract law, which consists of three monolingual contract law corpora, sampled according to the same criteria but do not include translations. Another interesting application of comparable corpora is in the fields of dialectology and language variation studies. A good example are the LOB (British English) and the Kolhapur Corpus (Indian English) comparable corpora, which use the same genres and sample sizes as the Brown corpus. General and specific corpora A further kind of classification of corpora is based on the distinction between general corpora and specific corpora. General corpora - also known as “reference corpora” - are very large databases compiled to be a representative selection from the language as a whole or of a clearly 44 Corpus-Based Research _________________________________________________________________ defined part of it. A case in point is the monitor corpus, a collection of texts drawn from different subject fields or registers. The best example of a monitor corpus is the British National Corpus (BNC), developed at Birmingham University by John Sinclair’s team in collaboration with Collins COBUILD. This collection of texts is an open-ended entity: texts are constantly being added to it, so that it gets bigger and bigger all the time. Currently it comprises over 200 million words of British English, drawn from different registers, yet focusing more on written (90 million) rather than spoken (10 million) texts. New texts are added on a regular basis, while ‘old’ texts are sometimes either stored on extra CD-ROMs or even deleted: this process enables the compiler to provide a general overview of current language use and ‘monitor’ its development across time. Monitor corpora are primarily of importance in lexicographic work, because they allow lexicographers to search a stream of very recent texts for the occurrence of new words or for changes in meaning of old ones. They also represent a valid field for research, because they include a broad range of registers and text types, which means that language can be modelled more accurately. General corpora can be used for research in various fields. Specialised (or LSP) corpora, by contrast, are created for a special purpose; many are in fact used for work on spoken language, others are sublanguage corpora, learner corpora and developmental corpora. LSP corpora (corpora of language for specific purposes) can be exploited to provide many different kinds of domain-specific material for language learning. Sublanguage corpora consist of texts that are chosen from a particular variety of a language, i.e. from a particular dialect or subject area. Early examples are the Guangzhou Petroleum English Corpus (GPEC) and the Computer Science Corpus of the Hong Kong University of Science and Technology (HKUST). Besides learning and teaching purposes, sublanguage corpora can be also used in language engineering: machine translation cannot be realistically trialled on general language, but it 45 Corpus-Based Research _________________________________________________________________ becomes feasible when the task is restricted to a particular domain, or sublanguage. Learner corpora are databases that aim to improve our understanding of language learning from an unusual point of view. Instead of describing language as it should be, this kind of corpora focuses on the analysis of the commonest mistakes made by non-native speakers in order to develop methodologies to avoid them. Although limited to only one aspect (free writing) of one type of sublanguage (advanced foreign Learners of English), maybe the best example of this type of corpora is the International Corpus of Learner English (ICLE). The ICLE is a comparable corpus whose design, compilation and processing are described in detail in Granger (1993). The last kind of corpora sampled for special purposes is represented by developmental corpora. This kind of database aims to represent the language used by native speakers whose linguistic competence has not yet reached maturity, that is to say they try to depict a type of raw language which is developing extremely fast and subject to numerous influences. Because language teaching is mostly concentrated on children during the periods of primary and secondary education, developmental corpora has lately become mainstream. In order to create reference works that really suit the needs of young learners, it is necessary to design a corpus that corresponds to the target language behaviour of the learners. CHILDES, the child language database designed in Pittsburgh by a team of researchers at the Department of Psychology of Carnegie Mellon University, is an example of such a corpus. Although this kind of corpus is most useful in language acquisition research, it also has a very practical application to the development of language-teaching and testing materials. This section aimed to outline the major kinds of corpora used in language research. 14 Obviously, it cannot be a full account. There are numerous types and subtypes not listed here that can be designed in connection 46 Corpus-Based Research _________________________________________________________________ with a specific kind of study. This is, I feel, the most exciting characteristic of corpora: they are flexible, that is they can be adapted to optimally suit specific needs. In order to exploit corpora even more effectively, however, researchers use tailor-made computer tools which enable them to obtain useful information about the specific characteristics of natural language. These will be described in the next section. 3.5 Tools for Corpus Exploitation Before outlining the different analytical tools which are used in corpus linguistics, I think it is necessary to briefly describe some of the more 15 frequently operations used in corpus work and to explain what tools can actually do for the linguist, rather than what kind of data structures they manipulate. 16 ♦ Searching: takes a text, raw or annotated, as well as a target item, and points to segments in the text where that specific item is found. ♦ Concordancing: takes a text, tagged or untagged, as well as a target item, and produces a concordance, that is a list of words and phrases in context. ♦ Parsing: takes as input a (segment of) text as well as a grammar, and delivers syntactic information about all items in different forms (i.e. parse trees). ♦ Counting: takes a text, raw or annotated, as well as a target item, and returns the number of text segments that match that specific item. ♦ Tabling: takes a text, raw or annotated, as well as a target specification, and produces a table (i.e. a frequency table, a table of collocations, etc.). 14 For examples of corpora in languages other than English see Appendix 2. For further details about informal specifications of operations, the required input, and the resulting output see Lager (1995). 16 The list is not exhaustive. See Lager (1995) for more functions. The reasons why I have decided to restrict my discourse to these operations is that they deliver sufficient quantitative data for all the types of corpus analysis that are of relevance to the translator. These operations can of course be combined: automatic part-of-speech tagging involves automatic disambiguation, concordancing may imply searching, etc. (see Lager 1995). 15 47 Corpus-Based Research _________________________________________________________________ ♦ Collocating: given a description as well as a target item, produces a list of collocations, that is a list of words that co-occur more often than expected by chance. ♦ Automatic part-of-speech tagging: takes a text as well as a lexicon (and sometimes some kind of rules or highly probable links) and delivers information about the text at the level of part-of-speech. ♦ Lemmatising: takes a text as well as a lexicon (and sometimes some kind of rules), and produces a description of the text which specifies the lemma from which different inflected forms have been derived. ♦ Manual/automatic disambiguation: given a number of alternative descriptions as well as either the user’s act of choosing (interaction), or rules for automatically selecting between them, returns a description. The four most useful tools for corpus-based research in translationoriented studies (i.e. concordancers, frequency tables, taggers and 17 parsers) are described in the following section. 3.5.1 Concordancers Concordancers enable you to discover patterns that exist in natural language by rearranging text in such a way that these patterns become clearly visible. Concordancing programs allow you to look for single lexical items or lexical groups. The principal objective of collocation searches is to identify the lexical items a given word or lexical group can collocate with. The example given below was produced by Conc 1.80b3 (a Macintosh application) from a plain ASCII text version of the first chapter of Lewis Carroll’s “Alice’s Adventures in Wonderland”. Note that the line numbers are automatically calculated by the application. 17 Again, the list of tools discussed is not exhaustive. New programs are constantly being written to meet specific purposes (i.e. language-specific analysers of derivational and inflectional morphology). See Biber (1998) for further comments on the issue of already available concordancing packages vs. own programming. 48 Corpus-Based Research _________________________________________________________________ (Source: http://lonestar.texas.net/~brazos/alice/aliceinw.htm) A printout like this, with the keyword in a straight column down the middle of the page with as much of the context as will fit running in one line to right and left, is known as KWIC (keyword in context) concordance. Many concordancers will also let you print out contexts consisting of a complete sentence, or a fixed number of words, or a whole paragraph, or allow you to trace any occurrence back to the original text. 49 Corpus-Based Research _________________________________________________________________ Tribble and Jones (1990) outline three main types of concordancing software: ♦ streaming concordancers: they “read” a text line-by-line and produce concordanced text either to screen, printer or disk as they chunk through the documents you are analysing. This kind of software is very accessible: you can use the macro option of any word processor or even develop one yourself if you have some programming knowledge, the most used programming language being currently Perl. There is, however, a major drawback. Although not limited to a particular size of text file, the concordance might take a long time to work through a long document (50,000+ words). An example of a streaming concordancer is Conc 1.80b3 for the Macintosh. ♦ text-indexers: they create an index of your text in one (sometimes lengthy) operation and then permit a large variety of text retrieval activities, including concordancing. Although ideal for large-scale research, text-indexers are still relatively little used, mainly because they might prove daunting to those with little computing experience or with limited time or motivation for learning how to use them. Maybe the best example of text-indexing software is WordCruncher. ♦ in-memory concordancers: this software loads a complete file - or set of files - into the memory of the computer in one operation. The text can then be consulted in a variety of ways, the results obtained being presented to the user more or less instantaneously. The most common in-memory concordancer is the Longman Mini-Concordancer. 3.5.2 Frequency Tables Before analysing concordances, however, it is worth remembering that there are also other extremely useful sources of information about texts. One of these is a list of word frequencies (also known as frequency tables), which can be obtained either by means of an extra application (i.e. Mike Scott’s Frequency Lister), or simply by activating one of the many features provided by concordancing softwares. The example given 50 Corpus-Based Research _________________________________________________________________ below is taken from a full frequency table of Chapter 1 of Lewis Carroll’s “Alice in Wonderland” produced with the index function of Conc 1.80b3: (Source: http://lonestar.texas.net/~brazos/alice/aliceinw.htm) The number in brackets shows the frequency of the tokens listed in the first column, while the numeric string indicates the line numbers where that token was detected. By creating a frequency table like the one illustrated above before running a concordance across a text it is possible to preselect the most analytically relevant items. If wordlists are created before you proceed to analysis, a great deal of guesswork can be avoided. Furthermore, a frequency table may even reveal stylistic characteristics of a text that would otherwise have gone unnoticed. 51 Corpus-Based Research _________________________________________________________________ 3.5.3 Taggers A tagger is a computer program that assigns grammatical information to words. For instance, a tagger might tell us that the word program in the previous sentence is a noun in the nominative singular, or that the word program is a present tense verb in the sentence “They program well”. Most taggers use the following modules: ♦ they isolate words and punctuation marks ♦ a lexical analyser inspects each word and adds tags that indicate the grammatical properties of the words (e.g. part of speech and inflectional properties). If a word can serve several grammatical functions, several tags are added as alternatives, as in the following example: He_P chairs_Npl_Vpres the_DET conference_N ♦ (P for pronouns) (Npl for plural noun; Vpres for present tense verb) (DET for determiner) (N for uninflected or singular noun) when a word could represent two or more grammatical categories, the context needs to be consulted to disambiguate the word. The final stage in tagging, then, is disambiguation: a disambiguator tries to select the correct alternative by removing contextually illegitimate tags. As a result of successful disambiguation, the above sample would be analysed as follows: He_P chairs_Vpres the_DET conference_N This last operation is by far the most difficult subproblem in tagging. In spite of nearly 40 years of research, no perfect solutions are in view, although considerable progress has been made. Taggers are extremely useful in linguistics. In the previous section I have already outlined the advantages of a tagged corpus. What has not 52 Corpus-Based Research _________________________________________________________________ been mentioned yet is that fact that tagging (and parsing) software has also found application in another major field of applied linguistics: machine translation. 18 3.5.4 Parsers The simple act of encoding phrases or sentences in a target language is actually not a difficult task for a computer. Since the process is largely mechanical, the machine even has an advantage - in terms of pure speed - over a human being. Additionally, the database of words to which it has virtually instant access is considerably larger than the active vocabulary that most people carry in their heads. Where the real challenge resides is in analysing the phrase correctly into its constituent elements before the translation process starts. This is known as "parsing" the phrase. In computer technology, a parser is a program that receives input in the form of annotated text, interactive online commands, or some other user-defined interface and breaks them up into parts (i.e. singular or plural nouns, verbs, adjectives etc.) together with their attributes or options. It then draws a map of the phrase or sentence either in a linear or in a schematic form. A parser may also check to see that all necessary input has been provided, otherwise signalling an error in the syntactic construction of the sentence. Despite the very advanced technology applied to the process of parsing, however, automatic syntactic analyses may be simply not sufficient. The poor performance of most machine translation programs, in fact, strongly suggests that such analyses should be supervised by humans: interaction with the machine allows the human analyst to make difficult linguistic judgements, while the computer takes care of recordkeeping. Again, performance and competence are symbiotic. 3.5.5 Ready-Available Tools versus Own Programming 18 See Chapter 4 Section 2.2. 53 Corpus-Based Research _________________________________________________________________ An issue that continues to cause major disagreement among researchers is the question of whether or not corpus users should be able to create their own analytical tools. While many corpus users feel that the commercially available software does not provide for the kind of analysis that they need, not many are familiar with the programming languages used in corpus linguistics. Conversely, there are many linguistic software developers who have little knowledge of linguistics and therefore no insight into the real needs of corpus users. The following exchange is quite typical of the kinds of arguments put forward by those supporting the use of ready-made software: Date: Wed, 29 Jul 1998 12:50:41 +0200 To: [email protected] From: Henning Reetz <[email protected]> Subject: Re: Corpora: Corpus Linguistics User Needs Sender: [email protected] 1) Writing a program is one thing. Testing and proving its correctness is another thing. Even for simple statistical problems I prefer to use standard statistical packages because I expect their algorithms to be better tested than my own code (but I compare always their results with examples from text books; if both disagree, I compute the problem on the example data by hand and found more often bugs in the textbooks than in the programs). Being an experienced programmer having written many thousands lines of code, I prefer to use standard software. 2) I don't have to be a car mechanic to drive a car. Why do I have to be a programmer to use a corpora? --- But I have to know as a driver what petrol my car takes, how good the breaks are, etc. As a user of a program, I cannot simple trust the program but have to be aware of its bugs or problems. I think it is a good policy to test a function by hand on a small data set and do cross-checks and plausibility tests on large data sets. 3) Why re-invent the wheel? Henning Reetz Allgemeine Sprachwissenschaft 54 Corpus-Based Research _________________________________________________________________ Universität Konstanz (Source: [email protected], th in reponse to a question by Mason and Berglund on 27 July) The point made by Reetz is well-grounded. Interdisciplinarity supported by close cooperation is often enough to solve the problem. Being able to write your own software, however, has several advantages. As Biber (1998) points out, creating your own programs allows you to conduct analyses that would otherwise not be possible, either because no readilyavailable tool explores the pattern of use you are aiming to analyse, or because it does not apply the scale of analysis you have chosen for your study. A further advantage is that someone familiar with programming languages would be able to modify them and so increase their speed and accuracy. (Biber ibid.) Another argument in favour of programming your own tools is that you can tailor the analysis process to fit your research needs. For studies that are based on - or simply require - human assessments, you can develop an interactive interface, where the user takes over from the computer whenever s/he feels s/he should. See the following comment: To: [email protected] Date: Wed, 29 Jul 1998 11:14:17 -0400 Subject: Re: Corpora: Corpus Linguistics User Needs X-Juno-Line-Breaks: 0-4,9-10,14-15,17-31 X-Juno-Att: 0 MIME-Version: 1.0 From: [email protected] (C Hogan) Sender: [email protected] Henning Reetz writes: >I don't have to be a car mechanic to drive a car. Why do I have to >be a programmer to use a corpus? The argument here turns on the meaning of the word "use": It is not necessary to be a car mechanic if all you want out of your car is to drive it to work, turn right and left, stop and accelerate, etc. On the other hand, if you would like to put in a new engine, or tune-up your car, then yes, you do need to be an auto mechanic. 55 Corpus-Based Research _________________________________________________________________ Similarly for corpus linguistics: if all you want to do is get word counts from your corpus, then you can probably rely on existing software. If, however, you want to do really custom stuff, then you should probably learn to program. (Source: [email protected]) In my own opinion, it is not vital for a young corpus linguist to know a particular computer language particularly well. However, a track record of the ability and, above all, the willingness to acquire programming skills is almost indispensable. I believe that a corpus linguist has first to be a pure linguist, and only later - when the basic knowledge of linguistic processes have been assimilated - become a software developer. Modern linguists – translators included - are very versatile, mainly because they have to cope with the information madness of the society we live in. They are very far indeed from the celebrated stereotype of the ‘armchair linguist’ created in the early 1990s by Fillmore, that is they are aware of the need for interdisciplinary learning. If they choose to focus on linguistic issues, rather than implementing a more integrated approach, then, there must be a valid reason, and that is – at least at the beginning of their career - time. I simply believe that such priorities must be respected. 3.6 Corpus Networks This section introduces an aspect which in recent years has been at the forefront of theoretical discourse and of practice-oriented debates alike. Throughout the years corpora have become bigger and bigger in size, requiring not only more capable hardware, but also the collaboration of different people, who would all be assigned specific tasks and would have to carry out them simultaneously (i.e. exploitation, updating, support, etc.). It is because of the need for flexible access to stored data that networking has become necessary. 56 Corpus-Based Research _________________________________________________________________ Here I shall focus on basic technical details and outline the reasons why I believe that a computer network is essential for corpus exploitation, both for language learning and for language teaching purposes. The major advantage of networking is that the content of hard disks can be shared. Only one machine needs to hold the data, while the other computers simply access them through the network. Another advantage is that the data can only be changed by the project management, so that the corpus remains secure from unwanted modifications. However, a networked system has also some crucial drawbacks. Hughes (1997), for instance, points out that shared files cannot be distributed among several computers: it would be in fact quite difficult to know where particular parts of the data are stored. It is normally only one computer which is used to distribute the shared files, that is to say that all users have to access the data on its hard disk. As a consequence, this machine may become a bottleneck for the whole system, being further of no use as a standard machine. The solution to the problem represented by a large number of users is to dedicate an extra computer to the task of running the network. This kind of machine is called ‘server’, while all computers dedicated to accessing the data are known as ‘clients’. In the long run, this solution turns out to be an easy way of creating a shared resource and fits therefore ideally in an institutional framework, because it supplies data rapidly and necessitates periodical servicing only on the mainframe machine. Moreover, support is easier because it is more structured, and growth simply involves an increase of resources or client machines. Even the huge amount of resources on the Internet can be exploited more effectively: a case in point of server-client network is the Bank of English, which can be accessed remotely through the net by means of a specifically designed retrieval program called SARA. A computer network offers some evident improvements to language learning. First of all, the simultaneous availability of different machines is an essential component of a learning-by-doing method. Scholars can practice language use directly and independently, not having to wait either 57 Corpus-Based Research _________________________________________________________________ for the teacher to answer their questions or for some corpus tool to give them evidence of language use. This kind of approach boosts the scholar’s curiosity in practical applications, and - provided s/he knows how to exploit a corpus appropriately - not only delivers an exhaustive answer, but also encourages him/her to develop his/her own attitude to research in general. However, it should be mentioned that it is the task of the scholar to take advantage of the structure s/he is being offered: merely knowing about the possibility of self-training (as in the case of computer-aided language learning) is clearly not enough. Furthermore, the initiative has to be given enough space within the institution (i.e. a room specifically designed for this purpose) and be supported by additional courses that focus on the teaching of corpus exploitation, that is ‘teaching to selfteach’. Corpora resources run on a computer network offer some ‘strategic’ advantages for language teaching as well. Again, there is a huge variety of practical perspectives that vary according to the methodology adopted by the teacher. One major point in favour of a computing infrastructure is that it enables the teacher to teach language courses without bringing his/her research activities to a halt. One possible feedback for the teacher, in fact, might consist in the supervision of individual research projects that the students have to carry out using specific resources made available by a source whose characteristics are well-known to both teacher and scholars. It might be very interesting, for example, to analyse in detail the different use of the words cheers in a corpus of American, British and Hibernian English, or even confine the study to a determined corpus (i.e. the BNC) and highlight the difference between two similar words, such as recently and lately, well considering their pragmatic connotations. 58 Corpus-Based Research _________________________________________________________________ Finally, installing a computer network also means being able to profit from particular software licences and agreements (i.e. the campus licence 19 of Translator’s Workbench ), and therefore save big amounts of money. PART II 19 „Translator’s Workbench“ is a trademark of Trados Gmbh, Stuttgart, Germany. 59 _________________________________________________________________ APPLICATIONS 60 _________________________________________________________________ 4 Applications of Corpora In this chapter the focus is on applications. In Section 4.1 I shall begin with a short overview of the uses of corpora in some major domains of linguistics, and then discuss some of the advantages of implementing a corpus-based approach in the fields of translation and interpreting (Section 4.2). 4.1 The Use of Corpora in Linguistics The main advantage of a corpus-based approach, as I have suggested, is its capacity to deliver evidence of language usage. This section describes in detail how corpus-oriented studies can contribute to the various branches of linguistics. 4.1.1 Corpora and Grammar One of the earliest applications for corpus-oriented approaches has been grammar. Using language corpora in grammatical analyses has a double advantage, as it supports both the deductive and the inductive construction of theories. Especially parsed corpora – i.e. corpora with annotated grammatical information - tell us a great deal about which syntactic structures are associated with which linguistic contexts. In other words, empirical data gives the grammarian the possibility to deduct the rules and to develop theories underlying language use. Corpora are also increasingly used to support inductive grammar theories. Aarts (1991), for instance, describes how primarily rationalist formal grammars developed at Nijmegen University are tested on natural 61 Applications of Corpora _________________________________________________________________ language using computer corpora. In other words, empiricism (corpus evidence) and rationalism (introspection) are successfully combined to draw up a comprehensive grammar. Possibly one of the most famous advocates of a combined approach is Michael Halliday whose theory of 20 systemic grammar is perhaps the best example of an efficacious symbiosis between corpora and grammar. Examples of Practical Applications In grammar studies, corpora can be used to raise consciousness for specific grammatical phenomena in comparative analyses of target language and source language structures. For instance, learners of English trying to comprehend the different usage of some and any will be more motivated if they can examine the contexts in which the two determiners occur themselves and compare them with the German equivalents irgendeiner, jeder, etc. (see Wolff, 1996:78) Another exciting application is the Internet Grammar of English, which is available at http://www.ucl.ac.uk/internet-grammar/intro/intro.html and was developed by the Survey of English Usage group headed by R. Quirk (1996-1998). Another interesting internet site which draws on corpora for grammar teaching is http://www.ccl.umist.ac.uk/projects/salsa/ . The University of Manchester developed SALSA, a program for students of English, French, German, and Spanish who have an interest not only in learning the language, but also in learning facts about the language, and who want to acquire proficiency in the use of the linguistic metalanguage. In addition, the multilingual nature of the package offers the opportunity to compare linguistic phenomena in freely selected language pairs. SALSA 20 M.A.K. Halliday’s theory of systemic grammar is based upon the notion of language as a set of choices for each instance from which the speaker must select one. In each situation various choices are more or less likely to be selected by the speaker: Halliday uses this idea of a probabilistically ordered choice to interpret many aspects of linguistic variation and change in terms of the differing probabilities of linguistic systems. It is, for example, one of Halliday’s suggestions that the notion of a register, such as that of conversational speech, is really equivalent to a set of these kinds of variations in the probabilities of the grammar. (McEnery and Wilson 1996:95-96) 62 Applications of Corpora _________________________________________________________________ emulates the learning situation in which the students can freely experiment with different ways of analysing language without their attempts being recorded or marked. Instead, the program compares the input with the expected answer in order to provide feedback and help. Although the primary function of the software is to provide practice in syntactic analysis, SALSA offers short hypertext tutorials as well. These tutorials are not intended as yet another book on syntactic theory, but aim to explain the objectives and methods of the practical lesson(s) which follow in SALSA or give a synopsis of the topic in question before the student attempts the relevant exercises. Whenever feasible, the software user can call up a set of examples which corresponds to the nature of the tutorial, e.g. when the noun is being described the students can draw on a list of nouns in the language of their choice. These examples are taken from the same database as the sentences for the exercises so that learners familiarise themselves gradually with the new material. 4.1.2 Corpora and Lexicography/Terminology Lexicography is the only branch of linguistics that was making massive use of empirical evidence long before corpus linguistics became mainstream. Nevertheless, the advantages of a machine-readable, representative corpus are obvious. First of all, corpus-based research is very efficient: the only thing a lexicographer needs to do is to switch on the computer and call up a word or phrase. In a few seconds, s/he will obtain a huge amount of information about that particular entry and is able to draw his/her own conclusions within a very short period of time. This means that collections of linguistic information about lexical items (i.e. vocabularies) can be updated more quickly and, depending on the kind of corpus examined, can reach a very high degree of accuracy and completeness. Moreover, corpus exploitation tools such as concordancers, which can, for example, sort hits by the first word on the right or on the left of the keyword, and frequency listers allow the lexicographer to catalogue 63 Applications of Corpora _________________________________________________________________ entries on the basis of the context in which they are embedded and to establish a scale of importance within all possible meanings. (McEnery and Wilson 1996:91) Another advantage of corpora is the fact that corpora can be enriched with extra linguistic and non-linguistic information. As already mentioned when introducing the COCOA references, text headers can deliver data about different variables, including also age and gender of the author, his or her social status, date the text was produced, its genre and register variety, etc. With this extra input, the lexicographer has the possibility to assign plain text to prototypical text categories on the basis of linguistic factors and social contexts on the basis of non-linguistic parameters, and so produce a well-grounded analysis of language use. Representativeness, however, remains a complex issue. Corpora obviously cannot solve all the lexicographer’s problems, as Della Summers points out: “Frequency is a powerful tool in the lexicographer’s arsenal of resources, allowing her to make informed linguistic decisions about how to frame the entry and analyse the lexical patterns associated with words in a more objective and consistent way. However, in dictionay-making editorial judgement is of paramount importance, because blindly following the corpus, no matter how carefully it may be constructed to represent the target language type accurately, can lead to oddities. We expect our motto: ‘Corpus-based, but not corpus-bound’ to hold good for many years to come.” (Della Summers 1996:266) Examples of Practical Applications One easily replicable example of lexicographic research using corpora is described by Tim Johns (http://sun1.bham.ac.uk/johnstf/neo.htm). His corpus comprised the 1994 electronic editions of the Guardian and the Observer newspapers to search for neologisms in the world of computing, and in particular terms relating to the Internet. What he found was that some of the new terms were actually old words that were given new meanings, whereas others were completely new coinages. Most of these neologisms come from the USA, and were joky and informal. 64 Applications of Corpora _________________________________________________________________ In addition to the obvious usefulness of corpora in lexicographic research, there has been a trend in recent years to draw on corpora in quite specific terminological studies. One of these is of particular interest to translators and is therefore being described in more detail. Susanne Lenz conducted a corpus-based lexical study in the field of terminology and showed how a corpus-based approach can help detect mutual lexical influences between German and English. As evidence she used two large corpora, the Bank of English and the more recent IDS-Korpora. For reasons of topicality, environmental "terms" were chosen: items which fall somewhere between terminology and general language, e.g., green dot, green bin, habitat. In her study, the focus was on terms selected from a German and an English glossary and their respective translations. These glossaries had been compiled as part of an earlier project which investigated communicative structures and interfaces in the environmental field of various industrial companies, and were chosen mainly for contextual reasons, to ensure cohesion with regard to type of text and circulation. She first selected a limited number of terms from the glossaries and then cross-checked them with the German IDS-Korpora and the Bank of English. The selected items were revisited throughout the corpora in order to look for common contextual patterns and, if possible, also for chronological evidence that might be attributable to mutual lexical influence between German and English in this field. Her findings show that the use of large corpora in this type of research promises interesting results, although further bilingual corpus alignment is required if maximally 21 reliable statements are to be produced. The last point I shall make concerns future trends. Some recent projects have been trying to use word sense frequencies as the basis of categorisation, and to produce sense-ordered frequency dictionaries (see 21 For tools specifically developed to extract terminologies and lexicographic data from corpora see QUIRK (http://www.mcs.surrey.ac.uk/Research/CS/AI/SystemQ/index2.html), a project conducted by Sharp Laboratories of Europe Ltd, Cambridge University Press, and the University of Cambridge Computer Laboratory. 65 Applications of Corpora _________________________________________________________________ for instance research project conducted by Christian-Albrechts-Universität in Kiel and Bowling Green State University in Ohio; the aim is to create a sense-ordered frequency dictionary for mediaeval German epic). Such semantics-based dictionaries would undoubtedly be of great interest to translators, as they provide the kind of information that is not included in traditional dictionaries. 4.1.3 Corpora and Morphology One interesting application of corpora has been its use in morphological analyses in deduction-based language learning. Learners themselves systematically collect data on word-formation rules by means of the wildcard option. This allows them to identify semantic similarities and differences between, for example, English words ending in *ic vs. *ical, such as classic/classical, historic/historical, economic/economical, or the usual contexts of English and French adverbs (*ly and *ment respectively) and adjectives. (Legenhausen 1996:69) 4.1.4 Corpora and Semantics There are two major reasons for using a corpus-based approach in semantics: the first is that a corpus helps identify criteria for assigning meanings to linguistic items, the second reason is that corpora permit a prototypical approach to the linguistic categorisation of lexical units. Concordances show lexical items embedded in contexts which normally specify their meanings in that particular phrase or in a given period. This use of contexts may also reveal deviations from established uses: a monitor corpus, for instance, can show changes in the meaning and the use of words as well as the expansion of their semantic fields. Corpora have also played a major role in eroding the belief in the possibility of hard and fast categorisation. Conventional studies attempted to formulate unambiguous descriptions of word meanings and to ignore that alternative options existed. Corpora have proved this methodology to 66 Applications of Corpora _________________________________________________________________ be wrong. McEnery and Wilson (1996) sum up this point clearly and concisely: “In looking empirically at natural language in corpora it becomes clear that this ‘fuzzy’ model accounts better for the data: there are often no clear-cut category boundaries but rather gradients of membership which are connected with frequency of inclusion rather than simple inclusion or exclusion. Corpora are invaluable in determining the existence and scale of such gradients.” (McEnery and Wilson ibid:97) The notion of ‘frequency of inclusion’ suggests a prototypical approach to language analysis which is based upon definite, clearly established items serving as descriptors of a specific function within the text. Linguistic units gravitate towards these; their meanings vary according to numerous factors (i.e. time, place, speaker, situation, etc.). The corpus is so far the only linguistic tool which enables the definition of this ”gradience of membership”. Examples of Practical Applications One current application is the DELIS project (Descriptive Lexical Specifications) which is conducted jointly by several universities and publishers 22 and aims to produce a dictionary describing the major semantic classes of English, French, Italian, Danish and Dutch, and also the interaction between syntax and semantics. Another more ambitious project is the FRACAS project (Framework for 23 Computational Semantics) . 22 The Center for Sprogteknologi at Kobenhavns Universitet; Research Unit for Computational Linguistics, Helsingin Yliopisto; Linguacubun Ltd., London; Istituto di Linguistica Computazionale del CNR, Università di Pisa; Sonovision ITEP Technologies, Paris; Vrije Universiteit Amsterdam; Van Dale publishers, Utrecht; as consultants: Université de ClermontFerrand, Den Danske Ordbog, Oxford University Press. 23 This project is conducted jointly by the Centre for Cognitive Science (CCS) and Human Communication Research Centre (HCRC), University of Edinburgh; the National Centre for Mathematics and Computer Science (CWI), Foundation Mathematical Centre (SMC) in Amsterdam; the Institut für Computerlinguistik, Universität Saarbrücken; and the Stanford 67 Applications of Corpora _________________________________________________________________ 4.1.5 Corpora and Pragmatics Up to now very few corpus-based studies have been carried out in pragmatics. One of the reasons is that the usual design criteria for corpora do not really recommend them to pragmatic analysis. I mention earlier that most corpora try to be representative and so include rather short samples, which are removed from their original social and textual context. A pragmatic analysis, however, needs an unabridged version of the text in order to be able to extrapolate all meanings and reactions in particular situations. Another problem is that pragmatic elements cannot be extracted from corpora by means of a simple concordance. The solution to this problem suggested by McEnery and Wilson is that pragmatic elements could be represented by some kind of matrix which associates words with meanings and correlates them with effects or reactions. What they advocate, then, is the creation of a linguistic database that is able to bridge the gap between what is actually said and what is really meant. Examples of Practical Applications One example of a pragmatic-related study is described by McEnery and Wilson (1996): Stenström (1987) carried out a study in which she looked at ‘carry-on signals’ such as right, right o, and all right. and was able to classify the use of these signals according to a typology of their various functions. She found, for instance, that right - the most frequent locution had many functions and was very often used either as a response or to both evaluate a previous response and terminate the exchange. All right, instead, acted as boundary between two stages in the discourse, while locutions such as it’s alright and that’s right were used as responses to apologies. On the basis of quantitative data, Stenström was then able to infer that the use of carry-on signals in conversational English was Research Institute (SRI), International Cambridge Computer Science Research Centre in Cambridge, U.K. 68 Applications of Corpora _________________________________________________________________ strongly linked to the channel used, i.e. telephone English (McEnery and Wilson 1996). A more ambitious project is being conducted by the ISSCO at the University of Geneva and the Institut Dalle Molle pour les Etudes Sémantiques et Cognitives. They use corpora of dialogues to identify regularities in how the beliefs and intentions of the interactants are reflected in language. 4.1.6 Corpora, Stylistics and Discourse Studies While earlier applications of corpora were largely restricted to the study of low-level grammatical and lexical phenomena, several recent projects have gone beyond the word or sentence level and tried to identify generic and textual patterns, taking into consideration pragmatic as well as discoursal features. Biber (1998) summarises the pros and cons of corpus-based methodologies. While it is true that conventional concordancers are not really able to identify discourse-level features, their usefulness can be improved through interactivity, with the concordancer producing a fast and reliable list of all the discourse characteristics the researcher wants to identify, which are then checked by the researcher who decides whether the items identified comply with the given specification. Examples of Practical Applications One possible application described by Biber uses concordancers to track surface grammatical features over the course of the whole text and to produce what Biber (1998:108) refers to as a „discourse map“, i.e. the monitoring of the development of discourse patterns through texts. These discourse maps can then be used as the basis for textual comparisons aimed at finding typical patterns of text design in various genres and registers. A further example of the use of corpora in stylistics is given by Tim Johns, who compiled a corpus of articles and letters from the scientific 69 Applications of Corpora _________________________________________________________________ research journal Nature of the year 1989. Johns examined in detail five reporting verbs (indicate, show, suggest, find, demonstrate), which had previously been identified on the basis of the formal criterion that they were those most frequently followed by that-clause complements. Even though other complements were not included, related nominals (indication(s), suggestion(s), finding(s), demonstration) were taken into account when they were followed by a that-clause. The main features of the syntactic environment of each verb were then identified with the 24 assistance of the program MicroConcord (OUP). Another popular area has been stylistic variation analysis. A good example is the research project carried out by Thomas and Short (1996), which used a corpus-based approach to examine patterns of speech and thought presentation in contemporary prose fiction and newspaper reports. They compiled quite a small corpus (40 extracts of approximately 2,000 words each, giving a total corpus of 88,631 words), which was split into four roughly equal parts. There are 10 extracts from each of the following areas: – ‘high’ literature (21,911 words); – popular fiction (23,301 words); – broadsheet newspapers (22,814 words); – tabloid newspapers (20,605 words). Contrary to other major approaches to corpus compiling, the corpus had not been built for general research purposes, but for very specific tasks, which were (Thomas and Short ibid.): ♦ to enable investigation of the similarities and differences in the presence and patterning of speech and thought categories in literary and non-literary texts; ♦ to investigate a linguistic phenomenon which is textual/discoursal; and ♦ to explore the possibilities for automatic parsing of texts for speech and thought categories. 24 For further information about the nature and the results of the analysis go to http://sun1bham.ac.uk/johnstf/Five_vbs.htm 70 Applications of Corpora _________________________________________________________________ Thomas and Short were able to test hypotheses in an empirical way as well as to determine and quantify categories across text-types. They also found that the use of corpora forced them to label every single part of a text and not to ignore examples inconvenient to the theory. Further, the presence of ambiguous codings (4,5% of the total) and frequent overlap between categories underlined once again the importance of pragmatic factors in assigning items to linguistic categories, supporting indirectly the notion of a continuum in speech and though presentation and so helping to dismiss hard and fast categorisation. Another example is the COLT project (Bergen Corpus of London Teenage Language), which is the first large English Corpus focusing on the speech of teenagers. In a pilot version, information on the use of linguistic items by different groups (age, gender, socio-economic class, location, etc.) can be obtained. In the future, as more researchers will familiarise themselves with these techniques and learn how to use them, we will hopefully be able to learn more about patterns of discourse that hold across texts and registers. 4.1.7 Corpora, Language Teaching and Learning Language teaching is another typical field of application of corpus linguistics. The use of real life examples has many advantages: it exposes students to real communicative situations at early stages of their language comprehension process, provides an empirical basis for progression in language learning and helps address individual students’ needs. Mindt (1992, 1996) convincingly showed that many grammar textbooks pay little attention to real life. Most grammar textbooks introduce future time references and modal verbs quite late. This may present considerable difficulties for foreign language learners who might not understand native speaker utterances. He therefore argues that corpus studies should be used to inform the production of teaching resources, so that more common choices of usage are given more attention than those which are less common. 71 Applications of Corpora _________________________________________________________________ Examples of Practical Applications One example of how corpora can be used to address the individual needs of students is the exploitation of LSP corpora. Major LSP corpora available at the moment include the Guangzhou Petroleum English Corpus and the Hong Kong University of Science and Technology 25 (HKUST) Learner Corpus . Another possibility to fit language learning to students’ individual requirements are the CALL programs (computer-assisted language learning). CALL’s primary objective is to create a user-friendly computational environment that allows the student to exploit resources by focusing on his or her specific needs. A major application of corpora to CALL has been implemented by Tim Johns, who designed user-centred didactic software for language acquisition. Cloze and Contexts 26 enable the teacher to automatically generate learning materials from any corpus s/he might have sampled. These two tools not only represent a convenient way of introducing concordance-based methods in language teaching, but they also help students appreciate the advantages of concordancing, which might then be introduced at a later stage. Further, student performance is logged in text files, which can allow the teacher to monitor learning strategies. As Johns points out, CALL methods allow every student to become a “Sherlock Holmes” (Johns 1997:101), a ‘language detective’ who learns to recognise and interpret clues from a context. However, CALL can also be applied to more specific areas of language teaching and learning, as McEnery, Baker and Wilson (1995) demonstrated. The researchers compared the performance of students in part-of-speech analysis tasks, using corpus-based computer vs. traditional human teaching methods. Their focus was on accuracy of participant response 25 over time. Of seventeen first year English language See appendix for further details on these two corpora. 72 Applications of Corpora _________________________________________________________________ undergraduates who participated in the seven-week experiment, nine were taught grammar via the traditional classroom-based “human teacher” method, while the rest used CyberTutor, a corpus-based computer-aided linguistic learning program. What McEnery et al. found out was that the computer-aided group out-performed their human-taught counterparts in terms of accuracy and number of words analysed. 4.1.8 Corpora and Ethnolinguistics Ethnolinguistic studies have only recently begun to make use of machine-readable corpora and few examples of practical applications are as yet available. Examples of Practical Applications One notable exception is a study by Leech and Fallon (1992) in which the authors investigated differences between British and American English which might be attributed to differences in life-styles and cultural attitudes. They first prepared frequency lists and then compared these. What they found was rather suggestive: travel words, for instance, were more frequent in American English, perhaps because of the larger size of the United States. Further, words belonging to the domain of criminality and the military proved to be more frequent in the American corpus, which might support the theory of the American ‘gun culture’. Another example of a possible application of corpora in ethnolinguistics 27 is the Austrian-based one-year project Racism at the Top , led by Ruth Wodak and Teun A. van Dijk in cooperation with seven EU Contries. The overall aim was to investigate the role of top politicians in the reproduction of racism societies. and anti-racism More specifically, in Europe's Wodak and increasingly van multicultural Dijk 26 For further information about The Cloze and Contexts programs see Johns 1997 as well as Tim John’s CALL page at http://sun1.bham.ac.uk/johnstf/timcall.htm 27 Further information about the nature and the results of the project can be obtained from Teun A. van Dijk’s home page at http://www.let.uva.nl/~teun/ 73 Applications of Corpora _________________________________________________________________ investigated how leading politicians write and speak about immigration, minorities, and other 'ethnic' issues in Great Britain, France, Germany, Spain, Italy, Austria and the Netherlands. The results of this project yielded crucial insight into the ways leading politicians influence public discourse and attitudes about ethnic issues. Despite the rather restricted scope of the work so far, it seems a very promising line which should be more closely integrated with work in cultural studies. 4.2 The Use of Corpora in Translation Translation practice has recently undergone major changes: quality standards have improved in an attempt to meet increasingly stiff competition, new text types and genres have been created, and language technologies have been enjoying a major boost. One of the consequences of this development is that translation can no longer be taught merely in a hands-on, “learning by doing” fashion (if this was ever possible), where translation exercises are used to improve language proficiency. What is required is sound theoretical knowledge that will allow students to identify the distinctive features of texts and to develop translation strategies for the various situational contexts and translation briefs. As professional translators they will also need to be familiar with all the tools that can help them to fulfil this task. For this, they require interdisciplinary knowledge. In a world where the translator can no longer cope with the information overload resulting from contemporary technologies, s/he will necessarily have to look in his/her bag of tricks for new tools able to combine accuracy and rapidity. While we may not agree with Gavioli’s suggestion (1996) that translation should be considered as a non-standard LSP situation, she is right in saying that translators (and translation students) are language experts rather than specialists in a certain discipline who nevertheless require a high degree of technical competence. The acquisition of such specialised knowledge is not normally a major focus in the translation 74 Applications of Corpora _________________________________________________________________ course curriculum. Indeed, because it is not possible to foresee what translators will be asked to translate during the course of their professional careers, it may even be an ill-conceived move to offer translation courses in too narrowly circumscribed specialist areas. A more promising approach, I feel, is to focus on paradigms that are maximally generalisable. 28 It is this area where I feel corpora will prove most useful. 4.2.1 Parallel, Multilingual and Comparable Corpora In translation, corpora are being used for a wide range of analyses. Amongst the most obvious applications is the use of comparable and parallel corpora as reference material for terminological analyses. Friedbichler et al. (1997), for instance, demonstrate how aligned parallel corpora enable translators to pinpoint the terms, collocations or colligations standardly used by target-language experts in a given specialist area. Whether the translator then uses these standard collocations or decides to adopt a foreignising translation (see Venuti 1995) is a different matter. What is important is that translators know which terms/phrases reflect regular usage and that they therefore have a real choice. Parallel corpora can also be used to provide information on languagepair specific translation behaviour, highlighting equivalence relationships between lexical items or structures in source and target languages (Kenny 1998; Kenny, forthcoming). On a more general level, parallel corpora can also be used to analyse the ‘side effects’ of the translation process, as for instance unconventional linguistic phenomena such as errors, interference 29 between similar languages, etc. Comparable corpora have been used for example by Baker (1993:239-247) in a large-scale study aimed at identifying ‘translation universals’. Drawing on Toury’s notion of laws of 28 See http://www.sslmit.unibo.it/cult.htm Baker defines a comparable corpus as a collection of texts originally written in a language, say English, alongside a collection of texts translated (from one or more languages) into English (Kenny 1998). 29 75 Applications of Corpora _________________________________________________________________ translation, she and her associates have focused on identifying those features that occur exclusively (or with a suspiciously low or high frequency) in translated texts, which - provided they are not the result of interference and have subsequently been confirmed by the analysis of comparable corpora in other languages - can be considered as candidates for translation universals (see also Laviosa-Braithwaite 1998 and Kohn 1996:46-48). Based on some preliminary studies, Baker’s hypothesis is that translated texts tend to be more explicit, unambiguous, and grammatically conventional than their source texts or original texts in the target language. She further argues that translations also tend to avoid source text redundancy and to exaggerate stereotypical characteristics of the target language (see Baker 1993:243-245). Kenny (1998) found that translations tend to display high type-token ratio, low lexical density and low sentence length vis-à-vis original texts in the same language, which seems to support the hypothesis that simplification may be a translation universal (see also Laviosa-Braithwaite, 1998). While the notion of language universal is not new (see Toury 1991; Baker 1993:243-247), earlier research had to rely on manual analysis, which proved very time-consuming. Corpus linguistics, and modern tools, such as Scott’s Wordsmith Tools, allow for rapid processing of linguistic patterns in vast quantities of texts, and produce comprehensive statistical data. Despite this obvious success of corpora in translation research, much work still needs to be done. So far only few researchers have dealt with the problem of which type of corpus (parallel, comparable, multilingual, monolingual) should be best used in which kind of study to achieve optimal results. Another issue that is rarely addressed is that of how theoretical and applied aspects could best be merged. To what extent theoretical investigations are hoped to feed into translation pedagogy will again impact on the type of corpora used. That theory and applied translation studies should be closely related was convincingly argued by James 76 Applications of Corpora _________________________________________________________________ Holmes. In his seminal work of 1972 entitled "The Name and Nature of Translation Studies", which essentially laid the foundations of the discipline, Holmes stated that translation practice should not be divorced from theory and that the discipline should be receptive to developments in other fields of study. It was his firm conviction that the theoretical component, whose aim is essentially to describe the phenomena of translating and translation and to establish general principles by means of which these phenomena can be explained and predicted, should be closely linked to applied areas of translation studies such as translator training, foreign-language learning and translation criticism. Multilingual parallel corpora provide an effective means of integrating a theoretical component into the pedagogical material required for a translator training course, helping to analyse translation universals and to identify the tolerance factors that constitute "adequate" and "acceptable" translations. 4.2.2 Machine Translation MT systems have adopted a variety of different approaches and also evolved considerably since their first beginnings in the 1950s, when translation was considered to be essentially on a par with code-breaking: the ‘first-generation direct systems’ tried to implement dictionary-based direct replacement on the word level. During the 1960s, then, the basic techniques of word transfer were revised, and gave way to the ‘indirect method’: the transfer approach, involved structural analysis of the input text, a bilingual mapping at an abstract level, and synthesis of the target text; while the interlingua approach avoided the bilingual transfer stage and instead used a more abstract universal representation. (Somers 1998) The pyramid diagram depicted below and probably first used by Vauquois in 1968 shows the essentials of MT systems: the deeper the analysis, the less transfer is needed, the ideal case being the interlingua approach, where there is no transfer at all. 77 Applications of Corpora _________________________________________________________________ Interlingua Transfer Analysis Generation Direct translation Source text Target text Source: Somers 1998:145 Although much improved compared to the early systems, even 2ndgeneration MT programs were unable to produce fully acceptable texts. MT systems either required comprehensive pre- or post-editing of texts to ensure they fulfilled the end-user’s needs. This was generally considered too time-consuming and so prevented the wider distribution of such systems amongst translators and companies. In recent years, a ‘third generation’ of MT systems has evolved which try to incorporate real-word knowledge. The new paradigm that is being developed in MT research is called ‘artificial intelligence’ (AI). AI researchers essentially agree that, in order to be able to ‘teach’ knowledge to a MT system, huge amounts of naturally occurring language are needed: the source they need is obviously text corpora. Two major examples of such ‘third generation’ MT systems which make extensive use of corpora are example-based MT and statistics-based MT. In example-based systems, translation is produced by comparing the input with a corpus of typical translated examples, extracting the closest matches and using them as a model for the target text. This is done in two 78 Applications of Corpora _________________________________________________________________ stages: ‘matching’ the input with examples, and ‘recombining’ the targetlanguage fragments extracted. This approach is considered to be more 30 like the way humans translate and its result is said to be more ‘stylish’ , since it is not solely based on the structural analysis of the input text. (Somers 1998:148) Statistics-based systems are essentially a non-linguistics-based technique. They attempt to translate purely on the basis of probabilities calculated by considering millions of words of parallel (or comparable) text, thus trying to determine lexical equivalents and target-language word order. (ibid.) Examples of Corpus-Integrating MT Systems: SPARKLE SPARKLE is one of several MT systems currently being developed at various European centres that are designed to be context-sensitive, making use of phrase-level syntactic analysis to solve the problem of lexical clusters and disambiguation. Computational concordancing is used to systematically examine words and phrases which occur in the proximity of a given term. This approach has proved particularly successful in the area of special-domain languages where the number of standardised collocations appears to be more restricted. With more context-sensitive MT software the output will require substantially less time for post-editing. PARGRAM The major goals of the PARGRAM project are the analysis and encoding of important and most generally occurring syntactic structures in German, and the development of parallel analyses for cross-linguistic phenomena (e.g. binding, modification). The parallel nature of the analyses is ensured through the concurrent development of German, English, and French Lexical Functional Grammars (cf. the LFG websites in Essex and Stanford). The researchers also strive for maximally broad 30 Literal translation is obviously an exception. 79 Applications of Corpora _________________________________________________________________ coverage, coupled with efficient processing. A spin-off of their work is that they are accumulating extensive experience in the encoding of large grammars. 4.2.3 Translation Memory Systems The development of AI-based MT systems was one of the responses put forward by computational linguistics in an attempt to improve the disappointing performance of MT. Another was to develop systems that would support, rather than replace the translator. The basic idea was to create computer tools which were able to reuse previously translated passages. Earlier translations are stored in a database – the so-called translation memory – where sentences of the source text are aligned with corresponding sentences of the target text. Translation memory systems can be particularly useful if the sourcelanguage text is an updated version of a document (for instance a computer manual). When starting to translate the new text using the translation editor, the system automatically segments the source text and looks up each segment in the translation memory database. If a segment has occurred previously, the stored version is offered as a possible equivalent. The output can then be accepted, amended or even rejected, that is, the translator remains responsible for drawing analogies and for structuring the target text during the translation process. (Freigang 1998) 80 Applications of Corpora _________________________________________________________________ Screenshot of Translator’s Workbench for Windows by Trados Translation memories can therefore be considered as a special kind of parallel corpus, which, in addition to their practical applications, also provides interesting data for cross-linguistic studies and the study of language use in translation in general. 31 While translators using Translation Memories were originally restricted to exploiting the texts they (or their colleagues) had translated, several new projects are currently underway which try to improve the workbench programs so that they can also draw on comparable corpora (for details see Chapter 6). 31 A very interesting project about translation support tools entitled ”Linguistic Engineering for Generation and Translation of Documentation” is being conducted at the Department of Computer and Information Science of Linköping University. For further details see Ahrenberg et al. 1996. 81 Applications of Corpora _________________________________________________________________ 4.2.4 Corpora vs. Termbanks Until recently building corpora was the privilege of a handful of specialists in the field of language engineering. But since the advent of the new media, most notably of CD-ROMs and the World Wide Web, the number of electronic texts available has been increasing exponentially each year. In the field of the medical sciences, for example, more and more prestigious journals are publishing annual full-text collections of their hardcopy issues on CD-ROM and the proceedings of many specialist conferences are available on electronic resources. Similar trends can be observed in many other disciplines. As the benefits of concordancing rely heavily on the quality and adequacy of the corpus, this means that it is now becoming profitable for every professional translator working in a specific domain to compile his own custom-designed domain-specific corpus. But why do we need domain-specific corpora when we have term banks, some terminologists may wonder. First of all, corpus research is a highly efficient tool for compiling more authoritative data banks. Furthermore, bilingual data banks can be incorporated in the corresponding domain-specific corpus in which they would act as a pivot between two unaligned source and target language corpora. In addition, and this is the crucial point for professional translators, a well-designed representative corpus is far richer and much more adaptable to the various language queries a translator is confronted with. Experience has shown that once the initial learning phase has been overcome finding the proper terms - especially the more common ones which are likely to be available from data banks - is an issue of decreasing importance, while embedding the key terms in the appropriate idiom and hitting the adequate domain-specific register, phraseology and style remain timeconsuming tasks in final-draft revision even for translators with extensive experience. It is precisely in this latter context that professional translators having a representative specialised corpus at hand will save a 82 Applications of Corpora _________________________________________________________________ considerable amount of referencing time and, at the same time, enhance the quality of their translations. 4.2.5 Translation Teaching and Translation Research Although an old adage has it that practice makes perfect, in translation programmes this approach will often cause frustration, as learners are told they need to improve their performance yet are not offered any advice on how this could be achieved. If we are to keep pace with new trends in translation and the translation market, I believe more creative approaches are needed, especially approaches which promote self-access study. Using corpora to me seems one of the best ways of enhancing autonomous learning skills. Silvia Bernardini (1997:3) also supports this view when she suggests that activities involving self-access use of a large corpus for learning rather than reference purposes may help students develop the skills and strategies that are necessary complements to the translation task. Bernardini summarises her starting point as follows: I want to let learners find out for themselves the solution to a problem they are (or are made) aware of, or the answer to a curiosity or doubt. Besides, however, I also want them to develop procedures and strategies which allow them to take maximum advantage of the resources they have - in this case a large corpus - in order to accomplish the task successfully and economically. Finally, I want them to feel free to look around, to notice unexpected - or indeed expected - phenomena, to deviate from their path in order to follow a new one, or go back to the old one if the new one reaches a dead end. Clearly, the aim here is not the acquisition of descriptively adequate knowledge that, or competence, although this is a valuable, and indeed likely, outcome of large-corpus concordancing. Instead, what is at stake is the development of a number of skills that can be grouped under the heading of knowledge how to, or capacity. In other words, we focus on processes rather than products, on methods rather than outcomes, on resourcefulness, awareness and reflectiveness rather than learnedness (Bernardini, 1997:3; my emphases). Examples of Practical Applications 83 Applications of Corpora _________________________________________________________________ Lynne Bowker at Dublin City University similarly promotes the use of corpora and corpus tools in the translation class. In an experiment she conducted with a group of final-year students she convincingly proved that the quality of the translations produced by the students substantially improved (both with regard to comprehension errors, specifically errors resulting from a lack of comprehension of the subject field, and production errors, including wrong choice of term, un-idiomatic constructions, grammatical errors, and incorrect register) when a target-language comparable corpus was used. A similar experiment was carried out by Federica Scarpa at the Università per Interpreti e Traduttori of Trieste with a group of final-year Italian students. The study was carried out on the section of the corpus consisting of original English texts and their translations into Italian and, conversely, of original Italian texts and their translations into English. Concordance queries were undertaken at different levels of "delicacy": at the word-level the focus was on alerting students to basic translation problems such as "false friends" (e.g. prima facie equivalents such as in fact and infatti, eventually and eventualmente), and, at the paragraph level, the students investigated the different strategies used to signal the same pragmatic feature in the two languages (e.g. the greater grammaticalisation of modality in English compared to Italian, where modal functions of auxiliaries have been taken over to some extent by other items). Scarpa stresses that this type of activity discourages a wordto-word approach to translation and enhances the critical awareness of the students, often disturbing received ideas such as the fact that published translations must be accurate. A very specific application is described by Robert Spence. He carried out an experiment in corpus-based translation teaching at the Fachrichtung 8.6 (Angewandte Sprachwissenschaft sowie Ubersetzen und Dolmetschen) of the Universitat des Saarlandes. Spence analysed two text corpora: the first a corpus of 100 student translations of a short news report on world population growth and the second a corpus of 37 84 Applications of Corpora _________________________________________________________________ student translations of a tourist guide to the Chamber of the House of Commons. Most of the translations were done by German native speaker students. The texts included in the first corpus were assessed for errors, which were then classified according to their likely origin (in relation to the metafunctions, strata and ranks of the systemic functional model of text as instantiation of “meaning potential”) and in terms of their likely effect (on the “usability” of the translation). In analysing the second corpus, the focus was on the relation between register, genre and ideology, and on the role of microregisterial variation as a tool for identifying genre-specific text structures. The experiment had three main aims: ♦ to investigate the phenomenon of Learner English, and in particular the phenomenon of L1 (and possibly also L3) interference, in a highly constrained text-creation environment (i.e., in relation to translation rather than free composition); ♦ to explore didactic applications of corpora of student L2 errors in the context of an undergraduate course in translation; ♦ to ascertain the feasibility of using such corpora in interaction with a multilingual systemic functional computational generative grammar and parser as part of a future computer-aided approach to the difficult task of "learning to translate". The primary role of corpora in cross-linguistic research has also been advocated by Stig Johansson, who recently examined the agreement between bilingual dictionaries entries and the correspondences observed in the corpus material. By confronting the Norwegian modal particle nok to its English counterpart probably (1998:13), he successfully shows that corpus-based analysis gives a far richer picture of correspondences across language than dictionaries do. A comparison like the one carried out by Johansson gives new insight into translation and provides a new perspective on the languages compared. In the same paper, Johansson (ibid.:16) also shows very clearly how a linguistic context can affect the meaning of words. The general conclusions he drew from the analysis of the English noun mind, for 85 Applications of Corpora _________________________________________________________________ instance, are that English and Norwegian tend to refer to mental processes in different ways and that correspondences are highly sensitive to context. His study demonstrates that there is no single preferred Norwegian counterpart of the English noun mind, and in approximately half of the cases Norwegian opts for a form without a corresponding noun. (Johansson ibid.:18) A detailed example of one possible application of corpora for translational purposes is presented by Margaret Rogers (1997). In her study about synonymy and equivalence in German and English speciallanguage texts, Rogers considers the linguistic behaviour of two sets of potential synonyms in English and German from the domain of genetic engineering, based on a corpus of texts aimed at a scientific but not necessarily expert readership. The analysis resulted in a number of constraints which are of relevance to translators as text creators. Her study also showed that translators should not merely rely on dictionaries, which often present synonyms as decontextualised lexemes, but aim primarily to spot - by means of corpus exploitation - possible relations of overlap and exclusion which are neither logically predictable nor amenable to standardisation procedures. Guyda Armstrong (1996) also contributed to the implementation of the corpus-based approach into translation (studies). In order to force students to investigate the development of Machiavelli’s political thought, Armstrong, a teacher of Italian at the University of Edinburgh, let second year students in translation run Machiavelli’s 32 Il Principe through the TACT program (See appendix 3). The exercise focused on the key concepts of virtù (‘prowess’), fortuna (‘fate’) and the associated concept of prudenzia (‘caution’). The students investigated these words using various (corpus-based) methodologies, first analysing the single words individually, for then moving on to more sophisticated collocational searches which included all three items. Finally, the students were asked 86 Applications of Corpora _________________________________________________________________ to compare the distribution graphs of all three words and draw some general conclusions about the overall structure of Il Principe. In the end, students were able to recognise Machiavelli’s lexical choices and the meaning he assigned to them. Armstrong (ibid.), however, points out that the corpus-based approach does not compensate for inadequate preparation, but offers a possibility of looking at the text from a new perspective, maybe discovering unexpected leads which can be followed up elsewhere (i.e. by means of an etymological search, the development of Machiavelli’s political concepts, an analysis of synonyms and associated words, etc.) This kind of work not only supports the translator’s lexical choice in the target language, but can also be of help for students of political science and history, therefore promoting the interdisciplinary use of corpora. In a very recent study about the translation of the German modal particle doch into French, Feyer (1998) set off from the general assumption that all nuances concealed behind German modal particles can be basically expressed in other languages as well. In order to shift the focus on the problems encountered when translating such linguistic items, Feyer (1998:118-124) decided to compile a corpus of written literary texts including a great amount of spoken language, since this kind of artistic production was felt to deliver the optimal test-bed for a contrastive analysis. The corpus included major works of Austrian and German authors (Bernhard, Böll, Dürrenmatt, Horváth, Schneider and Konsalik) as well as their translation into French. Feyer believed that the very expressive writing skills of these ‘word-jugglers’ would be challenging enough for a translator to interpret. In her detailed analysis of linguistic and cultural patterns of the modal particle doch, she convincingly shows that there is almost no lexical correspondence between German and French, but also that the meaning can generally be got across in various ways, depending on the kind of sentence one is dealing with. (Feyer 32 th Machiavelli, one of the most prominent and linguistically complex politicians of the 16 Century, used to assign new meanings to old words, which resulted in his works being sometimes 87 Applications of Corpora _________________________________________________________________ 1998:130-259) On a more general note, her study demonstrates that the translator remains the one in charge of deciding which TL solution to opt for, making clear that lexical and semantic variation has not to be confused with inaccuracy. Indeed, variation is sometimes even necessary in order to render the translation culture-specific. (Feyer 1998:279). Feyer, then, supports the interpretative and creative side of translation, giving an account of how it can be possible to assess both the role and the behaviour of translators, analyse word structures, and develop clever translation strategies by means of a corpus-based approach to translational issues. What all these studies have in common is a strong belief in the necessity to design translation training courses that focus on processes rather than products, mainly because it is impossible to teach translation trainees all the words or acquaint them with the entire range of texts that they will be confronted with in their professional lives. What they therefore need are strategies that will help them cope with new terminologies and with unfamiliar genres and their conventions. Corpora are seen as tools that allow trained learners to: ♦ solve problems on their own, using the available resources, which should also boost their creativity and resourcefulness, since they will need to learn where to look for solutions to a given problem that may arise in the course of the translation task ♦ develop greater awareness for culture, situation, genre and textdependent language use ♦ improve their ability to cope in new situations ♦ develop the technical skills necessary for efficient corpus use, such as computing and logical skills 4.2.6 Thinking Globally – Acting Locally misunderstood by later generations. 88 Applications of Corpora _________________________________________________________________ Corpus-oriented studies are going global. Before the advent of the corpus-based approach, the major fields of linguistic study (e.g. grammar and lexicography) were normally strictly separated. Corpora allow scholars to tackle different tasks simultaneously, and thus to unite and integrate different fields of research and approaches. This has improved descriptive cross-linguistic research and, therefore, more comprehensive and coherent language descriptions. (See also Johansson 1998:21) A greater ‘globality’ - i.e. closer integration of different disciplines - is being promised by the multimedia technologies. Multimedia technologies permit the integration of both spoken and written language - two research fields between which there has traditionally been little cooperation - as well as non-verbal data. The fact that all the different types of data would be stored, analysed and described on a single platform would immensely improve the representation, manipulation and retrieval of corpus data (see also McEnery and Wilson 1996:173). A truly multimedia corpus would for instance allow users to switch between a section of transcribed text and a segment of a video recording showing the interaction, which could then be annotated on many different levels (e.g. transcription, grammatical analysis of the text, on-line notes describing the social background of the speaker, analysis of the sequence in terms of its discourse structure, an ethnographic description of the context, a detailed analysis of non-verbal elements, etc.). Over the exitement of the vast potential of improved interdisciplinary research we must not forget, however, that specific research projects still require tailor-made solutions. Over the last two decades many (kinds of) corpora have been made available to the international research community. As researchers, our role is to identify our needs and to exploit linguistic resources accordingly, and not merely to assume that sampling criteria and parameters that were outlined for other projects will be applicable to our own studies. In other words, corpus exploitation very much depends on a 'think globally - act locally' philosophy. 89 Applications of Corpora _________________________________________________________________ 4.2.7 Critical Comments While recent years have seen a considerable increase in the number of corpus-based investigations in translation, not all translation theorists are convinced that corpora can really provide all the solutions and so have sounded a note of caution (Melmkjaer, forthcoming, quoted in Kenny 1998:53). One point that has been made is that corpora exploitation is mostly statistics-oriented, that is, its advantages can be fully understood only by translators au fait with computational linguistics. While I agree that knowledge of statistics is necessary if the use of corpus-based research results is to be maximised, I cannot concur with the critics’ claim that one needs to be a computational linguist to decode and interpret statistical information. If more corpus-based work was to be used during translation training, what would of course need to be done is to include a more comprehensive introduction to computational issues in the translation curriculum. Provided the focus was on enhancing awareness of the translation process and end users’ needs, this would also help students develop a perception of translational skills not merely as a means to a (highly practise-oriented) end, but as something that should be analysed and discussed from a more theoretical perspective. (See also Kohn 1996:48) A further drawback has been mentioned by Kenny (1998:53). Referring mainly to the use of comparable corpora in literary translation research, she found, because new genres are often introduced from one literature to another, that there was nothing comparable in the “host literature”. The same problem may arise in non-literary genres in less-widely used and taught languages. A case in point is Irish Gaelic, where many (nonliterary) genres are modelled on English so that there are no ‘native’ texts with which to compare translations. (Kenny ibid.) One further point of criticism often mentioned is that corpus linguistics has traditionally applied a strict bottom-up approach. Data were collected and statistically evaluated before any theories about generalisable usage 90 Applications of Corpora _________________________________________________________________ patterns were proposed. Most translation theorists, however, have adopted a top-down direction. Theories were drawn up, and only later tested against real-language evidence. While it has been convincingly shown that the two approaches are not mutually exclusive and may well complement each other (see Aston 1997:2; Chafe 1992; Leech 1991; Svartvik 1992; Kohn 1996:48), it also seems that many translation theorists are reluctant to engage in corpusbased research, possibly because this would imply that their theories would have to be restricted to a fairly narrow domain, while traditionally translation theories (i.e. Reiss and Newmark) seem to have been allinclusive and promulgated for all translational events. 4.2.8 Conclusions I hope that the examples described in this chapter have shown that corpora and the programs available to exploit them are immensely useful tools for translators. Once translators - and translation students understand the many different types of analyses they can carry out with and on corpora, the ease and efficiency with which such investigations can be conducted should provide sufficient impetus to make them interested in issues that go beyond purely practical applications. Comparisons of larger bodies of texts and their translations should also encourage them - as Venuti puts it - to be ‘suspicious’, and to query the transparency of translations. They should make them eager to find out what is really concealed behind the word, what the author of the text really wanted to express, and what strategies the translators employed in their efforts to render this meaning in the target language. Apart from inspiring such more theoretical interests, work with corpora and the elaboration of translation strategies this permits - obviously also allows translators to keep up with current developments in language production, and therefore assures both high quality and productivity. Indeed, I feel that corpus-based work is the only way that will ensure this. 91 _________________________________________________________________ 5 Case Study This chapter tries to demonstrate how corpora might be used by translators and translation students to solve a specific linguistic problem. The principal aim of this chapter is to show what difficulties they might encounter when trying to select a suitable corpus and how initial hypotheses may have to be revised following some pilot analyses. It also shows some of the limitations of corpus-based work, and sounds a note of caution regarding the validity of its results. The focus is on procedural issues; other potential applications are described in the previous chapters. 5.1 The problem When I tried to decide which kind of case study would provide the most suitable framework to show the kinds of problems that may arise during the investigation of linguistic patterns, I at first was of course very tempted to replicate one of the studies that have been carried out within translation studies. However, given the limited scope of an undergraduate thesis, and the likely ignorance of corpus tools of most of the readers of this thesis, I decided that a more limited case study that focussed on a clearly defined linguistic problem would be better able to demonstrate the pros and cons of corpus applications. The linguistic problem that I then chose to analyse was the difference between the use of the prepositions tra and fra in Italian. As a native speaker I have often been asked by fellow-students which they should use in which contexts. Generally, I was able to tell them which I preferred, yet when asked why, my explanations rarely went beyond “the other one does not sound right”. 93 Case Study _________________________________________________________________ In this chapter, then, I shall first present my own hypothesis about the use of tra and fra in the Italian language and describe the reasons why I want to describe their usage patterns in spoken Italian (Section 5.2). Section 5.3 will describe the corpus which was used to test my hypothesis, while Section 5.4 will deal with the tools used in the analysis. In the light of the results of a trial run, I shall then reformulate the claims as to the validity of the study (Section 5.5). The actual study will be presented in Section 5.6, and in Section 5.7 I shall offer possible interpretations of the findings and some concluding remarks. 5.2 Formulation of the Hypothesis In the Italian language, tra and fra are considered to be synonymous prepositions which basically indicate: 33 ♦ a relation between (or among) two or more people or things, as in fra le due possibilità (between two possibilities) or tra fratelli (among brothers) ♦ a position (in the middle of, amid, amidst), as in tra la folla (in the middle of the crowd) ♦ a movement (through), as in il sentiero s’insinuava fra i monti (the path wound through the mountains) ♦ a time reference (in, within), as in tra due giorni (in two days’ time) Most people would maintain that the two prepositions are fully synonymous and totally interchangeable. A quick collocation search of a 34 corpus of 289,426 tokens, however, produced 452 hits for tra and only 96 hits for fra, which suggests that there is a degree of preference for the former preposition. My native speaker hunch feeling has always been that the use of fra is motivated by phonological constraints: Fra, I believed, was used to avoid cacophonous repetitions, especially, I conjectured, in speech that was 33 The information included has been gathered from numerous grammar books (e.g. Krenn 1996, Renzi et al. 1995, Dardano and Trifone 1985, Salvi and Vanelli 1992, Levi and Dosi 1982) as well as from a dictionary of frequency of contemporary Italian language (Bartolini et al. 1971). 94 Case Study _________________________________________________________________ trying to sound more accomplished. 35 If my hypothesis was correct, the case study would produce regular cotextual patterns that would show which preposition was preferably used in which environment. 5.3 Selecting the Corpus To test the hypothesis formulated in the previous section I needed to find a corpus which was able to provide data suitable for a qualitative analysis. Since corpus compilation is one of the most difficult tasks, I thought it was vital to ask more experienced people which kind of corpus they felt would be the most appropriate for my purposes. I therefore posted a message to the ICAME mailing list ([email protected]). My query was answered by two subscribers who suggested two different ways of compiling a suitable corpus. One was Ralf Steinberger, who works as a researcher at the Joint Research Center of the European Commission. He suggested the following: Dear Andrea, I can think of two sources for Italian corpora: 1) The ECI corpus, obtainable at ELRA (http://www.icp.inpg.fr/ELRA/cata/tabtext.html). 2) You can download Italian texts from the European Union web sites, as many texts exist in all official EU languages. This is a bit tiresome, but if you only need 200.000 words, you can do this in less than half an hour. One possible site is: http://eu The latter source is quite EU-biased, of course, so it is certainly not literature. For prose, you may find something at the Oxford Text Archive, but I do not know their internet address. Maybe it is http://www.ox.ac.uk/... Good luck, Ralf (Source: private e-mail correspondence) 34 35 See Section 5.2 for a detailed description of the corpus mentioned. Some support for my hypothesis is found in the following quote: ”Queste ragioni di eufonia diedero qualche pensiero al Manzoni che, adeguandosi anche in questo particolare all'uso fiorentino del tempo, sostituì i fra della prima edizione dei Promessi Sposi con tra: nel capitolo IX, dove aveva scritto "fra tre o quattro confidenti", per evitare il brutto tra tre, "se l'è cavata correggendo: 'tra quattro o cinque confidenti'. Sennonchè le cifre non sempre son così elastiche come erano per sua fortuna qui!” (D'Ovidio 1933:102, quoted in Serianni 1989:299) 95 Case Study _________________________________________________________________ The second was Elisabeth Burr, a lecturer at the Romance Languages Department of the Gerhard Mercator University at Duisburg, who mentioned the possibility of on-line research via tactweb: Dear Andrea, I have created two corpora of Italian newspaper language. Part of one of them (ca. 750.000 words) is available via the Oxford Text Archive for teaching and research. You could, however, also use my tactweb page and do your study online. Have a look at: http://www.uni-duisburg.de/FB3/ROMANISTIK/PERSONAL/Burr/burr.htm You'll find a link from there. The part of the corpus which is online contains about 75.000 words. In the near future I am planning to put more material on-line for a seminar I am teaching. So if you can wait a bit longer, you might be able to get enough material together. The part which is on-line already and what I am going to put there is not POS-tagged, however. I have done some POS-tagging but it still has to be corrected. All the best for your research Elisabeth Burr (Source: private: e-mail correspondence) These sources, however, did not really meet my needs: the use of the ELRA corpus - as well as most of the material from the Oxford Archive - is subject to a subscription fee, while an on-line corpus can be neither downloaded nor exploited by means of collocation software other than the built-in search engines, which again did not provide the kind of information I was interested in. A further source of data was suggested to me by Guy Aston, Associate Professor of English Linguistics at the Scuola Superiore di Lingue Moderne per Interpreti e Traduttori of Forlì, who mailed me concordances of tra and fra from the LIP Corpus as well as a wordlist. While his material was very useful, unfortunately, I was unable to gain access to the entire LIP corpus, so I had to opt for yet another source of electronic texts, the Associazione Liber Liber homepage (http://gsi.it/LiberLiber/index.htm). This choice was mainly due to the fact that this copyright-free source of data allowed me to compile my own corpus by selecting only those texts that I considered appropriate for my study. A further reason was that the 96 Case Study _________________________________________________________________ Liber Liber Association collects both transcribed spoken texts and literary masterpieces by major Italian - and, exceptionally, non-Italian – writers, which promised to be extremely interesting. After much further pondering of which type of texts I should choose, I decided to sample the transcribed spoken subcorpus, since I assumed it would be closer to the language used by Italians in unplanned interactions, albeit in a formal setting. It also seemed unconstrained with regard to the lexicon and syntax used and could therefore be assumed to contain different stylistic registers. The texts I first selected for my corpus comprised all the transcribed records of the Commissione Parlamentare Antimafia (Parliamentary Commission against Mafia Crimes) which were made available to Liber th Liber on 15 May 1995. Since this corpus comprised over 1.6 million 36 words, it proved unmanageable for my concordancer. I therefore had to select a smaller sub-corpus. The problem that arose at this stage was how representativeness could be ensured in this small sub-selection. In order to overcome this obstacle, I resorted to a little trick. I introduced an extra variable: only the hearings chaired by Tiziana Parenti were included (in total 28 hearings, and 422,590 tokens). This of course makes the study less representative; however, I felt that even this ‘limited’ representativeness was sufficient for the purpose of this study. Another basic problem I was faced with was that of POS-tagging. Although a tagged corpus would have offered me the chance to look at my corpus from a statistical point of view as well, I decided not to have it tagged. The main reason for my decision was that the kind of analysis I 36 This larger corpus of 1,676,863 tokens was first posted to the ICAME mailing list, and then put on the net for public availability (http://www.bhak-bludenz.ac.at/mdgrosse). The corpus I decided to exploit for my purposes, then, could well be defined as a ‘trimmed’ version of this general corpus. 97 Case Study _________________________________________________________________ wanted to carry out did not require syntactical information or lexical categorisation. The meanings of both prepositions can easily be 98 Case Study _________________________________________________________________ extracted from any Italian monolingual dictionary or grammar book. I was, as I stated above, primarily interested in differences in usage patterns between tra and fra in the spoken language. 5.4 Choosing the Tools Once the corpus was compiled, I proceeded to the selection of the tools for its exploitation. At present, the two most comprehensive concordancing programs running on a Macintosh environment are Conc 1.80b3 and SysConc 2.5. Both text browsers load the entire text into memory for processing and can therefore handle only relatively small corpora, which, as explained above, was the main reason why I resized my corpus. Conc is a statisticsoriented research concordancer developed in 1996 by John Thomson at the Summer Institute of Linguistics of Dallas. It is very fast, and produces both KWIC concordances and indices (see Appendix for further details). SysConc has been developed by Christian Matthiessen and Canzhong Wu, respectively Associate Professor and Research Assistant at the Natural Language Laboratory of the Speech, Hearing and Language Research Centre of the Department of Linguistics at Macquarie University, Sydney. Although SysConc cannot browse text files bigger than 2 megabytes, that is to say about 300,000 tokens in MS Word format, its information output is much better structured (i.e. through bar graphs, frequency maps and hierarchies) than Conc’s, which means that regular patterns of language use may be spotted more easily. It also allows you to perform collocational searches (search of two items in a preestablished collocational range, with or without wildcards) and a feature search (search of a number of items, with or without wildcards), with the possibility to highlight irregular verb forms. Although some of Conc’s features were also interesting, such as its split screen display of text and concordance, the potentialities of SysConc as well as its friendly interface convinced me that it was more suitable, and I chose it for my analysis. 99 Case Study _________________________________________________________________ 5.5 Summarising the Restrictions Before I start the actual analysis I shall summarise the main issues discussed above: ♦ This analysis focusses on differences between tra and fra in spoken Italian. It does not attempt to produce statistical data, or data that will hold true for all modes (spoken and written) and all text types and genres. ♦ As far as the size and the representativeness of the corpus are concerned, it has to be admitted that this study cannot be considered a deep analysis of these two Italian prepositions. Nonetheless, its results can still provide quite significant information: a frequency of 452 hits in a corpus of 300,000 tokens is sufficiently high to provide a good basis for hypothesis testing. Moreover, the fact that the guiding principle in text selection was maximum consistency (achieved by including only the hearings chaired by Tiziana Parenti) should also ensure maximum corpus validity. 37 ♦ Another problem, which I have not yet mentioned because it is not directly related to sampling criteria is the question of whether or not a collection of transcribed hearings can be considered as true representations of spoken language. The corpus I used seems to be heavily normalised: common features of spoken language such as pauses, interruptions and false starts have been edited out. Despite these shortcomings, it still appears to be an accurate enough reflection of spoken language in a formal setting. 37 Of course, homogeneity is a double-edged weapon: while data become more creditable, the findings can bo longer generalised unconditionally as they might result in a misleading description of language use. 100 Case Study _________________________________________________________________ 5.6 The Study Essentially, the case study aims to prove two sub-hypotheses: ♦ that tra and fra are synonyms ♦ that avoidance of cacophony is the primary factor determining their use in spoken Italian. In this section, I shall first attempt to address the question of synonymity, and then discuss some points that support my second hypothesis. 5.6.1 Synonymity In order to prove that the two propositions are synonymous, I searched the corpus for any similarities concerning the cotexts in which fra and tra occur. As far as syntax is concerned, even a rather superficial analysis of the frequency tables produces interesting results: the cotexts of tra and fra are very similar. 101 Case Study _________________________________________________________________ Frequency table of the Italian preposition tra Frequency table of the Italian preposition fra The frequency tables show the number of tokens of all words that collocate with each of the two prepositions, and also the right-hand and left-hand collocates. As it can be easily gathered from the pictures reproduced above, the same grammatical classes preced and succeed tra and fra: the first collocate on the left is mostly a noun, while the first collocate on the right is normally an article or a pronoun. Similarities in the semantic structure, on the other hand, cannot be extrapolated from a simple frequency list. Even if frequency tables contain various hints, a more detailed analysis of collocations is required. After a first analysis of left-hand and right-hand collocates, the hypothesis that the two prepositions are fully synonymous seems to be supported: 102 Case Study _________________________________________________________________ quite a number of nouns, including rapporto/i, distinzione, collegamento, coordinamento, are followed as often by tra as by fra. ...che il tema del rapporto tra criminalità organizzata ed effetti... ...approccio con il grande tema del rapporto tra economia, finanza e... ...una ricognizione sul tema del rapporto fra mafia ed enti locali… ...ma è una questione di rapporti fra Governo, Parlamento e… ...almeno qui, avessimo chiara la distinzione tra Governo e Stato... ...tutti assieme, senza distinzione tra maggioranza e opposizione... ...la giusta distinzione fra i pubblici ministeri è evidente che esisterà... ...riguardano: la distinzione fra intermediari finanziari ed i soggetti... ...come "ufficiale di collegamento" tra i paesi dell'Unione europea e... ...che bisogna creare un collegamento tra istituzioni governative e... ...daremo avvio ad un collegamento fra tutti i paesi amici per... ...era emerso alcun collegamento fra queste persone e la criminalità... ...occuparsi del coordinamento tra l'azione dello Stato e quella svolta... ...che ha compiti di coordinamento tra gli enti governativi e quelli non... ...possibilità che il coordinamento fra le forze di polizia possa essere... (Source: Mafia Corpus, 1998) The only problem that remains is: If the two prepositions true synonyms and interchangeable in all contexts, why then are there 452 occurrences of tra and only 96 occurrences of fra in the corpus? If the difference between them is neither semantic nor syntactic, what motivates their choice? My second hypothesis is that the use of fra and tra is guided by phonological constraints. This hypothesis will be tested in the next section. 5.6.2 Cacophony 103 Case Study _________________________________________________________________ Before entering into a detailed discussion of cacophony, I shall give a very brief introduction to some basic phonological concepts: The first consonant in fra is a labio-dental fricative; the initial consonant in tra is an alveolar plosive. The concatenation of identical sounds is generally considered to be cacophonous in Italian, while the alternative use 38 of fricatives and plosives is seen as more 38 euphonous. See also Serianni 1989:298-299 104 Case Study _________________________________________________________________ This hypothesis is borne out by the following examples in my corpus: ...per proporre intese fra tutti i paesi per arrivare ad una armonizza... ... dopodomani, la prossima settimana, fra due settimane e fra tre mesi... ...Questi casi, fra l'altro, sono apparsi su tutti i giornali... ... si articola lungo più direttrici tutte fra loro strettamente connesse... ...l'effettivo isolamento del detenuto. Fra questi si annoverano quelli... ... mafiose, prime fra tutte le attività economiche e finanziarie.... ...di infiltrazioni, di relazioni fra settori economici, istituzionali,... ... riguarda i rapporti intrattenuti fra i detenuti ed il mondo esterno... ... nuovo rapporto che ha cercato di instaurare fra cittadino e Stato... ...senz’altro si rileva uno scarto fra l’entità del fenomeno e la quantità... (Source: Mafia Corpus, 1998) There were, however, other examples in my corpus which did not support this hypothesis: … lo svolgimento delle elezioni in Germania, tra i quattro o cinque paesi… …sono stati assunti), primo tra tutti la revisione della legge che consente… … credo che lo scarto tra entrate ed uscite annue sia elevatissimo… …cultura della legalità soprattutto tra i giovani, in particolare nella scuola… …Ricordo la drammatica notte tra il 19 e il 20 luglio 1992, quando i ministri… … al coordinamento tra attività "ordinarie" e "antimafia" nelle… …Un disegno di legge si è infranto tra le proteste delle organizzazioni… …sicurezza che non rientra tra quelle riservate ai detenuti sottoposti… …il trait d'union tra il detenuto e il tribunale di sorveglianza… …di cui da tempo parliamo, tra struttura e personale addetti alle indagini… …la risposta: il contatto tra magistrati e pentiti, per le ragioni indicate… …momento di attrito tra il potere giudiziario e quello amministrativo… (Source: Mafia Corpus, 1998) In total, the distribution of fra and tra across ‘euphonic’ and ‘cacophonous’ cotexts was as follows: Preposition Total Occurrences Euphonic Cacophonous FRA 96 27 5 TRA 452 35 41 105 Case Study _________________________________________________________________ The rest of the occurrences can be considered neutral, that is no dental plosive or fricative consonant occurred in the immediate cotext. With regard to fra my initial hypothesis seems to be confirmed: out of 96 occurrences only 5 instances can be considered as cacophonous. A further point in favour of the hypothesis is the fact that all set phrases and idioms present in the corpus actually avoid cacophony (i.e. ‘tra virgolette’ instead of ‘fra virgolette’). The collocation results obtained for tra, by contrast, do not really confirm my hypothesis. There may be several reasons why repetition of tra occurred: ♦ Pauses between tra and succeeding cotext: Continuity of discourse: because the corpus omits pauses, interruptions and false starts, it is impossible to assess to what extent this might have had an effect on the results. It seems reasonable to assume, however, that language processing strictures play an important role, i.e. that utterance planning up to and including the preposition was completed before the remainder of the sentence was planned. As tra appears to be the default choice, this obviously leads to repetition of sounds if the lexical item that is later chosen as the one that can most appropriately construe the intended meaning contains dental plosives. To what extent this may be true would, however, need to be verified with an appropriate corpus and through additional experiments, which is beyond the scope of this thesis. ♦ Emphasis of a statement: the t(r) sound may be deliberately repeated to focus attention on this part of the sentence. No similar effect can be achieved through repeating the fricatives ‘f/v’, since these consonants cannot be pronounced as loudly as dental plosives can. ♦ Easier pronunciation: repeating dental plosives is easier because many Italian words and word clusters feature dental plosives (e.g. ministro, struttura, tra l’altro, etc.). My corpus contained a total of 60 occurrences of dental plosives repetition, some of which are reported here below: 106 Case Study _________________________________________________________________ ...per esempio, tra magistrati di vari gradi, tra magistrati che si... ...Vanno considerati, tra gli altri, i limiti di resistenza umana;... ...la separazione, di cui da tempo parliamo, tra struttura e personale... …soffermarmi sui rapporti tra la distrettuale, le procure ed i tribunali... ...gruppo di lavoro interministeriale (tra ministro dell'interno e ministro... ...credo che lo scarto tra entrate ed uscite annue sia elevatissimo... ...del trattamento; tra l'altro, il magistrato di sorveglianza decide... ...provvedimento del giudice, tra l'altro motivato, per poterlo limitare... ...regime dell'articolo 41-bis. Tra l'altro, di questo mi dà conferma l'ultima... ...vigente e di consentire, tra l'altro, il ricorso a strumenti di indagine... ...rivisitata, tenendo conto tra l'altro delle oggettive difficoltà... (Source: Mafia Corpus, 1998) 5.7 Conclusions On the basis of the findings presented, it is fair to conclude that the Italian prepositions tra and fra are synonyms of each other. The cacophony hypothesis, on the other hand, could not be fully verified. It appears to be supported by the occurrences of fra in the corpus; tra, however, requires further investigations. Even though the results may not be what I had hoped to achieve, I felt that by describing very faithfully how I went through the various steps, from initial hypothesis formulation to corpus selection and final interpretation of the results and what problems I encountered during the process, I could perhaps demonstrate more realistically the advantages, as well as the pitfalls of corpus-based analysis. 107 _________________________________________________________________ PART III CONCLUSION AND OUTLOOK 108 _________________________________________________________________ 6 Drawing Conclusions In the previous chapters, I have dealt in detail with the most crucial arguments in favour of - and against - the application of corpora in language and translation studies. In doing so, I have tried to discuss the issues from a variety of different perspectives, focussing first on more general aspects before providing specific examples. This last chapter looks to the future. It summarises the main implications of a corpus-based approach and makes suggestions for new fields of application, both in linguistics and translation research. It tries above all to get across the one message that to me seems to be the most important one of all, which is: times are changing, and so are corpora - and hopefully our approach to teaching translation. The Discipline of the Future What we hear and read is so often mediated language that it is probably fair to say that exposure to translated material is now a regular feature of most people's daily existence. Given that this trend is likely to continue in the new millenium, I believe that it is high time that translators and translation scholars as well as linguists and lay people started to rethink and reconsider their views of what translation entails and how translation studies should be conducted. Linguists in particular need to recognise that translation is a central mode of communication in modern societies. So far, their attitude towards 109 Drawing Conclusions _________________________________________________________________ translation has been at best ambivalent and at worst dismissive, shortsighted, and highly prescriptive. If they considered translation at all they generally focussed on how linguistics could be employed to ‘put matters right’, rather than on translation as a phenomenon in its own right, which does not necessarily have to conform to the linguist's preconceived ideas of what counts as correct or incorrect use of language. Seeing translation as a skill which can be improved through enhanced sensitivity to linguistic patterns is of course a legitimate view. However, it is also a rather limited and unsophisticated perspective, given the much more productive role theoretical linguists could play in translation studies. The growing interest within translation studies in exploiting corpus linguistics for a variety of translation-related analyses, including the examination of translation-specific features of language use (e. g. ‘translation universals’) should provide sufficient motivation for linguists to enter into more fruitful partnerships with translation scholars that are aimed at developing descriptive methodologies for translation studies. Translators and translation teachers also will need to revise their views and methods. One of the major aims of this thesis has been to show the benefits of the implementation of corpus-based techniques in translation research and teaching. These resources will only be fully exploited, however, if there is a basic willingness to change the status quo, and if there exists a consistent institutional policy that encourages such change. Obviously, effecting innovative strategies will be difficult and those in charge of course design will have to be ready to take risks, as it may not yet be possible to enshrine the use of novel technologies in translation curricula. 110 Drawing Conclusions _________________________________________________________________ There are plenty of examples of institutions that have been prepared to confront the challenge posed by the new technologies and which have developed pioneering projects. At the Centro Nazionale di Ricerca (CNR) of Pisa, for instance, Peters and Picchi (1997:267-271) have integrated a lexical database and a text management system into a prototype workstation. The system includes many different components which can be exploited by the translator and the lexicographer, by the language learner, or by any user interested in using to the full the possibility of being able to dynamically access, browse, and extract the different kinds of linguistic information contained in dictionary and text databases. (Peters and Picchi 1997:271). Given the potential of universities in terms of available human and technical resources, it is difficult to understand why they should not engage in similar projects. This is particularly true of the School for Translators and Interpreters at Graz University. Even though our School has shown that it is aware of the great importance of new technologies by obtaining a campus licence for a major Translation Memory tool, and although students have ready access to a variety of concordancing programs and statistical software, only few translation classes make use of corpus-based tools and electronically available sources. As a consequence, the number of students attending tutorials aimed at familiarising them with TM and other tools is very limited. I believe that no professional translator today can afford not to use a computerised environment: computer literacy is a must for anyone entering the translation market. I also believe that institutions training translators have an obligation to show the students which computer-based resources are available and how they might help them improve the quality of their output, both during their course and - perhaps even more 111 Drawing Conclusions _________________________________________________________________ importantly - also later, when they are given their first professional assignments. There are a host of different ways of how this could be achieved. Using corpora and concordancing programs in the translation class would be one possible approach. This way, students would be introduced to data cataloguing, including parallel text management, semantic and lexical disambiguation, stylistic analyses, etc.. Another would be the implementation of TM and MT routines in the translation class, which, apart from its obvious practical benefits, would have the additional advantage of allowing the department to compile a huge parallel (or even multilingual) corpus made up of original texts and students’ translations. A further area where it is easy to see possible applications of corpuslinguistics is that of language acquisition. Learners’ corpora could be compiled in the more language-acquisition oriented classes, which would represent very interesting material for a variety of applied linguistics research projects. The results of such analyses could then be used as the basis for computer-based self-instruction exercises. Quite apart from the didactic potential of such projects, they would, I believe, also improve the reputation of the university as an innovative research institution which keeps abreast of new developments in order to meet the increasingly more exacting standards of the professional world. One final argument in favour of a corpus-based approach is, I believe, the great motivational potential of their use in the translation class. Students who discover language through corpora are constantly challenged as they are obliged to analyse texts and reflect on the linguistic and textual evidence they find, to make decisions and explain their choices, and to query and justify their own textual production. This ability to reflect and to challenge received views is among the most important 112 Drawing Conclusions _________________________________________________________________ objectives of third-level education. Corpora and corpus-based methodologies, I believe, can greatly contribute towards attaining this goal. 113 _________________________________________________________________ APPENDICES 114 _________________________________________________________________ 1 Glossary ♦ Alignment The practice of defining explicit links between texts in a parallel corpus. ♦ Annotation The practice of adding explicit additional information to machine-readable texts, as well as the physical representation of such information. ♦ ASCII (American Standard Code for Information Interchange) A numerical coding system for computerised text. When people refer to a computer document being ‘in ASCII’, they usually mean that it consists only of the characters that fall within the near-universally adopted lower range of ASCII codes, 1-127, which cover unaccented Latin characters, roman numerals, and a basic range of punctuation. Such files, which may also be referred to as ‘text only’, present far fewer problems than formatted word-processor files when it comes to manipulating data with different types of software and on different computing platforms. ♦ Behaviourism Psychological doctrine developed at the end of the 19 th century which focused exclusively on observable behaviour. The most valuable achievement realised by this discipline was to exclude introspection from scientific study. John Watson - probably the first real behaviourist - typified the approach and dismissed introspection as untestable: he was convinced that the study of language had to be based on subjectiveness, namely the only valid scientific approach to limit study to specific stimuli and consequent observable periferal muscular and glandular responses. Together with Watson, who 115 Glossary __________________________________________________________________ actually developed a complete behavioural theory, to be mentioned are also other behaviourists such as Hull, Tolman and Skinner. ♦ COBUILD COBUILD is an acronym for COllins Birmingham University International Language Database. This is a joint project between industry (HarperCollins Publishers) and the University of Birmingham, which began in 1980. A large corpus of contemporary English was gathered from spoken and written sources, and each word in turn was studied for its lexical, grammatical, semantic, stylistic and pragmatic features. The information was entered into a database from which were edited the Cobuild dictionaries and other publications. ♦ COCOA Reference A balanced set of angled brackets (<>) containing two things: a code standing for a particular type of information, and a string or set of strings, which are the instantiations of that information. ♦ Colligation Collocation patterns based on syntactic groups rather than individual words. ♦ Compile Collect and put together (i.e. texts for a corpus). ♦ Concordancer A program which identifies a pattern (usually a word) within a text, and prints out instances of its occurrence along with a specified amount of context. ♦ Corpus A collection of natural-occurring language text, usually in machine-readable form and compiled to be representative of a particular kind of language. ♦ Co-text The co-text of a selected word or phrase consists of the other words on either side of it. This is a more precise term than context or verbal context, but it is not much used. ♦ KWAL (Key Word and Line) A form of concordance which can allow several lines of context either side of the key word. ♦ KWIC (Key Word In Context) The most common type of concordance output, in which the search item, or key word is presented with a single line of context. When several lines of output are presented the key word is aligned vertically giving the impression of a column. 116 Glossary _________________________________________________________________ ♦ Lemma The headword form that one would look for if looking up a word in a dictionary, i.e. the word-form of eats belongs to the lemma EAT. ♦ Lemmatisation The process or result of dividing a text into lemmas. ♦ Machine-readable A term to describe textual resources which have been stored on computer. It refers specifically to text which has been encoded as characters, rather than images (such as a fax). ♦ Match When your search string is found in the corpus, it is referred to as a match or hit. ♦ Mailing List A mailing list is an e-mail-based bulletin board. E-mails are sent to a particular site for inclusion in an electronic mailshot. When the administrator of the mailing list feels that a new mailshot is ready, the collected messages are posted to people who have specifically subscribed to the mailing list. ♦ Natural Language Term used for human language, as opposed to artificial languages used for, for example, computer programming and formal logic (i.e. PROLOG). ♦ Parsing A form of grammatical analysis which represents all of the grammatical relationship (syntactic structures) within a sentence. ♦ Running Words This term is used in measuring the length of a text. Each successive word-form is counted once, whether or not that particular form has occurred before. For example, the sentence „Andrea is a very cool guy.“ contains 6 running words. ♦ SGML (Standard Generalised Mark-up Language) Mark-up system used for electronic texts. ♦ Sublanguage A constrained variety of a language. Although a sublanguage may be naturally occurring, its key feature is that it lacks the productivity generally associated with language. ♦ String Combination of letters/characters. ♦ Structural Linguistics At the beginning of the 20th century, attention shifted to the fact that not only language change, but language 117 Glossary _________________________________________________________________ structure as well, is systematic and governed by regular rules and principles. The attention of the world's linguists turned more and more to the study of grammar, intended as the organisation of the sound system of a language and the internal structure of its words and sentences. By the 1920s, the program of 'structural linguistics', inspired in large part by the ideas of the Swiss linguist Ferdinand de Saussure, was developing sophisticated methods of grammatical analysis. Structural linguistics focused on the synchronic analysis of language and contributed greatly to the evolution of phonology. Major structural schools were the Prague School (Trubeckoj), the Copenaghen School (Hjelmslev) and the American structuralism (Bloomfield). ♦ Tag A code attached to words in a text representing some feature or set of features relating to those words. ♦ Tagger A program which assigns labels to words or other units in a machine-readable text. Currently the most common type of tagger is one which assigns part of speech labels, typically using a probabilistic algorithm, based on frequencies observed in previously tagged, or annotated, text corpora. ♦ TEI (Text Encoding Initiative) An international project to define standards for the format of machine readable texts. ♦ Text Continuous spoken or written language. ♦ Treebank A corpus which has been annotated with phrase structure information. ♦ Universals of translation Linguistic features typically occuring in translated rather than original texts. They are thought to be independent of the influence of the specific language pairs involved in the process of translation. (Baker 1993:243) ♦ Word-Form This term is used for any unique string of characters, bounded by spaces. Hence eat, eating, ate, eaten are all different wordforms of the same lemma (eat). 118 _________________________________________________________________ 2 Major Corpora Available This appendix is by no means an exhausting listing. It merely aims to provide an insight into the major corpora available at the moment of writing, as well as a contact address for further information on every specific corpus. The list is divided into three main categories (written, spoken, written and spoken) and arranged alphabetically. The main features of every entry are highlighted, so that parsed, tagged, historical or any other kind of specialised corpora can be easily identified. Written ♦ The Aarhus Corpus of Contract Law Features: multilingual corpora made of three 1,000,000-word subcorpora of Danish, English and French respectively. Texts are taken from the area of contract law. This is not a parallel corpus. Contact: The Aarhus School of Business, Fuglesangs Allé 4, DK-8210 Aarhus V, Denmark. ♦ The ACI/DCI Corpus (Association of Computational Linguistics/Data Collection Initiative) Features: monolingual corpus of 63 million words of written American English (40 million words from the Wall Street Journal, 23 million words from scientific abstracts) Contact: Department of Linguistics, University of Pennsylvania, Philadelphia, PA 19104 USA. 119 Major Corpora Available __________________________________________________________________ ♦ The American Printing House for the Blind Corpus (APHB) Features: monolingual treebanked corpus of fiction text produced for IBM USA at Lancaster University. Contact: not available for research purposes. ♦ The Augustan Prose Sample Features: historical corpus of about 80,000 words of British English reading material from between c.1675 and 1705. Contact: Oxford Text Archive, Oxford University Computing Service, 13 Banbury Rd., Oxford, OX2 6NN (e-mail: [email protected]). ♦ The Australian Corpus of English (ACE) Features: 1-million-word monolingual corpus of Australian English, compiled to be comparable with the Brown Corpus. Contact: School of English, Linguistics & Media, Macquarie University North Ryde NSW 2109, Australia. ♦ The BAF Corpus Features: French-English bitext of about 400,000 words per language. It gathers four subcategories: - Four institutional texts (including a representative excerpt of the so called Hansard corpus) for a total size close to 300000 words per language; - Five scientific articles of about 50000 words per language each. - A technical documentation with 39328 English-words for 46828 French ones. - The novel of Jules Verne: “De la terre à la lune”. (40,161 English words vs. 53,181 French words). This corpus is very interesting because the translations are sometimes divergent (75% of 1 to 1 patterns). In fact, it is even not clear whether the English version is really a translation of the French one or if it has been translated from 120 Major Corpora Available _________________________________________________________________ an abridged version. The English version has a lot of missing segments. Contact: RALI, Département d'Informatique et recherche opérationnelle, Université de Montréal, C.P. 6128, succursale Centreville, Montréal (Québec), Canada, H3C 3J7. Team leader is Pierre Isabelle (e-mail: [email protected]). The BAF corpus has got its own webpage at http://www-rali.iro.umontreal.ca/arc-a2/BAF/ ♦ The Brown Corpus Features: monolingual corpus of about 1 million words of written American English dating from 1961 including many different registers. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Canadian Hansard Corpus Features: a corpus of proceedings from the Canadian parliament. The corpus is a parallel French-English corpus of about 750,000 words of each language. The English version of the corpus has been part-ofspeech tagged and parsed at Lancaster University. Contact: Department of Linguistics, University of Pennsylvania, Philadelphia, PA 19104, USA (raw text corpus only!). The parsed and tagged version is not available for distribution. ♦ The Crater Corpus (ITU Corpus) Features: a trilingual parallel corpus of French, English and Spanish from the telecommunications domain. It is available in part-of-speech tagged, lemmatised and aligned form. Contact: Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, UK. ♦ The CURIA 121 Major Corpora Available _________________________________________________________________ Features: an ongoing text collection project sponsored by the Royal Irish Academy to make available machine-readable texts in the several languages used in Ireland during its history - Irish (both old and modern), Hiberno-Latin and Hiberno-English. Contact: Royal Irish Academy, Dawson Street, Dublin, Ireland (e-mail: [email protected]). An e-mail discussion list provides periodic updates on the work. ♦ The Freiburg Corpus Features: monolingual corpus of about 1 million words of written British English from material published in 1991. The corpus aims to parallel as closely as possible the contents of the LOB, in order to enable the study of language change in the 30 years separating the two corpora. Contact: Institut für Englische Sprache und Literatur, Albert-Ludwigs Universität, D-7800 Freiburg, Germany. ♦ The Guangzhou Petroleum English Corpus Features: a sublanguage corpus of 411,612 words of written English from the petrochemicals domain. Contact: Guangzhou Training College of the Chinese Petroleum University, Guangzhou, China. ♦ The Helsinki Diachronic Corpus Features: historical corpus of about 1,5 million words from 850 to 1710. The corpus is divided in 3 periods and 11 subperiods and covers many registers. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Helsinki Corpus of Early American English 122 Major Corpora Available _________________________________________________________________ Features: historical corpus of about 500,000 words of late 17th and early 18th century of North American English. Contact: Department of English, University of Helsinki, Porthania 311, 00100 Helsinki, Finland. ♦ The Helsinki Corpus of Older Scots Features: historical corpus of 830,000 words from 15 registers dated from 1450 to 1700. Contact: Department of English, University of Helsinki, Portania 311, 00100 Helsinki, Finland. ♦ The Hong Kong University of Science and Technology (HKUST) Learner Corpus Features: learner corpus of about 6 million words (with on-going collection) of written undergraduate assignments and „A“ level Use of English scripts from the Hong Kong Examination Authority. Contact: Language Center, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. ♦ The Innsbruck Computer Archive of Middle English Texts Features: historical corpus of about 2 million words of Middle English prose from 1100 to 1500. Texts are arranged alphabetically. Contact: [email protected]. ♦ The International Corpus of Learner English (ICLE) Features: learner corpus of about 1 million words of written English texts from nine different language backgrounds: Chinese, Czech, Dutch, Finnish, French, German, Japanese, Spanish, and Swedish. Contact: University of Louvain, B-1348 Louvain-La-Neuve, Belgium. ♦ The Kolhapur Corpus 123 Major Corpora Available _________________________________________________________________ Features: monolingual corpus of 1 million words of written Indian English from 1978. The corpus uses the same genres and proportions as the Brown Corpus and the LOB Corpus. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Lampeter Corpus of Early Modern English Tracts Features: historical corpus of about 500,000 words of pamphlet literature dating between 1640 and 1740. This corpus contains whole texts rather than smaller samples from texts. Contact: TU Chemnitz-Zwickau, D-09107 Chemnitz, Germany ♦ The Lancaster-Leeds Treebank Features: a subsample of about 45,000 words taken from the LOB corpus. The corpus is tagged for part-of-speech and fully parsed. Contact: Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, UK. ♦ The Lancaster-Oslo/Bergen Corpus (LOB) Features: monolingual corpus of about 1 million words of written British English, all published in 1961. Many different registers are included. The genre categories are parallel to those of the Brown corpus. The entire corpus has been part-of-speech tagged, and various subsamples have also been parsed (see: Lancaster Parsed Corpus; LancasterLeeds Treebank). Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Lancaster Parsed Corpus 124 Major Corpora Available _________________________________________________________________ Features: 133,000 words from the LOB Corpus that have been syntatically analysed. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Longman-Lancaster Corpus Features: monolingual corpus of about 30 million words of written British and American English covering a broad range of subject fields from the early 1900s to the 1980s. Contact: Longman Dictionaries, Longman House, Burnt Mill, Harlow, Essex, CM20 2JE UK. ♦ The Melbourne-Surrey Corpus Features: monolingual corpus of 100,000 words from Australian newspapers. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Newdigate Newsletter Corpus Features: historical corpus of 750,000 words of manuscript newsletters from 1674 to 1692. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Scottish Dramatical Texts Corpus 125 Major Corpora Available _________________________________________________________________ Features: monolingual corpus of about 101,000 words of drama in traditional and Glaswegian Scots. Contact: School of English, The Queen’s University of Belfast, Belfast, BT7 1NN, UK. ♦ The SUSANNE Corpus Features: a part-of-speech tagged, parsed and lemmatized subset of the Brown corpus (about 128,000 words) and the LOB corpus. Contact: Oxford Text Archive, Oxford University Computing Service, 13 Banbury Rd., Oxford OX2 6NN, UK (e-mail: [email protected]). ♦ Thesaurus Linguae Graecae (TLG) Features: a machine-readable collection of most of ancient Greek literature. Contact: TLC Project, University of California at Irvine, Irvine, CA 92717-5550, USA. ♦ The Tosca Corpus Features: a monolingual corpus of about 1,500,000 words of written English from dates between 1976 and 1986. The corpus is part-ofspeech tagged and parsed. Contact: Department of English, University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The Netherlands. ♦ The Zurich Corpus of English Newspapers (ZEN) Features: historical corpus of London newspapers from the mid 1660s to the beginning of the twentieth century. Contact: University of Zurich, Plattenstraße 47, CH-8032, Zurich, Switzerland. 126 Major Corpora Available _________________________________________________________________ Spoken ♦ The Corpus of Spoken American English (CSAE) Features: this monolingual corpus (still under construction) aims to reach the size of 200,000 words of spoken American English. Contact: Department of Linguistics, University of California at Santa Barbara, Santa Barbara, CA 93106, USA. ♦ The Helsinki Corpus of English Dialects Features: a dialect corpus of about 245,000 words of spoken English from several regions of England. Speakers are elderly and rural in conversation with fieldworkers. Contact: Department of English, University of Helsinki, Porthania 311, 00100 Helsinki, Finland. ♦ The IBM-Lancaster Spoken English Corpus (SEC) Features: monolingual corpus of 52,000 prosodically annotated and part-of-speech tagged words of spoken British English, mostly form BBC recordings. The Machine-Readable Spoken English Corpus (MARSEC) is a version of the SEC which exists in the form of a relational database and also includes some additional information, such as phonetic transcription. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The London-Lund Corpus Features: monolingual corpus of about 1/2 million prosodically annotated words of spoken British English collected in the 1960s and early 1970s. The corpus includes mainly conversational genres, with some additional categories such as legal proceedings and commentary added later. 127 Major Corpora Available _________________________________________________________________ Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ The Northern Ireland Transcribed Corpus of Speech (NITC) Features: dialect corpus of about 400,000 words of spoken material from 42 locations and over three age groups (children, middle-aged and elderly). The data represents conversations with fieldworkers. Contact: Oxford Text Archive, Oxford University Computing Service, 13 Banbury rd., Oxford OX2 6NN, UK (e-mail: [email protected]). ♦ The Polytechnic of Wales Corpus (POW) Features: monolingual corpus of 61,000 words of children’s spoken language. The corpus has been parsed using the Hallidayan SystemicFunctional Grammar. Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). Written and Spoken ♦ A Representative Corpus of Historical English Registers (ARCHER) Features: historical corpus of about 2 million words of British and American English covering the time from 1650 to 1990. Both written and speech-based registers are available. Contact: Douglas Biber, Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA (e-mail: [email protected]). ♦ The Bank of English 128 Major Corpora Available _________________________________________________________________ Features: monitor corpus of more than 200 million words of British English (mostly written) built by Collins COBUILD at Birmingham University, constantly growing. The data have been part-of-speech tagged and parsed. Contact: The Bank of English, Westmere, 50 Edgbaston Park Road, Birmingham B15 2RX, UK. ♦ The Birmingham Corpus Features: a monolingual corpus of about 20,000,000 words (approximately 90% written and 10% spoken). The corpus consists mainly of British English, although some other varieties are also represented. Contact: The Bank of English, Westmere, 50 Edgbaston Park Road, Birmingham B15 2RX, UK. ♦ The British National Corpus (BNC) Features: monolingual corpus of about 100 million words of British English (90 million written, 10 million spoken) covering many different registers. The entire corpus is part-of-speech tagged, while only a onemillion-word subset is parsed. Contact: British National Corpus, Oxford University Computing Service, 13 Banbury Rd., Oxford OX2 6NN, UK (e-mail: [email protected]). ♦ The CHILDES Project Features: collection of children’s spoken and written language and language pathologies. The samples are mainly American and British English, but other languages are also represented. Contact: CHILDES Project, Department of Psychology, Carnegie Mellon University, Pittburg, PA 15213, USA (e-mail: [email protected]). ♦ The International Corpus of English (ICE) 129 Major Corpora Available _________________________________________________________________ Features: a collection of 1-million-word corpora - one written and one spoken - of different varieties of English. Samples are collected in each country or region in which English is a first or major language (i.e. East Africa, Australia, New Zealand, as well as the UK and USA). Collection is still in progress. Contact: Survey of English Usage, University College London, Gower Street, London WC1E 6BT UK. ♦ The Nijmegen Corpus Features: monolingual corpus of about 130,000 parsed words of written and spoken British English (120,000 written, 10,000 spoken). The spoken part is made of transcripted sports commentary. Contact: TOSCA Group, Department of Language and Speech, University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The Netherlands (e-mail: [email protected]). ♦ The Penn Treebank Features: a monolingual, part-of-speech and parsed corpus consisiting primarily of articles from the Wall Street Journal but also including some samples of spoken language. Contact: Penn Treebank, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA. ♦ The Survey of English Usage (SEU) Features: monolingual corpus of about 1 million words of British English collected from 1953 to 1987, divided evenly into spoken and written. The spoken texts make up the London-Lund Corpus. Contact: Survey of English Usage, University College London, Gower Street, London WC1E 6 BT UK. 130 _________________________________________________________________ 3 Software Available for Corpus-Based research This appendix intends to deliver some basic information about the major tools for text analysis. As a matter of fact, the majority of the entries concerns concordancing software: this is due to the fact that concordancers are practically the sine qua non of corpus exploitation and a very useful tool for the non-linguist as well. Needless to say that this software is nothing but a simple collection of computer programs. In other words, they will not do the miracle: the ‘output’ still needs to be analysed, filed and compared with other quantitative data in order to produce ‘results’. Tools For IBM-Compatible Personal Computers ♦ Corpusbench Features: this tool enables word counts, concordancing, simple grammatical and morphological analyses (i.e. past tense „ed“). It can handle large corpora, but it needs to construct a text database. Contact: Textware Direct, Hörscholmsgrade. 20 2 DK-2200 Københaven N, Copenhagen, Denmark. ♦ International Corpus of English Utility Program (ICEUP) Features: for use only with the International Corpus of English. 131 Software Available for Corpus-Based Research _________________________________________________________________ Contact: Survey of English Usage, University College London, Gower Street, London WC1E 6BT, UK. ♦ LEXA Features: LEXA is a sophisticated corpus analysis system. It produces lexical databases and concordances. The program is able to handle texts marked with COCOA references. It goes beyond the basic frequency and concordance features of most corpus analysis programs and also enables simple tagging and lemmatization routines to be run Contact: International Computer Archive of Modern English (ICAME), Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway (e-mail: [email protected]). ♦ Longman Mini Concordancer Features: for use with ASCII texts less than 50,000 words. It provides frequency lists, KWICs, and some text statistics. It is also possible to call up concordances of specific collocations using the program. Contact: Longman Group UK, Longman House, Burnt Mill, Harlow, Essex CM20 2JE, UK. ♦ MicroConcord Features: fast classroom concordancer. It is basically a searcher that produces a concordance. MicroConcord offers also other features, such as word counts, simple syntactic analyses and some morphological analyses (i.e. past tense „ed“). It can be used with a variety of languages and alphabets. Site licences are available. Contact: in the US: Athelstan, PO Box 8025, La Jolla, CA 92038-8025; in Europe: Oxford University Press, Walton Street, Oxford, OX2 6DP, UK. A downloadable demo is available from the Oxford University Press website. ♦ Micro-OCP 132 Software Available for Corpus-Based Research _________________________________________________________________ Features: slow research concordancer. Micro-OCP is a full-featured concordancer, with many options for tailoring the search and the concordance, including user-definable alphabets and references. It can produce indexes, statistics, frequency lists and save the output to file. It can be used with a variety of languages. Apparently, there are no size limits. Contact: Electronic Publishing, Oxford University Press, 200 Madison Ave., New York, NY 10016, USA. ♦ MonoConc Features: quite fast and user-friendly concordancing program developed by Michael Barlow at Rice University (USA). Contact: Athelstan, PO Box 8025, La Jolla CA 92038-8025 USA. ♦ Nijmegen Linguistic DataBase Software (LDB) Features: allows browsing, concordancing, and syntactic pattern searches specifically with the Nijmegen Corpus. It can also be used with other parsed corpora that have been adapted for use with the LDB. Contact: TOSCA Group, Department of Language and Speech, University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The Netherlands (e-mail: [email protected]). ♦ SARA Features: a sophisticated concordancer designed specifically to handle texts which use TEI/SGML markup. SARA is necessary to browse the text collection of the Bank of English. Contact: Electronic Publishing, Oxford University Press, 200 Madison Ave., New York, NY 10016, USA. ♦ TACT Features: freeware package. The functionality of TACT is quite similar to that of Wordcruncher. The program’s basic outputs are KWAL and 133 Software Available for Corpus-Based Research _________________________________________________________________ KWIC concordances and frequency lists. It also enables the user to produce graphs of the distribution of words through a text or corpus. Further features are a basic collocation list generator and the ability to group words for searching according to user-defined categories (i.e. semantic fields). TACT requires the user to convert the raw text into a TACT database using a program called MAKBAS, a quite difficult task to be carried out by a non-expert. Contact: Centre for Computing in the Humanities, Room 14297A, Robarts Library, University of Toronto, Toronto, Ontario, M5S 1A5, Canada (e-mail: [email protected]). Also available by anonymous FTP from the latter (ftp://epas.utoronto.ca) and from ICAME (ftp://nora.hd.uib.no) ♦ TransSearch Features: bilingual concordancing tool designed to query exclusively the Canadian Hansard texts, currently a database of seven years of Canadian parliamentary debates (the Hansards), from 1986 to 1993. A nice option of TransSearch is the possibility to submit searches using both a simple or a bilingual interface. Contact: The former Computer-Aided Translation Research Team of the Centre for Information Technology Innovation (CITI) constitutes now the core of the RALI laboratory of the University of Montreal, Canada. For further information contact Université de Montréal, CP 6128-A, Montréal, Québec, H3C 3J7, Canada, or go to the RALI website (currently http://www-rali.iro.umontreal.ca). ♦ Wordcruncher Features: user-friendly package able to produce frequency listings, KWAL and KWIC, concordances and concordances of user-selected collocations. It can also produce word distribution statistics. Like TACT, Wordcruncher requires texts to be in a specially indexed format. The LOB, Brown, London-Lund, Kolhapur and Helsinki Diachronic corpora 134 Software Available for Corpus-Based Research _________________________________________________________________ are available on CD-ROM from ICAME in a ready-indexed form for use with Wordcruncher. Contact: Johnston & Company, PO Box 446, American Fork, UT 84003, USA. ♦ Wordsmith Tools Features: suite produced by Mike Scott at the University of Liverpool which includes a concordancer, a text aligner, a frequency lister as well as a variety of other tools. The only program suite based on a Windows environment. Currently the best set of tools available. For purchase conditions see the Oxford University Press catalogue. Contact: Further information can be obtained at Mike Scott’s Wordsmith site web-published by the Oxford University Press (currently http://www.liv.ac.uk/~ms2928/homepage.html). Tools For Apple Macintosh Computers ♦ Conc 1.8 Features: research concordancer. It works with small texts only but it is very fast. An attractive feature of Conc 1.8 is its split screen display of text and concordance: users can click in the concordance window to see the full context, and vice-versa. Conc 1.8 has a variety of options for including or excluding words, sorting, exporting concordance to a file and producing statistics. Contact: International Academic Bookstore, Summer Institute of Linguistics, 7500 West Camp Wisdom Road, Dallas TX 75236. This software can also be downloaded from the site of the Summer Institute of Linguistics. ♦ Concorder Features: a fairly simple KWIC concordancer. 135 Software Available for Corpus-Based Research _________________________________________________________________ Contact: Les Publications CRM, Université de Montréal, CP 6128-A, Montréal, Québec, H3C 3J7, Canada. ♦ FreeText Browser Features: fast research concordancer based on an HyperCard stack. It has no limitation on file size, but also no print/extract capability. However, settings can be modified. It is a very nice tool for ad hoc browsing: it delivers three windows, showing words with frequency, concordance and text. Contact: FreeText Browser, PO Box 598, Kensington, MD 20895, USA (e-mail: [email protected]). It can also be downloaded from the Umich Mac HyperCard Archive. ♦ SysConc 2.5 Features: tool for extracting linguistic patterns from a large corpus of texts. It searches for specific lexical items, collocational patterns, or a group of items of any semantic type set by the user. SysConc displays the search results in a list, so that a larger context for a certain item can be obtained once required by the user. It also shows the statistical results for the words around the searched items, demonstrating them in a bar graph format and their collocations in a hierarchical pattern. Contact: School of English, Linguistics & Media, Macquarie University North Ryde NSW 2109, Australia. It can also be downloaded free of charge from the Macqurie Systemic Modelling Group home page (currently http://minerva.ling.mq.edu.au). Part-of-Speech Taggers ♦ CLAWS Features: theConstituent Likelihood Automatic Wordtagging System is a part-of-speech tagger for English which makes use of a probabilistic model trained on large amounts of manually corrected analysed text. 136 Software Available for Corpus-Based Research _________________________________________________________________ Contact: Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, UK. ♦ Xerox Tagger Features: a part-of-speech tagger, developed at the Xerox Parc laboratories, whose basic tagging program is language-independent and is being used at the Universidad Autónoma de Madrid to tag the Spanish part of the CRATER corpus. Contact: available by anonymous FTP from ftp://ftp.parc.xerox.com/pub/tagger. 137 _________________________________________________________________ 4 Results of a Collocation search of Tra and Fra In this last appendix I reproduced the collocations of both Italian prepositions tra and fra for reference. The number at the beginning of each string identifies its order of appearance in the corpus. 138 139 1 tivo, di ciascun gruppo. Ritengo quindi che tra una settimana- dieci giorni la Commissione 2 za che procede a ripartire i relativi oneri tra i due rami del Parlamento. 3 emo subito dopo - dell'usura e del rapporto tra banche, finanziarie ed intermediatori finan 4 distrettuali e, in particolare, il rapporto tra le procure distrettuali antimafia e la DNA. 5 erso un obiettivo preciso. Quella odierna è tra l'altro la prima seduta "vera" della Commis 6 anti per evitare di perdere tempo prezioso; tra l'altro, nell'elenco delle audizioni si dov 7 la lotta alla mafia, oltre che dei rapporti tra mafia e politica, qualora ve ne fossero. Si 8 re, acquisire gli atti relativi ai rapporti tra mafia e massoneria e, in generale, tutti gl 9 da parte della Commissione. Condivido, tra l'altro, una sua dichiarazione che ho letto 10 razione, che dobbiamo rendere obbligatoria, tra le forze di polizia e dell'esercito. La col 11 polizia e dell'esercito. La collaborazione tra carabinieri, polizia e Guardia di finanza, 12 nza allargato ai rappresentanti dei gruppi. Tra le richieste che dovremmo porre al ministro 13 voluto sottolineare questo piccolo problema tra i tanti. ANTONIO BARGONE. Avevo posto un 14 doci ad ascoltare quanto ci vengono a dire. Tra l'altro, dobbiamo anche tenere presente qua 15 la criminalità economica, cioè del rapporto tra crimine organizzato ed economia. Da questa 16 e immediatamente - che il tema del rapporto tra criminalità organizzata ed effetti sull'eco 17 endo apparire come il tipico saputello, che tra l'altro non sono e ripeto - non intendo ess 18 o approccio con il grande tema del rapporto tra economia, finanza e criminalità organizzata 19 e cosa fare fino alle prossime scadenze. Tra l'altro, questi difetti di organizzazione d 20 In caso contrario, ci troveremmo a fissare tra dieci giorni una riunione in cui si definis 21 cupazione di non creare una sovrapposizione tra l'ufficio di presidenza e la Commissione pl 22 o quella di non creare una contrapposizione tra l'ufficio di presidenza e la Commissione pl 23 ani ed americani che operavano, in simbiosi tra loro ed in collegamento con la mafia colomb 24 o la tendenza verso una stretta interazione tra realtà criminali diverse, ha favorito il co 25 minali diverse, ha favorito il collegamento tra differenti settori dello scambio illegale e 26 delinquenziali siffatte, che interagiscono tra loro proponendosi come un sistema complesso 27 ell'azione antimafia, un quadro di raccordo tra il momento della valutazione strategica del 28 ale esistente prevede un raccordo immediato tra Consiglio generale e strutture di contrasto 29 traverso Pagina 40 una costante interazione tra il momento dell'acquisizione conoscitiva e 30 un programma d'intervento il quale prevede, tra l'altro, l'adozione, di concerto con il min TRA 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 reati economici ed il traffico di armi. Tra posizione trainante di rilievo, provvedendo tra rare un pió elevato livello di cooperazione tra luppo di iniziative di collaborazione anche tra anica collaborazione di carattere operativo tra uare una pió efficace divisione dei compiti tra nsiderando che la Commissione antimafia ha, tra ia possibile su questo versante un raccordo tra O, ROS), unificando un'azione oggi dispersa tra lo svolgimento delle elezioni in Germania, tra esi interessati al fenomeno - la Germania è tra pibile! Non soltanto: in molte grandi città tra in cui è certamente presente una collusione tra , stia tranquillo. Ho fatto una distinzione tra ai rapporti, pericolosamente in estensione, tra no contrario a quello che ci si prefiggeva. Tra n questa sede e che vorrei riprendere. Sono tra nistro accennava, di maggiore coordinamento tra piccola norma che ero riuscito ad ottenere. Tra molto serio di porsi di fronte al rapporto tra . Concludo con un'ultima questione. Tra onsiglio comunale di Trani - e leggiamo che tra sione straordinaria possiamo constatare che tra è giustissima, ma va evitata la cogestione tra prannominato "Gigi l'americano", il quale è tra rre individuare e recidere i legami mafiosi tra n all'interno della struttura né tanto meno tra CO ed averlo fatto funzionare con successo (tra ontinuità nella gestione della direzione. Tra finora, cioè di occuparsi del coordinamento tra volontariato. Credo che la collaborazione tra che unità) proprio per il legame fortissimo tra 140 le organizzazioni impegnate a vario titolo l'altro all'istituzione di speciali agenzie gli organismi di polizia impegnati nella pr paesi extracomunitari, specie laddove quest gli organismi investigativi attivi nell'are polizia e carabinieri. Come tutti sappiamo, i suoi compiti, quello di verificare che tu le esperienze di alta professionalità dei v i vari corpi, specializzando l'intervento g i quattro o cinque paesi interessati al fen questi - per verificare in quale modo si po cui Napoli, nei centri dove è presente un h amministrazioni locali e forze mafiose. In la sua volontà e l'azione complessiva del G economia e criminalità. Un problema di coor l'altro, io stesso già due anni fa avevo la gli ammiratori dei carabinieri, sia ben chi le forze dell'ordine siano accelerate e che l'altro, al di là di tutte le domande e di mafia e politica. Piuttosto, signor mini i personaggi a rischio nella lotta contro l le motivazioni da cui ha tratto origine un questi è contenuto il permesso per tale dis Parlamento e Governo. Si tratta di un atto gli arrestati. Le accuse sono di associazio la criminalità e la struttura che non sempr i diretti interessati. Credo sia utile ed o parentesi, lo SCO ha gestito per due anni u gli spostamenti che però non vengono quasi l'azione dello Stato e quella svolta nella questi due mondi, che finora non si sono pa i suoi componenti. Oggi abbiamo la grande o 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 ato comunque costituito un gruppo di lavoro da commissariato a commissariato: infatti, studio e di approfondimento su questi temi considero l'Unione europea una via di mezzo diterraneo come "ufficiale di collegamento" io, perché l'argomento riguarda il rapporto rventi della Banca d'Italia, i collegamenti ha comportato la chiusura di più di 2 mila roppo, però, non sono stati assunti), primo 41- bis. Per quanto riguarda i rapporti risposto che bisogna creare un collegamento ia in Sicilia ed il fatto che i funzionari, pre necessario trovare un giusto equilibrio migrazione, che ha compiti di coordinamento arantiscono, io credo, il giusto equilibrio di protezione per un centinaio di persone, perseguito attraverso molteplici strumenti, a quella fascia di clienti che è al confine dell'economia del paese, tende a stabilirsi rito che solo le banche possono raccogliere ooperative finanziarie di raccogliere fondi izzazione favorendo la più ampia diffusione rio contributo tecnico. E' stata condivisa, ediari sono tenuti a conservare. I rapporti ati gravi produttori di ricchezza illecita, e su basi non codificate, si va realizzando aese impegnati nell'azione antiriciclaggio, avete sempre assunto circa il collegamento tassimo un pochino il limite che intercorre NO VIOLANTE. Inoltre, credo che lo scarto are un solo istituto o un solo paese. oro del 1991, per ragioni contabili interne tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra Tra tra 141 le forze di polizia che presenterà entro la due persone che svolgono le stesse funzioni esperti del ministero, della Bocconi (già d estero e territorio nazionale (non è estera i paesi dell'Unione europea e quelli non fa Governo e settore del credito, le funzioni Tesoro, Banca d'Italia e settore del credit società finanziarie e fiduciarie; siamo sem tutti la revisione della legge che consente economia e criminalità, si tratta di un tem istituzioni governative e istituzioni non g cui il segretario comunale, rimangono al lo la consistente presenza di forze dell'ordin gli enti governativi e quelli non governati la sicurezza e la possibilità per l'interes cui molti politici o ex politici. Si tratta cui il controllo degli assetti proprietari, la bancabilità e la non bancabilità. Ind economia criminale ed economia legale. M il pubblico fondi con l'impegno di restituz i loro soci. Per corrispondere all'esigenza il pubblico degli elenchi degli intermediar l'altro, la scelta di svincolare la figura intermediari e organi inquirenti potranno d i quali quindi anche i fatti di usura. R le autorità dei paesi ad economia matura; c i quali la Banca d'Italia e l'Ufficio itali la proliferazione degli sportelli bancari e il soggetto bancabile e quello non bancabil entrate ed uscite annue sia elevatissimo, n i paradisi fiscali noti ho contato 14 paesi banca centrale e singole banche, li esclude 142 95 della provincia di Agrigento, per esempio, tra gli arrestati per vicende connesse all'usur 96 one alla necessità di definire nuove regole tra banche ed utenti per quanto riguarda la cer 97 ripetersi di vicende di questo genere, che tra l'altro riguardano transazioni di decine di 98 ca d'Italia ha fatto un'analisi comparativa tra la situazione economicosociale di alcune a 99 etta separazione dell'attività di vigilanza tra le autorità dei paesi, in particolare di qu 100 o, al loro interno, manchevolezze o abbiano tra i loro dipendenti elementi infedeli che con 101 L'onorevole Del Prete ha citato alcuni tra i casi più clamorosi: mi riferisco alle due 102 chiedono) che rendiamo pubbliche. Io stesso tra una settimana, a Foggia, svolgerò un interv 103 uoi lavori. Esiste quindi una differenza tra l'emendamento ed il testo attuale. Non ho b 104 automaticamente, già porta ad una divisione tra noi. PRESIDENTE. I proponenti insistono p 105 Nel corso degli incontri che si sono svolti tra lei, signor presidente, ed il capogruppo de 106 il tema degli insediamenti mafiosi nel nord tra quelli propri dei gruppi di lavoro della Co 107 in alcun modo i lavori della Commissione. Tra l'altro, è in vigore quello provvisorio. 108 cinque Commissari eletti dalla Commissione tra i suoi membri. Tra questi la Commissione el 109 eletti dalla Commissione tra i suoi membri. Tra questi la Commissione elegge il Presidente. 110 (penso, in particolare, alla Calabria, dove tra breve inizieranno processi molto importanti 111 nto riguarda il filone relativo al rapporto tra mafia e politica, occorre fare riferimento 112 ricche, è diventato "importante" - lo dico tra virgolette - perché è finalizzato anche all 113 rta un'internazionalizzazione del discorso: tra poco tempo si terrà, come è noto, la confer 114 trasparenza e per un diverso modo di porsi tra cittadini e istituzioni, è tema prioritario 115 e di una cultura della legalità soprattutto tra i giovani, in particolare nella scuola, per i 116 mi riferisco alla questione dei rapporti tra mafia e sistema eversivo. Mentre ho affront 117 o vale per altre questioni, come i rapporti tra Cosa nostra e la banda della Magliana, che 118 ntegrazione (su frodi comunitarie, rapporti tra mafia e massoneria, e così via) riprendono 119 tratta di intrecci, nemmeno tanto occulti, tra politica, economia e mafia. Non diamo a que 120 piamo bene che la mafia ha sempre sguazzato tra grembiuli e cappucci. Ma si tratta anche di 121 di trasparenza. PRESIDENTE. Comunico che tra venti minuti avranno inizio votazioni alla 122 alisi delle seguenti tematiche: connessioni tra mafia e politica negli organi dello Stato e 123 e politica della mafia, di capire cioè come tra mafia e politica si fosse stabilito un rapp 124 zione di quella che è stata la coabitazione tra il potere politico e la mafia. In questo mo 125 zione dai Presidenti delle Camere di intesa tra di loro. Quindi, tali adempimenti non dipen 126 rdo che vengano costituiti. Dirò subito che tra i compiti della Pagina 392 Commissione bica 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 ono registrati. Ricordo la drammatica notte data l'esigenza di conoscenza del rapporto enza su quelle altrui. Occorrono un accordo iguardano aspetti interessanti dei rapporti i dei rapporti tra mafia e politica, quelli è rappresentato dal rapporto che intercorre lla legge, anche se si tratta di una legge, votazione. GIUSEPPE ARLACCHI. Poiché sono che in questo caso potrebbe venir fuori che i gruppi di lavoro si trovino in disaccordo Ho messo come priorità quella del rapporto ne possano essere raggiunti". Tale modifica curiosa scissione che esiste sempre di più no comunque trovare una stretta connessione Procure distrettuali; al coordinamento a" nelle procure distrettuali e al raccordo .P., effettivo coordinamento delle indagini rapporti e delle strutture di collegamento ati e gli stretti legami e interconnessioni ruzione dei molteplici aspetti dei rapporti aspetti dei rapporti tra mafia e politica e serio lavoro conoscitivo sulle connessioni o sulle connessioni tra mafia e politica, e articolare, riescono a consolidare i legami consente di individuare le interconnessioni ciascuna tematica e delle diverse tematiche e seguenti tematiche: 1) Connessioni o sia opportuno ribadire in questa sede. ste una grande collaborazione, per esempio, trovare là occasioni di lavoro importanti; a domanda sulla mafia, sulla ricongiunzione . Svolgerò ora alcune premesse generali. tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra Tra tra tra tra Tra 143 il 19 e il 20 luglio 1992, quando i ministr l'apposita commissione presso il Ministero i gruppi parlamentari e buona volontà da pa mafia e politica, quelli tra mafia ed econo mafia ed economia e della mafia nel nord. Q noi e chi governa questo paese, nel senso c le tante, che mai è stata applicata. L'urge i firmatari dell'ordine del giorno Bargone, qualche giorno i gruppi di lavoro si trovin loro sulle formulazioni suggerite. PRESID mafia, politica ed economia, e questa è un' l'altro toglie quell'apprezzamento sul lavo il dibattito politico sulla mafia fuori da di loro, così da realizzare la conclamata e attività "ordinarie" e "antimafia" nelle pr queste ultime e le procure circondariali; diversi PM) o che potrebbero favorire un ac gli altri servizi centrali e periferici (S. gli stessi, si trasforma in una mera formal mafia e politica e tra mafia ed economia, c mafia ed economia, che si possa pervenire, mafia e politica, e tra mafia ed economia, mafia ed economia, non trascurando di verif ambiente governativo, militare, apparati di i diversi settori nell'ambito di ciascuna t di loro, in una circolarità che eviti un la mafia e politica negli organi dello Stato e l'altro, questo viaggio in Russia mi ha riv Washington e Mosca. Credo che debba essere l'altro, c'è tutta la riconversione dell'in la mafia italiana e quella dei paesi dell'e i suoi obiettivi primari e fondamentali, il 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 atezza assumono i temi relativi ai rapporti tra oni per un recupero del rapporto fiduciario tra no: il riequilibrio dell'equità competitiva tra perfezionati i meccanismi di coordinamento tra istri. Era altamente affettuoso. Lei sa che tra e e del gruppo di lavoro interministeriale (tra sia per ragioni di continuità di competenza tra so", si aggiungono infatti quelle legate... tra upposti per un sempre più diffuso "scambio" tra e al fine di rendere più agevoli i rapporti tra no esperienze pregresse e la collaborazione tra della sua audizione, che esiste un rapporto tra sibile. Ecco perché la Commissione ha posto tra Governo? Un disegno di legge si è infranto tra n viso a cattivo gioco, si tengono riunioni tra o futuro. Sarebbe auspicabile, di intesa tra nare e quindi emettere sentenze più rapide. Tra o peggiorativo. C'è un problema di coerenza tra dei problemi che dobbiamo affrontare. Oggi tra nti forse si potrebbe trovare un equilibrio tra non aveva affatto queste caratteristiche e, tra ente sottoposti a tale regime ha raggiunto, tra sottoposti al regime dell'articolo 41-bis, tra ezione di massima sicurezza che non rientra tra mi risulta gode di notevole prestigio anche tra sto cospicuo ampliamento, che ha comportato tra entrale della lotta alla mafia, il rapporto tra legami, spesso ambigui e sempre insidiosi, tra i e sempre insidiosi, tra mafia e politica, tra se lei ritenga che il problema del rapporto tra 478 economica, realizzando quella saldatura tra rtamente quanto meno un accenno al rapporto tra 144 i vari organismi di polizia ed al loro coor cittadino ed istituzioni e per l'acquisizio gli operatori, grandi e piccoli, anche tram le diverse autorità, amministrative e di po il presidente e i tifosi non ci può essere ministro dell'interno e ministro di grazia uffici giudiziari sia per esigenze di funzi l'altro sento di cifre che, sia pure senza le mafie tradizionali e quelle straniere (l le autorità giudiziarie (specie in tema di i vari paesi, anche per quanto riguarda le istituzioni, sistema bancario e mondo econo i suoi compiti quello di indagare sul ricic le proteste delle organizzazioni sindacali i questori, prima dello sciopero generale, la Presidenza del Consiglio e i ministri de coloro che aspettano di essere processati i indirizzi e proposte del Presidente del Con funzionari amministrativi e magistrati vi è le esigenze di bilancio e le esigenze di un l'altro, prevedeva anche uno specifico cont la fine del 1992 e il primo semestre del 19 la fine del 1992 e il primo semestre del 19 quelle riservate ai detenuti sottoposti al i rappresentanti dell'opposizione. Si tratt l'altro il frazionamento di una direzione ( mafia e politica. Eppure di tale questione mafia e politica, tra criminalità organizza criminalità organizzata e consenso elettora mafia e politica sia stato in tutto o in pa criminalità organizzata e criminalità degli mafia e amministrazione, politica e istituz 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 pio, intenda istituire un corpo specifico); one di tutti noi sulla differenza esistente rtuno questo suo accenno), con riferimento, sorvolo). La celebrazione dei processi, che e debbo dirle con franchezza circa il nesso di Palermo, esponente di forza Italia (che il mondo e vi è quindi quasi un'incoerenza iglio, nelle sue comunicazioni ha collocato una situazione di obiettiva incompatibilità uttorie, depotenzia la "periferia" - sempre di ciò che accadeva, dei rapporti illeciti fossero o quali potessero essere i rapporti ll'opposizione verificare la corrispondenza erativi speciali dell'Arma dei carabinieri; , salta agli occhi il problema del rapporto spetto fondamentale, si tratta del rapporto rio perché mostrava possibilità di intrecci 567 decreti, tanto che nel periodo compreso rio previsto dall'articolo 41-bis sono 436; 144 facenti parte di altre cosche mafiose, ratta di persone inserite in famiglie unite lla quale non si può uscire. Naturalmente, rovvedimenti in vigore nel periodo compreso si propone, quello cioè di tagliare i fili la possibilità di emanare decreti e perché, ole: il cosiddetto radio-carcere funziona e prontare un congresso che si terrà a Napoli vi è la necessità di effettuare interventi, prattutto con riferimento alla 'ndrangheta, atterizzate da una grande presenza mafiosa. sociali, che rappresentano il trait d'union relazione sulla diversità dei comportamenti tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra Tra tra tra 145 l'altro, secondo i calcoli, i pentiti sono il mantenimento normativo del suddetto arti l'altro, alla questione della celebrazione primo e secondo grado è una celebrazione so il problema mafia e il turismo: al riguardo l'altro conosco da anni ed ho sempre avuto un tipo di interpretazione che si può dare i vari settori dell'illecito le case da gio il magistrato che gestisce il pentito e il virgolette, perché è un termine che non mi politica, imprenditoria, criminalità. Lei h mafia e massoneria. Abbiamo avuto un interl parole e fatti, indurre e stimolare il Gove i 400 e i 500 uomini lavorano nel servizio diverse forze di polizia. Siamo convinti ch il Governo ed il mondo della cultura, degli politica e criminalità che essi conoscevano la fine del 1992 e l'inizio del 1993 il tot questi, ve ne sono alcuni per i quali i dec cui la stidda (in questo caso si tratta di di loro da matrimoni o da alleanze che spes i personaggi nei cui confronti è stato appl la fine del 1992 e l'inizio del 1993 si è g un certo tipo di detenuti e coloro che sono i decreti emanati dal direttore generale, 5 i detenuti vi sono amici, coimputati, corre il 21 e il 23 novembre, al quale affluirann l'altro, sul parco macchine e ricordo, per i cui componenti vi sono migliaia di detenu l'altro, in Calabria vi sono 157 cosche, pe il detenuto e il tribunale di sorveglianza. questo e quell'istituto, ho cercato di farv 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 ati come un supporto, come un trait d'union istrati di sorveglianza, obbligatoriamente. ne penitenziaria ci sono tanti problemi, ma i diritti inviolabili Pagina 522 dell'uomo, , deve sempre sussistere un contemperamento oblemi che avvertiamo in merito al rapporto e equilibrio, sotto il profilo percentuale, in generale nella materia del trattamento; è pió grave, genera solidarietà nel carcere potrebbe addirittura creare una solidarietà a di persone dal carcere significa portarle 'è bisogno di un provvedimento del giudice, sottoposti al regime dell'articolo 41-bis. e al tribunale debbono marciare altre cose, sua banda che ha commesso parecchi omicidi, ontavano a 26 mila unità circa, il rapporto e, come ho detto, siamo in sede di rinvio; o un altro principio, del quale sottolineo, luto, non può assicurare che non conversino lla coerenza, direi della consequenzialità, ialità, tra intenzioni ed atti di Governo e che si ponga anche un problema di coerenza problema di coerenza tra parole e parole e oni per un recupero del rapporto fiduciario rte della banca è più economica che morale. . Si pone, quindi, il problema del rapporto enomeno. Non vede lei una contraddizione a Mario Pirani che non è l'ultimo arrivato icando! su corruzione e mafia, sui rapporti richiesta a magistrati (del popolo, lo dico edere la normativa vigente e di consentire, al cliente. E' infatti sempre la difformità tra Tra tra tra tra tra tra tra tra tra tra tra Tra tra tra tra tra tra tra tra tra tra tra tra Tra tra tra tra tra tra tra tra 146 il detenuto, la famiglia, il posto di lavor noi e i magistrati c'è una specie di amore quelli che hanno priorità bisogna annoverar cui quello della libertà personale, la cui potestà punitiva, tutela della sicurezza pu il legislativo e la magistratura, è chiaro le due ipotesi? ANTONELLA GIULIANA MAGNAV l'altro, il magistrato di sorveglianza deci tutti i detenuti: con De Lorenzo è solidale i mafiosi colpiti da questa norma, che perc la popolazione e addossarne il controllo ai l'altro motivato, per poterlo limitare. Lo l'altro, di questo mi dà conferma l'ultima cui il carcere. Si sono dimenticati del car cui quello di un agente di pubblica sicurez reclusi definitivi e quelli in attesa di gi breve il problema dovrebbe essere risolto virgolette, la gravità. Tale principio, inf loro. PRESIDENTE. O che la posta segua vi intenzioni ed atti di Governo e tra progett progetti e loro realizzazione. E credo che parole e parole e tra intenzioni e intenzio intenzioni e intenzioni, soprattutto quando cittadino ed istituzioni e per l'acquisizio cliente bancabile e cliente non bancabile l aree depresse del Mezzogiorno e capacità di quanto ella ha proclamato, ossia di voler c i giornalisti - lei attaccò con forza i giu affari e mafia, siano comunisti, cioè che a parentesi) che l'amministrano in altro modo l'altro, il ricorso a strumenti di indagine l'entità del patrimonio delle persone, i mo 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 onale antimafia, che si collocano a cavallo non approdò ad esiti di rilievo. Peraltro, ato alla migliore distribuzione dei compiti ne, nell'ambito delle associazioni mafiose, ma completamente rivisitata, tenendo conto necessità di attuare una netta separazione cano nel più generale contesto dei rapporti ali privilegiati per infittire le relazioni genericamente allo stesso tema dei rapporti con riguardo al problema del coordinamento oni per il recupero del rapporto fiduciario anche i presupposti per un diffuso scambio che nella dimensione internazionale vedono, del codice di procedura penale, i rapporti ece che fosse la cupola mafiosa a scegliere le diverse province i "personaggi" (lo dico no sull'argomento o a mancato coordinamento protettori. Con il distinguo che si opera a... La pericolosa commistione esistente tituto in cui vi siano una vera separazione ia pure in prima battuta -, una distinzione ontrollo. Noi non vogliamo creare conflitti la pronuncia è difficile, vi è un conflitto con disagio questa conflittualità immanente nza però realizzare alcunché. PRESIDENTE. la questione della conflittualità oggettiva PRESIDENTE. Come viene diviso l'istituto stato mandato, anche perché non è detto che losi custodi della nuova identità, cosa che enta di mettere a proprio agio il soggetto. tirla e non di discriminare nella sicurezza aspettative dei collaboratori di giustizia. tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra Tra tra tra tra tra Tra tra Tra 147 la fine del 1991 ed i primi mesi del 1992, le stesse forze di polizia, i cui responsab i vari corpi ed allo sviluppo di un'azione attività delinquenziali primarie (fonte del l'altro delle oggettive difficoltà connesse chi investiga sui fatti dichiarati dal pent mafia e politica. Di tale relazione voglio i suoi appartenenti e coloro i quali, in qu mafia e pubblica amministrazione: per evita le forze di polizia, che già è stato affron cittadino ed istituzioni, sia creando una n le organizzazioni criminali, che nella dime l'altro, un mezzo più sicuro e proficuo di le autorità giudiziarie. Sotto questo profi i candidati inclusi nelle liste dei vari pa virgolette) da appoggiare in ogni singola c la Presidenza della Commissione e palazzo C i carcerati con l'applicazione dell'articol imputati e condannati, sia pure in primo gr imputato e condannato sia pure in prima uomini e donne e speciali accorgimenti per i poteri dello Stato; le nostre pronunce so poteri dello Stato, perché potere esecutivo il potere esecutivo e quello giudiziario, s l'altro, costano anche poco. MARCELLO G alcune norme e la disciplina carceraria, mi collaboratori e detenuti soggetti a regime coloro che sono sottoposti al regime di cui l'altro comporta anche delle spese, perché gli aspetti paradossali della vicenda di qu persona e persona. GIUSEPPE SC l'altro amministriamo soldi dello Stato e q 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 dino. Ognuno dice quello che pensa. Io sono petenza. Per quanto riguarda la distinzione i fondi, vista la crescita esponenziale che o di un testimone né di un suo familiare. "). Vi sono stati diversi collaboratori che o inoltre conoscere il rapporto, se esiste, iale. Per quanto riguarda il rapporto il direttore del servizio, sull'interazione eve svolgere, ossia quella di trait d'union agneremmo del tempo -, però è opportuno che te sconsigliabile, ed addirittura negativo. ste fare, intendete operare una distinzione Valentini abbia dei dati più puntuali, però di diritto amministrativo... PRESIDENTE. no della magistratura reggina, per esempio, per esempio, tra magistrati di vari gradi, de e si sviluppa il dibattito o il rapporto a giustizia siano in qualche modo collegati almeno qui, avessimo chiara la distinzione inzione tra Governo e Stato. Chi vi parla è sul terreno della mafia, ma è anche da anni difeso lo Stato e lo difende, distinguendo sto Governo farà meglio degli altri, saremo a pagina a caso in cui si parla di disagio, dichiarazione e di tutti gli interstizi che presentato particolari margini di rischio, pentiti emerga la permanenza di un rapporto tito non ha reso dichiarazioni sui rapporti tà di ascoltare alcuni pentiti sui rapporti l'arresto, per effetto della detenzione. Ma e interessare sotto il profilo del rapporto guerra di religione, come contrapposizione tra tra tra Tra tra tra tra tra tra tra Tra tra tra Tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra 148 coloro che sono stanchi di vedere uno Stato i pentiti e coloro che sono infiltrati dell il 1992 e il 1994 emerge dalle cifre, esso l'altro, devo dire che, tranne due casi, ch l'altro hanno mandato in tilt la questura d personale impiegato e persone tutelate: inf i collaboratori ed il personale del Servizi le procedure applicative del programma di p voi, che siete impegnati direttamente sul c voi e la Commissione antimafia si instauri l'altro, non sempre tutti coloro che sono p il collaboratore, il pentito ed il testimon il 1^ novembre 1993 e il 1^ novembre 1994 s poco lo sapremo. GIACOMO GARRA. Fate un r magistrati di vari gradi, tra magistrati ch magistrati che si occupano di pentiti ed al chi pone le domande stesse e chi risponde. loro e, nel caso in cui tale collegamento e Governo e Stato. Chi vi parla è tra coloro coloro che, da anni, si sono schierati cont coloro che cercano di difendere lo Stato ne Stato e governi. Se questo Governo farà meg Pagina 632 quelli che gliene daranno atto e le tante che erano state redatte in quel pe dichiarazione e dichiarazione possono prese fenomeni criminali e tessuto istituzionale. mafia e politica che sia attuale e se vi si mafia e politica perché purtroppo alcuni po mafia e politica, ovviamente senza che ques questi due poli ci sono mille gamme interme mafia e politica, la domanda se ci siano re posizioni filosofiche diverse - a volte add 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 - a volte addirittura a livello di litigio gistrati è stata quella di tranquillizzare, E qui bisogna ancora una volta distinguere guere tra collaborante e collaborante, cioè si potrebbe immaginare. Vanno considerati, di difesa. Cosa accadrebbe, ad esempio, se tra valutazione circa il rapporto esistente indagini sul problema cruciale del rapporto è la separazione, di cui da tempo parliamo, oblemi per quanto concerne il coordinamento e ad instaurare un coordinamento nel lavoro In ordine alla valutazione del rapporto questa ipotesi consolidatisi e sviluppatisi e sembrava una cosa incredibile; novecento, lato in questa sede di una giusta equazione ossa esserlo anche la risposta: il contatto resenza dello Stato "in periferia" (lo dico siciliani per poi soffermarmi sui rapporti ocura distrettuale di Palermo sia un unicum sta DDA, costretti a quotidiani spostamenti i passaggi, resta il problema dei rapporti parrocchiale che, di fronte all'alternativa ta verifica sul campo dell'attuale rapporto atitanti e ad alcune questioni carcerarie), i Leonardo Messina, l'esistenza di contatti ina ed ai suoi o no? Abbiamo una spaccatura r quanto riguarda la questione dei contatti A noi non risultano contatti di questo tipo i una spaccatura o distinzione di strategie e organicità l'esistenza di una spaccatura si traduce in un disagio, in una difficoltà I, Procuratore della Repubblica di Palermo. tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra tra Tra 149 tifosi da stadio - che non, ripeto, con la virgolette, affermando che non c'è nulla di collaborante e collaborante, cioè tra un so un soggetto che ha vissuto un'esperienza cr gli altri, i limiti di resistenza umana; in i mille ricordi dell'esperienza criminale d Cosa nostra e le associazioni di tipo masso due tipi di organizzazioni eversive: Cosa n struttura e personale addetti alle indagini i pubblici ministeri interessati? Quali vie i pubblici ministeri? Inoltre, quando vi Cosa nostra e massoneria ed a che punto sia componenti di Cosa nostra e momenti, profil l'altro è il numero complessivo dei collabo segretezza e sicurezza. Dottor Lo Forte, magistrati e pentiti, per le ragioni indica virgolette perché la lotta alla mafia non d la distrettuale, le procure ed i tribunali le altre distrettuali: non c'è altra distre Palermo, Agrigento e Sciacca, attraverso st procure distrettuali e procure circondarial l'esortazione evangelica ad ispirarsi all'o Cosa nostra e la società civile a confermar i quali anche quello relativo alla protezio collaboratori di giustizia, nel momento in gli uomini d'onore reclusi nelle carceri di collaboratori di giustizia ed agenti dei se i collaboratori di giustizia, le cui dichia uomini d'onore detenuti e quelli liberi, ev le due componenti dell'organizzazione. MI i pentiti, si può essere portati a chieders l'altro non è neanche rientrata. E' vero ch 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 di Grasso e di Vigna vengono sostituite - e tra elencati, che PM di primo grado non erano (tra ssere, così com'è oggi, per molti detenuti, tra r Caselli, ovvero la relazione strettissima tra per uscire da questa grave situazione che, tra mafiosa in alcune amministrazioni comunali, tra otto una vecchia consuetudine di collusione tra ba esserci un rapporto chiaro e trasparente tra o lavorare tutti assieme, senza distinzione tra itico, dell'asprezza del dibattito politico tra anto uno sforzo volontaristico, ritengo che tra mmo dovuto fare per procedere in tal senso. Tra ventilato da qualcuno, di creare spaccature tra e possa estrinsecarsi una certa litigiosità tra ella Commissione antimafia: una discussione tra empre molto vicina alla passione politica e tra stata rimarcata una differenza territoriale tra ortata ad esempio Pagina 697 una differenza tra . Non voglio farmi illusioni, perché so che tra Sono profondamente convinto che lo scontro tra alle opinioni di ciascuno (che non sempre, tra e sospetti. Sono certissima di trovarmi qui tra isogno piuttosto che almeno in questa sede, tra o abbia stimolato un momento di riflessione tra ARLACCHI. Sul caso Ayala c'è un contrasto tra in Sicilia debba essere fatto; siamo stati tra one non è personale, ma attiene al rapporto tra o e che li dobbiamo affrontare uno per uno. Tra ç agli altri un problema di incompatibilità tra ne per rinnovare le intese assunte nel 1993 tra ni per effetto di una convenzione stipulata tra e queste associazioni esistano ancora oggi. Tra 150 l'altro circolano nomi, mi pare non smentit questi c'era anche il dottor Ilarda), c'è u i più significativi, una scatola vuota, un segnali istituzionali, atti pubblici, misur l'altro, ha provocato un clima di tensione le quali quelle di Corleone e San Giuseppe amministrazione ed ambienti mafiosi, restit il presidente e i membri della Commissione, maggioranza e opposizione. Questa dovrebbe le forze qui rappresentate, ma è di tipo is una settimana ci ritroveremo di fronte agli l'altro, abbiamo denunciato fin dall'inizio nord e sud, se è vero, com'è vero, che alla le componenti politiche della Commissione, me e lui si concluse proprio sulla necessit sede e sede muta la risonanza delle azioni nord e sud, non certo per orgoglio o razzi nord e sud, in un momento in cui, invece, s noi molti non sono amici, né posso pretende il presidente ed il commissario Ayala non s l'altro, occorre tenere presenti, così come persone più che oneste, lontane da ogni sos i 51 componenti della Commissione, ci si ri di noi. Avremmo tutti - io per prima, non m dichiarazioni fatte da Ayala e dichiarazion coloro che hanno ritenuto opportuno di rinv il presidente della Commissione ed i compon questi problemi ne individuo tre sollevati la sua posizione e il suo incarico. PRESI la Commissione antimafia della precedente l il Ministero di grazia e giustizia e, appun l'altro, in quella zona si sono registrati 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 'è dubbio. ANTONIO BARGONE. L'esperienza, voro mafia ed economia, abbiamo individuato - ma anche misure di carattere preventivo, esercizio in via prevalente; la confluenza eranti in presumibile regime di abusivismo. gli organi che hanno l'obbligo di riferire, , è consentita dagli operatori autorizzati. el merito delle ispezioni), ma oggi non c'è enzioni con i singoli istituti, soprattutto tuti, soprattutto tra UIC e Banca d'Italia, tra UIC e Banca d'Italia, tra UIC e ISVAP, dato (e che è stato il primo ostacolo posto rci. PRESIDENTE. Lo so, ma la convenzione gruppo di lavoro che si occupa dei rapporti asparenza: tenuto conto del tempo trascorso za del rapporto causale è questo il punto potere al commissario antiracket (il quale, o del lavoro che sta svolgendo sul rapporto a di fideiussione. Qualora si appurasse che r comprendere i motivi di queste reiezioni. afiosa. I problemi cominciarono a sorgere elle spese. Ciò detto, si deve sapere se ano possibilità di collusioni o di incontri normativa primaria, della differenziazione r sempre. In queste situazioni, il rapporto erverranno, e dopo aver valutato i rapporti in una situazione di sudditanza - sia detto lizia giudiziaria determina una commistione deve essere individualizzato in relazione, vo - se necessario dagli organi competenti, ubblica che non parleranno più dei rapporti compenso esistente in seno alla commissione tra tra tra tra Tra tra Tra tra tra tra tra tra tra tra tra tra tra tra tra Tra tra tra tra tra tra tra tra tra tra tra tra tra 151 l'altro, ci insegna che la partecipazione d gli obiettivi di primo piano quello relativ cui l'utilizzo degli intermediari quale str gli intermediari altresì dei soggetti che s le problematiche emerse, la Guardia di fina gli altri, al Tesoro, in ordine all'osserva l'altro, a questo proposito le farò una dom noi chi non sia andato in banca per fare un UIC e Banca d'Italia, tra UIC e ISVAP, tra UIC e ISVAP, tra UIC e Consob. Ciò in modo UIC e Consob. Ciò in modo che, ognuno per i le ruote) e la disponibilità di tutti quest l'UIC e la Consob viene fatta, in base alla mafia ed economia ha avvertito la necessità la data di presentazione della domanda e il l'attentato subito e gli atteggiamenti di c l'altro, opera presso la Presidenza del Con criminalità ed economia, potrebbe formulare coloro i quali hanno proposto domanda di ri le diverse zone del paese, le regioni del s la fine del 1992 e l'inizio del 1993, per l i presupposti dell'applicazione del program i soggetti. Questo è un punto centrale, per struttura investigativa e di protezione, pr il cambiamento delle generalità e l'offerta la commissione e l'autorità giudiziaria. virgolette - rispetto al capo della polizia i due aspetti della protezione e dell'inves l'altro, allo stato di pericolo; tale indiv i quali ovviamente doveva collocarsi, dopo politica ed istituzioni (vedete Buscetta, f i componenti cosiddetti laici (cinque, oltr 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 rchÇ sempre di più assistiamo a commistioni tra o diverso, c'erano situazioni assai diverse tra mbra si tratti di una collaborazione dovuta tra li interventi che svolgeremo di qui a poco, tra ono valutazioni divergenti - e ve ne sono - tra ttor Vigna ed altri che oggi interverranno (tra temente alimentarsi del confronto razionale tra o, vi è un'ampia circolazione di idee anche tra upano di altri affari. I temi più generali, tra no su due temi fondamentali: la separazione tra onatorio, quanto più vi sia una distinzione tra meno immediati e concreti; è la distinzione tra ti assolutamente impropri e non auspicabili tra mente quelli relativi ad eventuali rapporti tra e quali sono le regole di un serio rapporto tra ura migliorare e razionalizzare il rapporto tra to virtualmente un principio di distinzione tra olutamente non necessario di conflittualità tra empo, inoltre, potrebbe durare il conflitto tra quanto concerne il problema delle divisione tra dalla Commissione antimafia, quale un forum tra cipi costituzionali che regolano i rapporti tra obiettare che vi è una profonda differenza tra profonda differenza tra i due settori, cioè tra a Siclari, ha finito con il creare problemi tra come avviene nei rapporti di collaborazione tra i di collaborazione tra autorità diverse, e tra esta materia e di razionalizzare i rapporti tra - che peraltro non è mai mancato in passato tra tratta di un settore che si trova al limite tra essivo. Se è vero che tale procura ha avuto tra anche quello di effettuare un coordinamento tra 152 potere politico e criminalità organizzata. loro che occorreva riportare ad una certa r organi istituzionali e non mi pare ci sia a i magistrati che fanno parte della commissi il dottor Loris D'Ambrosio, il dottor Vigna questi la procura di Palermo), ciò signific opinioni diverse, debbo dire che nell'ambit la Direzione distrettuale antimafia ed i co cui questi fondamentali riguardanti la lott la fase delle investigazioni e quella della organo dell'investigazione e organo della p le sfere istituzionali di competenza della organi della giurisdizione e organi dell'am l'organizzazione criminale e componenti del il collaborante e le istituzioni dello Stat questi due aspetti. Ritengo tuttavia che, a i poteri dello Stato, che va salvaguardato autorità giudiziaria ed organo amministrati l'autorità giudiziaria proponente, che rite custodia ed investigazione, problema che, p le procure distrettuali e molte procure ord l'amministrazione e la giurisdizione. Natur i due settori, cioè tra la valutazione in o la valutazione in ordine alla possibilità d il pubblico ministero e la commissione. Que autorità diverse, e tra amministrazione e g amministrazione e giurisdizione. Mai, però, gli uffici della procura e l'organismo cent la commissione e la procura della Repubblic la legislazione primaria e quella di second i suoi poteri anche quello di effettuare un le procure distrettuali, è chiaro che una s 447 tiamo occupando, sia necessario distinguere 448 ssibilità di ricorrere ad alternative e che 449 siamo sicuramente su un terreno di confine 450 ratta di due attività profondamente diverse 451 mio parere dannosissimo momento di attrito 452 Lo Forte, cioè quello relativo ai rapporti tra tra tra tra tra tra 153 gli Pagina 781 aspetti formali e quelli sos queste vi sia anche la detenzione extracarc amministrazione e giurisdizione. Siamo su u loro: mentre quella giudiziaria ovviamente il potere giudiziario e quello amministrati mafia e politica e mafia ed istituzioni, il 154 1 dal Presidente per procedere all'elezione, fra i suoi componenti, di due Vicepresidenti e 2 nati dai presidenti delle Camere, di intesa fra loro. 2. Le spese per il funzionamento d 3 ssere assolti in tempi del tutto residuali, fra una votazione e l'altra o fra l'una o l'alt 4 to residuali, fra una votazione e l'altra o fra l'una o l'altra seduta di Commissione. Imma 5 li questioni, per poi cominciare a lavorare fra due o tre settimane. Si tratterebbe, a mio 6 he intanto la burocrazia diminuisca i tempi fra la confisca e l'assegnazione; mi rendo cont 7 nche una ricognizione sul tema del rapporto fra mafia ed enti locali. In merito a quest'asp 8 unque serio, ma ä una questione di rapporti fra Governo, Parlamento e sistema bancario e no 9 tati Uniti: daremo avvio ad un collegamento fra tutti i paesi amici per rafforzare la lotta 10 cune gestioni commissariali (non ricordo se fra esse vi era anche quella del comune di Terl 11 re, le connessioni giuridiche ed economiche fra i soggetti prenditori del credito. Passo a 12 ve o statutarie che ne regolano l'attività; fra queste, evidentemente, rientra sicuramente 13 anizzata ä rappresentata dalla cooperazione fra le autorità preposte ai controlli. Con la s 14 ttoscrizione dei Memoranda of understanding fra le "Vigilanze" dei paesi comunitari si dà c 15 resentato un significativo foro di incontro fra le diverse delegazioni nazionali dell'inter 16 oggi continui ad esistere. Mi scuso se fra poco dovrò allontanarmi, ma avrò il piacere 17 che si faceva sul serio soprattutto perché fra gli studiosi - fra i quali cito il Rey - si 18 serio soprattutto perché fra gli studiosi - fra i quali cito il Rey - si cominciava a prosp 19 ovrà essere anche la questione dei rapporti fra le varie forze di polizia, anche con riferi 20 alutare la possibilità che il coordinamento fra le forze di polizia possa essere potenziato 21 i flussi di spesa pubblica, con l'intreccio fra le imprese mafiose e gli eventuali appoggi 22 livello internazionale, per proporre intese fra tutti i paesi per arrivare ad una armonizza 23 n dolore. A me non ha fatto piacere quando, fra i primi, ho scritto delle collusioni dell'a 24 domani, dopodomani, la prossima settimana, fra due settimane e fra tre mesi. Il senso dell 25 la prossima settimana, fra due settimane e fra tre mesi. Il senso dell'ordine del giorno, 26 'entra l'autorità giudiziaria. Questi casi, fra l'altro, sono apparsi su tutti i giornali; 27 ella Commissione alcuni elementi, contenuti fra l'altro nel programma, ai quali si intende 28 che si articola lungo più direttrici tutte fra loro strettamente connesse ed alla cui scel 29 curare l'effettivo isolamento del detenuto. Fra questi si annoverano quelli dell'Asinara e 30 essere dalle organizzazioni mafiose, prime fra tutte le attività economiche e finanziarie. FRA 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 à economica, di infiltrazioni, di relazioni il tribunale di Reggio Calabria, dal quale il periodo attuale. La distinzione generale non era emerso alcun collegamento pplicazione, tenendo conto di vari elementi sa efficacia ma che si sottrae al conflitto dinaria potrebbe superare. Alla discussione ticolo 41-bis), ossia una netta separazione egli ultimi avvenimenti e della commistione roduce altri reati e strane alleanze, anche i problemi cui accenna l'onorevole Simeone, o in zone molto diverse. Il collegamento sidente, che la differenza di comportamento un'ispezione, fate un'indagine comparativa in alcuni interventi ä quello del rapporto to che lo stesso ä imputato di molti reati, o istituti nuovi, come Opera a Milano, dove ri, educatori e chi più ne ha più ne metta: ispetto e che per una vita ä stato detenuto ia in discussione! LUIGI ROSSI. Scusi, ma odia). In molti di essi, in particolar modo poi individuare un sistema di collegamento ma anzi attiva quell'indiscussa solidarietà Per quanto riguarda i rapporti intrattenuti ecché ne pensi il direttore, ä pacifico che perché magari un giorno devono essere qui e n provincia di Lecce, dove ora si ammazzano illazioni cui abbiamo assistito alla Camera ri come mai continui il contenzioso in atto iose, se non addirittura in odore di mafia. segnalazione che dovrà avere a presupposto, nuovo rapporto che ha cercato di instaurare fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra Fra fra fra 155 settori economici, istituzionali, imprendit poco saranno scarcerate centinaia di person i vari periodi ä utile, anche se costringe queste persone e la criminalità organizzata i quali anche la giurisprudenza costituzion persone di ottima volontà da una parte e da 41bis sì, 41-bis no, la risposta é una norm i detenuti in base ai reati commessi, possa detenuti normali e politici, lei non pensa la delinquenza comune e politica. La sua l'altro per gli avvenimenti emersi recentem organizzazioni criminali diverse, comunque, banche del nord e del sud ä abissale: lei c gli istituti bancari del nord e del sud: è dichiarazioni e fatti. Credo sia compito in cui un certo numero di omicidi, che inducon l'altro abbiamo creato cento posti nell'osp questi vi ä un conflitto terribile, come av i detenuti. PRESIDENTE. Tutti i magistrat l'articolo 41-bis e l'articolo 13 della Cos i giovani, vi ä un'adesione al malumore pop la magistratura e il carcere ai fini delle gli stessi uomini d'onore"; infatti gli uom i detenuti ed il mondo esterno hanno parlat di loro vi siano contatti; ma all'esterno a tre a San Gimignano per un altro reato. Sug loro: ä una tragedia che il fenomeno sia ar maggioranza e Governo, che hanno portato di l'esecutivo e la magistratura, che suscita l'altro, ho raccolto vivaci critiche sulle gli indici di anomalia delle operazioni ste cittadino e Stato) e, dall'altro, all'inver 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 e, prevedendo eventualmente un collegamento ato per quanto riguarda i problemi del sud, ntito l'efficace funzionamento del sistema. uello informativo e operativo. Sono questi, tica nazionale ed internazionale, che tenga presenti nell'animo siciliano: il dualismo re la questione delle sue fonti di guadagno o male, per poi trovare la medicina adatta. questo tipo di servizio: vi ä una rotazione iudici faziosi e, in qualche caso, di faide in cui si affronta il capitolo dei rapporti beni. Permanenza di un rapporto attuale à difficile una linea di demarcazione netta che costituivano il momento di collegamento tuale della verifica delle tesi di accusa la giusta autonomia e la giusta distinzione ichiarazione chiara, ma ciò non ä avvenuto. zioni, e non avendolo fatto credo di essere di recuperare una compatibilità funzionale bile, ma almeno che neutralizzi le distanze n ordinario, banale e fisiologico confronto n ordinario, banale e fisiologico confronto OPELLITI. Potremmo convocare la Commissione ci auguriamo, quindi, che a breve riavremo uesta sede abbiamo soltanto discusso, ed io gressa normativa riguardano: la distinzione vità svolta, alla dimensione ed al rapporto ne di servizi di pagamento, la demarcazione stessa disciplina, per stabilire i confini uno scarto enorme, o comunque consistente, lo ripeto, senz'altro si rileva uno scarto do esattamente l'espressione da lei usata - fra fra Fra fra fra fra fra Fra fra fra fra fra fra fra fra fra Fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra fra 156 i due momenti ai limitati fini di garantire i quali in primo luogo vi ä quello della ma quei punti vanno certamente inseriti la sca gli altri, gli obiettivi della prossima con l'altro conto delle risultanze del referend società e Stato, il ripiegamento sulla fami le quali la principale ä il commercio delle le medicine vi ä anche l'isolamento, indubb tutto il personale? C'ä una determinata cat giudici), ha addirittura assunto un'altra i mafia e politica, tutto da verificare. I gi mafia e politica. Ferdinando Imposimato ha militare e colletto bianco: non esiste fors l'organizzazione militare e la società, cio le informazioni che provengono dall'interno i pubblici ministeri ä evidente che esister l'altro aggiungo che l'intervista con le di coloro che possono essere autorizzati a for i suoi componenti (perché si tratta di comp i gruppi politici per un lavoro in comune a opzioni diverse e nemmeno mi ä sembrato un opzioni diverse e nemmeno un ordinario, ban le 13,30 e le 14,30, tenendo conto degli im noi l'onorevole Ayala e che questa Commissi i primi ho fatto rilevare che impostare l'a intermediari finanziari ed i soggetti non o indebitamento e patrimonio sono iscritti in operatività nei confronti del pubblico e no l'una e l'altra normativa. Comunque, le l'entità del fenomeno del riciclaggio e qua l'entità del fenomeno e la quantità di atti il testo unico della legge bancaria e la le 157 95 Gazzetta Ufficiale, nel quale sono dettate, fra l'altro, e sempre in attuazione dell'artico 96 rganizzazioni criminali di tipo mafioso - e fra queste naturalmente tutta la materia riguar __________________________________________________________________ Bibliography Aarts, J. (1991) “Intuition-Based and Observation-Based Grammars”, in: Aijmer and Altenberg 1991:44-62 Aarts, J./de Haan, P./Oostdijk, N. (eds.) (1993) English Language Corpora: Design, Analysis and Exploitation, Amsterdam: Rodopi (Language and Computers: Studies in Practical Linguistics) Aarts, J./Meijs W. (eds.) (1984) Corpus Linguistics, Amsterdam:Rodopi Aarts, J./Meijs W. (eds.) (1986) Corpus Linguistics II, Amsterdam:Rodopi Aarts, J./Meijs W. (eds.) (1990) Theory and Practice In Corpus Linguistics, Amsterdam: Rodopi Ahrenberg, L./Merkel, M. (1996) “On Translation Corpora and Translation Support Tools: A Project Report”, in Aijmer et al. 1996 Aijmer, K. and Altenberg, B. (eds.) (1991) English Corpus Linguistics: Studies in Honour of Jan Svartvik, London: Longman Aijmer, K./Altenberg, B./Johansson, M. (eds.) (1996) Languages in Contrast: Papers From a Symposium on Text-Based Cross-Linguistic Studies, Lund: Lund University Press Armstrong, G. (1996) “Computer-Assisted Literary Analysis Using the TACT Text-Retrieval Program”, in: Computers & Texts 11(8) Aston, G. (1997) “Small and Large Corpora in Language Learning”, paper presented at the PALC Conference, University of Lodz, Poland Baker, M. (1993) “Corpus Linguistics and Translation Studies. Implications and Applications”, in: Baker et al. 1993:233-250 Baker, M. (1995) “Corpora in Translation Studies: An Overview and some Suggestions for Future Research”, in: Target 7(2):223-243 Baker, M. (ed.) (1998) Routledge Encyclopedia of Translation Studies, London: Routledge (Translating and Interpreting - Encyclopedia) 158 Bibliography __________________________________________________________________ Baker, M./Francis, G./Tognini-Bonelli, E. (eds.) (1993) Text and Technology: In Honour of John Sinclair, Amsterdam: John Benjamins Barnbrook, G. (1996) Language and Computers, Edinburgh: Edinburgh University Press Bergenholtz, H./ Schaeder, B. (eds.) (1979) Empirische Textwissenschaft: Aufbau und Auswertung von Text-Corpora, Königstein: Scripter Verlag Bernardini, S. (1997) “A ‘Trainee’ Translator’s Perspective on Corpora”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 39 Biber, D. (1988) Variation Across Speech and Writing, Cambridge: Cambridge University Press Biber, D. (1993) “Co-occurrence patterns among collocations: a tool for corpus-based lexical knowledge acquisition”, in: Computational Linguistics 19(3):549-556 Biber, D. (1993b) “Representativeness in corpus design”, in: Literary and Linguistic Computing 8:243-257 Biber, D./Conrad, S./Reppen, R. (1998) Corpus Linguistics: Investigating Language Structure and Use, Cambridge: Cambridge University Press (Cambridge Approaches to Linguistics) Biber, D./Conrad, S./Reppen, R. (1998b) “Corpus-based Approaches in Applied Linguistics”, in: Applied LInguistics 15:169-189 Bortolini, U./Tagliavini, C./Zampolli, A. (1971) Lessico di frequenza della lingua italiana contemporanea, Milano: IBM Italia Brill, E. (1993) A corpus-based approach to language learning, PhD Thesis, University of Pennsylvania: Department of Computing Calzolari, N./Bindi, R. (1990) “Acquisition of lexical information from a large textual Italian corpus”, in: Proceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki 39 See footnote 35 159 Bibliography __________________________________________________________________ Chafe, W. (1992) “The Importance of Corpus Linguistics to Understanding the Nature of Language”, in Svartvik 1992a:79-97 Chomsky, N. (1957) Syntactic Structures, The Hague: Mouton Chomsky, N. (1962) Paper given at the University of Texas 1958, 3rd Texas Conference on Problems on Linguistic Analysis in English, Austin, University of Texas Chomsky, N. (1965) Aspects of the Theory of Syntax, Cambridge, MA: MIT Press Cravetto, E./De Petri, L./Bosso, S./Ferrantino, R./Mazzucchetti, E./Pellicoro, E./Prato, G./Recupero, A./Rosso, C./Servetto, P./Vono, E. (1997) Dizionario Enciclopedico Multimediale, Torino: GarzantiUTET Crowy, S. (1993) “Spoken corpus design”, in: Literary and Linguistic Computing 8(4):259-296 Dardano, M./Trifone, P. (1983) Grammatica italiana con nozioni di linguistica, Bologna:Zanichelli Dardano, M./Trifone, P. (1985) La lingua italiana, Bologna:Zanichelli De Mauro, T./Mancini, F./Vedovelli, M./ Voghera, M. (1993) Lessico di frequenza dell'italiano parlato, Milano: Etaslibri. D’Ovidio, F. (1933) Le correzioni ai “Promessi Sposi” e la questione della lingua, Napoli: Guida Eaton, H. (1940) Semantic Frequency List for English, French, German and Spanish, Chicago: Chicago University Press Feyrer, C. (1998) Modalität übersetzungsorientierten im Kontrast: Ein Modalpartikelforschung Beitrag anhand zur des Deutschen und des Französischen, Frankfurt am Mein: Peter Lang (Europäische Hochschulschriften, Reihe XXI, Linguistik) Fillmore, C. (1992) “ “Corpus linguistics” or “Computer-aided armchair linguistics” ”, in: Svartvik 1992a:35-60 Freigang, K.-H. (1998) “Machine-Aided Translation”, in: Baker 1998:134136 160 Bibliography __________________________________________________________________ Friedbichler, I./Friedbichler, M. (1997) “Korpusgestütztes Übersetzen jenseits der Wortgrenzen”, in: Lebende Sprachen 2/97:49-53 Fiedbichler, I./Friedbichler, M. (1997) “The Potential of Domain-Specific Target-Language Corpora For The Translator’s Workbench”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 40 Fries, C. (1952) The Structure of English: An Introduction to the Construction of Sentences, New Yourk: Harcourt-Brace Fries, C./Traver, A. (1940) English Work Lists. A Study of their Adaptability and Instruction, Washington DC: American Council of Education Fries, U./Tottie, G./Schneider, P. (eds.) (1994) Creating and Using English Language Corpora, Amsterdam: Rodopi Gavioli (1996) “Corpora And The Concordancer In Learning ESP: An Experiment In A Course Of Interpreters And Translators”, paper presented at the 18 th Congress of the Associazione Italiana di Anglistica, Genova Gavioli, L./Zanettin, F. (1997a) “Corpus Use and Learning to Translate”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 41 Gavioli, L./Zanettin, F. (1997b) “Comparable Corpora And Translation: A Pedagogic Perspective”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 42 Gougenheim, G./Michéa, R./Rivenc, P./Sauvegot, A. (1956) L’Elaboration du français élémentaire, Paris: Didier Granger, S. (1993) “International Corpus of Learner English”, in: Aarts et al. 1993:57-71 40 Copyright notice: The moral rights of the author(s) to be identified as author(s) of this work are asserted in accordance with ss. 77 and 78 of the Copyright, Designs and Patents Act 1988. This work may be reproduced without the consent of the author, in part or in whole in any manner and in any medium subjected only to the two following conditions: – no charge shall be made for the copy containing the work or the excerpt – a copy of this notice shall preceed the work or the excerpt 41 See footnote 35 161 Bibliography __________________________________________________________________ Greenbaum, S. (1991) “The Development of the International Corpus of English”, in: Aijmer and Altenberg 1991:83-91 Halliday, M. A. K./Hasan, R. (1976) Cohesion in English, London: Longman Halliday, M. A. K./Hasan, R. (1985) An Introduction to Functional Grammar, London: Edwark Arnold Hughes, G. (1997) “Developing a Computing Infrastructure for Corpusbased Teaching”, in: Wichmann et al. 1997:292-307 Jimenez, M. M. (1995) Sprache, Computer und Übersetzen, Diplomarbeit, Übersetzer- und Dolmetscherinstitut, Graz Johansson, S. (ed.) (1982) Computer Corpora in English Language Research, Bergen: Norwegian Computing Centre for the Umanities Johansson, S./Stenström, A.-B. (eds.) (1991) English Computer Corpora: Selected Papers and Research Guide, Berlin: Mouton de Gruyter Johansson, S./Oksefjell, S. (1998) Corpora and Cross-linguistic Research, Amsterdam: Rodopi Johansson, S. (1998) “On the Role of Corpora in Cross-linguistic Research”, in Johannson and Oksefjell 1998 Johns, Tim (1997) Contexts: the Background, Development and Trialling of a Concordance-based CALL Program, in Wichmann et: al. 1997:100115 Käding, J. (1897) Häufigkeitswörterbuch der deutschen Sprache, Steglitz: privately published Kenny, D. (1998) “Corpora in Translation Studies”, in: Baker 1998:50-53 Kenny, D. (forthcoming) Developing A Corpus-Based Methodology For Investigating Universal Features Of Translation, PhD thesis Klaudy, K./Lambert, J./Sohár, A. (eds.) (1996) Translation Studies in Hungary, Budapest: Scholastica Klaudy, K/Kohn, J. (eds.) (1997) Transferre Necesse Est, Budapest: Scholastica 42 See footnote 35 162 Bibliography __________________________________________________________________ Kohn, J. (1996) “What Can (Corpus) Linguistics Do for Translation?”, in: Klaudy et al. 1996:39-52 Krenn, H. (1996) Italienische Grammatik, Ismaning: Max Hueber Verlag Kytö, M./Ihalainen, O./Rissanen, M. (eds.) (1988) Corpus Linguistics Hard and Soft, Amsterdam: Rodopi Laffling, J. (1991) Towards High-Precision Machine Translation: Based on Contrastive Textology, New York: Foris Publications (Distributed Language Translation:7) Lager, T. (1995) A Logical Approach to Computational Corpus Linguistics., PhD Thesis, University of Göteborg: Department of Linguistics Laviosa-Braithwaite, S. (1997) “Investigating Simplification in an English Comparable Corpus of Newspaper Articles”, in: Klaudy and Kohn 1997:531-540 Laviosa-Braithwaite, S. (1998) “Universals of Translation”, in: Baker 1998:288-291 Leech, G. (1991) “The State of The Art In Corpus Linguistics”, in: Aijmer and Altenberg 1991:8-29 Leech, G./Candlin, C. (eds.) (1986) Computers in English Language Teaching, London: Longman Leech, G./Fallon, R. (1992) “Computer Corpora - What Do They Tell Us About Culture”, in: ICAME Journal 16:29-50 Legenhausen, L. (ed.) (1996) Computers in the Foreign Language Classroom, proceedings of the workshop no. 2 of the annual meeting of the European Centre for Modern Languages, Graz: unpublished Leitner, G. (ed.) (1992) New Dimensions in English Language Corpora, Berlin: Mouton de Gruyter Levi, E./Dosi, A. (1982) I dubbi della grammatica, Milano:Longanesi & C. Lorge, I. (1949) Semantic Content of the 570 Commonest English Words, New York: Addison Wesley Louw, B. (1997) “The Role of Corpora in Critical Literary Appreciation”, in: Wichmann et al. 1997:240-251 163 Bibliography __________________________________________________________________ Maia, B. (1997a) “Making Corpora: A Learning Process”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 43 Maia, B. (1997b) “Sentence Structure and Thematization in Comparable and Parallel Texts”, in Klaudy and Kohn 1997:541-547 McEnery, T. (1992) Computational Linguistics, Wilmslow: Sigma Press McEnery, T./Wilson, A. (1993) “The Role Of Corpora In ComputerAssisted Language Learning”, in: Computer Assisted Language Learning 6(3):233-248 Mc Enery, T./Wilson, A. (eds.) (1996) Corpus Linguistics, Edinburgh: Edinburgh University Press (Edinburgh Textbooks in Empirical Linguistics) McEnery, T./Baker, P./Wilson A. (1995) “A Statistical Analysis Of Corpus Based Computer Vs. Traditional Human Teaching Methods Of Part Of Speech Analysis”, in: Computer Assisted Language Learning 8(2/3):259-274 Meijs, W. (ed.) (1987) Corpus Linguistics and Beyond, Amsterdam: Rodopi Merkel, M. (1993) “When And Why Should Translations Be Reused?”, paper presented at the XIII VAAKKI symposium, Vaasa Merkel, M. (1996) “Consistency And Variation in Technical Translations – A Study of Translators’ Attitudes”, in: Proceedings from Unity in Diversity, Translation Studies Conference, Dublin Mindt, D. (1992) Zeitbezug im Englischen: eine didaktische Grammatik des englischen Futurs, Tübingen: Gunter Narr Mindt, D. (1996) “English Corpus Linguistics and the Foreign Language Teaching Syllabus”, in: Thomas and Short 1996:232-247 2 Newmark, P. ( 1994) La traduzione: problemi e metodi, MIlano: Garzanti (Strumenti di studio) 43 See footnote 35 164 Bibliography __________________________________________________________________ Peters, C./Picchi, E. (1997) “Reference Corpora and Lexicons for Translators and Translation Studies”, in: Trosborg 1997:247-274 Porozinskaya, G. (1997) “Aspects of Literary and MT Editing in Teaching Translation”, in: Klaudy and Kohn 1997:553-557 Quirk, R./Greenbaum, S./Leech, G./Svartvik, J. (1985) A Comprehensive Grammar of the English Language, London: Longman Reinke, U. (1997) “Computergestützte Kommunikation im Übersetzungsunterricht?”, in Lebende Sprachen 4/97:145-153 2 Reiß, K. ( 1983) Texttyp und Übersetzungsmethode : der operative Test, Heidelberg : Groos Renouf, A. (1987) “Corpus Development”, in: Sinclair 1987:1-40 Renouf, A. (1997) “Teaching Corpus Linguistics to Teachers of English”, in: Wichmann et al. 1997:255-266 Renzi, L./Salvi, G./Cardinaletti, A. (1995) Grande grammatica italiana di consultazione, Bologna: il Mulino Rico Pérez, C./Martín De Santa Olalla Sánchez, A. (1997) “New Trends in Machine Translation”, in Meta 4/97:605-621 Rissanen, M. (1989) “Three Problems Connected With The Use Of Diachronic Corpora”, in: ICAME Journal 13:16-19 Rogers, M. (1997) “Synonymy and Equivalence in Special-language Texts: A Case Study in German and English Texts on Genetic Engineering”, in: Trosborg 1997:217-245 Salvi, G./Vanelli, L. (1992) Grammatica essenziale di riferimento della lingua italiana, Firenze: Istituto Geografico De Agostini, Le Monnier Serianni, L. (1989) Grammatica Italiana: suoni, forme, costrutti, Torino: UTET Short, M./Semino, E./Culpeper, J. (1996) “Using a Corpus For Stylistics Research: Speech And Thought Presentation”, in: Thomas and Short 1996:110-131 Sinclair, J. (ed.) (1987) Looking Up, London: Collins 2 Sinclair, J. (ed.) ( 1992) Corpus, Concordance, Collocation, Oxford: Oxford University Press (Describing English Language) 165 Bibliography __________________________________________________________________ Somers, H. L. (1998) “Machine Translation: Applications”, in: Baker 1998:136-139 Somers, H. L. (1998) “Machine Translation: History”, in: Baker 1998:140143 Somers, H. L. (1998) “Machine Translation: Methodology”, in: Baker 1998:143-149 Souter, C. and Atwell, E. (eds.) (1993) Corpus Based Computational Linguistics, Amsterdam: Rodopi Stenström, A.-B. (1987) “Carry-on Signals in English Conversation”, in: Meijs 1987:87-119 Stubbs, M. (1996) Text and Corpus Analysis, Computer-assisted Studies of Language and Culture, Oxford: Blackwell (Language in Society) Summers, D. (1996), “Computer Lexicography – The Importance of Representativeness in Relation to Frequency”, in: Thomas and Short 1996:260-266 Svartvik, J. (ed.) (1990) The London-Lund Corpus of Spoken English: Description and Research, Lund: Lund University Press Svartvik, J. (ed.) (1992a) Directions in Corpus Linguistics, Berlin: Mouton de Gruyter Svartvik, J. (1992b) “Corpus Linguistics Comes of Age”, in: Svartvik 1992a:7-13 Thomas, J./Short, M. (eds.) (1996) Using Corpora for Language Research, Studies in the Honour of Geoffrey Leech, New York: Longman Tribble, C. (1997) “Improvising Corpora in ELT: Quick-And-Dirty Ways Of Developing Corpora For Language Teaching”, paper presented at the first international conference Practical Applications in Language Corpora, University of Lodz, Poland Tribble, C./Jones, G. (1990) Concordances in the Classroom: A Resource Book for Teachers, London: Longman Trosborg, A. (ed.) (1997) Text Typology and Translation, Amsterdam: John Benjamins Publishing Co. (Benjamins translation library:26) 166 Bibliography __________________________________________________________________ Toury, G. (1991) “What are Descriptive Studies into Translation Likely to Yield apart from Isolated Descriptions”, in: van Leuven-Zwart and Naaijkens, 1991:172-192 Van Leuven-Zwart, K./Naaijkens, T. (eds.) (1991) Translation Studies: The State of the Art: Proceedings from the First James S. Holmes Symposium on Translation Studies, Amsterdam: Rodopi Varantola, K. (1997) “Translators, Dictionaries and Text Corpora”, paper presented at the international conference Corpus Use and Learning to Translate, Centro Residenziale Universitario, Bertinoro 44 Venuti, L. (ed.) (1992) Rethinking Translation, New York: Routledge Venuti, L. (1992) “Introduction”, in: Venuti 1992:1-15 Venuti, L. (1995) The Translator's Invisibility: A History of Translation, New York: Routledge Wichmann, A./Fligelstone, S./McEnery, A./Knowles, G. (1997) Teaching and Language Corpora, London: Longman (Applied Linguistics and Language Study) Wolff, Dieter (1996) MULTICONCORD: A Multilingual Parallel Concordancer, in: Legenhausen 1996:74-79 Wright, S. (1993) “In Search of History: English Language In the Eighteenth Century”, in: Aarts et al. 1993:25-39 44 See footnote 35 167 __________________________________________________________________ Index Alignment, 39 Annotation, 30, 33, 35, 38 Artificial intelligence (AI), 77 CALL, 71 Chomsky, Noam, 8, 13-14, 19 COCOA reference, 36, 63 Collins COBUILD, 16, 45 Collocating, 48 Comparability, 25, 28 Compilation, 29 Comparative linguistics, 12-13 Competence, 14, 15 Computer-aided translation (CAT), 39 Concordancers, 48-50, 63, 68, 96 Streaming, 50 Text-indexers, 50 In-memory, 50 Concordancing, 47 Copyright, 26, 44 CORPORA American Representative Corpus of Historical English Registers (ARCHER), 31 Bank of English, 16, 19, 57, 64 British National Corpus (BNC), 35, 45 Brown Corpus of American English, 28, 33, 44 Canadian Hansard, 39 Child Language Database (CHILDES7, 46 Computer Science Corpus of the Hong Kong University (HKUST), 45, 71 Corpora Project Språteknologi, 43 Corpus of Spoken American English (CSAE), 29 CRATER, 40 English-Norwegian Parallel Corpus, 41 ETAP, 42 FECCS, 42 Guangzhou Petroleum English Corpus (GPEC), 45, 71 Helsinki Historical English Corpus, 31, 36 IDS-Korpora, 64 International Archive of Modern English (ICAME), 29 International Corpus of English (ICE), 28 INTERSECT, 40 Kolhapur Corpus of Indian English, 44 Lancaster/IBM Spoken English Corpus, 20 Lancaster-Oslo/Bergen Corpus (LOB), 20, 26, 33 LINGUA, 40 London-Lund Corpus (LLC), 20, 29, 37, 38, 39 Longman-Lancaster Corpus, 36 MULTEXT, 40 MULTEXT-EAST, 40 Penn Treebank, 36 Proteus Project, 42 Scandinavian Project of Contrastive Corpus Studies, 43 Scania Corpus, 43 Survey of English Usage (SEU), 16, 20, 61 168 Index __________________________________________________________________ Swedish Government Corpus, 43 Swedish Immigrant Newspaper Corpus, 43 Text-Based Contrastive Studies in English, 42 Translation Corpus of English and German, 42 Translearn, 42 TRIPTIC, 40 CORPUS Annotated, 23, 33 Comparable, 74-75, 80 Core, 28, 41 Developmental, 46 Diachronic, 30 General, 44-45 Learner, 46 Monitor, 16, 19, 20, 28, 45, 65 Monolingual, 39 Morphology, 65 Multilingual, 39, 74 Multimedia, 88 Parallel, 39-40, 74 Parsed, 34, 36, 60 Prosodic, 37 Raw, 33 Reference, 44 Specialised, 28, 45 Spoken, 32, 38 Sublanguage, 45 Supplementary, 41 Synchronic, 24 Tagged, 33, 36, 52 Untagged, 33 Corpus-based research, 18 Corpus-based studies, 21 Corpus creation, 24 Corpus outline, 24 Counting, 47 Cross-cultural studies, 23, 29 Design criteria, 23, 24,30 Dialect, 27, 31 Disambiguation, 48, 52 Discourse studies, 68 Distribution, 25, 28 Diversity, 24, 27 Document header, 36 Electronic texts, 81 Ethnolinguistics, 72 Exploitation tools, 21 Flexibility, 23 Frequency tables, 50-51, 98-99 Grammar, 60 Idiolect, 27, 31 Inductive learning, 22 Interdisciplinarity, 23 Introspection, 15, 16, 21, 61 Keyword in Context (KWIC), 49 Language acquisition, 15 LANGUAGE Learning, 57, 70 Pedagogy, 12 Promotion, 24 Teaching, 58, 70-71, 82 Variety, 23, 24 Lemmatising, 48 Lexicography, 62 Machine-readable form, 16, 18, 20, 26, 62 Machine translation (MT), 39, 43, 53, 76-78, 108 Example-based, 77 Statistics-based, 78 Networking, 56 Parsers, 32, 53, 84 Parsing, 47 Part-of-speech (POS) tag, 33, 38 POS tagging, 48, 95 Performance, 14, 15, 16 Permission, 27 Pragmatics, 67 Registers, 31-32 Register variation, 27 Representativeness, 18-19, 24, 30, 63, 97 Reusability, 22 Sampling, 19, 24-25, 28, 32 Proportional, 25 Stratified, 26 Searching, 47 Semantics, 13, 65 Size, 20, 25, 28, 97 Finite, 18-19 169 Index __________________________________________________________________ Specificity, 23 Spelling conventions, 12 Standard reference, 16, 18, 21 Subject matter, 27 Stylistics, 68 Syntax, 13 Tabling, 47 Taggers, 52-53 Target domain selection, 25 Termbanks, 81 Terminology, 62, 64 Translation, 73-90 Translation memory systems (TM), 79, 108 Translation research, 82 Universals of translation, 75 170