From Lexicon to Text: Pre-Target structures Annotation in a L2 Italian corpus *Université Paris VIII °Università di Salerno Third International Lablita Workshop in Corpus Linguistics June 2008 Today we present the first results of a classification of pre-target structures collected during a wider research on the acquisition of second-language Italian syntax, based on a small test corpus of 23.000 words of written texts. The main goal of the project was to compare learners productions with those of natives as far as the use and frequency of Noun and Verb Phrases in different type of texts are concerned (Turco in press; Voghera 2008; Policarpi, Rombi, Voghera in press). Contrasting needs… Within this context we were looking for a tagging system : - which allowed a straightforward comparison between non-native and native productions; - which could make the minimum use of ad hoc categories for the description of L2 texts, so to make them really comparable with those of natives; - which allowed the retrieval of non-native structures. Consequently, we have adapted a tagging system that originally was not designed for L2 analysis, AN.ANA.S., supported by XML, developed within the Tree-bank Project of the University of Salerno (cfr. www.parlaritaliano.it). Initially what seemed to be a tentative step turned out to be a challenge: to conceive a system that could preserve the richness of the original annotation without leaving out the specificity of L2 texts. AN.ANA.S. 1. 2. 3. AN.ANA.S. (Annotation and Analysis of Syntax) is a syntactic annotation system based on a manual approach (Voghera et al. 2004, 2005). Its main properties are : alignment of the syntactic annotation with the signal, so to have a multilevel representation; focus on intraclausal relations flexible structures conceived for the annotation of a wide range of different texts: spoken and written, dialogues and monologues. AN.ANA.S. L2 - TAGSET We are going to show you AN.ANA.S L2 tagset and its DTD. A DTD (Document Type Definition) is a set of declarative elements which makes use of a special purpose syntax. The DTD is a part of the original XML specification that permits to specify what elements and attributes may be used in a particular type of XML document and what their structure may be (W3C: World Wide Web Consortium). XGATE For the annotation and the analysis of our corpus we have used XGate application (Cutugno, D’Anna 2006). XGate’s main purpose is to make text coding processing much easier and user-friendly thanks to the support of XML. – the Editor function allows us to create XML files and modify them, once a DTD has been defined. – The Query function allows to make queries on the databases in order to get quantitative results. AN.ANA.S. L2 allows to retrieve pre-target structures per level of syntactic encoding: Sentence Clause Phrase per level of textual encoding Text Paragraph Finally it allows to retrieve lexical deviations affecting head phrases. Interlanguage The present research has been developed within the framework of the functional approach applied to L2 studies (Huebner 1983; Long/Sato 1984; Givón 1984; Tomlin 1984; Sato 1990; Dittmar 1992; Perdue 1990; 1993 Klein/Perdue 1992; Giacalone Ramat/Crocco 1995) and the perspective of the Interlanguage (IL) (Selinker, 1972; 1992) IL is considered as a series of grammars developed by the language learner at different points in the L2acquistion process: “a separate linguistic system based on the observable output which results from a learner’s attempted production of a target language norm” (Selinker, 1972: 214) IL grammar can be «systematic», «permeable», «transitional» and «discrete» (Selinker, 1972; Adjemian, 1976; Selinker, 1992; Perdue 1993, etc.) L2 acquisition is systematic and, to a large extent, universal, that is it reflects how cognitive mechanisms control acquisition, irrespective of the personal background of learners, their mother tongue, or the setting in which they learn. ILs at different stages of acquisition/learning present systematic linguistic features, that can be described in terms of pre-target structures. Pre-target Structures (1) 1) Structures perceived as a-grammatical respect to the target language ho nato 5 marzo instead of (I) have born 5th March sono nato il 5 marzo mi ha piaciuto me has liked instead of mi è piaciuto ‘I liked’ me va matto me goes mad instead of vado matto ‘I go mad’ in la casa [BEGNARR] with instead of nella casa ‘in the house’ Pre-target Structures (2) 2) Structures that well-match with the grammar of the target language but do not fit in well with the context and/or do not convey the intended meaning Sono molto buono I am very good instead of Sto molto bene I am very well Vorrei alti gradi all’Università I’d like high degrees at the University alti gradi high degrees [BEGNARR] instead of voti alti high marks Un soggiorno ammobiliato e comodo e non più grande. a living room furnished is(?) comfortable and not more big. Una cucina pratica e l’appartamento è non più caro. a functional kitchen room and the flat is not more expensive più grande more big instead of troppo grande too big più caro more expensive instead of troppo caro too expensive [BEGDESC] The Corpus We have tagged a test corpus of 23.000 written words The quantitative analysis we present here is based on 6.000 words per learning level. Beginner : Intermediate : Advanced : 6.179 6.136 6.151 Collection of narrative, argumentative compositions: home written productions descriptive and Subjects: 50 Undergraduate Students optional language learning courses of Italian chosen by students as part of combined honour degree (Greenwich University; London) Different mother tongues (L1s)… English, Spanish, French, Portuguese, Greek, etc. …and English as a second language Some Remarks As we well know, annotation is not a straightforward process. 1) a pre-target structure may have different scopes, that is it may involve different levels of codification. There are many very clearly defined cases of pre-target structures as well as there are many other ones where a structure has necessarily to be tagged at more than one level. ex. il giardino dovere Ø grande e balla con molta fiori the garden must-INF be-Ø big and dance3PERS/SING PRES SIMP with many-FEM/SING flowers-FEM/PLUR [BEGNARR] 2) We believe that there is not just one grammar of the target language. We have kept away from a tempting prescriptive approach typically based on discrete and decontextualized features of language. Besides, as far as the Italian is concerned, we must take into account the deep diatopic differences that can interfere with the learning process (Dal Negro/Molinelli 2002; Lepshy 2005). ex. Ci stavano delle differenze There-EXIST stay-3PERS-PLUR/IMPERF differences ci stavano there stayed instead of (???) [ADVNARR] some c’erano there were Levels of pre-target structures TEXT LEVEL Incipit of a letter Tanti saluti da Pheonix. Many regards from Phoenix Visitavamo nostri amici per una settimana in albergo. We used to visit the-Ø our friends for a week at the-Ø hotel [BEGNARR] PARAGRAPH LEVEL In la casa le camere dovere grande e splendidamente decorate con i bagni. Che chiamo un bellissimo e speciale casa abitare in*. in the house the rooms must BE-Ø big and splendidly decorated with the bathrooms. That (I) call a beautiful-MASC and special house-FEM to live in. *instead of quello che io considero essere una casa bellissima e speciale in cui abitare what I consider to be a beautiful and special house to live in [BEGDESC] SENTENCE LEVEL Non potrei nuotare e la mia mamma ha deserta insegnarla. (I) could not swim and the my mother has deserted*-FEM teach her (PRON-COREF of “swim”) *instead of e la mia mamma ha desiderato insegnarmelo (?) and my mother wished to teach it to me (?) [BEGNARR] CLAUSE LEVEL l giorni più belle è mia sorella giornata del matrimonio* the-MASC/SING most beautiful-FEM/PLUR day-MASC/PLUR is my sister day of the wedding *instead of la giornata del matrimonio di mia sorella my sister’s wedding day [BEGDESC] Tutta la classe hanno mostrato i loro piatti e ha giudicato dalla mia preside The whole class have shown their dishes and has judged by my headmaster *instead of …è (stata) giudicata dalla mia preside …it is/has been judged by my head-master [INTNARR] PHRASE LEVEL la spiaggia più bello del mondo The-FEM/SING most beautiful-MASC/SING FEM/SING in the world beach- [BEGNARR] Sono scrittura voi dall’Italia* (I) am writing-NOUN you-PERS/PRON/PLUR from Italy *instead of Vi sto scrivendo dall’Italia I’m writing to you from Italy [BEGNARR] LEXICON Il sole sta lucidando* e fa caldo The sun is polishing and it is warm [BEGNARR] *instead of BRILLARE = TO BRIGHT Ho picchiato una macchina parcheggiata (I) beat* -1PERS/SING PRES PERF a parked car down [BEGNARR] *instead of TAMPONARE (?)= TO HIT What data and what for? We present here the first quantitative data of the pre-target structures per linguistic level and learning level. Since we have been working on a test corpus, we do not claim to give statistical significance to our data. We rather show a trend analysis concerning the relative distribution of different pre-target structures. How many pre-target structures? We found 1.271 pre-target structures which have the following distribution across the three learning levels BEG INT 17% ADV 51% 32% Pre-Target Structures per Learning Level 2% 11% 12% TEXT PARAGRAPH SENTENCE CLAUSE PHRASE 51% 16% Pre-Target Structures per Linguistic Level Pre-Target Structures per Linguistic Level & per Learning Levels Even if the advanced learners present the lowest distribution of pre-target structures, the types are basically the same. TEXT PARAGRAPH SENTENCE CLAUSE PHRASE 60 50 BEG 40 INT 30 20 10 0 ADV LEXICON Pre-target structures affect lexical level in 9% of cases: as expected, the number in the advanced level is nearly half than in the beginner level, although there is not a linear progression from beginners to advanced learners. 12,00% 10,00% 8,00% 6,00% LEXICON 4,00% 2,00% 0,00% BEG INT ADV We can notice that across the three levels pre-target structures are equally distributed : since the beginning learners seem to have a relatively good control of the highest level of textual and syntactic planning, i.e. sentential and clausal level. On the contrary all learners seem not to perform so well at a phrase level. This supports results from other studies on other languages (i.e. English – Kroll 1990) where it has been found that learners may exhibit varying degrees of control on writing. “We cannot predict students’ ability to perform in one area on the basis of their performance in the other area” (Kroll 1990: 150). Hypotheses H1: We have less pre-target structures at a textual level because textual competence is related to higher education and strictly dependent on the type of writing assessment. Learners were homogenous. relatively culturally Considering the type of text and the learners we have taken into account, text level can be seen as the less “marked” learning level for Cultural reason: common literacy tradition Linguistic reason: common textual-literally tradition and typological proximity 3. Psychological reason: transferability and learner’s perception 1. 2. – (Eckman, 1977, 1985; Kellerman 1979) Hypotheses H2: In Italian it is at the Phrase level that most of the choices related to grammatical categories must be made: gender, number, definiteness, case or preposition choice… … all this unavoidably leads to the production of a higher number of deviations from the target structures Our data encourage the idea that, at the first stages, the linguistic learning goes from the top planning levels to the bottom ones. This means that learners take advantage of the textual frame so to offset the deficits that eventually affect lower levels (i.e. phrase). An ill-structured phrase receive significance from a well-formed textual structure. Let’s look at the following example… Example Abbiamo parlati tutto il giorno e la notte, e da allora, noi amore l’altro* e ci sposiamo. We have spoken-PLUR all day and all night long, and since then, we love-NOUN the other and we get married *instead of ci amiamo we love each other [INTNARR] On the contrary a well-formed phrase looses significance if inserted in a illformed textual structure. Incipit of a letter Tanti saluti da Pheonix. Many regards from Phoenix Visitavamo nostri amici per una settimana in albergo. We used to visit the-Ø our friends for a week at the-Ø hotel [BEGNARR] Final remarks and comments As it often happens a research opens to new questions and insights: as far as the annotation is concerned, AN.ANA.S. L2 has done a pretty good job. However, we need to formalize the possibility to get a structure tagged at more than one level. If this becomes feasible, we will be able to distinguish deviant structures with a local and/or a global scope. Local and global are not mutually exclusive. From a linguistic viewpoint, even if we know that both top-down and bottom-up strategies are at work in language learning (among others Selinker et al., 2004) it could be interesting to explore this top-down frequency pattern of pretarget structure: – by analyzing pre-target structures across different text types; – by passing from a test corpus to a larger corpus; – by comparing spoken data with written data; – by comparing the acquisitional stages in L2 speaking and writing (i.e. L2 Italian oral descriptions by Progetto Pavia) Last but not least Perhaps we could have better entitled our contribution as : From Text to Lexicon rather than From Lexicon to Text Bibliography Adjemian, Adjemian, C. (1976). “On the nature of interlanguage systems” systems”. Language Learning, 26,(2), 297297-320 Cutugno, Cutugno, D’Anna (2006) “Limiti e complessità complessità del recupero delle informazioni da treetree-bank sintattiche” sintattiche”. Atti del convegno della SLI, Vercelli settembre 2006. Dal Negro, S./Molinelli, P. (2002) Comunicare nella torre di Babele Babele Comunicare nella torre di Babele. Repertori plurilingui in Italia Italia oggi. Roma Carocci. Carocci. Dittmar, 249–257. Dittmar, N. (1992) “Grammaticalization in second language acquisition” acquisition”. Studies in Second. Language Acquisition 14, 249– Eckmann, Eckmann, S. “Markdness and contrastive analysis hypothesis” hypothesis”. In Language Learning, 27, 1977: 315315-330. Eckmann, Eckmann, S. “Some theoretical and pedagogical implications of the markdness differential hypothesis” hypothesis”. Studies in second language acquisition, 7, 1985: 289289-307. Huebner, T. (1983) A Longitudinal Analysis of The Acquisition of English. Ann Arbor, MI: Karoma. Karoma. Kellerman, Kellerman, E. 1979 “Transfer and nonnon-transfer: where we are now. Studies in Second Language Acquisition 2, 3737-57. Klein, W., Perdue, Perdue, C. (1992) Utterance structure. Developing grammars again. Amsterdam, Benjamins Kroll, B., (1990) Second Language Writing: Research and Insights for the Classroom. Cambridge: Cambridge University Press Givòn, Givòn, T. (1984) On Understanding Grammar. New York: New York Academic Press. Lepshy, Lepshy, G. (2005) “Lo standard” standard”. Lepshy, Lepshy, A.L./Tamponi A.L./Tamponi A.R. a cura di In Prospettive dell’ dell’italiano come lingua straniera. Perugia Guerra : 151521. Long, M. H., & Sato, C. J. (1984). “Methodological issues in interlanguage studies: studies: an interactionist perspective” perspective”. In Davies, Davies, A., Criper, Criper, C., & Howatt, .), Interlanguage (pp. 253Howatt, A. P. R. (eds (eds.), 253-80). Edinburgh: Edinburgh University Press. Perdue, Perdue, C. (1990) “Complexification of the simple clause in the narrative discourse of adult language learners” learners” . Linguistics 28, 983– 983–1009 Perdue, C. (1993) Adult language acquisition: crosscross-linguistic perspectives. Cambridge, Cambridge University Press. Policarpi, Policarpi, Rombi, Rombi, Voghera in press, Classi lessicali e strategie sintattiche: sintattiche: nomi e verbi in sincronia e diacronia, diacronia, acettato al Congresso SILFI 2008. Sato, C. J. (1990) “Origins of complex syntax in interlanguage development” 371-95 development”. Studies in Second Language Acquisition 10: 371Selinker, Selinker, L. (1972). “Interlanguage”. International Review of Applied Linguistics, 10, 209209-31. Selinker, Selinker, L. (1992). Rediscovering interlanguage. New York: Longman. Longman. Selinker, Selinker, L. et al. Linguistic structure with processing in second language research: is « unified theory » possible?. In Second Language Research n° n°20, 2004: 7777-94. Tomlin, Tomlin, R.S. 1984. “The treatment of foregroundforeground-background in the onon-line descriptive discourse of second language learners” learners”. Studies in Second Language Research, 9, 4949-83. Turco, Turco, G. (in press) “Complessità Complessità sintattica nell’ nell’italiano scritto L2” L2” Voghera, M., Cutugno, F. 2004, AN.ANA.S.: AN.ANA.S.: Analisi sintattica e annotazione XML a contatto, in Albano Leoni Leoni F., Cutugno F., Pettorino M., Savy R. (a cura di), Il parlato italiano, Atti del Convegno Nazionale, Nazionale, D'Auria Editore, Napoli, M03 Voghera, M, Basile, G., Cutugno, F. Fiorentino, G. 2005, Sintassi Sintassi in AN.ANA.S., AN.ANA.S., in Albano Leoni F., Giordano R. (a cura di), Italiano Parlato. Parlato. Analisi di un dialogo, Liguori, Liguori, Napoli, 187187-209 Voghera, Voghera, M. (2008), La grammatica nei testi. testi. In A.L. Lepschy &A.Ledgway (eds.), Didattica della lingua italiana: italiana: testo e contesto. contesto. W3C: World Wide Web Consortium, http://www.w3.org/TR/REC http://www.w3.org/TR/REC--xml (accessed 27 Mai, 2008) www.parlaritaliano.it Errori a livello di sintagma Esempi Errori che riguardo il lessema testa o proprietà del lessema testa NP -> la bella colore VP-> ha <deserta> insegnarla = desiderato errori di valenza… PP Errori che riguarda l’accordo e/o l’ordine delle parole interno al sintagma Errori a livello di clausola Reggenza del nesso subordinante: in modo da potrei invece di in modo da poter In-between phrases – Per proteggere la macchina della pioggia Valenza verbale : – Ho voglia un campo di calcio (int. Voglio un campo di calcio – trasp. nominale) – Non piace essere disordinata – [not it likes to be messy ] – La Ø più importante è che …. [the- FEM. most important Ø is that …] Errori a livello di sentence In-between clauses: subordinators/coordinators Lack of – Es.: In la casa le camere dovere grande e – – – splendidamente decorate con i bagni. Che chiamo un bellissimo e speciale casa abitare in. (transl. a special house to live in ?) Es. Il giorno più bello della mia vita è il giorno <0> ho incontrato il mo ragazzo Vorrei un giardino <in modo da potrei> godere più meglio : + INF Gradisco fare il giardinaggio <in modo da mi assicurerò> che ho lotti dei fiori in esso <0> <da potrei godere> più meglio la sera Errori a livello di paragraph Errori a livello di testo Copiare p. 11 ??? Questi errori possono essere ulteriormente classificati nelle categorie tradizionali usate da Ellis di…. Qualche esempio APPENDICE The pre-target structures can be deviant structures which have different scope, i.e. must be considered deviant at local or global level. Local vs. Global Usually a deviant structure is considered to have a local scope when…. + esempio – Ho nato 5 marzo 1983 – [ (I) have born 5th March 1983] Usually a deviant structure is considered to have a global scope when…. + esempio But in the real linguistic parole Local and global are not mutually exclusive esempio di locale che si riflette a livello globale. Una stessa struttura pre-target may involve different levels of linguistic codification: from lexicon to text. Esempio Testo etichettato xgate qualche commento su ricorsioni e livelli di dipendenza…. SCARTI – Ex. Abbiamo bevuto molte tutti il tempo, era molto brilliante. Abbiamo le fotografie molte. [BEGNARR] – We have drunk a lot for a long time, it was very brilliant. We have the photosMASC/PLUR much-FEM/PLUR Remarks Doubt 1: pre-target involves more than one linguistic level EX. Vorrei avere Ø mio ufficio nella mia casa, in modo da potrei studiare*. [(I) like-1PERS/SING/COND to have my office in my house so to (I) can1PERS/SING/COND study] ??? *in modo da poter studiare /in modo che possa studiare [BEGDESC] Doubt 2: In some cases which linguistic level to tag as WF=F? Ex. ho deciso di andare al Casino giocare e scommettere al Roulette [(I) decide-PRES PERF go-INF to the Casino play-INF and beg on the Roulette] [INTNARR] Remarks Doubt 3: it is about whether a case should be tagged or not as well-formed (WF=T/F) Abbiamo bevuto molte tutti il tempo, era molto brilliante. Abbiamo le fotografie molte. We have drunk a lot for a long time, it was very brilliant. We have the photosMASC/PLUR much-FEM/PLUR [BEGNARR] Remarks Nonetheless, whenever feasible, we try to save as much as possible the correctness of the clause. The pre-target structures can be deviant structures which have different scope, i.e. must be considered deviant at local or global level. Local vs. Global Usually a deviant structure is considered to have a local scope when…. + esempio – Ho nato 5 marzo 1983 – [ (I) have born 5th March 1983] Usually a deviant structure is considered to have a global scope when…. + esempio But in the real linguistic parole Local and global are not mutually exclusive esempio di locale che si riflette a livello globale.