METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita’ di Venezia Obiettivi del corso Un’introduzione all’uso dei corpora e ai metodi statistici Piano del corso Fondamenti di statistica, uso dei corpora Tasks & tecniche base: predizione di parole, ngrams, smoothing, spelling, Bayesian inference POS tagging: tagsets, Brill tagger, HMM tagging Valutazione di sistemi Il lessico Grammatiche probabilistiche,parsing statistico Oggi Statistica e Linguistica (Abney, 1996) Fondamenti di probabilita’ Corpora Dettagli pratici Orario: 10:30-13, 14:30-17 Laboratori: dalle 17 alle 18 (non oggi) Orario di ricevimento: 9:30-10:30, 18-19 Email: [email protected] Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venez ia/Stat_NLP/ Empiricism vs. Rationalism Chomskyan linguistics: – – – Empirical methods – – – Assumption: linguistic knowledge mostly innate Emphasis on explanation Primary goal: simplicity of the theory Assumption: linguistic knowledge primarily derives from generalizations over experience Emphasis on data Primary goal: fact discovery Computational Linguistics between 1960 & 1980 mostly Chomskyan Problems statistical methods are meant to address Ambiguity resolution: previous choices were – – – Narrow domains to avoid ambiguity Hand-coded rules Hand-tuned preference weights Adaptation to new domains Measuring improvement Case study: POS tagging “Time flies like N/V N/V V/N/CJ an arrow” Det N Number of tags 1 2 3 Number of words types 3760 264 61 35340 4 5 6 7 12 2 1 The rise of statistical methods First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) ASR techniques then used for POS tagging, and then in all areas of CL A synthesis of statistical methods and linguistic insights now underway Modern empiricism in Computational Linguistics Large data collections Rigorous collection techniques (interannotator agreement) Rigorous evaluation techniques Discovery of generalizations: via learning techniques Statistics & the study of language? Theoretical advances – – – Empirical – – – Language acquisition: the role of experience Linguistic theory: graded grammaticality Language change: shifts in grammaticality Quantify linguistic phenomena Analyze data Test hypotheses Psychological – Express preferences Some interesting statistics about language Lexical biases – – Syntax – Category: “bank” = Noun 85%, Verb 15% Sense: Bank(river) 22%, Bank(money) 78% Subcategorization of “realised”: NP 20%, S 65%, Other 15% Semantics / discourse – “he” in subject position 65% of the time Corpora The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: – – – – – The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) Penn Treebank (4M words), syntactically annotated (English) SEMCOR (250K), annotated with wordsense information The MapTask, annotated with dialogue information Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (‘MapTask Italiano’) Basic uses of corpora: Collocations COMPOUNDS: “computer program”, “disk drive”, “calcio di rigore” PHRASAL VERBS: “wake up”, “come on” PHRASAL EXPRESSIONS: “bacon and eggs”, “the bees’ knees”, “siamo alla frutta” Bigrams: New York Frequency Word 1 Word 2 80871 of the 58841 in the 26430 to the … … … 15494 to be … … … 12622 from the 11428 New York … … … Statistical Language Processing Statistical inference: – – Example: language modeling – – Collect statistics about occurrence of X Predict new occurrences Problem: predict word that follows, given previous ones Find Wn that maximizes P(Wn|W1..W n-1) Applications: – – – Speech recognition Spell-checking POS tagging … Bibliografia Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995. Testi: – Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall – Piu’ generale, e piu’ facile da seguire Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma tecnicamente piu’ avanzato