LINGUISTICA GENERALE E
COMPUTAZIONALE,
PARTE 2
Lezione 1:
Cos’e’ la Linguistica Computazionale,
Introduzione al corso
1
LINGUISTICA COMPUTAZIONALE
• Questa seconda parte di LG&C è
un’introduzione alla LINGUISTICA
COMPUTAZIONALE (COMPUTATIONAL
LINGUISTICS): lo studio di modelli
computazionali e statistici
dell’INTERPRETAZIONE del linguaggio
– Normalmente distinta da CORPUS LINGUISTICS
(uso di modelli computazionali e statistici x
analizzare CORPORA)
2
QUESTA LEZIONE
• Riassunto dei concetti rilevanti di linguistica
generale
• Interpretazione: quali sono i problemi?
• Applicazioni di linguistica computazionale
• Piano del corso
Livelli di analisi linguistica – un
rapido riassunto
LIVELLI DI ANALISI LINGUISTICA
• Fonetica e fonologia
– “cat” = /k/ + /æ/ + /t/
• Parole
– Parti del discorso
– Morfologia
• Sintassi
• Semantica
• Discorso
5
PARTI DEL DISCORSO
•
•
•
•
•
•
•
•
•
NOMI (tavolo, Simona)
VERBI (camminare, mangiare, colpire)
AGGETTIVI (rosso, rapido)
AVVERBI (probabilmente, subito)
PRONOMI (io, lui, ci)
ARTICOLI (il, la, un)
PREPOSIZIONI (di, a, con)
CONGIUNZIONI (e, ma, o)
[Italiano]: INTERIEZIONI (ahi! )
MORFOLOGIA
• Le parole non sono unita’ ‘atomiche’: (in
Italiano almeno) si possono quasi sempre
scomporre in unita’ piu’ piccole: i MORFEMI
• Un MORFEMA e’ “la minima unita’ linguistica
dotata di un significato proprio”
DUE ESEMPI
REPURIFICARE
RE-
`ripetizione’
+
PUR-
+
`privo di contaminanti’
-IFICARE
`rendere’
STRUTTURA DELLE PAROLE
• INGLESE: RADICE + AFFISSI
– RADICE (boy)
– AFFISSI (-s in boy+s)
• ITALIANO: TEMA + AFFISSI
– RADICE (ragazz-)
– TEMA (radice + vocale tematica – e.g., ragazzo)
– AFFISSI (-i in ragazz+i)
SINTASSI
• Words are organized in PHRASES
– I put THE BAGELS in the freezer
– I put THE BAGELS THAT WE HAD NOT EATEN in the freezer
• Phrases are classified according to their main
CONSTITUENT, or HEAD:
– Noun phrases:
• the bagels, the homeless old man that I tried to help yesterday
• Mary, she, one of them
– Verb phrases:
• Mary went to the store and bought a bagel
– Adjective Phrases:
• John is tall / very tall / quite certain to succeed
– Sentences
13
Marking Phrase Constituents
• BRACKETING:
– [S [NP The children] [VP ate [NP the cake]]]
• TREES:
AT
the
S
N
P
NN
S
childr
en
V
P
VB
D
ate
NP
AT
the
NN
cak
e
14
Sintassi: obiettivo
SINTASSI
 Riconoscere i costituenti
 Riconoscere una struttura corretta
“(io)
[ ][Nel mezzo del cammin di nostra vita][[ mi ritrovai][ per una selva oscura]] ”
NP
PP
VP
PP
 Una frase italiana con la struttura (NP PP VP PP) è corretta
“[Oscura per mezzo] [nel selva] [del nostra] [mi] [ritrovai] [di cammin vita
una]”
??
PP
??
NP
VP
??
 Una frase italiana con la struttura (?? PP ?? NP VP ??) è scorretta
Sintassi
SEMANTICA
• Due tipi di conoscenza semantica sulle parole:
– Conoscenza ‘denotazionale’
– Conoscenza ‘composizionale’
• Quattro tipi di teorie:
– Referenziale
– Cognitivo / mentalista
• Teoria dei prototipi
– Strutturale / relazionale
Conoscenza denotazionale e
conoscenza composizionale
• Conoscenza DENOTAZIONALE: conoscenza
sulla ‘parola in se’:
– Il CAVALLO e’ un ANIMALE dalla lunga criniera …
– (Il tipo di conoscenza tipicamente trovata nelle
definizioni)
• Conoscenza COMPOSIZIONALE: conoscenza
sul come la parola si combina con altre parole
CONOSCENZA COMPOSIZIONALE
• Dal punto di vista composizionale si possono
fare almeno due distinzioni :
– Tra PREDICATI ed ARGOMENTI
– Tra parole FUNZIONALI e parole ‘CONTENUTO’
PREDICATI ED ARGOMENTI
PREDICATO
Maria ha noleggiato una macchina
ARGOMENTI
Discourse
• Anaphora
– John arrived late. He always does that.
– My car didn’t start this morning. There was some
problem with the engine fan.
• Discourse relations:
– My car didn’t start this morning BECAUSE there
was some problem with the engine fan.
NLE
21
Dave Bowman: “Open the pod bay doors, HAL”
HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
LA LINGUISTICA COMPUTAZIONALE NEL
2014: DOVE DOVREMMO ESSERE ..
Amer. Good afternoon, Hal. How's everything going?
Hal. Good afternoon, Mr Amer. Everything is going extremely well.
Amer. Hal, you have an enormous responsibility on this mission, in many ways
perhaps the greatest responsibility of any single mission element. You are the brain
and central nervous system of the ship, and your responsibilities include watching
over the men in hibernation. Does this ever cause you any - lack of confidence?
Hal. Let me put it this way, Mr Amer. The 9000 series is the most reliable computer
ever made. No 9000 computer has ever made a mistake or distorted information.
We are all, by any practical definition of the words, foolproof and incapable of
error.
Amer. Hal, despite your enormous intellect, are you ever frustrated by your
dependence on people to carry out actions?
Hal. Not in the slightest bit. I enjoy working with people. I have a stimulating
relationship with Dr Poole and Dr Bowman. My mission responsibilities range over
the entire operation of the ship, so I am constantly occupied. I am putting myself
to the fullest possible use, which is all, I think, that any conscious entity can ever
hope to do.
2004/05
ANLE
23
… E DOVE SIAMO
• A Febbraio del 2011 il sistema WATSON
sviluppato da IBM ha vinto a Jeopardy!
Battendo tre dei piu’ noti campioni del
passato
– http://www.youtube.com/watch?v=otBeCmpEKTs
2011/12
ELN
24
Modelli di interpretazione nella
linguistica computazionale
INTERPRETAZIONE: IL MODELLO A
‘PIPELINE’
PREPROCESSING
LEXICAL
PROCESSING
SYNTACTIC
PROCESSING
DISCOURSE
PROCESSING
SEMANTIC
PROCESSING
NLE
26
INTERPRETAZIONE: IL MODELLO A
PIPELINE
When did Watson won
Jeopardy?
PREPROCESSING
POS TAGGING
/ WORDSENSE
LEXICAL
PROCESSING
SYNTACTIC
PROCESSING
PREDICATE/A
RGUMENT
TOKENIZATION
DISCOURSE
ANAPHORA
IDENTIFY PHRASES
NLE
SEMANTIC
PROCESSING
27
LA STRUTTURA DI WATSON
TOKENIZZAZIONE
C’ERA UNA VOLTA UN PEZZO DI LEGNO.
C’ERA | UNA | VOLTA | UN | PEZZO | DI | LEGNO. |
C’ | ERA | UNA | VOLTA | UN | PEZZO | DI | LEGNO
|.|
PARTI DEL DISCORSO
Television/NN has/HVZ yet/RB to/TO work/VB
out/RP a/AT living/RBG arrangement/NN with/IN
jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT
medium/NN more/QL as/CS an/AT uneasy/JJ
guest/NN than/CS as/CS a/AT relaxed/VBN
member/NN of/IN the/AT family/NN ./.
ANALISI SINTATTICA CON
CONTEXT-FREE GRAMMARS
The cat sat on the mat
S
NP
VP
Det
the
N
cat
PP
V
sat
Prep
on
LING 2000 - 2006
NP
Det
the
NLP
N
mat
36
Processing Steps, IV: Semantic
Processing
• John went to the book store.
 John  store1, go(John, store1)
• John bought a book.
buy(John,book1)
• John gave the book to Mary.
give(John,book1,Mary)
• Mary put the book on the table.
put(Mary,book1,table1)
LING 2000 - 2006
NLP
37
DOVE STA IL PROBLEMA?
• Rumore (typos, linguaggio sgrammaticato, etc)
• Ambiguità
• Il ruolo del senso comune
BAD ENGLISH (E ITALIANO) ON THE
WEB
CHINGLISH:
To take notice of safe: The slippery are very crafty
(“Take care, slippery”)
Note that the level of gap
(“Mind the gap”)
LANGUAGE CHANGE:
I brought two apple's
Black is different to white
SPAM:
Buongiorno
sono sempre in attesa delle vostre informazioni affinché
possa rapidamente le trasmetta al mio avvocato perché
possa rapidamente fare l’analisi della vostra cartella più
rapidamente che il possibile.
Grazie rapidamente di me gli inviati.
AMBIGUITA’ NELLA CLASSIFICAZIONE
GRAMMATICALE
• Molte forme di parola possono essere
associate con parti del discorso diverse:
– STATO sia sostantivo (LO STATO ITALIANO) che
verbo (NON SONO STATO IO)
AMBIGUITA’ DI PARTE DEL
DISCORSO: LEGGE1
1 Norma, espressa dagli organi legislativi dello Stato, che stabilisce diritti e doveri
dei cittadini Legge delega, che viene emessa dal potere esecutivo su delega del
potere legislativo entro un ambito ben precisato Legge ponte, emessa in attesa di
un'altra più organica A norma, a termini di legge, secondo ciò che la legge
prescrive.
2 (est.) Complesso delle norme costituenti l'ordinamento giuridico di uno Stato: la
legge è uguale per tutti Essere fuori della legge, non essere garantito dalla legge o
non sentirsi a essa soggetto Dettar legge, imporre a tutti la propria volontà.
3 Scienza giuridica: laurea in legge; dottore in legge; facoltà di legge Uomo di
legge, specialista nella scienza giuridica.
4 Autorità giudiziaria: ricorrere alla legge In nome della legge, formula con cui i
rappresentanti dell'autorità giudiziaria intimano a qc. di obbedire a un comando della
stessa: in nome della legge, aprite!
5 (est.) Ogni norma che regola la condotta individuale o sociale degli uomini: le leggi
della società.
6 (est.) Regola fondamentale di una tecnica, di un'arte e sim.: le leggi della pittura.
7 Relazione determinata e costante fra le quantità variabili che entrano in un
fenomeno: le leggi della matematica, della fisica.
LEGGE2
leggere
v. tr. (pres. io lèggo, tu lèggi; pass. rem. io lèssi, tu leggésti; part. pass.
lètto)
1 Riconoscere dai segni della scrittura le parole e comprenderne il
significato: imparare, insegnare a leggere; leggere a voce alta (ass.) Fare
lettura, dedicarsi alla lettura: trascorro gran parte della giornata leggendo.
2 Interpretare certi segni convenzionali o naturali: i ciechi leggono con le
dita; leggere un diagramma (fig.) Leggere la mano, ricavare dati sul
carattere e sul destino di qc. basandosi sulle linee della mano.
3 (lett.) Interpretare uno scritto, un passo: i critici dell'Ottocento leggevano
erroneamente questa strofa (est.) Interpretare, valutare scritti, eventi e
sim. secondo particolari criteri: leggere un film in chiave ironica.
4 (fig.) Intuire i pensieri e le intenzioni di qc.: gli si legge il terrore sul volto.
STATISTICHE SULL’AMBIGUITA’ NEL
B.C.
Unambiguous (1tag) 35,340
Ambiguous (2-7 tags) 4,100
2 tags
3,760
3 tags
264
4 tags
61
5 tags
12
6 tags
2
7 tags
1 (“still”)
Part of Speech Tagging and Word
Sense Disambiguation
• [verb Duck ] !
[noun Duck] is delicious for dinner
• I went to the bank to deposit my check.
I went to the bank to look out at the river.
I went to the bank of windows and chose the
one dealing with last names beginning with
“d”.
Syntactic Disambiguation
• Structural ambiguity:
S
NP
I
S
VP
V
NP
NP VP
made her
V
duck
I
VP
V
NP
made det N
her duck
Semantics
Same event - different sentences
John broke the window with a hammer.
John broke the window with the crack.
The hammer broke the window.
The window broke.
LING 2000 - 2006
NLP
46
Scope ambiguity
NLE
47
IL RUOLO DEL SENSO COMUNE
• Winograd (1974):
– The city council refused the women a
permit because they feared violence.
– The city council refused the women a
permit because they advocated violence
NLP APPLICATIONS
•
Mature, everyday technology that hardly anybody notices anymore
– E.g., tokenization, normalization, regular expression search
•
Solid technology that is intensively used but can (and is) still be improved
– E.g., lemmatization; spelling correctors; IR / Web search; Speech synthesis
•
Used in real applications, but substantial improvements still desired
– E.g., POS tagging; term extraction; summarization; speech recognition; text classification
(e.g., for spam detection); sentiment analysis
– Spoken dialogue systems for simple information seeking (railways, phone)
•
‘Almost there’ technology – exists in prototype form
– E.g., information extraction, generation systems, simple speech translation systems
•
Pie in the sky
– Full machine translation, more advanced dialogue
2004/05
ANLE
49
Part I: Mature Technologies
• Research in NLE has been going on for many
years and in many forms – e.g., as part of
compiler technology, information retrieval,
etc.
• The results of this work are a number of wellestablished technologies that are hardly
considered ‘research’ anymore
2004/05
ANLE
50
Basic Word Processing
TOKENISATION:
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
prints out:
this
is
a
test
WORD COUNTING / FREQUENCIES
2004/05
ANLE
51
Regular Expressions for Search,
Validation and Parsing
•
Basics (e.g., search in Google)
– cat OR dog
– “Regular * in Java”
•
More advanced (e.g., regular expressions in PERL, Java, etc.) (for advanced search,
user input validation, etc.)
–
–
–
–
–
–
•
/[Ww]ordnet/
/colou?r/
/Mas*imo Poesio/
[a-z|A-Z]*
[^A-Z]
/$[0-9]+\.[0-9][0-9]
Note also: SUBSTITUTION
– s/colour/color/
•
ELIZA:
– s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
2004/05
ANLE
52
Part II: Solid Technology that could still
use improvements
• Over the last ten to twenty years new
applications have appeared which are by now
fairly well established, but whose results are
still not 100% accurate (nor is clear they ever
will!)
2004/05
ANLE
53
Stemming, lemmatization and
morphological analysis
•
Stemming:
– FOXES -> FOX
•
Lemmatization:
– 'screeching, screeches, screeched,' and 'screech' -> 'screech'
+ING
– 'were' -> 'be‘ +PAST
•
(Sometimes) used for: Information Retrieval
– But: not in GOOGLE
•
More general morphological analysis:
– Wissenschaftlichemitarbeiter ->
Wissenschaft + mitarbeiter
Scientific
collaborator (Researcher)
– Uygarlastiramadiklarimizdanmiscasina ->
– Uygar
+las +tir
+ama
+dik
Civilized
+BEC +CAUSE +NEGABLE +PPART
– +lar +imiz +dan +mis +siniz +casina
+PL +P1PL +ABL +PAST +2PL +AsIf
– `(behaving) as if you are among those whom we could not civilize’
2004/05
ANLE
54
Morphological Analysis: the Xerox
tools
2004/05
ANLE
55
Word Prediction
• Systems that can complete the current word /
sentence (e.g., to help people with disabilities)
• E.g., the Aurora System
• Or textHelp!
2004/05
ANLE
56
Spelling correction
• Word:
– Gettin -> getting
– Alway -> always
– But :
• olways -/-> always
• Definittely -/-> definitely
• Some shells:
> set correct = cmd
> lz /usr/bin
CORRECT>ls /usr/bin (y|n|e|a)?
2004/05
ANLE
57
Part of Speech Tagging
• Assign a PART OF SPEECH to each word:
– ‘dog’ -> NOUN
– ‘eat’ -> VERB
• Book that flight
VB DT NN
• Applications: all over the place!
– IR
– IE
– Translation
2004/05
ANLE
58
TEXT CLASSIFICATION:
SPAM DETECTION
From: "" <[email protected]>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even
thousands for similar courses
I am 22 years old and I have already purchased 6
properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
===========================================
======
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
===========================================
======
Dear Hamming Seminar Members
The next Hamming Seminar will take place on
Wednesday 25th May and the details are as follows Who: Dave Robertson Title: Formal Reasoning Gets
Social Abstract: For much of its history, formal knowledge
representation has aimed to describe knowledge
independently of the personal and social context in which
it is used, with the advantage that we can automate
reasoning with such knowledge using mechanisms that
also are context independent. This sounds good until you
try it on a large scale and find out how sensitive to
context much of reasoning actually is. Humans, however,
are great hoarders of information and sophisticated tools
now make the acquisition of many forms of local
knowledge easy. The question is: how to combine this
beyond narrow individual use, given that knowledge (and
reasoning) will inevitably be contextualised in ways that
may be hidden from the people/systems that may
interact to use it? This is the social side of knowledge
representation and automated reasoning. I will discuss
how the formal reasoning community has adapted to this
new view of scale. When: 4pm, Wednesday 25 May 2011
Where: Room G07, Informatics Forum There will be wine
and nibbles afterwards in the atrium café area.
SENTIMENT ANALYSIS
Id: Abc123 on 5-1-2008 “I bought an iPhone a few days
ago. It is such a nice phone. The touch screen is really
cool. The voice quality is clear too.
It is much better than my old Blackberry, which was a
terrible phone and so difficult to type with its tiny keys.
However, my mother was mad with me as I did not tell
her before I bought the phone. She also thought the
phone was too expensive, …”
SENTIMENT ANALYSIS
Id: Abc123 on 5-1-2008 “I bought an iPhone a few days
ago. It is such a nice phone. The touch screen is really
cool. The voice quality is clear too.
It is much better than my old Blackberry, which was a
terrible phone and so difficult to type with its tiny keys.
However, my mother was mad with me as I did not tell
her before I bought the phone. She also thought the
phone was too expensive, …”
SENTIMENT ANALYSIS
Id: Abc123 on 5-1-2008 “I bought an iPhone a few days
ago. It is such a nice phone. The touch screen is really
cool. The voice quality is clear too.
It is much better than my old Blackberry, which was a
terrible phone and so difficult to type with its tiny keys.
However, my mother was mad with me as I did not tell
her before I bought the phone. She also thought the
phone was too expensive, …”
Stylometry: Who wrote this?
“On the far side of the river valley
the road passed through a stark
black burn. Charred and limbless
trunks of trees stretching away on
every side. Ash moving over the
road and the sagging hands of blind
wire strung from the blackened
lightpoles whining thinly in the wind.”
Stylometry: Who wrote this?
“On the far side of the river valley
the road passed through a stark
black burn. Charred and limbless
trunks of trees stretching away on
every side. Ash moving over the
road and the sagging hands of blind
wire strung from the blackened
lightpoles whining thinly in the wind.”
Cormac McCarthy
Speech Synthesis
• Speech Synthesis (the automatic production
of speech from text or other computerencoded source) is much easier than speech
RECOGNITION and is currently a very hot area
in industry
• For a British example, check out Rhetorical
Systems
• US: AT&T
2004/05
ANLE
65
Parte 3: tecnologie più avanzate
• Da tecnologie usate per anni ma ancora
problematiche, a tecnologie solo disponibili in
forma prototipale
• (Non ci occuperemo di queste tecnologie nel
corso)
Speech Recognition
• Speech Recognition fairly solid (and works
very well for digits)
– E.g., IBM’s Via Voice:
•
http://www4.ibm.com/software/speech/enterprise/dc
enter/demo_0.html
2004/05
ANLE
68
Summarization
• Summarization is the production of a
summary either from a single source (singledocument summarization) or from a collection
of articles (multi-document summarization)
• An example is the Columbia Newblaster
Machine Translation
• Machine translation is one of the earliest
attempts at language technology (from the
’40s)
• Still mostly useful to get a quick idea of the
content of a text, but can sometimes works
reasonably well
• An example: Newstran.com
Machine Translation
Machine Translation
2004/05
ANLE
72
INFORMATION EXTRACTION:
REFERENCES TO (NAMED) ENTITIES
SITE
LOC
CULTURE
EXAMPLE OF IE APPLICATION: FINDING JOBS FROM
THE WEB
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
CONTENUTO DEL CORSO
• Il pacchetto NLTK, implementato in Python, permette
di sperimentare tecniche CL anche a chi ha poca
esperienza di programmazione
• Durante il corso
– introdurremo Python ricapitolando gli aspetti della
linguistica computazionale gia’ introdotti in IDUL
– Useremo NLTK per sperimentare
• POS tagging
• Parsing
• classificazione
• Seguiremo abbastanza fedelmente il testo di Bird
Klein & Loper
Il Testo
http://www.nltk.org/book
ALTRE INFORMAZIONI
• Sito:
– http://clic.cimec.unitn.it/massimo/Teach/ELN/
• Esame:
– Sviluppare un progettino in Python, da presentare
all’orale
• Ricevimento:
– Su appuntamento
SCARICARE Python e NLTK
• Primo compito: Seguire le istruzioni a
http://www.nltk.org/download
Scarica

Introduzione al Corso: Che cos`e` la Linguistica - clic