Estrazione di informazioni da
testo
Perchè occuparsene?
• E’ un’applicazione particolarmente complessa.
• Sfrutta la maggior parte delle risorse utilizzate in
compiti di analisi.
• Il suo studio permette quindi di avere una buona
panoramica delle problematiche e delle tecnologie
utilizzate nell’analisi del linguaggio naturale.
Cosa è l’Estrazione di Informazioni da Testo?
• Information retrieval (IR): cercare e informazioni in testi a
fronte di richieste specifiche.
• Recupero di passaggi: cercare e trovare passaggi
(paragrafi, frasi) all’interno di un testo che possano fornire
risposte a determinati quesiti.
• Estrazione di informazioni (IE): trovare informazioni che
possano riempire schemi (templates) predefiniti.
• Domanda-risposta (Question-answering): dare risposte a
domande di tipo generale formulate da un utente: IE+IR
• Comprensione di testi: modellare la comprensione dei testi
da parte di umani.
Tipo di domande
• IR
• Recupero di passaggi
• IE
• Domanda/risposta
• Comprensione dei testi
Pre-definite. Aspetti fissi
della informazione testuale
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Un esempio: FASTUS (1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in
Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at
20 million new Taiwan dollars, will start production in January
1990 with production of 20,000 iron and “metal wood” clubs a
month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Un esempio: FASTUS (1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in
Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at
20 million new Taiwan dollars, will start production in January
1990 with production of 20,000 iron and “metal wood” clubs a
month
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Bridgestone Sports Co. said Friday it had set up a joint venture in
Taiwan with a local concern and a Japanese trading house to produce
golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Come funziona FASTUS
set up
new Twaiwan dollars
a Japanese trading house
had set up
production of
20, 000 iron and
metal wood clubs
[company]
[set up]
[Joint-Venture]
with
[company]
1.Parole complesse e nomi propri
2.Sintagmi semplici: nominali, verbali,
particelle
3.Sintagmi complessi:
4.Eventi rilevanti
Costruzione di semplici templates
5. Fusione di templates, nel caso
Presentino informazioni sullo stesso evento
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Altro esempio – un template sbagliato
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture: Yaxing Benz
Products:
buses and bus chassis
Template
Location:
Yangzhou,China
sbagliato
Companies involved: (1)Name: X?
Country: German
(2)Name: Y?
Country: China
Template giusto
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Crime-Type: Murder
Type: Stabbing
The killed: Name: Jurgen Pfrang
Age:
51
Profession: Deputy general manager
Location: Nanjing, China
Chi esegue l’interpretazione?
(1) IR
Utente
(2) Recupero passaggi
Utente
(3) IE
Sistema
(4) Domanda/risposta
Sistema
(5) Comprensione testi
Sistema
Caratterizzazione dei testi
Sistema di IR
richiesta
Insieme di testi
conoscenza
interpretazione
Caratterizzazione dei testi
Sistema di IR
Richiesta
Insieme di testi
conoscenza
Caratterizzazione dei testi
Recupero passaggi
IR
richiesta
Insieme di testi
Interpretazione
conoscenza
Caratterizzazione dei testi
Interpretazione
Recupero
passaggi
IR
Sistema di IE
Queries
Elaborazione
Linguaggio
naturale
Insieme di testi
testi
template
conoscenza
Interpretazione
Sistema di IE
testi
Templates
IE: un approccio
Pragmatico al NLP
conoscenza
Interpretazaione
IE
Approccio generale
All’elaborazione/
Comprensione del LN
Testi
Templates
Predefinito
Valutazione delle prestazioni
(1)IR,
(2) recupero passaggi
(3) ie
Metodologia chiara
Metodologia non chiara
Metodologia chiara
(4) Domanda/Risposa
Metodologia abbastanza
vaga
(5) Comprensione di testi
Metodologia vaghissima
domanda
N: documenti corretti
M: documenti recuperati
C: documenti recuperati che
sono corretti
N
Insieme dei documenti
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
domanda
N: Templates corretti
M: Templates recuperati
C: Templates corretti
che sono stati recuperati
N
Insieme dei documenti
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
Il tutto è più complicato per la
Possibilità di template parzialmente
riempiti
Scarica

Estrazione di informazioni da testo