XML, Information Extraction
and
Document structuring
Maria Teresa PAZIENZA
Roma Tor Vergata University
Italy
In short
Why XML?
XML was created so that richly structured
documents could be shared over the web.
It requires the integration of heterogeneous
and distributed data and information
sources.
Crema, 29 Giugno 2001
In short
XML (Extensible Markup Language)
is a markup language for documents containing
structured information.
XML is being designed to deliver structured
content over the web
A markup language is a mechanism to identify structures in a document.
Crema, 29 Giugno 2001
In short
Structured information contains both content
(words, pictures, etc.) and some indication of
what role that content plays (for example,
content in a section heading has a different
meaning from content in a footnote, which
means something different than content in a
figure caption or content in a database table,
etc.).
Almost all documents have some structure
Crema, 29 Giugno 2001
In short
The word "document" refers not only to
traditional documents, but also to the miriad
of other XML "data formats". These include
vector graphics, e-commerce transactions,
mathematical equations, object meta-data,
server APIs, and a thousand other kinds of
structured information.
Crema, 29 Giugno 2001
In short
XML specifies neither semantics nor a tag set. In fact
XML is really a meta-language for describing
markup languages. In other words, XML provides a
facility to define tags and the structural
relationships between them.
Since there's no predefined tag set, there can't be any
preconceived semantics. All of the semantics of an
XML document will either be defined by the
applications that process them or by stylesheets.
Crema, 29 Giugno 2001
XML documents
1.
XML documents are composed of markup and
content
2.
Elements are the most common form of markup
3.
If an element is not empty, it begins with a
start-tag, <element>, and ends with an end-tag,
</element>.
Crema, 29 Giugno 2001
XML documents
Attributes are name-value pairs that occur
inside start-tags after the element name.
For example,
<div class="preface"> is a div element with
the attribute class having the value preface.
In XML, all attribute values must be quoted.
Crema, 29 Giugno 2001
XML document
Example 1. A Simple XML Document
<?xml version="1.0"?> <oldjoke>
<burns>Say <quote>goodnight</quote>,
Gracie.</burns>
<allen><quote>Goodnight,
Gracie.</quote></allen> <applause/>
</oldjoke>
Crema, 29 Giugno 2001
XML currently
XML is not actually a markup language; it is a
standard that specifies a syntax that allows
anyone to create his own markup language.
It is a tag-based language to describe tree structures
with a linear syntax.
The markup created language will depend on the
task you are trying to accomplish.
Crema, 29 Giugno 2001
XML currently
XML allows to separate data from the
processs that act on that data.
Crema, 29 Giugno 2001
XML semantics
The use of a unique Ontology for any application
contexts will never be possible.
Neither will an ontology be suitable for all
subjects and domains nor will such a large and
heterogeneous community, as the Web, agree
on a complex ontology for describing all their
issues.
Crema, 29 Giugno 2001
DTD (Document Type Definition)
It is the closest thing XML offers for ontological
modeling.
It defines the legal nesting of tags and introduces
attributes for them.
Defining tags, their nesting and attributes for tags
may be seen as defining an ontology.
Crema, 29 Giugno 2001
XML and DTD
An XML document is valid if it is wellformed and
if the document uses a DTD it respects it.
DTD are not necessary for XML documents, they
provide the possibility to define stronger
constraints for documents.
Crema, 29 Giugno 2001
DTD
A DTD consists of three elements:
Elements declaration that define composed tags
and value ranges for elementary tags
Attribute declaration that defines attributes of tags
Entity declaration
Crema, 29 Giugno 2001
Issues to be addressed
The availability of large amounts of data in Web raises
several issues that XML standards does not address:
1) Extracting data from large repositories of XML
documents
2) Translating XML data between different ontologies
(DTDs)
3) Integrating XML data from multiple XML sources
4) Transporting large amounts of XML selected data to
users
Crema, 29 Giugno 2001
XML
A first attempt
to structure the entropy
of Web world
Crema, 29 Giugno 2001
NAMIC project
NAMIC aims to extract relevant facts/events
from the news streams of large European
news agencies and newspaper producers,
to provide hypertextual structures within
each (monolingual) stream and then to
produce cross-lingual links among
streams.
Crema, 29 Giugno 2001
NAMIC project
Language specific procesors (LPs) are
responsible for text processing and event
matching in independent text units in each
stream. LPs compile an objective
representation for each source texts, including
the detected morphosyntactic information,
categorization in news standards (IPTC)
classes and description of relevant events.
Crema, 29 Giugno 2001
NAMIC Schema news

- <NEWS NEWSID="ita_1" DATE="6/7/2000" PLACE="PARIGI"
AGENCY="ANSA" LANG="ITA">
<TITLE> <PP IDEN="0">
<P>UE: IMMIGRAZIONE; APPREZZAMENTO A PARIGI PER
POLITICA ITALIA(2)</P>
</PP>


</TITLE>-
Crema, 29 Giugno 2001
Schema news

<BODY><PP IDEN="0">
 <P>''I risultati raggiunti - ha proseguito il sottosegretario - si devono all'
impegno di polizia, carabinieri, guardia di finanza, marina e aeronautica
militare. Ma in questa occasione abbiamo ribadito che, proprio perche'
l'Italia e' un paese di frontiera e i clandestini la considerano spesso un
passaggio per arrivare nel resto dell' Europa, serve l'impegno di tutti. In
particolare, abbiamo chiesto piu' spazio e piu' forza investigativa per
l'Europol''.</P>
 </PP> <PP IDEN="1">
 <P>Nella riunione di oggi, il ministro degli interni francese, Jean-Pierre
Chevenement, ha proposto all'Unione europea ''cinque piste'' in favore del
co-sviluppo: privilegiare le iniziative dei migranti in favore dello sviluppo dei
paesi di origine; favorire ''l'immigrazione alterna'' (come fa il Mali, con i
giovani di un villaggio che vanno per un periodo in Francia poi tornano e
lasciano il posto a coetanei di villaggi vicini); favorire l'accesso alla
formazione, con corsi programmati come prevede il decreto italiano; fa
cilitare la libera circolazione nell'Ue dei protagonisti delle politiche di
sviluppo; favorire la partnership negoziata su un piano di parita' fra gli stati
dell'Ue e i paesi di origine.</P>
 </PP>

</BODY>
</NEWS>
Crema, 29 Giugno 2001

<NEWS NEWSID="ita_1" DATE="6/7/2000" PLACE="PARIGI" AGENCY="ANSA"
LANG="ITA"> <CAT TYPE="cro" />

<TITLE> <PP IDEN="0">
 <P>UE: IMMIGRAZIONE; APPREZZAMENTO A PARIGI PER
POLITICA ITALIA(2)</P> 
<SYNTACTIC_GRAPH>
<LNS>
<LEX_HANDLE IDEN="1" POSTAG="NPR"
SURFACE="UE"> <TOKEN I="1" S="UE" SS="0" SE="2" POS="1"
/> <LEMMA IDEN="0" SURFACE="UE" SYNTCAT="nome.proprio"
MORPHFEAT="mas.fem.plur.sing." />

</LEX_HANDLE>
<LEX_HANDLE IDEN="2" POSTAG="COP" SURFACE=":"> <TOKEN
I="2" S=":" SS="2" SE="3" POS="2" /> <LEMMA IDEN="1"
SURFACE=":" SYNTCAT="cong.coord.duep."
MORPHFEAT="invariante" />

</LEX_HANDLE>
<LEX_HANDLE IDEN="3" POSTAG="NCS"
SURFACE="IMMIGRAZIONE"> <TOKEN I="3"
S="IMMIGRAZIONE" SS="4" SE="16" POS="3" /> <LEMMA
IDEN="2" SURFACE="immigrazione" SYNTCAT="nome.comune"
MORPHFEAT="fem.sing." />

</LEX_HANDLE>
Crema, 29 Giugno 2001
<LEX_HANDLE IDEN="4" POSTAG="COP" SURFACE=";"> <TOKEN I="4" S=";"
SS="16" SE="17" POS="4" /> <LEMMA IDEN="3" SURFACE=";"
SYNTCAT="cong.coord.pvirg." MORPHFEAT="invariante" />
</LEX_HANDLE><LEX_HANDLE IDEN="5" POSTAG="NCS"
SURFACE="APPREZZAMENTO"> <TOKEN I="5" S="APPREZZAMENTO"
SS="18" SE="31" POS="5" /> <LEMMA IDEN="4" SURFACE="apprezzamento"
SYNTCAT="nome.comune" MORPHFEAT="mas.sing." />
</LEX_HANDLE><LEX_HANDLE IDEN="6" POSTAG="PSE" SURFACE="A"> <TOKEN I="6" S="A"
SS="32" SE="33" POS="6" /> <LEMMA IDEN="5" SURFACE="a"
SYNTCAT="prep.sempl." MORPHFEAT="invariante" />
</LEX_HANDLE><LEX_HANDLE IDEN="7" POSTAG="NPR" SURFACE="PARIGI"> <TOKEN I="7"
S="PARIGI" SS="34" SE="40" POS="7" /> - <LEMMA IDEN="1"
SURFACE="parigi" SYNTCAT="nome.proprio"
MORPHFEAT="invariante"> <NEC CAT="citta" /> </LEMMA>
</LEX_HANDLE>
Crema, 29 Giugno 2001
<LEX_HANDLE IDEN="8" POSTAG="PSE" SURFACE="PER"> <TOKEN I="8"
S="PER" SS="41" SE="44" POS="8" /> <LEMMA IDEN="7" SURFACE="per"
SYNTCAT="prep.sempl." MORPHFEAT="invariante" />
</LEX_HANDLE><LEX_HANDLE IDEN="9" POSTAG="NCS" SURFACE="POLITICA"> <TOKEN
I="9" S="POLITICA" SS="45" SE="53" POS="9" /> <LEMMA IDEN="8"
SURFACE="politica" SYNTCAT="nome.comune" MORPHFEAT="fem.sing."
/> <LEMMA IDEN="8" SURFACE="politico" SYNTCAT="nome.comune"
MORPHFEAT="mas.fem.sing." />
</LEX_HANDLE><LEX_HANDLE IDEN="10" POSTAG="NPR" SURFACE="ITALIA"> <TOKEN
I="10" S="ITALIA" SS="54" SE="60" POS="10" /> - <LEMMA IDEN="1"
SURFACE="italia" SYNTCAT="nome.proprio"
MORPHFEAT="invariante"> <NEC CAT="paese" /> </LEMMA>
</LEX_HANDLE><LEX_HANDLE IDEN="11" POSTAG="COS" SURFACE="("> <TOKEN I="11" S="("
SS="60" SE="61" POS="11" /> <LEMMA IDEN="10" SURFACE="("
SYNTCAT="cong.subord.paren." MORPHFEAT="invariante" />
</LEX_HANDLE><LEX_HANDLE IDEN="12" POSTAG="NUM" SURFACE="2"> <TOKEN I="12"
S="2" SS="61" SE="62" POS="12" /> <LEMMA IDEN="11"
SURFACE="numero_card" SYNTCAT="nome.comune"
MORPHFEAT="invariante" />
</LEX_HANDLE>
Crema, 29 Giugno 2001
<SYNT_LINK IDEN="117" HEAD="222" MODIFIER="227"
TYPE="PP_PP" PLAUS="0.16666667" />
<SYNT_LINK IDEN="118" HEAD="220" MODIFIER="227"
TYPE="PP_PP" PLAUS="0.16666667" />
<SYNT_LINK IDEN="119" HEAD="217" MODIFIER="227"
TYPE="PP_PP" PLAUS="0.16666667" />
<SYNT_LINK IDEN="120" HEAD="215" MODIFIER="227"
TYPE="PP_PP" PLAUS="0.16666667" />
<SYNT_LINK IDEN="121" HEAD="211" MODIFIER="227"
TYPE="NP_PP" PLAUS="0.16666667" />
</SRS>
</SYNTACTIC_GRAPH>
</PP>
</BODY>
</NEWS>
Crema, 29 Giugno 2001
CROSSMARC project
It will develop a technology for e-retail product
comparison.
It will be able to process pages written in
several languages and will employ language
technology methods for information
extraction which will be extended and tailored
to the characteristics of e-shopping.
Crema, 29 Giugno 2001
prodotto
<?xml version="1.0" encoding="UTF-8" ?>
<document>
<source>Dell Latitude LSH 500 Lire 5516000 Pentium III 500 Mhz, 128
MbB Sdram, disco fisso da 20 GB, schermo TFT Svga da 12.1 pollici,
chip grafico NeoMagic MagicMedia 256AV con 2.5 MB, lettore esterno per
Cd-Rom, Ethernet 10/100 Mbit/sec. Integrata
</source>
</document>
Crema, 29 Giugno 2001
prodotto1
-
-
-
<?xml version="1.0" encoding="UTF-8" ?>
<document>
<source>Dell Latitude LSH 500 Lire 5516000 Pentium III 500 Mhz, 128 MbB Sdram,
disco fisso da 20 GB, schermo TFT Svga da 12.1 pollici, chip grafico NeoMagic
MagicMedia 256AV con 2.5 MB, lettore esterno per Cd-Rom, Ethernet 10/100 Mbit/sec.
Integrata </source>
- <tokenization>
<TOKEN Id="1" Label="dell">Dell</TOKEN>
<TOKEN Id="2" Label="latitude">Latitude</TOKEN>
<TOKEN Id="3" Label="lsh">LSH</TOKEN>
<TOKEN Id="4" Label="500">500</TOKEN>
<TOKEN Id="5" Label="lire">Lire</TOKEN>
<TOKEN Id="6" Label="5516000">5516000</TOKEN>
<TOKEN Id="7" Label="pentium">Pentium</TOKEN>
<TOKEN Id="8" Label="iii">III</TOKEN>
<TOKEN Id="9" Label="500">500</TOKEN>
<TOKEN Id="10" Label="mhz">Mhz</TOKEN>
………………
Crema, 29 Giugno 2001
prodotto1
<TOKEN Id="35" Label="con">con</TOKEN>
<TOKEN Id="36" Label="2.5">2.5</TOKEN>
<TOKEN Id="37" Label="mb">MB</TOKEN> <TOKEN Id="38"
Label=",">,</TOKEN>
<TOKEN Id="39" Label="lettore">lettore</TOKEN>
<TOKEN Id="40" Label="esterno">esterno</TOKEN>
<TOKEN Id="41" Label="per">per</TOKEN>
<TOKEN Id="42" Label="cd-rom">Cd-Rom</TOKEN>
<TOKEN Id="43" Label=",">,</TOKEN>
<TOKEN Id="44" Label="ethernet">Ethernet</TOKEN>
<TOKEN Id="45" Label="10">10</TOKEN>
<TOKEN Id="46" Label="/">/</TOKEN>
<TOKEN Id="47" Label="100">100</TOKEN> <
TOKEN Id="48" Label="mbit">Mbit</TOKEN>
<TOKEN Id="49" Label="/">/</TOKEN>
<TOKEN Id="50" Label="sec.">sec.</TOKEN>
<TOKEN Id="51" Label="integrata">integrata</TOKEN>
</tokenization>
Crema, 29 Giugno 2001
prodotto2
</tokenization>
<named-entities>
- <named-entity sem-type="Processor Name" normal="Intel Pentium III">
-
-
</named-entity>
- <named-entity sem-type="Processor Speed" normal="500 MHz">
-
-
-
-
<TOKEN Id="23" Label="tft">TFT</TOKEN>
</named-entity>
- <named-entity sem-type="Drive Types" normal="CD-ROM">
-
-
<TOKEN Id="9" Label="500">500</TOKEN>
<TOKEN Id="10" Label="mhz">Mhz</TOKEN>
</named-entity>
- <named-entity sem-type="Screen Type" normal="Active Matrix (TFT)">
-
-
<TOKEN Id="7" Label="pentium">Pentium</TOKEN>
<TOKEN Id="8" Label="iii">III</TOKEN>
<TOKEN Id="42" Label="cd-rom">Cd-Rom</TOKEN>
</named-entity>
- <named-entity sem-type="Ports" normal="10/100Base-T">
-
-
<TOKEN Id="44" Label="ethernet">Ethernet</TOKEN>
<TOKEN Id="45" Label="10">10</TOKEN>
<TOKEN Id="46" Label="/">/</TOKEN>
<TOKEN Id="47" Label="100">100</TOKEN>
-
</named-entity>
</named-entities>
-
</document>
-
Crema, 29 Giugno 2001
-
prodotto3
<PRODUCT>
- <DESCRIPTION>
<MANUF>Dell</MANUF>
- <MODEL>Latitude LSH 500</MODEL>
- <NUMEX TYPE="MONEY" NORM="2848.77" UNIT="EUR">Lire
5516000</NUMEX>
<PROCESSOR>Pentium III</PROCESSOR>
- <NUMEX TYPE="SPEED" NORM="500" UNIT="Mhz">500
Mhz</NUMEX> ,
<NUMEX TYPE="CAPACITY" NORM="128" UNIT="Mbyte">128
Mb</NUMEX>
- <TERM>Sdram</TERM>
, <TERM>disco fisso</TERM> da
- <NUMEX TYPE="CAPACITY" NORM="20000" UNIT="Mbyte">20
GB</NUMEX> ,
<TERM>schermo TFT Svga</TERM> da <NUMEX TYPE="LENGHT"
NORM="12.1" UNIT="inch">12.1 pollici</NUMEX> , chip grafico NeoMagic
MagicMedia
- <NUMEX TYPE="SIMPLE">256AV</NUMEX>
con
<NUMEX TYPE="CAPACITY" NORM="2.5" UNIT="Mbyte">2.5
MB</NUMEX> , <TERM>lettore</TERM> <LOC_ATTR>esterno</LOC_AT
TR> per <TERM>Cd-Rom</TERM> , <TERM>Ethernet
10/100</TERM> Mbit/sec. <LOC_ATTR>integrata</LOC_ATTR>
</DESCRIPTION>
</PRODUCT>
Crema, 29 Giugno 2001
<PRODUCT>
<MANUF>Dell</MANUF>
<MODEL>Latitude LSH 500</MODEL>
<PRICE>2848.77</PRICE> <PROCESSOR>
<PROCESSOR-NAME>PIII</PROCESSOR-NAME>
<PROCESSOR-SPEED>500</PROCESSOR-SPEED>
</PROCESSOR>
- <SCREEN>
<SCREEN-TYPE>TFT</SCREEN-TYPE>
<SCREEN-SIZE>12.1</SCREEN-SIZE>
</SCREEN>
- <MEMORY>
<STANDARD-RAM>128</STANDARD-RAM>
</MEMORY>
- <HARD-DISK>
<CAPACITY>20000</CAPACITY>
</HARD-DISK>
- <MEDIA>
<DRIVE-TYPE>CDROM</DRIVE-TYPE>
</MEDIA>
- <PORTS>
<PORT>Eth10/100</PORT>
</PORTS>
Crema, 29 Giugno 2001
</PRODUCT>
prodotto4
ontodemo
<?xml version="1.0" encoding="UTF-8" ?> <category> Hardware –
<product> Laptop Computers –
<feature> Operating System
<attribute>Operating System</attribute>
</feature>
- <feature> Processor
<attribute>Processor Name</attribute>
<attribute>Processor Speed</attribute>
</feature>
- <feature> Screen
<attribute>Screen Type</attribute>
<attribute>Screen Size</attribute>
<attribute>Maximum Resolution</attribute>
</feature>
…..…. </feature>
</product>
</category>
Crema, 29 Giugno 2001
Gazeteer
<?xml version="1.0" encoding="UTF-8" ?> <data>
<surface normal="Windows NT" sem-type="Operating System">
<T>Windows</T>
<T>NT</T>
</surface>
<surface normal="Windows NT" sem-type="Operating System">
<T>WinNT</T>
</surface>
- <surface normal="Windows NT" sem-type="Operating
System">
<T>NT</T>
<T>4</T>
</surface>
Crema, 29 Giugno 2001
-
<surface normal="Windows 95/98" sem-type="Operating System"> <T>Windows</T>
-
-
</surface>
- <surface normal="Windows 95/98" sem-type="Operating System">
-
-
-
-
-
-
<T>Windows</T>
<T>95</T>
</surface>
- <surface normal="Windows 95/98" sem-type="OperatingSystem">
<T>Windows95</T>
</surface>
- <surface normal="Windows 95/98" sem-type="OperatingSystem">
<T>Windows98</T>
</surface>
- <surface normal="Windows 95/98" sem-type="OperatingSystem">
-
-
<T>Windows</T>
<T>98</T>
</surface>
- <surface normal="Windows 95/98" sem-type="OperatingSystem">
-
-
<T>95</T>
<T>/</T>
<T>98</T>
<T>Win95</T>
</surface>
29 Giugno 2001
<surface normal="Windows 95/98"Crema,
sem-type="Operating
System">
Scarica

6.XMLStructuring-information - Università degli Studi di Roma