XML, Information Extraction and Document structuring Maria Teresa PAZIENZA Roma Tor Vergata University Italy In short Why XML? XML was created so that richly structured documents could be shared over the web. It requires the integration of heterogeneous and distributed data and information sources. Crema, 29 Giugno 2001 In short XML (Extensible Markup Language) is a markup language for documents containing structured information. XML is being designed to deliver structured content over the web A markup language is a mechanism to identify structures in a document. Crema, 29 Giugno 2001 In short Structured information contains both content (words, pictures, etc.) and some indication of what role that content plays (for example, content in a section heading has a different meaning from content in a footnote, which means something different than content in a figure caption or content in a database table, etc.). Almost all documents have some structure Crema, 29 Giugno 2001 In short The word "document" refers not only to traditional documents, but also to the miriad of other XML "data formats". These include vector graphics, e-commerce transactions, mathematical equations, object meta-data, server APIs, and a thousand other kinds of structured information. Crema, 29 Giugno 2001 In short XML specifies neither semantics nor a tag set. In fact XML is really a meta-language for describing markup languages. In other words, XML provides a facility to define tags and the structural relationships between them. Since there's no predefined tag set, there can't be any preconceived semantics. All of the semantics of an XML document will either be defined by the applications that process them or by stylesheets. Crema, 29 Giugno 2001 XML documents 1. XML documents are composed of markup and content 2. Elements are the most common form of markup 3. If an element is not empty, it begins with a start-tag, <element>, and ends with an end-tag, </element>. Crema, 29 Giugno 2001 XML documents Attributes are name-value pairs that occur inside start-tags after the element name. For example, <div class="preface"> is a div element with the attribute class having the value preface. In XML, all attribute values must be quoted. Crema, 29 Giugno 2001 XML document Example 1. A Simple XML Document <?xml version="1.0"?> <oldjoke> <burns>Say <quote>goodnight</quote>, Gracie.</burns> <allen><quote>Goodnight, Gracie.</quote></allen> <applause/> </oldjoke> Crema, 29 Giugno 2001 XML currently XML is not actually a markup language; it is a standard that specifies a syntax that allows anyone to create his own markup language. It is a tag-based language to describe tree structures with a linear syntax. The markup created language will depend on the task you are trying to accomplish. Crema, 29 Giugno 2001 XML currently XML allows to separate data from the processs that act on that data. Crema, 29 Giugno 2001 XML semantics The use of a unique Ontology for any application contexts will never be possible. Neither will an ontology be suitable for all subjects and domains nor will such a large and heterogeneous community, as the Web, agree on a complex ontology for describing all their issues. Crema, 29 Giugno 2001 DTD (Document Type Definition) It is the closest thing XML offers for ontological modeling. It defines the legal nesting of tags and introduces attributes for them. Defining tags, their nesting and attributes for tags may be seen as defining an ontology. Crema, 29 Giugno 2001 XML and DTD An XML document is valid if it is wellformed and if the document uses a DTD it respects it. DTD are not necessary for XML documents, they provide the possibility to define stronger constraints for documents. Crema, 29 Giugno 2001 DTD A DTD consists of three elements: Elements declaration that define composed tags and value ranges for elementary tags Attribute declaration that defines attributes of tags Entity declaration Crema, 29 Giugno 2001 Issues to be addressed The availability of large amounts of data in Web raises several issues that XML standards does not address: 1) Extracting data from large repositories of XML documents 2) Translating XML data between different ontologies (DTDs) 3) Integrating XML data from multiple XML sources 4) Transporting large amounts of XML selected data to users Crema, 29 Giugno 2001 XML A first attempt to structure the entropy of Web world Crema, 29 Giugno 2001 NAMIC project NAMIC aims to extract relevant facts/events from the news streams of large European news agencies and newspaper producers, to provide hypertextual structures within each (monolingual) stream and then to produce cross-lingual links among streams. Crema, 29 Giugno 2001 NAMIC project Language specific procesors (LPs) are responsible for text processing and event matching in independent text units in each stream. LPs compile an objective representation for each source texts, including the detected morphosyntactic information, categorization in news standards (IPTC) classes and description of relevant events. Crema, 29 Giugno 2001 NAMIC Schema news - <NEWS NEWSID="ita_1" DATE="6/7/2000" PLACE="PARIGI" AGENCY="ANSA" LANG="ITA"> <TITLE> <PP IDEN="0"> <P>UE: IMMIGRAZIONE; APPREZZAMENTO A PARIGI PER POLITICA ITALIA(2)</P> </PP> </TITLE>- Crema, 29 Giugno 2001 Schema news <BODY><PP IDEN="0"> <P>''I risultati raggiunti - ha proseguito il sottosegretario - si devono all' impegno di polizia, carabinieri, guardia di finanza, marina e aeronautica militare. Ma in questa occasione abbiamo ribadito che, proprio perche' l'Italia e' un paese di frontiera e i clandestini la considerano spesso un passaggio per arrivare nel resto dell' Europa, serve l'impegno di tutti. In particolare, abbiamo chiesto piu' spazio e piu' forza investigativa per l'Europol''.</P> </PP> <PP IDEN="1"> <P>Nella riunione di oggi, il ministro degli interni francese, Jean-Pierre Chevenement, ha proposto all'Unione europea ''cinque piste'' in favore del co-sviluppo: privilegiare le iniziative dei migranti in favore dello sviluppo dei paesi di origine; favorire ''l'immigrazione alterna'' (come fa il Mali, con i giovani di un villaggio che vanno per un periodo in Francia poi tornano e lasciano il posto a coetanei di villaggi vicini); favorire l'accesso alla formazione, con corsi programmati come prevede il decreto italiano; fa cilitare la libera circolazione nell'Ue dei protagonisti delle politiche di sviluppo; favorire la partnership negoziata su un piano di parita' fra gli stati dell'Ue e i paesi di origine.</P> </PP> </BODY> </NEWS> Crema, 29 Giugno 2001 <NEWS NEWSID="ita_1" DATE="6/7/2000" PLACE="PARIGI" AGENCY="ANSA" LANG="ITA"> <CAT TYPE="cro" /> <TITLE> <PP IDEN="0"> <P>UE: IMMIGRAZIONE; APPREZZAMENTO A PARIGI PER POLITICA ITALIA(2)</P> <SYNTACTIC_GRAPH> <LNS> <LEX_HANDLE IDEN="1" POSTAG="NPR" SURFACE="UE"> <TOKEN I="1" S="UE" SS="0" SE="2" POS="1" /> <LEMMA IDEN="0" SURFACE="UE" SYNTCAT="nome.proprio" MORPHFEAT="mas.fem.plur.sing." /> </LEX_HANDLE> <LEX_HANDLE IDEN="2" POSTAG="COP" SURFACE=":"> <TOKEN I="2" S=":" SS="2" SE="3" POS="2" /> <LEMMA IDEN="1" SURFACE=":" SYNTCAT="cong.coord.duep." MORPHFEAT="invariante" /> </LEX_HANDLE> <LEX_HANDLE IDEN="3" POSTAG="NCS" SURFACE="IMMIGRAZIONE"> <TOKEN I="3" S="IMMIGRAZIONE" SS="4" SE="16" POS="3" /> <LEMMA IDEN="2" SURFACE="immigrazione" SYNTCAT="nome.comune" MORPHFEAT="fem.sing." /> </LEX_HANDLE> Crema, 29 Giugno 2001 <LEX_HANDLE IDEN="4" POSTAG="COP" SURFACE=";"> <TOKEN I="4" S=";" SS="16" SE="17" POS="4" /> <LEMMA IDEN="3" SURFACE=";" SYNTCAT="cong.coord.pvirg." MORPHFEAT="invariante" /> </LEX_HANDLE><LEX_HANDLE IDEN="5" POSTAG="NCS" SURFACE="APPREZZAMENTO"> <TOKEN I="5" S="APPREZZAMENTO" SS="18" SE="31" POS="5" /> <LEMMA IDEN="4" SURFACE="apprezzamento" SYNTCAT="nome.comune" MORPHFEAT="mas.sing." /> </LEX_HANDLE><LEX_HANDLE IDEN="6" POSTAG="PSE" SURFACE="A"> <TOKEN I="6" S="A" SS="32" SE="33" POS="6" /> <LEMMA IDEN="5" SURFACE="a" SYNTCAT="prep.sempl." MORPHFEAT="invariante" /> </LEX_HANDLE><LEX_HANDLE IDEN="7" POSTAG="NPR" SURFACE="PARIGI"> <TOKEN I="7" S="PARIGI" SS="34" SE="40" POS="7" /> - <LEMMA IDEN="1" SURFACE="parigi" SYNTCAT="nome.proprio" MORPHFEAT="invariante"> <NEC CAT="citta" /> </LEMMA> </LEX_HANDLE> Crema, 29 Giugno 2001 <LEX_HANDLE IDEN="8" POSTAG="PSE" SURFACE="PER"> <TOKEN I="8" S="PER" SS="41" SE="44" POS="8" /> <LEMMA IDEN="7" SURFACE="per" SYNTCAT="prep.sempl." MORPHFEAT="invariante" /> </LEX_HANDLE><LEX_HANDLE IDEN="9" POSTAG="NCS" SURFACE="POLITICA"> <TOKEN I="9" S="POLITICA" SS="45" SE="53" POS="9" /> <LEMMA IDEN="8" SURFACE="politica" SYNTCAT="nome.comune" MORPHFEAT="fem.sing." /> <LEMMA IDEN="8" SURFACE="politico" SYNTCAT="nome.comune" MORPHFEAT="mas.fem.sing." /> </LEX_HANDLE><LEX_HANDLE IDEN="10" POSTAG="NPR" SURFACE="ITALIA"> <TOKEN I="10" S="ITALIA" SS="54" SE="60" POS="10" /> - <LEMMA IDEN="1" SURFACE="italia" SYNTCAT="nome.proprio" MORPHFEAT="invariante"> <NEC CAT="paese" /> </LEMMA> </LEX_HANDLE><LEX_HANDLE IDEN="11" POSTAG="COS" SURFACE="("> <TOKEN I="11" S="(" SS="60" SE="61" POS="11" /> <LEMMA IDEN="10" SURFACE="(" SYNTCAT="cong.subord.paren." MORPHFEAT="invariante" /> </LEX_HANDLE><LEX_HANDLE IDEN="12" POSTAG="NUM" SURFACE="2"> <TOKEN I="12" S="2" SS="61" SE="62" POS="12" /> <LEMMA IDEN="11" SURFACE="numero_card" SYNTCAT="nome.comune" MORPHFEAT="invariante" /> </LEX_HANDLE> Crema, 29 Giugno 2001 <SYNT_LINK IDEN="117" HEAD="222" MODIFIER="227" TYPE="PP_PP" PLAUS="0.16666667" /> <SYNT_LINK IDEN="118" HEAD="220" MODIFIER="227" TYPE="PP_PP" PLAUS="0.16666667" /> <SYNT_LINK IDEN="119" HEAD="217" MODIFIER="227" TYPE="PP_PP" PLAUS="0.16666667" /> <SYNT_LINK IDEN="120" HEAD="215" MODIFIER="227" TYPE="PP_PP" PLAUS="0.16666667" /> <SYNT_LINK IDEN="121" HEAD="211" MODIFIER="227" TYPE="NP_PP" PLAUS="0.16666667" /> </SRS> </SYNTACTIC_GRAPH> </PP> </BODY> </NEWS> Crema, 29 Giugno 2001 CROSSMARC project It will develop a technology for e-retail product comparison. It will be able to process pages written in several languages and will employ language technology methods for information extraction which will be extended and tailored to the characteristics of e-shopping. Crema, 29 Giugno 2001 prodotto <?xml version="1.0" encoding="UTF-8" ?> <document> <source>Dell Latitude LSH 500 Lire 5516000 Pentium III 500 Mhz, 128 MbB Sdram, disco fisso da 20 GB, schermo TFT Svga da 12.1 pollici, chip grafico NeoMagic MagicMedia 256AV con 2.5 MB, lettore esterno per Cd-Rom, Ethernet 10/100 Mbit/sec. Integrata </source> </document> Crema, 29 Giugno 2001 prodotto1 - - - <?xml version="1.0" encoding="UTF-8" ?> <document> <source>Dell Latitude LSH 500 Lire 5516000 Pentium III 500 Mhz, 128 MbB Sdram, disco fisso da 20 GB, schermo TFT Svga da 12.1 pollici, chip grafico NeoMagic MagicMedia 256AV con 2.5 MB, lettore esterno per Cd-Rom, Ethernet 10/100 Mbit/sec. Integrata </source> - <tokenization> <TOKEN Id="1" Label="dell">Dell</TOKEN> <TOKEN Id="2" Label="latitude">Latitude</TOKEN> <TOKEN Id="3" Label="lsh">LSH</TOKEN> <TOKEN Id="4" Label="500">500</TOKEN> <TOKEN Id="5" Label="lire">Lire</TOKEN> <TOKEN Id="6" Label="5516000">5516000</TOKEN> <TOKEN Id="7" Label="pentium">Pentium</TOKEN> <TOKEN Id="8" Label="iii">III</TOKEN> <TOKEN Id="9" Label="500">500</TOKEN> <TOKEN Id="10" Label="mhz">Mhz</TOKEN> ……………… Crema, 29 Giugno 2001 prodotto1 <TOKEN Id="35" Label="con">con</TOKEN> <TOKEN Id="36" Label="2.5">2.5</TOKEN> <TOKEN Id="37" Label="mb">MB</TOKEN> <TOKEN Id="38" Label=",">,</TOKEN> <TOKEN Id="39" Label="lettore">lettore</TOKEN> <TOKEN Id="40" Label="esterno">esterno</TOKEN> <TOKEN Id="41" Label="per">per</TOKEN> <TOKEN Id="42" Label="cd-rom">Cd-Rom</TOKEN> <TOKEN Id="43" Label=",">,</TOKEN> <TOKEN Id="44" Label="ethernet">Ethernet</TOKEN> <TOKEN Id="45" Label="10">10</TOKEN> <TOKEN Id="46" Label="/">/</TOKEN> <TOKEN Id="47" Label="100">100</TOKEN> < TOKEN Id="48" Label="mbit">Mbit</TOKEN> <TOKEN Id="49" Label="/">/</TOKEN> <TOKEN Id="50" Label="sec.">sec.</TOKEN> <TOKEN Id="51" Label="integrata">integrata</TOKEN> </tokenization> Crema, 29 Giugno 2001 prodotto2 </tokenization> <named-entities> - <named-entity sem-type="Processor Name" normal="Intel Pentium III"> - - </named-entity> - <named-entity sem-type="Processor Speed" normal="500 MHz"> - - - - <TOKEN Id="23" Label="tft">TFT</TOKEN> </named-entity> - <named-entity sem-type="Drive Types" normal="CD-ROM"> - - <TOKEN Id="9" Label="500">500</TOKEN> <TOKEN Id="10" Label="mhz">Mhz</TOKEN> </named-entity> - <named-entity sem-type="Screen Type" normal="Active Matrix (TFT)"> - - <TOKEN Id="7" Label="pentium">Pentium</TOKEN> <TOKEN Id="8" Label="iii">III</TOKEN> <TOKEN Id="42" Label="cd-rom">Cd-Rom</TOKEN> </named-entity> - <named-entity sem-type="Ports" normal="10/100Base-T"> - - <TOKEN Id="44" Label="ethernet">Ethernet</TOKEN> <TOKEN Id="45" Label="10">10</TOKEN> <TOKEN Id="46" Label="/">/</TOKEN> <TOKEN Id="47" Label="100">100</TOKEN> - </named-entity> </named-entities> - </document> - Crema, 29 Giugno 2001 - prodotto3 <PRODUCT> - <DESCRIPTION> <MANUF>Dell</MANUF> - <MODEL>Latitude LSH 500</MODEL> - <NUMEX TYPE="MONEY" NORM="2848.77" UNIT="EUR">Lire 5516000</NUMEX> <PROCESSOR>Pentium III</PROCESSOR> - <NUMEX TYPE="SPEED" NORM="500" UNIT="Mhz">500 Mhz</NUMEX> , <NUMEX TYPE="CAPACITY" NORM="128" UNIT="Mbyte">128 Mb</NUMEX> - <TERM>Sdram</TERM> , <TERM>disco fisso</TERM> da - <NUMEX TYPE="CAPACITY" NORM="20000" UNIT="Mbyte">20 GB</NUMEX> , <TERM>schermo TFT Svga</TERM> da <NUMEX TYPE="LENGHT" NORM="12.1" UNIT="inch">12.1 pollici</NUMEX> , chip grafico NeoMagic MagicMedia - <NUMEX TYPE="SIMPLE">256AV</NUMEX> con <NUMEX TYPE="CAPACITY" NORM="2.5" UNIT="Mbyte">2.5 MB</NUMEX> , <TERM>lettore</TERM> <LOC_ATTR>esterno</LOC_AT TR> per <TERM>Cd-Rom</TERM> , <TERM>Ethernet 10/100</TERM> Mbit/sec. <LOC_ATTR>integrata</LOC_ATTR> </DESCRIPTION> </PRODUCT> Crema, 29 Giugno 2001 <PRODUCT> <MANUF>Dell</MANUF> <MODEL>Latitude LSH 500</MODEL> <PRICE>2848.77</PRICE> <PROCESSOR> <PROCESSOR-NAME>PIII</PROCESSOR-NAME> <PROCESSOR-SPEED>500</PROCESSOR-SPEED> </PROCESSOR> - <SCREEN> <SCREEN-TYPE>TFT</SCREEN-TYPE> <SCREEN-SIZE>12.1</SCREEN-SIZE> </SCREEN> - <MEMORY> <STANDARD-RAM>128</STANDARD-RAM> </MEMORY> - <HARD-DISK> <CAPACITY>20000</CAPACITY> </HARD-DISK> - <MEDIA> <DRIVE-TYPE>CDROM</DRIVE-TYPE> </MEDIA> - <PORTS> <PORT>Eth10/100</PORT> </PORTS> Crema, 29 Giugno 2001 </PRODUCT> prodotto4 ontodemo <?xml version="1.0" encoding="UTF-8" ?> <category> Hardware – <product> Laptop Computers – <feature> Operating System <attribute>Operating System</attribute> </feature> - <feature> Processor <attribute>Processor Name</attribute> <attribute>Processor Speed</attribute> </feature> - <feature> Screen <attribute>Screen Type</attribute> <attribute>Screen Size</attribute> <attribute>Maximum Resolution</attribute> </feature> …..…. </feature> </product> </category> Crema, 29 Giugno 2001 Gazeteer <?xml version="1.0" encoding="UTF-8" ?> <data> <surface normal="Windows NT" sem-type="Operating System"> <T>Windows</T> <T>NT</T> </surface> <surface normal="Windows NT" sem-type="Operating System"> <T>WinNT</T> </surface> - <surface normal="Windows NT" sem-type="Operating System"> <T>NT</T> <T>4</T> </surface> Crema, 29 Giugno 2001 - <surface normal="Windows 95/98" sem-type="Operating System"> <T>Windows</T> - - </surface> - <surface normal="Windows 95/98" sem-type="Operating System"> - - - - - - <T>Windows</T> <T>95</T> </surface> - <surface normal="Windows 95/98" sem-type="OperatingSystem"> <T>Windows95</T> </surface> - <surface normal="Windows 95/98" sem-type="OperatingSystem"> <T>Windows98</T> </surface> - <surface normal="Windows 95/98" sem-type="OperatingSystem"> - - <T>Windows</T> <T>98</T> </surface> - <surface normal="Windows 95/98" sem-type="OperatingSystem"> - - <T>95</T> <T>/</T> <T>98</T> <T>Win95</T> </surface> 29 Giugno 2001 <surface normal="Windows 95/98"Crema, sem-type="Operating System">