DB Group @ unimo
Semi-automatic compound nouns
annotation for data integration systems
Tuesday, 23 June 2009
SEBD 2009
Sonia Bergamaschi
Serena Sorrentino
www.dbgroup.unimo.it
Dipartimento di Ingegneria dell’Informazione
Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena
Università di Modena e Reggio Emilia 1
DB Group @ unimo
The Problem
• Data integration systems: produce a comprehensive global schema
successfully integrating data from heterogeneous structured and semistructured data sources
– Starting from the “meanings” associated to schema elements it is possible to
discover mappings among the elements of different schemata
• Lexical Annotation :
– the explicit inclusion of the “meaning“ of a data source element (i.e.
class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case)
– Automatic Lexical Annotation becomes crucial as a starting point for mapping
discovery
• Problem :
– many schemata names are non-dictionary words (compound nouns,
acronyms, abbreviations etc.) i.e. not be present in the lexical resource
– in this work, we will concentrate only on non-dictionary Compound Nouns
(CNs)
– the result of lexical annotation is strongly affected by the presence of these
non-dictionary CNs in the schema
Università di Modena e Reggio Emilia 2
DB Group @ unimo
Proposed Solution & Motivation
•
In some approaches the constituents of a CN are treated as single words.
E.g. the CN “teacher_judgment" is split into two tokens (“teacher" and
“judgment") and its relatedness to other sources element is calculated as
an average relatedness between each token and the other elements
•
A large set of relationships among different schemata is discovered,
including a great amount of false positive relationships
We propose a semi-automatic method for the lexical annotation of nondictionary CNs
•
Università di Modena e Reggio Emilia 3
DB Group @ unimo
Compound Noun annotation
• Compound Noun (CN): a word composed of more than one words called
CN constituents
– In order to perform semi-automatic CNs annotation a method for their
interpretation has to be devised
• The interpretation of a CN is the task of determining the semantic
relationships among the constituents of a CN
• CNs can be divided in four categories: endocentric, exocentric, copulative
and appositional and to consider only endocentric CNs
• Endocentric CN: consists of a head (i.e. the part that contains the basic
meaning of the whole CN) and modifiers, which restricts this meaning. A
CN exhibits a modifier-head structure with a sequence of nouns
composed of a head noun and one or more modifiers where the head
noun occurs always after the modifiers
Università di Modena e Reggio Emilia 4
DB Group @ unimo
Compound Noun annotation
• Our restriction is motivated by different elements:
• the the vast majority of schemata CNs fall in the endocentric category
• endocentric CNs are the most common type of CNs in English
• exocentric and copulative CNs, which are represented by a unique word,
are often present in a dictionary (e.g. “loudmouth”, “sleepwalk”, etc.)
• appositional compound are not very common in English and less likely
used as element of a schema (e.g.“sweet-sour”)
• Our method can be summed up into four main steps:
•
•
•
•
CN constituents disambiguation
redundant constituents identification and pruning
CN interpretation via semantic relationships
creation of a new WN meaning for a CN
Università di Modena e Reggio Emilia 5
DB Group @ unimo
CN constituents disambiguation & pruning
• CN constituents disambiguation
– Compound Noun syntactic analysis: syntactic analysis of CN
constituents, performed by a parser
– Disambiguating head and modifier: by applying our CWSD (Combined
Word Sense Disambiguation) algorithm, each word is automatically
mapped into its corresponding WordNet 2.0 synsets
• Redundant constituents identification and pruning
Redundant words: words that do not contribute new information, i.e.
derived from the schema or from the lexical resource
E.g. the attribute “company_address” of the class “company”: “company”
is not considered as the relationship holding among a class and its
attributes is implicit in the schema
Università di Modena e Reggio Emilia 6
DB Group @ unimo
CN interpretation via semantic relationships
• Our goal is to select, among a set of predefined semantic relationships,
the one that best capture the relation between the head and the modifier
•
9 possible semantic relationship: CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE,
BE, FROM (Levi’s semantic relationships set)
• the semantic relationship between the head and the modifier of a CN
is the same holding between their top level WN nouns in the WN
hierarchy
• The top level concepts of the WN
hierarchy are the 25 unique
beginners for WN English nouns
defined by Miller
Università di Modena e Reggio Emilia 7
DB Group @ unimo
CN interpretation via semantic relationships
• To each couple of unique beginners we associate the relationship from
the Levi's set that best describes their combined meaning
• For example, we interpret the CN “teacher judgment“ by the MAKE
relationship as “teacher" is an hyponym of “person" and “judgment" is an
hyponym of “act“ and for the couple (person, act) of unique beginners we
choose the relationship MAKE
Person#1
hyponym
MAKE
…
hyponym
Educator#1
hyponym
Act#2
…
Teacher#1
MAKE
Judgment#2
Università di Modena e Reggio Emilia 8
DB Group @ unimo
Creation of a new WN meaning for a CN
• (a) Gloss definition: we create the gloss to be associated to a CN, starting
from the relationship associated to a CN and exploiting the glosses of the
CN constituents
Teacher #1 Gloss
A person whose
occupation is teaching.
judgment #2 Gloss
The act to judging or assessing a
person or situation or event.
+
+
Modifier MAKE Head
Teacher_judgment Gloss:
A person whose occupation is teaching make the
act to judging or assessing a person or situation or
event.
Università di Modena e Reggio Emilia 9
DB Group @ unimo
Creation of a new WN meaning for a CN
• (b) Inclusion of the new CN meaning in WN: as the concept denoted by a
CN is a subset of the concept denoted by the head we create
 an hyponym relationship between the new CN meaning and its head
meaning
 a generic relationship RT (Related term), corresponding to WN
relationships as member meronym, part meronym etc. between the CN
meaning and its modifier
 we use the WNEditor tool to create/manage the new meaning and to
set new relationships between it and WN meanings
Teacher_judgment#1
judgment#2
hypernym/
hyponym
Teacher#1
SYNSETβ
WNEditor
Related To
Teacher_judgment#1
SYNSETµ
Università di Modena e Reggio Emilia 10
DB Group @ unimo
Example
hypernym
Related To
Teacher_judgment#1
Università di Modena e Reggio Emilia 11
DB Group @ unimo
Evaluation: Experimental Result
• CNs annotation extends the automatic annotation tool within the MOMIS
system
• Evaluation over a real data sources environment: three sources of an
application scenario of the NeP4B project (491 schema elements) which
contain a lot of CNs (about 50%).
• Without CNs annotation, CWSD obtains a very low recall value. Our
method increases the recall without significantly worsening precision.
However, the recall value is not very high: presence of a lot of acronym
terms.
• A CN has been considered correctly annotated if the Levi's relationship
selected manually by the user is the same returned by our method
Università di Modena e Reggio Emilia 12
DB Group @ unimo
Conclusion
• The experimental results showed the effectiveness of our method which
significantly improves the result of the lexical annotation process
• Our method may be applied in general in the context of mapping
discovery, ontology merging and data integration system
• Future work will be devoted to investigate on the role of the set of
semantic relationships chosen for the CNs interpretation process
• We will extend the tool with a component which deals with acronyms
and abbreviations expansion (to appear at 28th International Conference on
Conceptual Modeling, ER 2009)
Università di Modena e Reggio Emilia 13
DB Group @ unimo
Thanks for your attention!
Università di Modena e Reggio Emilia 14