www.dbgroup.unimo.it
Special Session on Agricultural Metadata & Semantics
2nd International Conference on Metadata & Semantics Research
October 10-11, 2007 - Corfu, Greece
Creating and Querying an Integrated
Ontology for Molecular and
Phenotypic Cereals Data
Sonia Bergamaschi, Antonio Sala
www.dbgroup.unimo.it
DII - Dipartimento di Ingegneria dell’Informazione
Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena
Funded by:
FIRB project NeP4B: Networked Peers for Business (www.dbgroup.unimo.it/nep4b) and IST FP6 STREP project STASIS (www.dbgroup.unimo.it/stasis)
1
www.dbgroup.unimo.it
2
Motivations
•
Numerous public data sources have been realized and are now available for
researchers in the field of molecular biology
•
Problem of having access to this great amount of data:
– Numerous sources
– Heterogeneous interfaces and structures
– Low IT skills of the users
•
What is needed:
– Extracting and fusing information from different data sources
– Presenting the information according to a unique interface in a transparent and
easy way independently from the format of the different sources
www.dbgroup.unimo.it
3
CEREALAB project
•
Conducted by the Agrarian faculty of the University of Modena and Reggio Emilia
funded by the Regional Government of Emilia Romagna
•
The aim is to perform intelligent data integration of existing databases, i.e. to create
a Global Virtual Schema (GVV) for the genotypic selection of cereal cultivars
•
Genotypic selection of cereal cultivars:
– To extract genotypic data correlated to phenotypic traits
•
3 species involved:
– Wheat
– Barley
– Rice
www.dbgroup.unimo.it
The MOMIS System
MOMIS (Mediator Environment for Multiple Information Sources)
(www.dbgroup.unimo.it/Momis) is a framework to perform information extraction and
integration from heterogeneous distributed data sources and query management
Global
Virtual
View
(GVV)
gene
4
gene
gene
FHB
gene
FHB
FHB
FHB
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Graingenes
Gramene
CEREALAB
GRIN
CRA
ERData
WRAPPING
COMMON THESAURUS
GENERATION
ODLI3
LOCAL SCHEMA 1
SCHEMA DERIVED
RELATIONSHIPS
CEREALAB
Structured
source
ODLI3
LOCAL SCHEMA 2
USER SUPPLIED
RELATIONSHIPS
ODLI3
LOCAL SCHEMA 3
MAPPING
TABLES
INFERRED
RELATIONSHIPS
AUTOMATIC/
MANUAL
ANNOTATION
SEMI-AUTOMATIC
ANNOTATION
SYNSET#
SYNSET4
SYNSET2
5
clusters
generation
Common
Thesaurus
GRIN
Structured
source
GLOBAL
CLASSES
LEXICON
DERIVED
RELATIONSHIPS
Graingenes
Structured
source
GVV GENERATION
(CEREALAB)
OWL
Export
www.dbgroup.unimo.it
Integration Process Overview
SYNSET1
www.dbgroup.unimo.it
6
Data Sources
•
Molecular data:
– Gramene, Relational DB (www.gramene.org)
– Graingenes, Relational DB (wheat.pw.usda.gov/GG2)
– CEREALAB experimental data, Relational DB
•
Phenotypic Data:
– GRIN, Excel Files (www.ars-grin.gov)
– CEREALAB repository, Relational DB (created collecting data from specific
literature for regional germplasms and from the italian National Council of
Research in Agriculture (CRA))
•
All these data sources, if considered separately, present incomplete information for
the purpose of the CEREALAB project and are sometimes overlapping
www.dbgroup.unimo.it
Local Source Schemata Annotation
Local Source Annotation
• To assign meanings (by WordNet) to each local class and attribute name of a local
schema
• Semi-automatically performed
• a WordNet Editor is available to extend WordNet by adding new domain-dependent
terms and synsets
• This extension step has to be performed just the first time a domain is handled.
Gene: a segment of DNA that is involved in producing a
polypeptide chain; it can include regions preceding and
following the coding DNA as well as introns between the
exons; it is considered a unit of heredity) "genes were formerly
called factors"
Marker: A
some
segment
conspicuous
of DNAobject
with an
used
identifiable
to distinguish
physical
or mark
location
something
on"the
a chromosome
buoys werefor
markers
any feature
for thethat
channel"
has been
genetically mapped
7
Automatic Annotation
(Present in WordNet)
Manual Annotation
(Not present in WordNet)
www.dbgroup.unimo.it
8
Common Thesaurus Generation
•
MOMIS constructs a Common Thesaurus including SYN, BT/NT, and RT relationships
among schema elements.
•
The Common Thesaurus is constructed through an incremental process in which the
following relationships are added:
– schema-derived relationships
– lexicon-derived relationship
– designer-supplied relationships
– inferred relationships
•
As an example:
gene is identified as a BT of allele (as gene is a direct hypernym of allele)
marker is identified as a NT of gene (as genetic marker as been added as a direct
hyponym of gene)
www.dbgroup.unimo.it
Global Virtual Schema Generation
MOMIS identifies and groups classes that describe the same or semantically related
concept in different sources into clusters (global classes)
Mappings are generated among global and local
classes in the cluster (according to a GAV
approach)
A Mapping Table (MT) is automatically generated
for each global class of a GVV
gene(Global)
gene(CEREALAB)
gene(Gramene)
allele
allele
allele
locus
locus
name(Join)
name
name
reference
title
reference
…
…
…
The designer may interactively refine
and complete the proposed
integration by adding
Data Conversion Functions from local
to global attributes or Resolution
Functions for global attributes to
solve data conflicts of local attribute
Source, Reference: a publicationvalues
(or a
passage from a publication that is
referred to
9
www.dbgroup.unimo.it
Join Conditions
• Object identification: merging data from different sources requires different instantiations of
the same real world object to be identified
• Join Conditions among pairs of local classes belonging to the same global class allow to
identify instances of the same object and fuse them.
(CEREALAB.gene.name) = (gramene.gene.name)
10
www.dbgroup.unimo.it
11
The Global Virtual Schema (GVV)
•
Each GVV element is automatically annotated w.r.t. WordNet and can be
exported in OWL
•
The GVV can be seen as an Ontology of the underlying sources
•
This Ontology correlates the molecular data of Gramene, Graingenes and the
CEREALAB project with the phenotypic data of the GRIN database and those
collected by the CEREALAB project
www.dbgroup.unimo.it
12
The Integrated Ontology
•
Molecular data associated with germplasms and phenotypic evaluations
www.dbgroup.unimo.it
13
The Integrated Ontology
Phenotypic data are divided into six
categories chosen among those of major
interest for the cereal breeders:
– Abiotic Stress
– Biotic Stress
– Growth and Development related traits
– Quality traits
– Yield traits.
www.dbgroup.unimo.it
The Integrated Ontology
•
14
For each trait the specific evaluation of a germplasm for that trait
is available
www.dbgroup.unimo.it
15
The Integrated Ontology
•
Molecular data are related to phenotypic
data indicating their presence in a
germplasm for which a quantitative
phenotypic evaluation is available
•
Information about specific molecular
markers that can identify genes or QTLs
that express a particular phenotypic trait.
•
In this way genotypic selection of cereals
cultivars can be performed starting from
phenotypic data
www.dbgroup.unimo.it
Querying the Integrated Ontology
The MOMIS Query Manager is the coordinated set of functions which takes an
incoming query (say global query) and performs the following steps:
• Query rewriting
– to rewrite a global query as an equivalent set of queries expressed on the local
sources (local queries)
• Local queries execution
– the local queries are sent and executed on local sources
• Fusion and Reconciliation
– The local answers are fused into the global answer
•
16
A Graphical User Interface has been developed to compose queries over the GVV
regardless of the specific languages of the source databases.
www.dbgroup.unimo.it
17
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
18
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
19
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
20
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
21
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
22
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
23
Example
“Find information about wheat QTLs that express resistance to the Fusarium fungus”
www.dbgroup.unimo.it
24
Conclusions
•
The MOMIS system allows a straightforward creation of a Global Virtual Schema to
integrate data from the CEREALAB research project with data coming from the
databases Gramene, Graingenes and GRIN
•
The integration process provides a unique interface for the 3 sources according to a
common ontology
•
Querying the 3 sources results completely transparent and easy to the user through
a GUI
•
A unique answer is obtained
Scarica

Gene - DBGroup