DB Group @ unimo
1st International Workshop on Interoperability
through Semantic Data and Service Integration
25 June 2009
Camogli, Italy
Semantic Analysis for an
Advanced ETL framework
S.Bergamaschi1, F. Guerra2, M. Orsini1, C. Sartori3 , M. Vicini1
1DII
- Università di Modena e Reggio Emilia, via Vignolese 905, Modena, Italy
2DEA - Università di Modena e Reggio Emilia, v.le Berengario 51, Modena, Italy
3DEIS - Università di Bologna, v.le Risorgimento 2, Bologna, Italy
1
DB Group @ unimo
Outline
1.
2.
Motivation
Related Work
•
•
3.
4.
5.
MOMIS
RELEVANT
Semantic Extraction from data sources
A transformation function based on RELEVANT
Conclusion and future work
2
DB Group @ unimo
Motivation
•
•
Enterprise Information Systems provide a technology platform that enables organizations to
integrate and coordinate their business processes.
The data warehouse enables business intelligence analysis on data coming from multiple sources.
Traditional architectures of data warehouse systems rely on extraction, transformation and
loading (ETL) tools for:
1. the identification of relevant information at the source side,
2. the extraction of this information,
3. the customization and integration of the information coming from multiple sources
into a common format,
4. the cleaning of the resulting data set on the basis of database and business rules,
5. the propagation of the data to the data warehouse and/or data marts.
• ETL processes are typically based on constraints and requirements expressed in
natural language in the form of comments and documentations.
• ETL tasks are manually executed. In the context of traditional databases, this fact
does not represent a real big issue:
1.
2.
the processes requiring the manual user intervention involve data source schemata
that are generally fixed;
all the data sources collected in the data warehouses typically belong to the same
company and are known and managed by the same designers.
3
DB Group @ unimo
Motivation (2)
• The reference scenario is changed, since data sources populating the data warehouse
may not be directly known and managed by the designers. We have classified four
major critical activities:
–
–
–
–
Automating Extraction processes. Designers may no longer rely on internal documents,
comments and previous knowledge on the data source contents. Techniques for identifying
the information held by the data sources and extracting the relevant data for populating the
data warehouse are required;
Automating Transformation processes. Data from different sources may not be
homogeneous, i.e. different metrics may be used for expressing similar values, synonyms
may be used for representing the same values (and vice-versa the same value in different
sources may describe a different concept), and values may be expressed with different
granularity levels;
Relaxing the Transformation processes. Deeply homogenizing transformations risk to
flatten data warehouse structure and contents, thus allowing less accurate data analysis on
such data;
Speeding-up and easying ETL process execution. Fast and simple ETL execution is a
crucial competition factor when the sources populating the data warehouse are dynamic.
4
DB Group @ unimo
Motivation (3)
•
•
We introduce a tool for automating the extraction process and the formulation of
transformation functions which provides a real support in dealing with the above
issues.
By means of a semantic analysis coupled with clustering techniques, the tool
performs two tasks:
1. it works at the schema level, identifying the parts of the schemata of the data sources
which are related to the data warehouse, thus supporting the extraction process;
2. it works at the instance level, grouping the attribute values semantically related thus
defining a transformation function for populating with “homogeneous” values the data
warehouse.
5
DB Group @ unimo
MOMIS
•
•
The Mediator EnvirOnment for Multiple Information Sources (MOMIS) is a
semiautomatic data integration system that follows a global-as-view approach to
provide an intelligent access to heterogeneous, structured and semi-structured
information sources.
MOMIS is based on clustering techniques applied to a set of metadata collected in a
Common Thesaurus in the form of relationships describing inter- and intra-schema
knowledge about classes and attributes of the local source schemata. The
relationships are:
1. extracted from descriptions of local schemata;
2. obtained from the relationships existing in the WordNet database between the meanings
associated to the source elements;
3. provided by the integration designer;
4. inferred by means of Description Logics techniques.
6
DB Group @ unimo
MOMIS (2)
7
DB Group @ unimo
RELEVANT
•
•
•
The idea is that analyzing an attribute domain, we may find values which may be
clustered because strongly related.
Providing a name to these clusters, we may refer to a relevant value name which
encompasses a set of values.
More formally, given a class C and one of its attributes At, a relevant value for it,
rvAt is a pair rvAt =< rvnAt, valuesAt >
rvnAt is the name of the relevant value, while valuesAt is the set of values referring
to it.
8
DB Group @ unimo
RELEVANT (2)
•
•
•
•
Data pre-processing: a binary representation of the values of an attribute is obtained
with two matrices: the syntactic matching table (MT) and the root elements matching
table (MTR).
Similarity Computation: two tasks are enabled: the selection of the metrics for
computing the similarity between pairs of attribute values and the selection of the
semantics to be used: the semantics of containment, the semantics of clusters or a
combination of both.
Relevant Values elicitation: this module implements some clustering algorithms to
compute the set of relevant values on the basis of the choices made at step 2.
Validation: some standard techniques to evaluate cluster quality are implemented.
9
DW
MOMIS Functionalities
RELEVANT Functionalities
Extraction of
schema descriptions
Similarity Computation
Semi-automatic
annotation
Knowledge engine
Syntax / Membership / Thesaurus
Thesaurus
DB Group @ unimo
Semantic extraction from data sources
Clustering Technique
New
source
Mappings between DW
and new source elements
Cluster Analysis
Clusters
10
DB Group @ unimo
Semantic extraction from data sources (2)
1.
2.
3.
4.
5.
The schema descriptions of data sources are extracted by means of wrappers.
The annotation according to a reference ontology allows the specification of a unique meaning to
the schema description..
The knowledge engine is the component in charge of creating a thesaurus of relationships
between the schema elements (in terms of classes and attributes).
RELEVANT is applied to the descriptions of sources extracted in the first step with the aim of
computing clusters of related attributes, exploiting: 1) syntactic similarity; 2) memberships; 3)
semantic similarity. The application of a clustering algorithm generates clusters of similar
elements.
Mappings are automatically generated by analyzing the clustering result. The following cases are
possible:
1. A cluster contains attributes from the data warehouse schema and the new data sources: for
each data warehouse attribute a mapping to each attribute in the cluster is generated.
2. A cluster contains only attributes of the new data sources: it is not exploited for the mapping
generation. This cluster is due to the choice of a too selective clustering threshold.
3. A cluster contains only attributes of the data warehouse schema: it is not exploited for the
mapping generation. This kind of cluster indicates that there are attributes in the data warehouse schema
which are very close and may, perhaps, be fused into a unique table.
11
DB Group @ unimo
Example
•
•
This real scenario an ongoing experiment within the CROSS lab project, funded by
Italian Emilia Romagna region.
It concerns the ETL process for the creation of a data warehouse for a commercial
business intelligence application of the DataFlow company, which developed a new
application, called Bollicine Community business Intelligence (BCI).
–
•
•
•
BCI is based on a predefined data warehouse which is automatically available for companies
which already have the DataFlow Business management software.
The consortium wants to provide its members with the BCI solution, reaching two
objectives: 1) the analysis and planning of the enterprise market starting from its
past data; 2) developing a performance benchmarking w.r.t. general indexes (KPI)
obtained by aggregating data of all the members.
To reach this goal it is necessary to load all the data about the consortium members
in the BCI data warehouse.
We experimented in this context our tool: preliminary qualitative results show that by
loading their data in a declarative way, the enterprises considerably save human
resources.
12
DB Group @ unimo
Example
13
DB Group @ unimo
Example (2)
•
Since the attributes
FAMILY DESCRIPTION(S1), CATEGORY DESCRIPTION(S2), CATEGORY
DESCRIPTION(DW)
are annotated with the same ‘‘description” concept in WordNet and CLASS LABEL is
annotated with “label” that is a hyponymterm of ‘‘description” in WordNet, the
knowledge engine generates this set of SYN relationship in the thesaurus.
ARTICLE.CATEGORY_DESCRIPTION(DW) SYN MERCHANDISE.FAMILY_DESCRIPTION(S1)
ARTICLE.CATEGORY_DESCRIPTION(DW) SYN ARTICLE.CATEGORY_DESCRIPTION(S2)
ARTICLE.CATEGORY_DESCRIPTION(DW) BT PRODUCT.CLASS_LABEL(S3)
ARTICLE.CATEGORY_DESCRIPTION(S2) SYN MERCHANDISE.FAMILY_DESCRIPTION(S1)
ARTICLE.CATEGORY_DESCRIPTION(S2) BT PRODUCT.CLASS_LABEL(S3)
MERCHANDISE.FAMILY_DESCRIPTION(S1) BT PRODUCT.CLASS_LABEL(S3)
•
These relationships may be exploited with the semantic similarity thus obtaining a
unique cluster with all the attributes and consequently a set of mappings between
the DW CATEGORY_DESCRIPTION and the corresponding attributes in the new
sources.
14
DB Group @ unimo
A transformation function based on RELEVANT
•
We provide a new kind of transformation function based on semantic analysis for
string values.
–
•
Semantically related values of a chosen attribute in the new data source and the
correspondent values into the data warehouse target are grouped, thus providing a
semantic reconciliation of the attribute values.
The transformation function works according to the following steps:
–
–
Attribute domains analysis. RELEVANT is used for evaluating if the domains of the new
source and the data warehouse attribute are compatible, i.e. they describe similar
properties for the attributes.
The user may select to transform and load into the target attribute only the synthesized set
of values represented by the relevant values.
15
DB Group @ unimo
Example
•
Let us consider BCI table describing articles sold by companies
ARTICLE(CODE,DESCR,CATEGORY_DESCRIPTION,SUBCATEGORY_DESCRIPTION)
• The following fragment shows some of the attribute values:
Source: #1
Attribute: FAMILY_DESCRIPTION = {NOT DEFINED, WATER, BEER, WINE,
SOFT DRINK, FOOD)
Source: #2
Attribute: CATEGORY_DESCRIPTION = {HARD LIQUOR, BOTTLE WINE, NOT
DEFINED, JUICE DRINK, MINERAL WATER, BOTTLE BEER, SEVERAL
ALCOHOLIC BEVERAGE)
Source: #3
Attribute: CLASS_LABEL = {NOT DEFINED, MINERAL WATER, BEER, WINE,
ALCOHOLIC DRINK, FOOD, BOOZE, FOOD CATERING)
16
DB Group @ unimo
Example (2)
The RELEVANT application to these values defines a set of 8 clusters whose names are
loaded into the data warehouse attribute instead of the 22 original values:
RV1: NOT DEFINED {NOT DEFINED}
RV2: WATER {MINERAL WATER, WATER}
RV3: FOOD {FOOD, FOOD, FOOD CATERING}
RV4: SEVERAL {SEVERAL}
RV5: BEER {BEER, BOTTLE BEER}
RV6: WINE {WINE, BOTTLE WINE}
RV7: ALCOHOLIC DRINK {ALCOHOLIC DRINK, ALCOHOLIC BEVERAGE, HARD
LIQUOR, BOOZE}
RV8: SOFT DRINK {SOFT DRINK, JUICE DRINK}
17
DB Group @ unimo
Example (3)
•
The name assigned to each cluster is obtained in this case by using the priority list of sources:
the function is automatically written in the form of a SQL query on the basis of the COALESCE
function and applied to the clusters. UPDATE CLUSTER_ARTICLE_CATEGORY_DESCRIPTION
SET TARGET = COALESCE
((SELECT MIN(CL1.CATEGORY_DESCRIPTION)
FROM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1
WHERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)),
CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND
CL1.SOURCE = ’S1’),
(SELECT MIN(CL1.CATEGORY_DESCRIPTION)
FROM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1
WHERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)),
CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND
CL1.SOURCE = ’S3’),
(SELECT MIN(CL1.CATEGORY_DESCRIPTION)
OM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1
ERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)),
CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND
CL1.SOURCE = ’S2’))
18
DB Group @ unimo
Example (4)
19
DB Group @ unimo
Conclusion
•
In this paper we proposed to couple and extend our previous research on data
integration and data analysis for creating an ETL tool. We focused our work on:
–
–
•
•
the extraction phase, by implementing a technique that semi-automatically defines
mappings between a data warehouse schema and a new data source,
the transformation phase, by proposing a new function based on relevant values,
particularly useful for supporting drill down operations.
We experimented our approach on a real scenario, thus obtaining qualitative results
on the effectiveness of the approach.
Future work will be addressed on identifying a benchmark and a set of measures in
order to perform a complete technique evaluation.
20
Scarica

Semantic Analysis for an Advanced ETL framework