DB Group @ unimo 1st International Workshop on Interoperability through Semantic Data and Service Integration 25 June 2009 Camogli, Italy Semantic Analysis for an Advanced ETL framework S.Bergamaschi1, F. Guerra2, M. Orsini1, C. Sartori3 , M. Vicini1 1DII - Università di Modena e Reggio Emilia, via Vignolese 905, Modena, Italy 2DEA - Università di Modena e Reggio Emilia, v.le Berengario 51, Modena, Italy 3DEIS - Università di Bologna, v.le Risorgimento 2, Bologna, Italy 1 DB Group @ unimo Outline 1. 2. Motivation Related Work • • 3. 4. 5. MOMIS RELEVANT Semantic Extraction from data sources A transformation function based on RELEVANT Conclusion and future work 2 DB Group @ unimo Motivation • • Enterprise Information Systems provide a technology platform that enables organizations to integrate and coordinate their business processes. The data warehouse enables business intelligence analysis on data coming from multiple sources. Traditional architectures of data warehouse systems rely on extraction, transformation and loading (ETL) tools for: 1. the identification of relevant information at the source side, 2. the extraction of this information, 3. the customization and integration of the information coming from multiple sources into a common format, 4. the cleaning of the resulting data set on the basis of database and business rules, 5. the propagation of the data to the data warehouse and/or data marts. • ETL processes are typically based on constraints and requirements expressed in natural language in the form of comments and documentations. • ETL tasks are manually executed. In the context of traditional databases, this fact does not represent a real big issue: 1. 2. the processes requiring the manual user intervention involve data source schemata that are generally fixed; all the data sources collected in the data warehouses typically belong to the same company and are known and managed by the same designers. 3 DB Group @ unimo Motivation (2) • The reference scenario is changed, since data sources populating the data warehouse may not be directly known and managed by the designers. We have classified four major critical activities: – – – – Automating Extraction processes. Designers may no longer rely on internal documents, comments and previous knowledge on the data source contents. Techniques for identifying the information held by the data sources and extracting the relevant data for populating the data warehouse are required; Automating Transformation processes. Data from different sources may not be homogeneous, i.e. different metrics may be used for expressing similar values, synonyms may be used for representing the same values (and vice-versa the same value in different sources may describe a different concept), and values may be expressed with different granularity levels; Relaxing the Transformation processes. Deeply homogenizing transformations risk to flatten data warehouse structure and contents, thus allowing less accurate data analysis on such data; Speeding-up and easying ETL process execution. Fast and simple ETL execution is a crucial competition factor when the sources populating the data warehouse are dynamic. 4 DB Group @ unimo Motivation (3) • • We introduce a tool for automating the extraction process and the formulation of transformation functions which provides a real support in dealing with the above issues. By means of a semantic analysis coupled with clustering techniques, the tool performs two tasks: 1. it works at the schema level, identifying the parts of the schemata of the data sources which are related to the data warehouse, thus supporting the extraction process; 2. it works at the instance level, grouping the attribute values semantically related thus defining a transformation function for populating with “homogeneous” values the data warehouse. 5 DB Group @ unimo MOMIS • • The Mediator EnvirOnment for Multiple Information Sources (MOMIS) is a semiautomatic data integration system that follows a global-as-view approach to provide an intelligent access to heterogeneous, structured and semi-structured information sources. MOMIS is based on clustering techniques applied to a set of metadata collected in a Common Thesaurus in the form of relationships describing inter- and intra-schema knowledge about classes and attributes of the local source schemata. The relationships are: 1. extracted from descriptions of local schemata; 2. obtained from the relationships existing in the WordNet database between the meanings associated to the source elements; 3. provided by the integration designer; 4. inferred by means of Description Logics techniques. 6 DB Group @ unimo MOMIS (2) 7 DB Group @ unimo RELEVANT • • • The idea is that analyzing an attribute domain, we may find values which may be clustered because strongly related. Providing a name to these clusters, we may refer to a relevant value name which encompasses a set of values. More formally, given a class C and one of its attributes At, a relevant value for it, rvAt is a pair rvAt =< rvnAt, valuesAt > rvnAt is the name of the relevant value, while valuesAt is the set of values referring to it. 8 DB Group @ unimo RELEVANT (2) • • • • Data pre-processing: a binary representation of the values of an attribute is obtained with two matrices: the syntactic matching table (MT) and the root elements matching table (MTR). Similarity Computation: two tasks are enabled: the selection of the metrics for computing the similarity between pairs of attribute values and the selection of the semantics to be used: the semantics of containment, the semantics of clusters or a combination of both. Relevant Values elicitation: this module implements some clustering algorithms to compute the set of relevant values on the basis of the choices made at step 2. Validation: some standard techniques to evaluate cluster quality are implemented. 9 DW MOMIS Functionalities RELEVANT Functionalities Extraction of schema descriptions Similarity Computation Semi-automatic annotation Knowledge engine Syntax / Membership / Thesaurus Thesaurus DB Group @ unimo Semantic extraction from data sources Clustering Technique New source Mappings between DW and new source elements Cluster Analysis Clusters 10 DB Group @ unimo Semantic extraction from data sources (2) 1. 2. 3. 4. 5. The schema descriptions of data sources are extracted by means of wrappers. The annotation according to a reference ontology allows the specification of a unique meaning to the schema description.. The knowledge engine is the component in charge of creating a thesaurus of relationships between the schema elements (in terms of classes and attributes). RELEVANT is applied to the descriptions of sources extracted in the first step with the aim of computing clusters of related attributes, exploiting: 1) syntactic similarity; 2) memberships; 3) semantic similarity. The application of a clustering algorithm generates clusters of similar elements. Mappings are automatically generated by analyzing the clustering result. The following cases are possible: 1. A cluster contains attributes from the data warehouse schema and the new data sources: for each data warehouse attribute a mapping to each attribute in the cluster is generated. 2. A cluster contains only attributes of the new data sources: it is not exploited for the mapping generation. This cluster is due to the choice of a too selective clustering threshold. 3. A cluster contains only attributes of the data warehouse schema: it is not exploited for the mapping generation. This kind of cluster indicates that there are attributes in the data warehouse schema which are very close and may, perhaps, be fused into a unique table. 11 DB Group @ unimo Example • • This real scenario an ongoing experiment within the CROSS lab project, funded by Italian Emilia Romagna region. It concerns the ETL process for the creation of a data warehouse for a commercial business intelligence application of the DataFlow company, which developed a new application, called Bollicine Community business Intelligence (BCI). – • • • BCI is based on a predefined data warehouse which is automatically available for companies which already have the DataFlow Business management software. The consortium wants to provide its members with the BCI solution, reaching two objectives: 1) the analysis and planning of the enterprise market starting from its past data; 2) developing a performance benchmarking w.r.t. general indexes (KPI) obtained by aggregating data of all the members. To reach this goal it is necessary to load all the data about the consortium members in the BCI data warehouse. We experimented in this context our tool: preliminary qualitative results show that by loading their data in a declarative way, the enterprises considerably save human resources. 12 DB Group @ unimo Example 13 DB Group @ unimo Example (2) • Since the attributes FAMILY DESCRIPTION(S1), CATEGORY DESCRIPTION(S2), CATEGORY DESCRIPTION(DW) are annotated with the same ‘‘description” concept in WordNet and CLASS LABEL is annotated with “label” that is a hyponymterm of ‘‘description” in WordNet, the knowledge engine generates this set of SYN relationship in the thesaurus. ARTICLE.CATEGORY_DESCRIPTION(DW) SYN MERCHANDISE.FAMILY_DESCRIPTION(S1) ARTICLE.CATEGORY_DESCRIPTION(DW) SYN ARTICLE.CATEGORY_DESCRIPTION(S2) ARTICLE.CATEGORY_DESCRIPTION(DW) BT PRODUCT.CLASS_LABEL(S3) ARTICLE.CATEGORY_DESCRIPTION(S2) SYN MERCHANDISE.FAMILY_DESCRIPTION(S1) ARTICLE.CATEGORY_DESCRIPTION(S2) BT PRODUCT.CLASS_LABEL(S3) MERCHANDISE.FAMILY_DESCRIPTION(S1) BT PRODUCT.CLASS_LABEL(S3) • These relationships may be exploited with the semantic similarity thus obtaining a unique cluster with all the attributes and consequently a set of mappings between the DW CATEGORY_DESCRIPTION and the corresponding attributes in the new sources. 14 DB Group @ unimo A transformation function based on RELEVANT • We provide a new kind of transformation function based on semantic analysis for string values. – • Semantically related values of a chosen attribute in the new data source and the correspondent values into the data warehouse target are grouped, thus providing a semantic reconciliation of the attribute values. The transformation function works according to the following steps: – – Attribute domains analysis. RELEVANT is used for evaluating if the domains of the new source and the data warehouse attribute are compatible, i.e. they describe similar properties for the attributes. The user may select to transform and load into the target attribute only the synthesized set of values represented by the relevant values. 15 DB Group @ unimo Example • Let us consider BCI table describing articles sold by companies ARTICLE(CODE,DESCR,CATEGORY_DESCRIPTION,SUBCATEGORY_DESCRIPTION) • The following fragment shows some of the attribute values: Source: #1 Attribute: FAMILY_DESCRIPTION = {NOT DEFINED, WATER, BEER, WINE, SOFT DRINK, FOOD) Source: #2 Attribute: CATEGORY_DESCRIPTION = {HARD LIQUOR, BOTTLE WINE, NOT DEFINED, JUICE DRINK, MINERAL WATER, BOTTLE BEER, SEVERAL ALCOHOLIC BEVERAGE) Source: #3 Attribute: CLASS_LABEL = {NOT DEFINED, MINERAL WATER, BEER, WINE, ALCOHOLIC DRINK, FOOD, BOOZE, FOOD CATERING) 16 DB Group @ unimo Example (2) The RELEVANT application to these values defines a set of 8 clusters whose names are loaded into the data warehouse attribute instead of the 22 original values: RV1: NOT DEFINED {NOT DEFINED} RV2: WATER {MINERAL WATER, WATER} RV3: FOOD {FOOD, FOOD, FOOD CATERING} RV4: SEVERAL {SEVERAL} RV5: BEER {BEER, BOTTLE BEER} RV6: WINE {WINE, BOTTLE WINE} RV7: ALCOHOLIC DRINK {ALCOHOLIC DRINK, ALCOHOLIC BEVERAGE, HARD LIQUOR, BOOZE} RV8: SOFT DRINK {SOFT DRINK, JUICE DRINK} 17 DB Group @ unimo Example (3) • The name assigned to each cluster is obtained in this case by using the priority list of sources: the function is automatically written in the form of a SQL query on the basis of the COALESCE function and applied to the clusters. UPDATE CLUSTER_ARTICLE_CATEGORY_DESCRIPTION SET TARGET = COALESCE ((SELECT MIN(CL1.CATEGORY_DESCRIPTION) FROM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1 WHERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)), CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND CL1.SOURCE = ’S1’), (SELECT MIN(CL1.CATEGORY_DESCRIPTION) FROM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1 WHERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)), CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND CL1.SOURCE = ’S3’), (SELECT MIN(CL1.CATEGORY_DESCRIPTION) OM CLUSTER_ARTICLE_CATEGORY_DESCRIPTION AS CL1 ERE CHARINDEX(RTRIM(LTRIM(CL1.CATEGORY_DESCRIPTION)), CLUSTER_ARTICLE_CATEGORY_DESCRIPTION.CLUSTER) > 0 AND CL1.SOURCE = ’S2’)) 18 DB Group @ unimo Example (4) 19 DB Group @ unimo Conclusion • In this paper we proposed to couple and extend our previous research on data integration and data analysis for creating an ETL tool. We focused our work on: – – • • the extraction phase, by implementing a technique that semi-automatically defines mappings between a data warehouse schema and a new data source, the transformation phase, by proposing a new function based on relevant values, particularly useful for supporting drill down operations. We experimented our approach on a real scenario, thus obtaining qualitative results on the effectiveness of the approach. Future work will be addressed on identifying a benchmark and a set of measures in order to perform a complete technique evaluation. 20