XML Synopses Stefania Marrara Introduction In the last few years XML has spread in many applications The proposal query language XQuery is complex and still need further study The study of the expressive power of XQuery outlines that aggregates are very important but even XQuery does not contain basic OLAP features Introduction Approximate queries offer a good solution to the problem of computational costs of aggregate queries. Comples traversals of data hierarchy Non-trivial predicates on the path structure and the value content Synopsis: a small collection of data representative of a huge one Extending the synopsis approach (histograms, sampling and wavelets) to XML presents big difficulties due to the presence of structure and of heterogeneous data in the same document. Related work Buneman [B1] compresses the XML tree by using an appropriate bisimulation relation and evaluate an XPath query over the compressed instance. Goal is to compute an exact answer to a path query. Polyzotis, Garofalakis and Ioannidis [PGI2] propose TreeSketch: Clustering of doc elements where each cluster represents elements with similarity structured sub-trees. A histogram based approach for aggregate queries computation The approach is divided into two logical steps: Creation of the synopsis (designer and user) Query of the synopsis Aim of this work is the automatic generation of the synopsis and of the approximate queries by means of sets of rules able to create the appropriate XQuery query. Creation of the synopsis: TR-Rules <?XML version 1.0?> XML data <doc>......... ................ collection <doc> <?XML version 1.0?> <doc>......... ................ <?XML version 1.0?> <doc> <doc>......... ................ <doc> XML Schema of the collection XQuery transformation XML data synopsis Es + set of parameters of the histograms P <?XML version 1.0?> <doc>......... ................ <?XML version 1.0?> <doc>.........<doc> <hist>.......... </hist> ........ <doc> Query on the Synopsis: QTR-Rules 1 n 1 s k Synopsis elements Es={(pathe,<pathg>)} ={(list/car/selling/deta ils/model, <list/car/color, list/car/selling/details/city >} Color: white 70 60 50 40 30 20 10 0 Verona Milan Rome Verona Milan Fiat Brava Fiat Punto Fiat Marea Color: blue 100 80 60 40 Verona 20 Milan 0 Fiat Brava Fiat Punto Fiat Marea Milan Rome Verona Tree representation of the synopsis list car … car color color selling selling white blue details details city city … details city model model Milan 35 Rome model Milan 45 30 40 35 25 Fiat Brava Fiat Punto Fiat Marea 20 15 10 45 30 25 Fiat Brava 40 20 Fiat Punto 35 15 Fiat Marea 30 10 5 5 0 0 model model … 25 Fiat Brava 20 Fiat Punto 15 Fiat Marea 10 5 0 model Example of synopsis Query example <total>{ count(distinct-values ( for $det in doc(“cars.xml”) /list/car/selling/details where $det/model = “Fiat Brava” return $det/city )) } </total> Transformed query <total>{ count(distinct-values ( for $det in doc(“cars_syn.xml”) /list/car/selling/details where $det/model/hist/bucket/bv = “Fiat Brava” return $det/city )) } </total> Conclusions Prototype tool First experiments show strong reduction of space occupied by data: 500 bytes x n documents 750 bytes Huge collections of data… (n=100, 1000, 10000…) Error measure Ongoing and Future work Optimal refreshing of the synopsis as the original collection is updated (need XQuery update language) Bibliography P. Buneman, M. Grohe and C. Koch. “Path queries on compressed XML”. In Prooc. Of the 29°Int. Conf. On Very Large Data Bases, 2003 N. Polyzotis, M. Garofalakis, Y. Ioannidis. “Approximate XML Query Answers”, Sigmod ACM 2004. S. Marrara “Aggregate queries in XQuery”, PhD Thesis, Politecnico di Milano, 2005