XML Synopses
Stefania Marrara
Introduction
In the last few years XML has spread in many
applications
The proposal query language XQuery is
complex and still need further study
The study of the expressive power of XQuery
outlines that aggregates are very important
but even XQuery does not contain basic
OLAP features
Introduction
Approximate queries offer a good solution to the
problem of computational costs of aggregate
queries.
Comples traversals of data hierarchy
Non-trivial predicates on the path structure and the value
content
Synopsis: a small collection of data representative of
a huge one
Extending the synopsis approach (histograms,
sampling and wavelets) to XML presents big
difficulties due to the presence of structure and of
heterogeneous data in the same document.
Related work
Buneman [B1] compresses the XML tree by using
an appropriate bisimulation relation and evaluate an
XPath query over the compressed instance. Goal is
to compute an exact answer to a path query.
Polyzotis, Garofalakis and Ioannidis [PGI2] propose
TreeSketch:
Clustering of doc elements where each cluster represents
elements with similarity structured sub-trees.
A histogram based approach for
aggregate queries computation
The approach is divided into two logical
steps:
Creation of the synopsis (designer and user)
Query of the synopsis
Aim of this work is the automatic generation
of the synopsis and of the approximate
queries by means of sets of rules able to
create the appropriate XQuery query.
Creation of the synopsis: TR-Rules
<?XML version 1.0?>
XML data
<doc>.........
................
collection
<doc>
<?XML version 1.0?>
<doc>.........
................
<?XML
version 1.0?>
<doc>
<doc>.........
................
<doc>
XML Schema of
the collection
XQuery
transformation
XML data synopsis
Es
+ set of
parameters of the
histograms
P
<?XML version 1.0?>
<doc>.........
................
<?XML version
1.0?>
<doc>.........<doc>
<hist>..........
</hist>
........
<doc>
Query on the Synopsis: QTR-Rules
1
n
1
s
k
Synopsis elements
Es={(pathe,<pathg>)}
={(list/car/selling/deta
ils/model,
<list/car/color,
list/car/selling/details/city
>}
Color:
white
70
60
50
40
30
20
10
0
Verona
Milan
Rome
Verona
Milan
Fiat
Brava
Fiat
Punto
Fiat
Marea
Color:
blue
100
80
60
40
Verona
20
Milan
0
Fiat
Brava
Fiat
Punto
Fiat
Marea
Milan
Rome
Verona
Tree representation of the synopsis
list
car
…
car
color
color
selling
selling
white
blue
details
details
city
city
…
details
city
model
model
Milan
35
Rome
model
Milan
45
30
40
35
25
Fiat
Brava
Fiat
Punto
Fiat
Marea
20
15
10
45
30
25
Fiat Brava
40
20
Fiat Punto
35
15
Fiat Marea
30
10
5
5
0
0
model
model
…
25
Fiat Brava
20
Fiat Punto
15
Fiat Marea
10
5
0
model
Example of synopsis
Query example
<total>{
count(distinct-values (
for $det in doc(“cars.xml”)
/list/car/selling/details
where $det/model = “Fiat Brava”
return $det/city )) } </total>
Transformed query
<total>{
count(distinct-values (
for $det in doc(“cars_syn.xml”)
/list/car/selling/details
where $det/model/hist/bucket/bv = “Fiat
Brava”
return $det/city )) } </total>
Conclusions
Prototype tool
First experiments show strong reduction of
space occupied by data:
500 bytes x n documents
750 bytes
Huge collections of data… (n=100, 1000,
10000…)
Error measure
Ongoing and Future work
Optimal refreshing of the synopsis as the
original collection is updated (need XQuery
update language)
Bibliography
P. Buneman, M. Grohe and C. Koch. “Path
queries on compressed XML”. In Prooc. Of
the 29°Int. Conf. On Very Large Data Bases,
2003
N. Polyzotis, M. Garofalakis, Y. Ioannidis.
“Approximate XML Query Answers”, Sigmod
ACM 2004.
S. Marrara “Aggregate queries in XQuery”,
PhD Thesis, Politecnico di Milano, 2005