DISSIMILARITIES AND MATCHING
BETWEEN SYMBOLIC OBJECTS
Prof. Donato Malerba
Department of Informatics,
University of Bari, Italy
[email protected]
ASSO School
Athens, Greece
October 6-8, 2003
COMPUTING DISSIMILARITIES:
Fare clic per modificare
lo
WHY?
stile del titolo dello schema
• Several data analysis techniques are based on
Fare
clic per
modificare(or
glisimilarity)
stili del testo
quantifying
a dissimilarity
measure
between
multivariate data.
dello
schema
•Secondo
Clustering livello
• Discriminant
analysis
Terzo livello
• Visualization-based
• Quarto livello approaches
Quinto livello
• Symbolic –objects
are a kind of multivariate data.
•
Ex.: [colour={red, black}][weight  {60,70,80}][height  []1.50,1.60]
• The dissimilarity measures presented here are among
those investigated in the ASSO Project.
2
A case study
Fare
clic
per
modificare
lo
• Abalone features survey
stile
titolo
dello
schema
• Abalonesdel
are members
of a large
class (Gastropoda)
of molluscs
having one-piece shells.
Fare
clicof per
gli stili
del
testo
• 4177 cases
marinemodificare
crustaceans described
by the
following
attributes:
dello schema
Attribute
Data Type
Unit of
Name
meas.
Sex Terzo Nominal
livello
Length
Continuous
mm
• Quarto
livello mm
Diameter
Continuous
– Continuous
Quinto livellomm
Height
Whole weight
Continuous
grams
Shucked weight Continuous
grams
Viscera weight Continuous
grams
Shell weight
Continuous
grams
Rings
Integer
Secondo livello
Description
M, F. I (infant)
Longest shell measurement
Perpendicular to length
Measured with meat in shell
Weight of the whole abalone
Weight of the meat
Gut weight after bleeding
Weigh of the dried shell
Number of rings
3
Fare
clic
per
modificare
lo
The construction of SO
stile del titolo dello schema
DB2SO: facility of the ASSO system to generate (Boolean or
Probabilistic)
symbolic
objects fromgli
relational
databases.
Fare
clic per
modificare
stili del
testo
Input
:
dello
schema
• Secondo
a set of groupslivello
or classes C1, C2, …, CK
• a set
of n individuals
Terzo
livello k each of which is described by p
variables•YQuarto
1, …, Ylivello
p and is assigned to one or more groups
Output:
– Quinto livello
•a set of K symbolic objects ei described by p variables Y1, …,
Yp
Example: Nine symbolic objects, one for each interval of:
Number of rings
4
TABLE OF BOOLEAN SYMBOLIC
Fare clic perOBJECTS
modificare lo
stile del titolo dello schema
Fare clic per modificare gli stili del testo
dello schema
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
5
COMPUTATION OF DISSIMILARITIES
FareBETWEEN
clic per SYMBOLIC
modificare
lo
OBJECTS
stile del titolo dello schema
Fare clic per modificare gli stili del testo
Dissimilarity
matrix
dello schema
Secondo livello
SO
Terzo livello
1
• Quarto livello 2
– Quinto livello3
4
…
1
2
3
0.0000
0.2053 0.0000
12.8626 15.0793 0.0000
14.0338 15.0403 8.6463
4
0.0000
6
Fare
clic
per
modificare
lo
The MID property
stile del titolo dello schema
the degree of dissimilarity between crustaceans computed on
Fare
clic per attributes
modificare
del testo
the independent
shouldglibestili
proportional
to the
dissimilarity
in the dependent attribute (i.e., the difference in
dello schema
the number of rings). This property is called monotonic
Secondo
livello(MID).
increasing
dissimilarity
Terzo livello
• Quarto livello
– Quinto livello
7
A b alo ne - U _ 1
A b alo ne - U _ 2
The MID
Fare
clic per modificare lo
property
stile del titolo dello schema
15
2,5
12
2
9
1,5
6
1
3
0,5
0
0
1-3
4-6
7-9
1012
1315
1618
1921
22- 2524 29
1-3
1618
19- 22- 2521 24 29
0,15
0,1
gli stili del testo
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
0,05
0
1-3
1315
1618
19- 22- 2521 24 29
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
Abalone - SO_4
Abalone - SO_3
0,4
0,3
0,2
0,1
0
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
Abalone - C_1
Abalone - SO_5
1,2
0,9
0,6
0,3
0
1012
0,25
0,2
0,15
0,1
0,05
0
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
1,5
1,2
0,9
0,6
0,3
0
4-6 7-9
Abalone - SO_2
Abalone - SO_1
0,4
0,3
0,2
0,1
0
monotonic increasing
dissimilarity (MID).
1315
0,2
The degree of
dissimilarity
Fare
clicbetween
per modificare
crustaceans computed on
dello schema
the independent
attributes
shouldlivello
be
Secondo
proportional
tolivello
the
Terzo
dissimilarity in the
• Quarto livello
dependent attribute (i.e.,
Quinto
the difference– in
the livello
number of rings). This
property is called
1012
A b alo ne - U _ 4
Abalone - U_3
1,5
1,2
0,9
0,6
0,3
0
4-6 7-9
1,2
0,9
0,6
0,3
0
1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29
1-3 4-6 7-9 10- 13- 16- 19- 8
22- 2512 15 18 21 24 29
BOOLEAN SYMBOLIC OBJECTS
Fare clic per modificare
lo
(BSO’S)
stile del titolo dello schema
A BSO is a conjunction of boolean elementary
events: clic per modificare gli stili del testo
Fare
[Ydello
 [Y2=A2]  ...  [Yp=Ap]
1=A1] schema
where
each variable
Secondo
livello Yi takes values in Yi and Ai is
a subset of Yi
Terzo livello
Let a and• Quarto
b be livello
two BSO’s:
a = [Y1=A1]– Quinto
 [Y2=A
livello
2]  ...  [Yp=Ap]
b = [Y1=B1]  [Y2=B2]  ...  [Yp=Bp]
where each variable Yj takes values in Yj and Aj
and Bj are subsets of Yj. We are interested to
compute the dissimilarity d(a,b).
9
Fare clic
per
modificare
lo
CONSTRAINED BSO’S
stile del titolo dello schema
Two types of dependencies between variables:
• Hierarchical
mother-daughter
Fare
clic perdependence
modificare (gli
stili del testo): A
variable
i may be inapplicable if another variable Yj takes
dello Yschema
its values in a subset Sj  Yj. This dependence is
Secondo livello
expressed as a rule:
Terzo livello
if [Yj = Sj] then [Yi = NA]
• Quarto livello
• Logical dependence
: This case occurs, if a subset
– Quinto livello
Sj  Yj of a variable Yj is related to a subset Si  Yi of a
variable Yi by a rule such as:
if [Yj = Sj] then [Yi = Si]
10
DISSIMILARITY AND SIMILARITY
Fare clic perMEASURES
modificare lo
stile del titolo dello schema
Dissimilarity Measure
* = d(a,a)  d(a,b)
Fare
gli stili
del<testo
d: EERclic
suchper
that dmodificare
= d(b,a)
 a,bE
a
dello schemaSimilarity Measure
s: EE
 R such that
s*a = s(a,a)  s(a,b) = s(b,a)  0  a,bE
Secondo
livello
Generally:
Terzo livello
 a  E: d*•a Quarto
= d* and
s*a= s* and specifically, d* = 0 while s*= 1
livello
– Quinto
livello can be transformed into
Dissimilarity
measures
similarity measures (and viceversa):
d=(s)
( s=-1(d) )
where:
(s) strictly decreasing function, and (1) = 0, (0) = 
11
DISSIMILARITY AND SIMILARITY
Fare clic
per
modificare
lo
MEASURES: PROPERTIES
stile del titolo dello schema
Some properties that a dissimilarity measure d on E may
satisfy
Fareare:
clic per modificare gli stili del testo
1. dello
d(a, b) = schema
0   c  E: d(a, c) = d(b, c)
(eveness)
2. d(a,
b) = 0  a =livello
b
Secondo
(definiteness)
3. d(a, b)
 d(a, c) livello
+ d(c, b)
Terzo
(triangle inequality)
• Quartoc),
livello
4. d(a, b)  max(d(a,
d(c, b))
– Quinto livello
(ultrametric inequality )
5. d(a, b) + d(c, d)  max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality)
6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance )
 A dissimilarity function that satisfies proprieties 2 and 3 is called metric.
 A dissimilarity function that satisfies only property 3 is called pseudo
metric or semi- distance.
12
DISSIMILARITY MEASURES
Fare clic BETWEEN
per modificare
lo
BSO’S
stile del titolo dello schema
Author(s) (Year)  Notation from the SODAS Package
Fare clic per modificare gli stili del testo
• Gowda & Diday (1991)  U_1
dello
schema
• Ichino & Yaguchi (1994)  U_2, U_3, U_4
Secondo
livello SO_1, SO_2
• De
Carvalho (1994)
Terzo livello
• De Carvalho
(1996, 1998)  SO_3, SO_4, SO_5, C_1
• Quarto livello
– Quinto livello
U: only for unconstrained BSO’s
C: only for constrained BSO’s
SO: for both constrained and unconstrained BSO’s
13
GOWDA & DIDAY’S DISSIMILARITY
Fare clic perMEASURE
modificare lo
stile del titolo dello schema
Gowda & Diday’s dissimilarity measures for two BSO’s a and b:
p
Fare
clic
per
modificare
gli
stili
del
testo
U_1
D
(
A
,
B
)
D(a, b) = 
j
j
j 1
dello schema
If Yj is a continuous variable:
Secondo
livello
D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj)
Terzo livello
while if Yj is a nominal variable:
• Quarto livello
D(Aj, Bj) = Ds(A
Dc(Aj, Bj)
j, Bj) +
– Quinto
livello
where the components are defined so that their values are
normalized between 0 and 1:
Aj
Bj
• D(Aj, Bj) due to position,
• Ds(Aj, Bj) due to span,
D
Ds
Dc 14
• Dc(Aj, Bj) due to content
GOWDA & DIDAY’S DISSIMILARITY
Fare clic perMEASURE
modificare lo
stile del titolo dello schema
Properties:
Fare
clic per modificare gli stili del testo
D(a,
b) schema
= 0  a = b (definiteness property),
dello
No proof is reported for the triangle inequality property
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
15
ICHINO & YAGUCHI’S
FareDISSIMILARITY
clic per modificare
lo
MEASURES
stile del titolo dello schema
Ichino & Yaguchi’s dissimilarity measures are based on the
Cartesian operators join  and meet .
Fare
clic per modificare gli stili del testo
For continuous variables:
Aj  Bj
dello schema
Aj
Bj
Aj  Bj
Secondo
Aj  Bj livello
Terzo
livello
Aj  Bj
while for
nominal
variables:
• Quarto livello
Aj  Bj = Aj  Bj
– Quinto livello
Aj  Bj = Aj  Bj
Given a pair of subsets (Aj, Bj) of Yj the componentwise
dissimilarity(Aj,Bj) is:
(Aj, Bj) =Aj  Bj Aj  Bj+ (2Aj  BjAj Bj)
where 0    0.5 and Ajis defined depending on variable16types.
ICHINO & YAGUCHI’S
FareDISSIMILARITY
clic per modificare
lo
MEASURES
stile del titolo dello schema
(Aj,Bj) are aggregated by an aggregation function such as
the generalised Minkowski’s distance of order q:
Fare
clic per modificare gli stili del testo
U_2
dello schema
d q ( a, b)  q
  ( A j , B j )
p
q
j 1
Secondo
livello on the chosen units of measurements
Drawback
: dependence
Terzo
livello of the componentwise dissimilarity:
Solution
: normalization
U_3
• Quarto livello
p
livello
d q (–a,Quinto
b)  q 
 ( Aj , B j ) q ,

j 1

 ( Aj , B j ) 
 ( Aj , B j )
Yj
The weighted formulation guarantees that dq(a,b)[0,1].
p
U_4
d q (a, b)  q  c j ( A j , B j )q
j 1
The above measures are metrics
17
DE CARVALHO’S DISSIMILARITY
Fare clic per
modificare
lo
MEASURES
stile del titolo dello schema
A straightforward extension of similarity measures for
classical clic
data matrices
with nominal
Fare
per modificare
glivariables.
stili del testo
dello schema
Agreement
Agreement
=(AjBj)
Secondo livello
Disagreement
=(c(Aj)Bj)
Terzo livello
Total
(Bj)
• Quarto livello
– Quinto livello
Disagreement
Total
=(Ajc(Bj))
(Aj)
=(c(Aj)c(Bj)) (c(Aj))
(c(Bj))
(Yi)
where (Vj) is either the cardinality of the set Vj (if Yj is a
nominal variable) or the length of the interval Vj (if Yj is a
continuous variable).
18
DE CARVALHO’S DISSIMILARITY
Fare clic per
modificare
lo
MEASURES
stile
del
titolo
dello
schema
Five different similarity measures s , i = 1, ..., 5,
i
defined:
Fare
clic per modificare
gli stili
del testo
si Comparison
Function Range
Property
s1 /(++)
dello
schema
[0,1]
s2 2/(2++)
[0,1]
Secondo
livello
s3 /(+2+2)
[0,1]
s4 Terzo
[0,1]
½[/(+)+/(+)]
livello
½
S5 /[(+)(+)]
[0,1]
• Quarto livello
are
metric
semi metric
metric
semi metric
semi metric
– Quinto livello
The corresponding dissimilarities are di = 1  si.
The di are aggregated by an aggregation function AF such
as the generalised Minkowski metric, thus obtaining:
SO_1
p
d ai (a, b)  q
 w j di ( A j , B j )
q
j 1
1 i  5
19
DE CARVALHO’S EXTENSION OF
ICHINO
YAGUCHI’S
DISSIMILARITY
Fare
clic&per
modificare
lo
MEASURE
stile del titolo dello schema
A different componentwise dissimilarity measure:
Fare clic per modificaregli
A j , stili
B j  del testo
 A j , B j  
dello schema
 A j  B j 
Secondo
livello
where
 is defined
as in Ichino & Yaguchi’s dissimilarity
Terzo livello
measure.
• Quartofunction
livello
The aggregation
AF suggested by De Carvalho is:
– Quinto livello
SO_2
q
p 1

d q (a, b)  q   ( A j , B j )

j 1 p
This measure is a metric.
20
THE DESCRIPTION-POTENTIAL
Fare clic per
modificare
lo
APPROACH
stile
del
titolo
dello
schema
All dissimilarity measures considered so far are defined
by two
functions: a comparison function (componentwise measure) and
Fare
clic per
modificare
gli stili del testo
an aggregation
function
.
A dello
differentschema
approach is based on the concept of description
potential (a) of a symbolic object a.
Secondo livello
Terzo livello
p
( a )    ( A j )
• Quarto livello
j 1
– Quinto livello
where (Vj) is either the cardinality of the set Vj (if Yj is a nominal
variable) or the length of the interval Vj (if Yj is a continuous
variable).
21
THE DESCRIPTION-POTENTIAL
Fare clic perAPPROACH
modificare lo
stile del titolo dello schema
SO_3
d1 (a, b)  (a  b)  (a  b)  [2(a  b)  (a)  (b)]
Fare clic per modificare gli stili del testo
dello schema
(a  b)  (a  b)  [2(a  b)  (a)  (b)]
SO_4

d 2 ( a, b) 
Secondo livello
(a E )
Terzo livello
• Quarto 
livello
(a  b)  (a  b)  [2(a  b)  (a)  (b)]
SO_5 d 2 (a, –b)Quinto

livello
(a  b)
The triangular inequality does not hold for SO_3 and SO_4,
which are equivalent. SO_5 is a metric.
22
DESCRIPTION POTENTIAL FOR
Fare clic
per
modificare
lo
CONSTRAINED BSO’S
stile del titolo dello schema
Given a BSO a and a logical dependence expressed by the
rule:
Fare clic per modificare gli stili del testo
if [Yj = Sj] then [Yi = Si]
dello schema
the incoherent restriction a’ of a is defined as:
Secondo livello
a’= [Y1=A1]  ...  [Yj-1=Aj-1]  [Yj=Aj Sj]  ...  [Yi-1=Ai-1]
Terzo livello
 [Yi=Ai (Yi\Si)]  ...  [Yp=Ap]
• Quarto livello
Then the description
of a is:
– Quinto potential
livello
p
(a )   ( A j )  (a)
j 1
A similar extension exists for hierarchical dependencies.
23
DISSIMILARITY MEASURES FOR
Fare clic
per
modificare
lo
CONSTRAINED BSO’S
stile
del
titolo
dello
schema
•The extended definition of description potential can be
applied to the computation of the distances SO_3, SO_4
Fare
clic per modificare gli stili del testo
and SO_5.
•De
Carvalho
proposed an extension of ’, so that SO_2
dello
schema
can also be applied to constrained BSO.
livello
•HeSecondo
also proposed
an extension of , , , and  in order
to takeTerzo
into account
livello of constraints. Therefore, SO_1 can
also be applied
tolivello
constrained BSO.
• Quarto
p
q
Finally, C_1 is– defined
as
follows:
Quinto livello
 di A j , B j 
where:
j 1
0
if
Y

NA


d
(
a
,
b
)


q
j
q
p
( j )  
 ( j )
1 otherwise
j 1
If all BSO’s are coherent, then the dissimilarity measures
24
do not change.
Fare clic per
modificare
lo
MATCHING
stile del titolo dello schema
• Matching is the process of comparing two or more
Fare
per modificare
gli stili
del testo
structuresclic
to discover
their similarities
or differences.
dello schema
• Similarity
judgements in the matching process
are
directional: They have a
Secondo livello
• referent, a, a prototype or the description of a class of
objectsTerzo livello
Quarto
livello of the prototype or an instance of a
• subject,• b
, a variant
– Quinto livello
class of objects.
• Matching two structures is a common problem to many
domains, like symbolic classification, pattern recognition,
data mining and expert systems.
31
Fare clicMATCHING
per modificare
lo
BSO’S
stile del titolo dello schema
•Generally, a BSO represents a class description and plays
the
role ofclic
the per
referent
in the matching
process.
Fare
modificare
gli stili
del testo
a:
[color = {black, white}]  [height =[170, 200]]
dello schema
describes a set of individuals either black or white, whose
Secondo
livello [170,200]. Such a set of individuals
height
is in the interval
is called
extension
of the BSO. The extension is a subset of
Terzo
livello
the universe
 of individuals
.
• Quarto
livello
– Quinto
livello
Given two BSO’s
a and
b, the matching operators define
whether b is the description of an individual in the extension
of a.
• In the ASSO software two matching operators for BSO’s
have been defined.
32
Fare
clic
per
modificare
lo
CANONICAL MATCHING OPERATOR
stile del titolo dello schema
• The result of the canonical matching operator is either 0
(false) or 1 (true).
Fare
clic
per
modificare
gli
stili
del
testo
• If E denotes the space of BSO’s described by a set of p
dello schema
variables
Yi taking values in the corresponding domains Yi,
thenSecondo
the matching
operator is a function:
livello
Match: E × E  {0, 1}
Terzo livello
such that for any two BSO’s a, b  E:
• Quarto livello
a –=Quinto
[Y1=Alivello
1]  [Y2=A2] 
...  [Yp=Ap]
b = [Y1=B1]  [Y2=B2]  ...  [Yp=Bp]
it happens that:
• Match(a,b) = 1
• Match(a,b) = 0
if BiAi for each i=1, 2, , p,
otherwise.
33
Fare
clic
per
modificare
lo
CANONICAL MATCHING OPERATOR
stile del titolo dello schema
Examples:
Fare clic per modificare gli stili del testo
District1
[profession={farmer, driver}]  [age=[24,34]]
dello =schema
Indiv1 = [profession=farmer]  [age=28]
Secondo livello
Indiv2 = [profession=salesman]  [age=[27,28]]
Terzo livello
• Quarto livello
Match(District1,Indiv1)
– Quinto livello= 1
Match(District1,Indiv2) = 0
34
Fare
clic
per
modificare
lo
CANONICAL MATCHING OPERATOR
stile del titolo dello schema
• The canonical matching function satisfies two out of three
Fare
modificare
propertiesclic
of aper
similarity
measure: gli stili del testo
dello
•  a,schema
b  E: Match(a, b)  0
•  a, b  E:livello
Match(a, a)  Match(a, b)
Secondo
while itTerzo
does not
satisfy the commutativity or simmetry
livello
property: • Quarto livello
– Quinto
livello
•  a, b 
E: Match(a,
b) = Match(b, a)
because of the different role played by a and b.
35
Fare
clic
per
modificare
lo
FLEXIBLE MATCHING OPERATOR
stile del titolo dello schema
• The requirement BiAi for each i=1, 2, , p, might be too
strict for real-world problems, because of the presence of
Fare
clicdescription
per modificare
gli stiliof del
testo
noise in the
of the individuals
the universe.
dello schema
• Example:
District1
= [profession={farmer,
driver}]  [age=[24,34]]
Secondo
livello
Indiv3livello
= [profession=farmer]  [age=23]
Terzo
=0
• QuartoMatch(District1,Indiv3)
livello
• It is necessary
to rely
on a flexible definition of matching
– Quinto
livello
operator, which returns a number in [0,1] corresponding to
the degree of match between two BSO’s, that is
flexible-matching: E × E  [0,1]
36
Fare
clic
per
modificare
lo
FLEXIBLE MATCHING OPERATOR
stile del titolo dello schema
For any two BSO’s a and b,
i) flexible-matching(a,b)=1 if Match(a,b)=true,
Fare
clic per modificare
gli
stili del testo
ii) flexible-matching(a,b)
[0,1)
otherwise.
dello
The
resultschema
of the flexible matching can be interpreted as the
probability of a matching b provided that a change is made in b.
Secondo livello
Let Ea = {b' E | Match(a,b')=1} and P(b | b') be the
Terzo
livello of observing b given that the original
conditional
probability
Quarto
observation• was
b'.livello
Then
– Quinto livello
flexible - matching( a, b) =
max P(b | b' )
def b' Ea
that is flexible-matching(a,b) equals the maximum conditional
37
probability over the space of BSO’s canonically matched by a.
FLEXIBLE MATCHING: AN
Fare clic per
modificare
lo
APPLICATION
stile del titolo dello schema
• Credit card applications (Quinlan)
Fare
clic
per
modificare
gli
stili
del
testo
• Fifteen variables whose names and values have
dello
schema to meaningless symbols to
been changed
protect
the confidentiality
of the data.
Secondo
livello
+
Terzo livello
• Quarto livello
• class variable:
positive in case of approval of
– Quinto negative
livello
credit facilities,
otherwise.
• Training set: 490 cases
• 6 rules generated by Quinlan’s system C4.5
38
FLEXIBLE MATCHING: AN
Fare clic per
modificare
lo
APPLICATION
stile del titolo dello schema
Rule
Class
Conditions
41
[Y3 > 1.54]  [ Y9 = f ]  [ Y4  {u, y}] 
Fare
clic
per
modificare gli stili del testo
 [Y6{c,d, cc, i, j, k, m, r, q, w, e, aa, ff}]
43
[ Y4  {u, y}]  [ Y8 <= 1.71 ] [ Y9 = f ]
dello
schema
[ Y3 <= 0.835]  [ Y6  {c,d,i,k,m,q,w,e,aa }] 
Secondo livello
[ Y7  {v,bb}]  [Y14 > 102]  [Y15 <= 500]
30 Terzo
+ livello
[ Y9 = t ]
34
[Y3livello
<= 0.125 ]  [Y14 > 221 ]
•+Quarto
46
+ – Quinto
[Y4  {l}
]
livello
6
-
• Such rules can be easily represented by means
of Boolean symbolic objects.
• Both matching operators can be considered in
order to test the validity of the induced rules.39
Fare
clic
per modificare
lo
A new
dissimilarity
measure
stile del titolo dello schema
Flexible
is asymmetric.
Fare clic matching
per modificare
gli stili delHowever
testo
it
is possible
dello
schemato “symmetrize” it  New
dissimilarity
measure SO_6
Secondo livello
livello as
It isTerzo
computed
d(a,b) =•
Quarto livello
– Quinto livello
= 1-(flexible_matching(a,b)+flexible_matching(b,a))/2
40
Dissimilarity measure Parameters
U_1 (Gowda & Diday)
U_2 (Ichino & Yaguchi)
Constraints
Default
none
Gamma
[0 .. 0.5]
0.5
Order of power
1 .. 10
2
U_3 (Normalized Ichino Gamma
[0 .. 0.5]
0.5
& Yaguchi)
Order of power
1 .. 10
2
U_4 (Weighted
Gamma
[0 .. 0.5]
0.5
Normalized Ichino &
Order of power
1 .. 10
2
Yaguchi)
List of weights, one per var. Sum(weights) = 1.0 Equal weights
C_1 (Normalized De
Comparison function
D1, D2, D3, D4, D5
D1
Carvalho)
Order of power
1 .. 10
2
SO_1 (De Carvalho)
Comparison function
D1, D2, D3, D4, D5
D1
Order of power
1 .. 10
2
List of weights, one per var. Sum(weights) = 1.0 Equal weights
SO_2 (De Carvalho)
Gamma
[0 .. 0.5]
0.5
Order of power
1 .. 10
2
SO_3 (De Carvalho)
Gamma
[0 .. 0.5]
0.5
Order of power
1 .. 10
2
SO_4 (Normalized De
Gamma
[0 .. 0.5]
0.5
Carvalho)
Order of power
1 .. 10
2
SO_5 (Normalized De
Gamma
[0 .. 0.5]
0.5
Carvalho)
Order of power
1 .. 10
2
SO_6 (Symmetrized
none
Flexible Matching)
PROBABILISTIC SYMBOLIC OBJECT
Fare clic per (PSO’S)
modificare lo
stile del titolo dello schema
Probabilistic symbolic objects (PSO’s) involve modal
Fare
clic per
modificare gli stili del testo
(probabilistic)
variables.
dello schema
Each cell represents the set of weighted values that the
Secondo
livellofor a symbolic object, where a
variable
can take
Terzoweighting
livello system is adopted.
probabilistic
• Quarto livello
Quinto itlivello
In case of –PSO,
isn’t possible to use dissimilarity
measures for BSO because they don’t take the
probabilities into consideration and so this determines a
notable information loss.
Therefore, new dissimilarity measures for PSO are
42
needed.
Defining dissimilarity measures
forclic
probabilistic
symbolic
Fare
per modificare
lo
objects
stile del titolo
dello schema
Steps:
Fare clic per modificare gli stili del testo
1. dello
Define
coefficients measuring the divergence between
schema
two probability distributions
Secondo livello





Kullback-Leibler
divergence
Terzo livello
Chi-square
divergence
• Quarto livello
Hellinger
– Quinto livello
similarity coefficient (*)
K-divergence
Variation distance
non-symmetric
coefficients
symmetric coefficient
(*) from them two dissimilarity measures, namely the Renyi’s and Chernoff’s
coefficients, are obtained
43
Defining dissimilarity measures
probabilistic
symbolic
Farefor
clic
per modificare
lo
objects
stile del titolo
dello schema
Steps:
Fare clic per modificare gli stili del testo
2. dello
Symmetrize
the
schema
coefficients
Secondo
livello
m(P,Q)=
Terzo livello
non
symmetric
m(Q,P) + m(P,Q)
• Quarto livello
3. Aggregate
the contribution of all
– Quinto livello
variables
to compute the dissimilarity
between two symbolic objects
 PSO Dissimilarity measures
44
Fare
clic per
Mixture
SO modificare lo
stile del titolo dello schema
Some SO’s can be described by both nonFare
clic
per
modificare
gli
stili
del
testo
modal and modal variables
dello schema
They are neither BSO’s nor PSO’s
Secondo
livello measure, then?
What
dissimilarity
InTerzo
ASSO livello
it has been proposed to combine
• Quarto livello
the result
of two dissimilarity measure, one
– Quinto livello
for modal
and the other for non-modal.
Combination can be either additive or
multiplicative.
 This possibility should be taken with great
care!!!
45
Fare clic per
modificare
lo
REFERENCES
stile
del
titolo
dello
•
Esposito
F., Malerba
D., V. Tamma,
H.-H.schema
Bock. Classical
resemblance measures. Chapter 8.1
•Fare
Esposito
F., Malerba
D., V. Tamma. Dissimilarity
measures
for
clic
per
modificare
gli
stili
del
testo
symbolic objects. Chapter 8.3
schema
• dello
Esposito
F., Malerba D., F.A. Lisi. Matching symbolic objects.
Chapter 8.4
Secondo
livello
in H.-H.
Bock, E. Diday
(eds.): Analysis of Symbolic Data. Exploratory
methods
for extracting
Terzo
livello statistical information from complex data.
•
•
Springer Verlag, Heidelberg, 2000.
• Quarto livello
D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of
Quinto livello
dissimilarity–measures
for Boolean symbolic data. In P. Brito, J.
Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on
“Dealing with Structured Data in Machine Learning and Statistics”,
Barcelona.
D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing
Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings
of EKT-NTTS, vol. 1, pp. 473-481.
46
Fare clic per
modificare
lo
REFERENCES
stile
del titolo dello schema
•
D. Malerba, F. Esposito, M. Monopoli (2002). Estrazione e matching
di oggetti simbolici da database relazionali. Atti del Decimo
Convegno
su Sistemi Evoluti
Basi di
Datitesto
SEBD’2002,
Fare
clicNazionale
per modificare
gliperstili
del
265-272.
schema
• dello
D. Malerba,
F. Esposito, & M. Monopoli (2002). Comparing
dissimilarity measures
Secondo
livellofor probabilistic symbolic objects. In A.
•
•
Zanasi, C. A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.) Data Mining
livello Information Systems, Vol 6, 31-40, WIT
III, Terzo
Series Management
Press, Southampton,
UK.
• Quarto livello
E. Diday, F. –Esposito
Quinto (2003).
livello An Introduction to Symbolic Data
Analysis and the Sodas Software, Intelligent Data Analysis, 7, 6,
(in press).
Other project reports
47
Fare clic METHOD
per modificare
lo
DISS
stile del titolo dello schema
• Dissimilarity measures
between
and gli stili del testo
Fare
clicboth
per BSO’s
modificare
PSO’s.
dello schema
• Input: Asso file of SO’s
Secondo livello
• Output
for dissimilarities:
Terzo livello
Report•+Quarto
Assolivello
file with
dissimilarity
matrix
– Quinto
livello
DI
• Developer: Dipartimento
method
di Informatica, University
of Bari, Italy.
Report
48
file
TWO
DIAGRAMS lo
FareUSE
clicCASE
per modificare
stile del titolo dello schema
Create an ASSO chaining
with the DISS method
Set up testo
parameters of
Fare clic Create
peran ASSO
modificare gli stili del
the DISS method
chaining with the DISS
dello schema
method
Secondo livello
up parameters
TerzoSet
livello
the DISS method
User
of
• Quarto livello
– Quinto livello
User
Run the DISS method and generate
a new ASSO file with a
dissimilarity matrix
Create a new chaining with the
new ASSO file
Run the DISS method
and generate a report
file
View report file
Run VDISS and visualize
the dissimilarity measure,
the bi-dimensional
mapping & the graphical
49
representation
Fare clic
per
modificare
lo
PARAMETER SETUP
stile del titolo dello schema
• The user can select a subset of variables Yi on which the
dissimilarity
measure
or the matching
operator
Fare
clic per
modificare
gli stili del
testohas to
computed .
dello schema
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
50
Fare clic
per
modificare
lo
PARAMETER SETUP
stile
del
dello
schema
• The user
cantitolo
select a number
of parameters.
Dissimilarity measure
Fare
clic per modificare gli stili del testo
dello schema
combine
Secondo livello
Terzo livello
Name of
• Quarto livello
the new
– Quinto livello
ASSO
file
?
?
51
OUTPUT
SODAS FILE
Fare clic
per modificare
lo
stile
del titolo
• The output
ASSO file dello
containsschema
both the same
input data and an additional dissimilarity matrix.
Fare
clic
per
modificare
gli
stili
del
testo
The dissimilarity between the i-th and the j-th
dello
BSO isschema
written in the cell (entry) (i, j) of the
Secondo livello
matrix.
livellopart of the dissimilarity matrix is
• OnlyTerzo
the lower
• Quarto livello
reported
in the file, since dissimilarities are
– Quinto livello
symmetric.
abalone output file
52
OUTPUT REPORT FILE
The report file is organized as follows:
Output report file
Fare clic per
modificare
lo
Output
stile del titolo dello schema
Visualization of the dissimilarity table
Fare clic per modificare gli stili del testo
dello schema
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
54
Fare clic per
modificare
lo
Output
stile
del titolo
schema
Visualization
of dello
a line
graph
dissimilarities
Fare clic per modificare gli stili del testo
dello schema
of
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
The number of
lines in each
graph is equal to
the number of
SOs minus one
Each
line
represents
the
dissimilarity
between a given
SO
and
the
subsequent SOs
in the file
55
Fare clic per
modificare
lo
Output
stile del titolo dello schema
Visualization of a scatterplot of Sammon’s
Fare
clicmapping
per modificare
gli stili delspace
testo
nonlinear
into a bidimensional
dello schema
Secondo livello
Terzo livello
• Quarto livello
– Quinto livello
56
Scarica

B j - Dipartimento di Informatica