DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy [email protected] ASSO School Athens, Greece October 6-8, 2003 COMPUTING DISSIMILARITIES: Fare clic per modificare lo WHY? stile del titolo dello schema • Several data analysis techniques are based on Fare clic per modificare(or glisimilarity) stili del testo quantifying a dissimilarity measure between multivariate data. dello schema •Secondo Clustering livello • Discriminant analysis Terzo livello • Visualization-based • Quarto livello approaches Quinto livello • Symbolic –objects are a kind of multivariate data. • Ex.: [colour={red, black}][weight {60,70,80}][height []1.50,1.60] • The dissimilarity measures presented here are among those investigated in the ASSO Project. 2 A case study Fare clic per modificare lo • Abalone features survey stile titolo dello schema • Abalonesdel are members of a large class (Gastropoda) of molluscs having one-piece shells. Fare clicof per gli stili del testo • 4177 cases marinemodificare crustaceans described by the following attributes: dello schema Attribute Data Type Unit of Name meas. Sex Terzo Nominal livello Length Continuous mm • Quarto livello mm Diameter Continuous – Continuous Quinto livellomm Height Whole weight Continuous grams Shucked weight Continuous grams Viscera weight Continuous grams Shell weight Continuous grams Rings Integer Secondo livello Description M, F. I (infant) Longest shell measurement Perpendicular to length Measured with meat in shell Weight of the whole abalone Weight of the meat Gut weight after bleeding Weigh of the dried shell Number of rings 3 Fare clic per modificare lo The construction of SO stile del titolo dello schema DB2SO: facility of the ASSO system to generate (Boolean or Probabilistic) symbolic objects fromgli relational databases. Fare clic per modificare stili del testo Input : dello schema • Secondo a set of groupslivello or classes C1, C2, …, CK • a set of n individuals Terzo livello k each of which is described by p variables•YQuarto 1, …, Ylivello p and is assigned to one or more groups Output: – Quinto livello •a set of K symbolic objects ei described by p variables Y1, …, Yp Example: Nine symbolic objects, one for each interval of: Number of rings 4 TABLE OF BOOLEAN SYMBOLIC Fare clic perOBJECTS modificare lo stile del titolo dello schema Fare clic per modificare gli stili del testo dello schema Secondo livello Terzo livello • Quarto livello – Quinto livello 5 COMPUTATION OF DISSIMILARITIES FareBETWEEN clic per SYMBOLIC modificare lo OBJECTS stile del titolo dello schema Fare clic per modificare gli stili del testo Dissimilarity matrix dello schema Secondo livello SO Terzo livello 1 • Quarto livello 2 – Quinto livello3 4 … 1 2 3 0.0000 0.2053 0.0000 12.8626 15.0793 0.0000 14.0338 15.0403 8.6463 4 0.0000 6 Fare clic per modificare lo The MID property stile del titolo dello schema the degree of dissimilarity between crustaceans computed on Fare clic per attributes modificare del testo the independent shouldglibestili proportional to the dissimilarity in the dependent attribute (i.e., the difference in dello schema the number of rings). This property is called monotonic Secondo livello(MID). increasing dissimilarity Terzo livello • Quarto livello – Quinto livello 7 A b alo ne - U _ 1 A b alo ne - U _ 2 The MID Fare clic per modificare lo property stile del titolo dello schema 15 2,5 12 2 9 1,5 6 1 3 0,5 0 0 1-3 4-6 7-9 1012 1315 1618 1921 22- 2524 29 1-3 1618 19- 22- 2521 24 29 0,15 0,1 gli stili del testo 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 0,05 0 1-3 1315 1618 19- 22- 2521 24 29 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 Abalone - SO_4 Abalone - SO_3 0,4 0,3 0,2 0,1 0 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 Abalone - C_1 Abalone - SO_5 1,2 0,9 0,6 0,3 0 1012 0,25 0,2 0,15 0,1 0,05 0 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 1,5 1,2 0,9 0,6 0,3 0 4-6 7-9 Abalone - SO_2 Abalone - SO_1 0,4 0,3 0,2 0,1 0 monotonic increasing dissimilarity (MID). 1315 0,2 The degree of dissimilarity Fare clicbetween per modificare crustaceans computed on dello schema the independent attributes shouldlivello be Secondo proportional tolivello the Terzo dissimilarity in the • Quarto livello dependent attribute (i.e., Quinto the difference– in the livello number of rings). This property is called 1012 A b alo ne - U _ 4 Abalone - U_3 1,5 1,2 0,9 0,6 0,3 0 4-6 7-9 1,2 0,9 0,6 0,3 0 1-3 4-6 7-9 10- 13- 16- 19- 22- 2512 15 18 21 24 29 1-3 4-6 7-9 10- 13- 16- 19- 8 22- 2512 15 18 21 24 29 BOOLEAN SYMBOLIC OBJECTS Fare clic per modificare lo (BSO’S) stile del titolo dello schema A BSO is a conjunction of boolean elementary events: clic per modificare gli stili del testo Fare [Ydello [Y2=A2] ... [Yp=Ap] 1=A1] schema where each variable Secondo livello Yi takes values in Yi and Ai is a subset of Yi Terzo livello Let a and• Quarto b be livello two BSO’s: a = [Y1=A1]– Quinto [Y2=A livello 2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] where each variable Yj takes values in Yj and Aj and Bj are subsets of Yj. We are interested to compute the dissimilarity d(a,b). 9 Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema Two types of dependencies between variables: • Hierarchical mother-daughter Fare clic perdependence modificare (gli stili del testo): A variable i may be inapplicable if another variable Yj takes dello Yschema its values in a subset Sj Yj. This dependence is Secondo livello expressed as a rule: Terzo livello if [Yj = Sj] then [Yi = NA] • Quarto livello • Logical dependence : This case occurs, if a subset – Quinto livello Sj Yj of a variable Yj is related to a subset Si Yi of a variable Yi by a rule such as: if [Yj = Sj] then [Yi = Si] 10 DISSIMILARITY AND SIMILARITY Fare clic perMEASURES modificare lo stile del titolo dello schema Dissimilarity Measure * = d(a,a) d(a,b) Fare gli stili del<testo d: EERclic suchper that dmodificare = d(b,a) a,bE a dello schemaSimilarity Measure s: EE R such that s*a = s(a,a) s(a,b) = s(b,a) 0 a,bE Secondo livello Generally: Terzo livello a E: d*•a Quarto = d* and s*a= s* and specifically, d* = 0 while s*= 1 livello – Quinto livello can be transformed into Dissimilarity measures similarity measures (and viceversa): d=(s) ( s=-1(d) ) where: (s) strictly decreasing function, and (1) = 0, (0) = 11 DISSIMILARITY AND SIMILARITY Fare clic per modificare lo MEASURES: PROPERTIES stile del titolo dello schema Some properties that a dissimilarity measure d on E may satisfy Fareare: clic per modificare gli stili del testo 1. dello d(a, b) = schema 0 c E: d(a, c) = d(b, c) (eveness) 2. d(a, b) = 0 a =livello b Secondo (definiteness) 3. d(a, b) d(a, c) livello + d(c, b) Terzo (triangle inequality) • Quartoc), livello 4. d(a, b) max(d(a, d(c, b)) – Quinto livello (ultrametric inequality ) 5. d(a, b) + d(c, d) max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality) 6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance ) A dissimilarity function that satisfies proprieties 2 and 3 is called metric. A dissimilarity function that satisfies only property 3 is called pseudo metric or semi- distance. 12 DISSIMILARITY MEASURES Fare clic BETWEEN per modificare lo BSO’S stile del titolo dello schema Author(s) (Year) Notation from the SODAS Package Fare clic per modificare gli stili del testo • Gowda & Diday (1991) U_1 dello schema • Ichino & Yaguchi (1994) U_2, U_3, U_4 Secondo livello SO_1, SO_2 • De Carvalho (1994) Terzo livello • De Carvalho (1996, 1998) SO_3, SO_4, SO_5, C_1 • Quarto livello – Quinto livello U: only for unconstrained BSO’s C: only for constrained BSO’s SO: for both constrained and unconstrained BSO’s 13 GOWDA & DIDAY’S DISSIMILARITY Fare clic perMEASURE modificare lo stile del titolo dello schema Gowda & Diday’s dissimilarity measures for two BSO’s a and b: p Fare clic per modificare gli stili del testo U_1 D ( A , B ) D(a, b) = j j j 1 dello schema If Yj is a continuous variable: Secondo livello D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj) Terzo livello while if Yj is a nominal variable: • Quarto livello D(Aj, Bj) = Ds(A Dc(Aj, Bj) j, Bj) + – Quinto livello where the components are defined so that their values are normalized between 0 and 1: Aj Bj • D(Aj, Bj) due to position, • Ds(Aj, Bj) due to span, D Ds Dc 14 • Dc(Aj, Bj) due to content GOWDA & DIDAY’S DISSIMILARITY Fare clic perMEASURE modificare lo stile del titolo dello schema Properties: Fare clic per modificare gli stili del testo D(a, b) schema = 0 a = b (definiteness property), dello No proof is reported for the triangle inequality property Secondo livello Terzo livello • Quarto livello – Quinto livello 15 ICHINO & YAGUCHI’S FareDISSIMILARITY clic per modificare lo MEASURES stile del titolo dello schema Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join and meet . Fare clic per modificare gli stili del testo For continuous variables: Aj Bj dello schema Aj Bj Aj Bj Secondo Aj Bj livello Terzo livello Aj Bj while for nominal variables: • Quarto livello Aj Bj = Aj Bj – Quinto livello Aj Bj = Aj Bj Given a pair of subsets (Aj, Bj) of Yj the componentwise dissimilarity(Aj,Bj) is: (Aj, Bj) =Aj Bj Aj Bj+ (2Aj BjAj Bj) where 0 0.5 and Ajis defined depending on variable16types. ICHINO & YAGUCHI’S FareDISSIMILARITY clic per modificare lo MEASURES stile del titolo dello schema (Aj,Bj) are aggregated by an aggregation function such as the generalised Minkowski’s distance of order q: Fare clic per modificare gli stili del testo U_2 dello schema d q ( a, b) q ( A j , B j ) p q j 1 Secondo livello on the chosen units of measurements Drawback : dependence Terzo livello of the componentwise dissimilarity: Solution : normalization U_3 • Quarto livello p livello d q (–a,Quinto b) q ( Aj , B j ) q , j 1 ( Aj , B j ) ( Aj , B j ) Yj The weighted formulation guarantees that dq(a,b)[0,1]. p U_4 d q (a, b) q c j ( A j , B j )q j 1 The above measures are metrics 17 DE CARVALHO’S DISSIMILARITY Fare clic per modificare lo MEASURES stile del titolo dello schema A straightforward extension of similarity measures for classical clic data matrices with nominal Fare per modificare glivariables. stili del testo dello schema Agreement Agreement =(AjBj) Secondo livello Disagreement =(c(Aj)Bj) Terzo livello Total (Bj) • Quarto livello – Quinto livello Disagreement Total =(Ajc(Bj)) (Aj) =(c(Aj)c(Bj)) (c(Aj)) (c(Bj)) (Yi) where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable). 18 DE CARVALHO’S DISSIMILARITY Fare clic per modificare lo MEASURES stile del titolo dello schema Five different similarity measures s , i = 1, ..., 5, i defined: Fare clic per modificare gli stili del testo si Comparison Function Range Property s1 /(++) dello schema [0,1] s2 2/(2++) [0,1] Secondo livello s3 /(+2+2) [0,1] s4 Terzo [0,1] ½[/(+)+/(+)] livello ½ S5 /[(+)(+)] [0,1] • Quarto livello are metric semi metric metric semi metric semi metric – Quinto livello The corresponding dissimilarities are di = 1 si. The di are aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining: SO_1 p d ai (a, b) q w j di ( A j , B j ) q j 1 1 i 5 19 DE CARVALHO’S EXTENSION OF ICHINO YAGUCHI’S DISSIMILARITY Fare clic&per modificare lo MEASURE stile del titolo dello schema A different componentwise dissimilarity measure: Fare clic per modificaregli A j , stili B j del testo A j , B j dello schema A j B j Secondo livello where is defined as in Ichino & Yaguchi’s dissimilarity Terzo livello measure. • Quartofunction livello The aggregation AF suggested by De Carvalho is: – Quinto livello SO_2 q p 1 d q (a, b) q ( A j , B j ) j 1 p This measure is a metric. 20 THE DESCRIPTION-POTENTIAL Fare clic per modificare lo APPROACH stile del titolo dello schema All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and Fare clic per modificare gli stili del testo an aggregation function . A dello differentschema approach is based on the concept of description potential (a) of a symbolic object a. Secondo livello Terzo livello p ( a ) ( A j ) • Quarto livello j 1 – Quinto livello where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable). 21 THE DESCRIPTION-POTENTIAL Fare clic perAPPROACH modificare lo stile del titolo dello schema SO_3 d1 (a, b) (a b) (a b) [2(a b) (a) (b)] Fare clic per modificare gli stili del testo dello schema (a b) (a b) [2(a b) (a) (b)] SO_4 d 2 ( a, b) Secondo livello (a E ) Terzo livello • Quarto livello (a b) (a b) [2(a b) (a) (b)] SO_5 d 2 (a, –b)Quinto livello (a b) The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric. 22 DESCRIPTION POTENTIAL FOR Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema Given a BSO a and a logical dependence expressed by the rule: Fare clic per modificare gli stili del testo if [Yj = Sj] then [Yi = Si] dello schema the incoherent restriction a’ of a is defined as: Secondo livello a’= [Y1=A1] ... [Yj-1=Aj-1] [Yj=Aj Sj] ... [Yi-1=Ai-1] Terzo livello [Yi=Ai (Yi\Si)] ... [Yp=Ap] • Quarto livello Then the description of a is: – Quinto potential livello p (a ) ( A j ) (a) j 1 A similar extension exists for hierarchical dependencies. 23 DISSIMILARITY MEASURES FOR Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema •The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 Fare clic per modificare gli stili del testo and SO_5. •De Carvalho proposed an extension of ’, so that SO_2 dello schema can also be applied to constrained BSO. livello •HeSecondo also proposed an extension of , , , and in order to takeTerzo into account livello of constraints. Therefore, SO_1 can also be applied tolivello constrained BSO. • Quarto p q Finally, C_1 is– defined as follows: Quinto livello di A j , B j where: j 1 0 if Y NA d ( a , b ) q j q p ( j ) ( j ) 1 otherwise j 1 If all BSO’s are coherent, then the dissimilarity measures 24 do not change. Fare clic per modificare lo MATCHING stile del titolo dello schema • Matching is the process of comparing two or more Fare per modificare gli stili del testo structuresclic to discover their similarities or differences. dello schema • Similarity judgements in the matching process are directional: They have a Secondo livello • referent, a, a prototype or the description of a class of objectsTerzo livello Quarto livello of the prototype or an instance of a • subject,• b , a variant – Quinto livello class of objects. • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems. 31 Fare clicMATCHING per modificare lo BSO’S stile del titolo dello schema •Generally, a BSO represents a class description and plays the role ofclic the per referent in the matching process. Fare modificare gli stili del testo a: [color = {black, white}] [height =[170, 200]] dello schema describes a set of individuals either black or white, whose Secondo livello [170,200]. Such a set of individuals height is in the interval is called extension of the BSO. The extension is a subset of Terzo livello the universe of individuals . • Quarto livello – Quinto livello Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a. • In the ASSO software two matching operators for BSO’s have been defined. 32 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema • The result of the canonical matching operator is either 0 (false) or 1 (true). Fare clic per modificare gli stili del testo • If E denotes the space of BSO’s described by a set of p dello schema variables Yi taking values in the corresponding domains Yi, thenSecondo the matching operator is a function: livello Match: E × E {0, 1} Terzo livello such that for any two BSO’s a, b E: • Quarto livello a –=Quinto [Y1=Alivello 1] [Y2=A2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] it happens that: • Match(a,b) = 1 • Match(a,b) = 0 if BiAi for each i=1, 2, , p, otherwise. 33 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema Examples: Fare clic per modificare gli stili del testo District1 [profession={farmer, driver}] [age=[24,34]] dello =schema Indiv1 = [profession=farmer] [age=28] Secondo livello Indiv2 = [profession=salesman] [age=[27,28]] Terzo livello • Quarto livello Match(District1,Indiv1) – Quinto livello= 1 Match(District1,Indiv2) = 0 34 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema • The canonical matching function satisfies two out of three Fare modificare propertiesclic of aper similarity measure: gli stili del testo dello • a,schema b E: Match(a, b) 0 • a, b E:livello Match(a, a) Match(a, b) Secondo while itTerzo does not satisfy the commutativity or simmetry livello property: • Quarto livello – Quinto livello • a, b E: Match(a, b) = Match(b, a) because of the different role played by a and b. 35 Fare clic per modificare lo FLEXIBLE MATCHING OPERATOR stile del titolo dello schema • The requirement BiAi for each i=1, 2, , p, might be too strict for real-world problems, because of the presence of Fare clicdescription per modificare gli stiliof del testo noise in the of the individuals the universe. dello schema • Example: District1 = [profession={farmer, driver}] [age=[24,34]] Secondo livello Indiv3livello = [profession=farmer] [age=23] Terzo =0 • QuartoMatch(District1,Indiv3) livello • It is necessary to rely on a flexible definition of matching – Quinto livello operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is flexible-matching: E × E [0,1] 36 Fare clic per modificare lo FLEXIBLE MATCHING OPERATOR stile del titolo dello schema For any two BSO’s a and b, i) flexible-matching(a,b)=1 if Match(a,b)=true, Fare clic per modificare gli stili del testo ii) flexible-matching(a,b) [0,1) otherwise. dello The resultschema of the flexible matching can be interpreted as the probability of a matching b provided that a change is made in b. Secondo livello Let Ea = {b' E | Match(a,b')=1} and P(b | b') be the Terzo livello of observing b given that the original conditional probability Quarto observation• was b'.livello Then – Quinto livello flexible - matching( a, b) = max P(b | b' ) def b' Ea that is flexible-matching(a,b) equals the maximum conditional 37 probability over the space of BSO’s canonically matched by a. FLEXIBLE MATCHING: AN Fare clic per modificare lo APPLICATION stile del titolo dello schema • Credit card applications (Quinlan) Fare clic per modificare gli stili del testo • Fifteen variables whose names and values have dello schema to meaningless symbols to been changed protect the confidentiality of the data. Secondo livello + Terzo livello • Quarto livello • class variable: positive in case of approval of – Quinto negative livello credit facilities, otherwise. • Training set: 490 cases • 6 rules generated by Quinlan’s system C4.5 38 FLEXIBLE MATCHING: AN Fare clic per modificare lo APPLICATION stile del titolo dello schema Rule Class Conditions 41 [Y3 > 1.54] [ Y9 = f ] [ Y4 {u, y}] Fare clic per modificare gli stili del testo [Y6{c,d, cc, i, j, k, m, r, q, w, e, aa, ff}] 43 [ Y4 {u, y}] [ Y8 <= 1.71 ] [ Y9 = f ] dello schema [ Y3 <= 0.835] [ Y6 {c,d,i,k,m,q,w,e,aa }] Secondo livello [ Y7 {v,bb}] [Y14 > 102] [Y15 <= 500] 30 Terzo + livello [ Y9 = t ] 34 [Y3livello <= 0.125 ] [Y14 > 221 ] •+Quarto 46 + – Quinto [Y4 {l} ] livello 6 - • Such rules can be easily represented by means of Boolean symbolic objects. • Both matching operators can be considered in order to test the validity of the induced rules.39 Fare clic per modificare lo A new dissimilarity measure stile del titolo dello schema Flexible is asymmetric. Fare clic matching per modificare gli stili delHowever testo it is possible dello schemato “symmetrize” it New dissimilarity measure SO_6 Secondo livello livello as It isTerzo computed d(a,b) =• Quarto livello – Quinto livello = 1-(flexible_matching(a,b)+flexible_matching(b,a))/2 40 Dissimilarity measure Parameters U_1 (Gowda & Diday) U_2 (Ichino & Yaguchi) Constraints Default none Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 U_3 (Normalized Ichino Gamma [0 .. 0.5] 0.5 & Yaguchi) Order of power 1 .. 10 2 U_4 (Weighted Gamma [0 .. 0.5] 0.5 Normalized Ichino & Order of power 1 .. 10 2 Yaguchi) List of weights, one per var. Sum(weights) = 1.0 Equal weights C_1 (Normalized De Comparison function D1, D2, D3, D4, D5 D1 Carvalho) Order of power 1 .. 10 2 SO_1 (De Carvalho) Comparison function D1, D2, D3, D4, D5 D1 Order of power 1 .. 10 2 List of weights, one per var. Sum(weights) = 1.0 Equal weights SO_2 (De Carvalho) Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 SO_3 (De Carvalho) Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 SO_4 (Normalized De Gamma [0 .. 0.5] 0.5 Carvalho) Order of power 1 .. 10 2 SO_5 (Normalized De Gamma [0 .. 0.5] 0.5 Carvalho) Order of power 1 .. 10 2 SO_6 (Symmetrized none Flexible Matching) PROBABILISTIC SYMBOLIC OBJECT Fare clic per (PSO’S) modificare lo stile del titolo dello schema Probabilistic symbolic objects (PSO’s) involve modal Fare clic per modificare gli stili del testo (probabilistic) variables. dello schema Each cell represents the set of weighted values that the Secondo livellofor a symbolic object, where a variable can take Terzoweighting livello system is adopted. probabilistic • Quarto livello Quinto itlivello In case of –PSO, isn’t possible to use dissimilarity measures for BSO because they don’t take the probabilities into consideration and so this determines a notable information loss. Therefore, new dissimilarity measures for PSO are 42 needed. Defining dissimilarity measures forclic probabilistic symbolic Fare per modificare lo objects stile del titolo dello schema Steps: Fare clic per modificare gli stili del testo 1. dello Define coefficients measuring the divergence between schema two probability distributions Secondo livello Kullback-Leibler divergence Terzo livello Chi-square divergence • Quarto livello Hellinger – Quinto livello similarity coefficient (*) K-divergence Variation distance non-symmetric coefficients symmetric coefficient (*) from them two dissimilarity measures, namely the Renyi’s and Chernoff’s coefficients, are obtained 43 Defining dissimilarity measures probabilistic symbolic Farefor clic per modificare lo objects stile del titolo dello schema Steps: Fare clic per modificare gli stili del testo 2. dello Symmetrize the schema coefficients Secondo livello m(P,Q)= Terzo livello non symmetric m(Q,P) + m(P,Q) • Quarto livello 3. Aggregate the contribution of all – Quinto livello variables to compute the dissimilarity between two symbolic objects PSO Dissimilarity measures 44 Fare clic per Mixture SO modificare lo stile del titolo dello schema Some SO’s can be described by both nonFare clic per modificare gli stili del testo modal and modal variables dello schema They are neither BSO’s nor PSO’s Secondo livello measure, then? What dissimilarity InTerzo ASSO livello it has been proposed to combine • Quarto livello the result of two dissimilarity measure, one – Quinto livello for modal and the other for non-modal. Combination can be either additive or multiplicative. This possibility should be taken with great care!!! 45 Fare clic per modificare lo REFERENCES stile del titolo dello • Esposito F., Malerba D., V. Tamma, H.-H.schema Bock. Classical resemblance measures. Chapter 8.1 •Fare Esposito F., Malerba D., V. Tamma. Dissimilarity measures for clic per modificare gli stili del testo symbolic objects. Chapter 8.3 schema • dello Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4 Secondo livello in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting Terzo livello statistical information from complex data. • • Springer Verlag, Heidelberg, 2000. • Quarto livello D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of Quinto livello dissimilarity–measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona. D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481. 46 Fare clic per modificare lo REFERENCES stile del titolo dello schema • D. Malerba, F. Esposito, M. Monopoli (2002). Estrazione e matching di oggetti simbolici da database relazionali. Atti del Decimo Convegno su Sistemi Evoluti Basi di Datitesto SEBD’2002, Fare clicNazionale per modificare gliperstili del 265-272. schema • dello D. Malerba, F. Esposito, & M. Monopoli (2002). Comparing dissimilarity measures Secondo livellofor probabilistic symbolic objects. In A. • • Zanasi, C. A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.) Data Mining livello Information Systems, Vol 6, 31-40, WIT III, Terzo Series Management Press, Southampton, UK. • Quarto livello E. Diday, F. –Esposito Quinto (2003). livello An Introduction to Symbolic Data Analysis and the Sodas Software, Intelligent Data Analysis, 7, 6, (in press). Other project reports 47 Fare clic METHOD per modificare lo DISS stile del titolo dello schema • Dissimilarity measures between and gli stili del testo Fare clicboth per BSO’s modificare PSO’s. dello schema • Input: Asso file of SO’s Secondo livello • Output for dissimilarities: Terzo livello Report•+Quarto Assolivello file with dissimilarity matrix – Quinto livello DI • Developer: Dipartimento method di Informatica, University of Bari, Italy. Report 48 file TWO DIAGRAMS lo FareUSE clicCASE per modificare stile del titolo dello schema Create an ASSO chaining with the DISS method Set up testo parameters of Fare clic Create peran ASSO modificare gli stili del the DISS method chaining with the DISS dello schema method Secondo livello up parameters TerzoSet livello the DISS method User of • Quarto livello – Quinto livello User Run the DISS method and generate a new ASSO file with a dissimilarity matrix Create a new chaining with the new ASSO file Run the DISS method and generate a report file View report file Run VDISS and visualize the dissimilarity measure, the bi-dimensional mapping & the graphical 49 representation Fare clic per modificare lo PARAMETER SETUP stile del titolo dello schema • The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator Fare clic per modificare gli stili del testohas to computed . dello schema Secondo livello Terzo livello • Quarto livello – Quinto livello 50 Fare clic per modificare lo PARAMETER SETUP stile del dello schema • The user cantitolo select a number of parameters. Dissimilarity measure Fare clic per modificare gli stili del testo dello schema combine Secondo livello Terzo livello Name of • Quarto livello the new – Quinto livello ASSO file ? ? 51 OUTPUT SODAS FILE Fare clic per modificare lo stile del titolo • The output ASSO file dello containsschema both the same input data and an additional dissimilarity matrix. Fare clic per modificare gli stili del testo The dissimilarity between the i-th and the j-th dello BSO isschema written in the cell (entry) (i, j) of the Secondo livello matrix. livellopart of the dissimilarity matrix is • OnlyTerzo the lower • Quarto livello reported in the file, since dissimilarities are – Quinto livello symmetric. abalone output file 52 OUTPUT REPORT FILE The report file is organized as follows: Output report file Fare clic per modificare lo Output stile del titolo dello schema Visualization of the dissimilarity table Fare clic per modificare gli stili del testo dello schema Secondo livello Terzo livello • Quarto livello – Quinto livello 54 Fare clic per modificare lo Output stile del titolo schema Visualization of dello a line graph dissimilarities Fare clic per modificare gli stili del testo dello schema of Secondo livello Terzo livello • Quarto livello – Quinto livello The number of lines in each graph is equal to the number of SOs minus one Each line represents the dissimilarity between a given SO and the subsequent SOs in the file 55 Fare clic per modificare lo Output stile del titolo dello schema Visualization of a scatterplot of Sammon’s Fare clicmapping per modificare gli stili delspace testo nonlinear into a bidimensional dello schema Secondo livello Terzo livello • Quarto livello – Quinto livello 56