DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy [email protected] ASSO School Porto, Portugal May 13-15, 2002 COMPUTING DISSIMILARITIES: Fare clic per modificare lo WHY? stile del titolo dello schema • Several data analysis techniques are based on quantifying a dissimilarity measure Fare clic per modificare(or glisimilarity) stili del testo between multivariate data. dello schema • Clustering Secondo livello • Discriminant analysis Terzo livello • Visualization-based approaches • Quarto livello • Symbolic –objects are a kind of multivariate data. Quinto livello • Ex.: [colour={red, black}][weight {60,70,80}][height []1.50,1.60] • The dissimilarity measures presented here are among those investigated in the SODAS Project. 2 BOOLEAN SYMBOLIC OBJECTS Fare clic per modificare lo (BSO’S) stile del titolo dello schema A BSO is a conjunction of boolean elementary events: clic per modificare gli stili del testo Fare [Ydello [Y2=A2] ... [Yp=Ap] 1=A1] schema where each variable Secondo livello Yi takes values in Yi and Ai is a subset of Yi Terzo livello Let a and• Quarto b be livello two BSO’s: a = [Y1=A1]– Quinto [Y2=A livello 2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] where each variable Yj takes values in Yj and Aj and Bj are subsets of Yj. We are interested to compute the dissimilarity d(a,b). 7 Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema Two types of dependencies between variables: • Hierarchical mother-daughter Fare clic perdependence modificare (gli stili del testo): A variable i may be inapplicable if another variable Yj takes dello Yschema its values in a subset Sj Yj. This dependence is Secondo livello expressed as a rule: Terzo livello if [Yj = Sj] then [Yi = NA] • Quarto livello • Logical dependence : This case occurs, if a subset – Quinto livello Sj Yj of a variable Yj is related to a subset Si Yi of a variable Yi by a rule such as: if [Yj = Sj] then [Yi = Si] 8 DISSIMILARITY AND SIMILARITY Fare clic perMEASURES modificare lo stile del titolo dello schema Dissimilarity Measure * = d(a,a) d(a,b) Fare gli stili del<testo d: EERclic suchper that dmodificare = d(b,a) a,bE a dello schemaSimilarity Measure s: EE R such that s*a = s(a,a) s(a,b) = s(b,a) 0 a,bE Secondo livello Generally: Terzo livello a E: d*•a Quarto = d* and s*a= s* and specifically, d* = 0 while s*= 1 livello – Quinto livello can be transformed into Dissimilarity measures similarity measures (and viceversa): d=(s) ( s=-1(d) ) where: (s) strictly decreasing function, and (1) = 0, (0) = 9 DISSIMILARITY AND SIMILARITY Fare clic per modificare lo MEASURES: PROPERTIES stile del titolo dello schema Some properties that a dissimilarity measure d on E may satisfy Fareare: clic per modificare gli stili del testo 1. dello d(a, b) = schema 0 c E: d(a, c) = d(b, c) (eveness) 2. d(a, b) = 0 a =livello b Secondo (definiteness) 3. d(a, b) d(a, c) livello + d(c, b) Terzo (triangle inequality) • Quartoc), livello 4. d(a, b) max(d(a, d(c, b)) – Quinto livello (ultrametric inequality ) 5. d(a, b) + d(c, d) max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality) 6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance ) A dissimilarity function that satisfies proprieties 2 and 3 is called metric. A dissimilarity function that satisfies only property 3 is called pseudo metric or semi- distance. 10 DISSIMILARITY MEASURES Fare clic BETWEEN per modificare lo BSO’S stile del titolo dello schema Author(s) (Year) Notation from the SODAS Package Fare clic per modificare gli stili del testo • Gowda & Diday (1991) U_1 dello schema • Ichino & Yaguchi (1994) U_2, U_3, U_4 Secondo livello SO_1, SO_2 • De Carvalho (1994) Terzo livello • De Carvalho (1996, 1998) SO_3, SO_4, SO_5, C_1 • Quarto livello – Quinto livello U: only for unconstrained BSO’s C: only for constrained BSO’s SO: for both constrained and unconstrained BSO’s 11 GOWDA & DIDAY’S DISSIMILARITY Fare clic perMEASURE modificare lo stile del titolo dello schema Gowda & Diday’s dissimilarity measures for two BSO’s a and b: p Fare clic per modificare gli stili del testo U_1 D ( A , B ) D(a, b) = j j j 1 dello schema If Yj is a continuous variable: Secondo livello D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj) Terzo livello while if Yj is a nominal variable: • Quarto livello D(Aj, Bj) = Ds(A Dc(Aj, Bj) j, Bj) + – Quinto livello where the components are defined so that their values are normalized between 0 and 1: Aj Bj • D(Aj, Bj) due to position, • Ds(Aj, Bj) due to span, D Ds Dc 12 • Dc(Aj, Bj) due to content GOWDA & DIDAY’S DISSIMILARITY Fare clic perMEASURE modificare lo stile del titolo dello schema Properties: Fare clic per modificare gli stili del testo D(a, b) schema = 0 a = b (definiteness property), dello No proof is reported for the triangle inequality property Secondo livello Terzo livello • Quarto livello – Quinto livello 13 ICHINO & YAGUCHI’S FareDISSIMILARITY clic per modificare lo MEASURES stile del titolo dello schema Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join and meet . Fare clic per modificare gli stili del testo For continuous variables: Aj Bj dello schema Aj Bj Aj Bj Secondo Aj Bj livello Terzo livello Aj Bj while for nominal variables: • Quarto livello Aj Bj = Aj Bj – Quinto livello Aj Bj = Aj Bj Given a pair of subsets (Aj, Bj) of Yj the componentwise dissimilarity(Aj,Bj) is: (Aj, Bj) =Aj Bj Aj Bj+ (2Aj BjAj Bj) where 0 0.5 and Ajis defined depending on variable14types. ICHINO & YAGUCHI’S FareDISSIMILARITY clic per modificare lo MEASURES stile del titolo dello schema (Aj,Bj) are aggregated by an aggregation function such as the generalised Minkowski’s distance of order q: Fare clic per modificare gli stili del testo U_2 dello schema d q ( a, b) q ( A j , B j ) p q j 1 Secondo livello on the chosen units of measurements Drawback : dependence Terzo livello of the componentwise dissimilarity: Solution : normalization U_3 • Quarto livello p livello d q (–a,Quinto b) q ( Aj , B j ) q , j 1 ( Aj , B j ) ( Aj , B j ) Yj The weighted formulation guarantees that dq(a,b)[0,1]. p U_4 d q (a, b) q c j ( A j , B j )q j 1 The above measures are metrics 15 DE CARVALHO’S DISSIMILARITY Fare clic per modificare lo MEASURES stile del titolo dello schema A straightforward extension of similarity measures for classical clic data matrices with nominal Fare per modificare glivariables. stili del testo dello schema Agreement Agreement =(AjBj) Secondo livello Disagreement =(c(Aj)Bj) Terzo livello Total (Bj) • Quarto livello – Quinto livello Disagreement Total =(Ajc(Bj)) (Aj) =(c(Aj)c(Bj)) (c(Aj)) (c(Bj)) (Yi) where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable). 16 DE CARVALHO’S DISSIMILARITY Fare clic per modificare lo MEASURES stile del titolo dello schema Five different similarity measures s , i = 1, ..., 5, i defined: Fare clic per modificare gli stili del testo si Comparison Function Range Property s1 /(++) dello schema [0,1] s2 2/(2++) [0,1] Secondo livello s3 /(+2+2) [0,1] s4 Terzo ½[/(+)+/(+)] [0,1] livello ½ s3 /[(+)(+)] [0,1] • Quarto livello are metric semi metric metric semi metric semi metric – Quinto livello The corresponding dissimilarities are di = 1 si. The di are aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining: SO_1 p d ai (a, b) q w j di ( A j , B j ) q j 1 1 i 5 17 DE CARVALHO’S EXTENSION OF ICHINO YAGUCHI’S DISSIMILARITY Fare clic&per modificare lo MEASURE stile del titolo dello schema A different componentwise dissimilarity measure: Fare clic per modificaregli A j , stili B j del testo A j , B j dello schema A j B j Secondo livello where is defined as in Ichino & Yaguchi’s dissimilarity Terzo livello measure. • Quartofunction livello The aggregation AF suggested by De Carvalho is: – Quinto livello SO_2 q p 1 d q (a, b) q ( A j , B j ) j 1 p This measure is a metric. 18 THE DESCRIPTION-POTENTIAL Fare clic per modificare lo APPROACH stile del titolo dello schema All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and Fare clic per modificare gli stili del testo an aggregation function . A dello differentschema approach is based on the concept of description potential (a) of a symbolic object a. Secondo livello Terzo livello p ( a ) ( A j ) • Quarto livello j 1 – Quinto livello where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable). 19 THE DESCRIPTION-POTENTIAL Fare clic perAPPROACH modificare lo stile del titolo dello schema SO_3 d1 (a, b) (a b) (a b) [2(a b) (a) (b)] Fare clic per modificare gli stili del testo dello schema (a b) (a b) [2(a b) (a) (b)] SO_4 d 2 ( a, b) Secondo livello (a E ) Terzo livello • Quarto livello (a b) (a b) [2(a b) (a) (b)] SO_5 d 2 (a, –b)Quinto livello (a b) The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric. 20 DESCRIPTION POTENTIAL FOR Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema Given a BSO a and a logical dependence expressed by the rule: Fare clic per modificare gli stili del testo if [Yj = Sj] then [Yi = Si] dello schema the incoherent restriction a’ of a is defined as: Secondo livello a’= [Y1=A1] ... [Yj-1=Aj-1] [Yj=Aj Sj] ... [Yi-1=Ai-1] Terzo livello [Yi=Ai (Yi\Si)] ... [Yp=Ap] • Quarto livello Then the description of a is: – Quinto potential livello p (a ) ( A j ) (a) j 1 A similar extension exists for hierarchical dependencies. 21 DISSIMILARITY MEASURES FOR Fare clic per modificare lo CONSTRAINED BSO’S stile del titolo dello schema •The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 Fare clic per modificare gli stili del testo and SO_5. •De Carvalho proposed an extension of ’, so that SO_2 dello schema can also be applied to constrained BSO. livello •HeSecondo also proposed an extension of , , , and in order to takeTerzo into account livello of constraints. Therefore, SO_1 can also be applied tolivello constrained BSO. • Quarto p q Finally, C_1 is– defined as follows: Quinto livello di A j , B j where: j 1 0 if Y NA d ( a , b ) q j q p ( j ) ( j ) 1 otherwise j 1 If all BSO’s are coherent, then the dissimilarity measures 22 do not change. Fare clic per modificare lo MATCHING stile del titolo dello schema • Matching is the process of comparing two or more Fare per modificare gli stili del testo structuresclic to discover their similarities or differences. dello schema • Similarity judgements in the matching process are directional: They have a Secondo livello • referent, a, a prototype or the description of a class of objectsTerzo livello Quarto livello of the prototype or an instance of a • subject,• b , a variant – Quinto livello class of objects. • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems. 29 Fare clicMATCHING per modificare lo BSO’S stile del titolo dello schema •Generally, a BSO represents a class description and plays the role ofclic the per referent in the matching process. Fare modificare gli stili del testo a: [color = {black, white}] [height =[170, 200]] dello schema describes a set of individuals either black or white, whose Secondo livello [170,200]. Such a set of individuals height is in the interval is called extension of the BSO. The extension is a subset of Terzo livello the universe of individuals . • Quarto livello – Quinto livello Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a. • In the SODAS software two matching operators for BSO’s have been defined. 30 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema • The result of the canonical matching operator is either 0 (false) or 1 (true). Fare clic per modificare gli stili del testo • If E denotes the space of BSO’s described by a set of p dello schema variables Yi taking values in the corresponding domains Yi, thenSecondo the matching operator is a function: livello Match: E × E {0, 1} Terzo livello such that for any two BSO’s a, b E: • Quarto livello a –=Quinto [Y1=Alivello 1] [Y2=A2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] it happens that: • Match(a,b) = 1 • Match(a,b) = 0 if BiAi for each i=1, 2, , p, otherwise. 31 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema Examples: Fare clic per modificare gli stili del testo District1 [profession={farmer, driver}] [age=[24,34]] dello =schema Indiv1 = [profession=farmer] [age=28] Secondo livello Indiv2 = [profession=salesman] [age=[27,28]] Terzo livello • Quarto livello Match(District1,Indiv1) – Quinto livello= 1 Match(District1,Indiv2) = 0 32 Fare clic per modificare lo CANONICAL MATCHING OPERATOR stile del titolo dello schema • The canonical matching function satisfies two out of three Fare modificare propertiesclic of aper similarity measure: gli stili del testo dello • a,schema b E: Match(a, b) 0 • a, b E:livello Match(a, a) Match(a, b) Secondo while itTerzo does not satisfy the commutativity or simmetry livello property: • Quarto livello – Quinto livello • a, b E: Match(a, b) = Match(b, a) because of the different role played by a and b. 33 Fare clic per modificare lo FLEXIBLE MATCHING OPERATOR stile del titolo dello schema • The requirement BiAi for each i=1, 2, , p, might be too strict for real-world problems, because of the presence of Fare clicdescription per modificare gli stiliof del testo noise in the of the individuals the universe. dello schema • Example: District1 = [profession={farmer, driver}] [age=[24,34]] Secondo livello Indiv3livello = [profession=farmer] [age=23] Terzo =0 • QuartoMatch(District1,Indiv3) livello • It is necessary to rely on a flexible definition of matching – Quinto livello operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is flexible-matching: E × E [0,1] 34 Fare clic per modificare lo FLEXIBLE MATCHING OPERATOR stile del titolo dello schema For any two BSO’s a and b, i) flexible-matching(a,b)=1 if Match(a,b)=true, Fare clic per modificare gli stili del testo ii) flexible-matching(a,b) [0,1) otherwise. dello The resultschema of the flexible matching can be interpreted as the probability of a matching b provided that a change is made in b. Secondo livello Let Ea = {b' E | Match(a,b')=1} and P(b | b') be the Terzo livello of observing b given that the original conditional probability Quarto observation• was b'.livello Then – Quinto livello flexible - matching( a, b) = max P(b | b' ) def b' Ea that is flexible-matching(a,b) equals the maximum conditional 35 probability over the space of BSO’s canonically matched by a. FLEXIBLE MATCHING: AN Fare clic per modificare lo APPLICATION stile del titolo dello schema • Credit card applications (Quinlan) Fare clic per modificare gli stili del testo • Fifteen variables whose names and values have dello schema to meaningless symbols to been changed protect the confidentiality of the data. Secondo livello + Terzo livello • Quarto livello • class variable: positive in case of approval of – Quinto negative livello credit facilities, otherwise. • Training set: 490 cases • 6 rules generated by Quinlan’s system C4.5 36 FLEXIBLE MATCHING: AN Fare clic per modificare lo APPLICATION stile del titolo dello schema Rule Class Conditions 41 [Y3 > 1.54] [ Y9 = f ] [ Y4 {u, y}] Fare clic per modificare gli stili del testo [Y6{c,d, cc, i, j, k, m, r, q, w, e, aa, ff}] 43 [ Y4 {u, y}] [ Y8 <= 1.71 ] [ Y9 = f ] dello schema [ Y3 <= 0.835] [ Y6 {c,d,i,k,m,q,w,e,aa }] Secondo livello [ Y7 {v,bb}] [Y14 > 102] [Y15 <= 500] 30 Terzo + livello [ Y9 = t ] 34 [Y3livello <= 0.125 ] [Y14 > 221 ] •+Quarto 46 + – Quinto [Y4 {l} ] livello 6 - • Such rules can be easily represented by means of Boolean symbolic objects. • Both matching operators can be considered in order to test the validity of the induced rules.37 Fare clic per modificare lo REFERENCES stile del titolo dello • Esposito F., Malerba D., V. Tamma, H.-H.schema Bock. Classical resemblance measures. Chapter 8.1 •Fare Esposito F., Malerba D., V. Tamma. Dissimilarity measures for clic per modificare gli stili del testo symbolic objects. Chapter 8.3 schema • dello Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4 Secondo livello in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting Terzo livello statistical information from complex data. • • Springer Verlag, Heidelberg, 2000. • Quarto livello D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of Quinto livello dissimilarity–measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona. D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481. 38 Fare clic per modificare lo METHOD DI stile del titolo dello schema • Both dissimilarity measures and matching operators between BSO’s available in Fare clic perare modificare the method DI (Distance dello schema matrix) of the SODAS software. Secondo livello • Input: Sodaslivello file of BSO’s Terzo • Output for dissimilarities: • Quarto livello Report + Sodas filelivello with – Quinto distance matrix • Output for matching operators: Report • Developer: Dipartimento di Informatica, University of Bari, Italy. gli stili del testo DI method Report 39 file TWO DIAGRAMS Fare clicUSE perCASE modificare lo stile del titolo dello schema Create a SODAS chaining with the DI method Fare clic Create pera SODAS modificare gli stili del testo chaining with the DI dello schema Set up parameters of method the DI method Secondo livello up parameters TerzoSet livello the DI method of • Quarto livello – Quinto livello User User Run the DI method and generate a new SODAS file with a dissimilarity matrix Create a new chaining with the new SODAS file Run the DI method and generate a report file View report file 40 Fare clic per modificare lo PARAMETER SETUP stile del titolo dello schema • The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator Fare clic per modificare gli stili del testohas to computed . dello schema Secondo livello Terzo livello • Quarto livello – Quinto livello 41 Fare clic per modificare lo PARAMETER SETUP stile del titolo dello schema • The user can select a number of parameters. Fare clic per modificare gli stili del testo dello schema Dissimilarity measure Secondo livelloor matching operator Terzo livello Name •ofQuarto livello the new – Quinto livello SODAS file 42 Dissimilarity measure Parameters Constraints Default U_1 (Gowda & Diday) none U_2 (Ichino & Yaguchi) Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 U_3 (Normalized Ichino Gamma [0 .. 0.5] 0.5 & Yaguchi) Order of power 1 .. 10 2 U_4 (Weighted Gamma [0 .. 0.5] 0.5 Normalized Ichino & Order of power 1 .. 10 2 Yaguchi) List of weights, one per var. Sum(weights) = 1.0 Equal weights C_1 (Normalized De Comparison function D 1, D 2, D 3, D 4, D 5 D 1 Carvalho) Order of power 1 .. 10 2 SO_1 (De Carvalho) Comparison function D 1, D 2, D 3, D 4, D 5 D 1 Order of power 1 .. 10 2 List of weights, one per var. Sum(weights) = 1.0 Equal weights SO_2 (De Carvalho) Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 SO_3 (De Carvalho) Gamma [0 .. 0.5] 0.5 Order of power 1 .. 10 2 SO_4 (Normalized De Gamma [0 .. 0.5] 0.5 Carvalho) Order of power 1 .. 10 2 SO_5 (Normalized De Gamma [0 .. 0.5] 0.5 Carvalho) Order of power 1 .. 10 2 OUTPUT SODAS FILE Fare clic per modificare lo stile delSODAS titolo delloboth schema • The output file contains the same input data and an additional dissimilarity matrix. The dissimilarity between theper i-th modificare and the j-th BSO written the cell Fare clic gliisstili delintesto (entry) (i, j) of the matrix. dello schema • Only the lower part of the dissimilarity matrix is reported inSecondo the file, since dissimilarities are symmetric. livello Terzo livello TRIANGE_MATRIX = ( • Quarto livello (0) , – Quinto livello (0.20531, 0) , (12.8626, 15.0793, 0) , (14.0338, 15.0403, 8.64626, 0) , (14.1655, 15.1651, 11.4512, 6.90074, 0) , 44 …) OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 07/26/01 Sodas The Statistical Package for Symbolic Data Analysis Version 1.0 - 05 January 2001 **************D I S T A N C E Data Information: M E A S U R E S*********** OUTPUT REPORT FILE Input Sodas File: C:\ASSO_L~1\SPERIM~3\NUOVAS~1\ABALO.SDS 28 8 Boolean Symbolic Objects (BSOs) read. Variables selected for each BSO: 1 -- 8 Selected Distance Function: U_1 Gowda & Diday Distance Matrix BSO 1 2 3 4 1 0 2 0.2053 0 3 12.86 15.08 0 4 14.03 15.04 8.646 0 5 14.17 15.17 11.45 6.901 -----------------------------------------------------------------------Page 2 SODAS 07/26/01 Fare clic per modificare lo PARAMETER SETUP stile del titolo dello schema • Only one parameter has to be specified for matching BSO’s. clic per modificare gli stili del testo Fare • dello The Class BSO represents the BSO that will be used as schema referent (default: 2nd BSO). Secondo • In the case oflivello canonical matching the subject can be any Terzo BSO. livello • In the case of livello flexible matching the subject must be a • Quarto BSO representing individual. – Quinto an livello Matching operator Parameters CM (Canonical Matching) Class BSO FM (Flexible Matching) Class BSO Constraints 1 .. no BSO 1 .. no BSO Default 2 2 47 OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 12/31/99 Sodas The Statistical Package for Symbolic Data Analysis *********** C A N O N I C A L M A T C H I N G ********** Data Information 29 Boolean Symbolic Objects (BSOs) read. 1 : Boolean Symbolic Object (BSO) selected as class. Matching Vector BSO 1 1 Match 2 No Match ... 29 No Match This procedure was completed at 17:57:33 FURTHER DEVELOPMENTS IN THE ASSO Fare clic perPROJECT modificare lo stile del titolo dello schema Two modules, DISS and MATCH. Both modules will be upgraded to work with both BSO’s and Problabilistic Fare clic per modificare gli stili del testo Symbolic Objects (PSO’s). dello schema Formally, a PSO, which is made up of M probabilistic Secondo livello elementary events (PEE) ai , is defined as: Terzo livello M M a= a a [ Y c • Quarto livello i = i ij i 1 i 1 – Quinto livello pija j 1... ki ] Each PEE ai is attached to an observed quantitative variables Yi which takes ki values cija with probability pija (which sum up to 1). 49