ESTEEM:
Trust-aware P2P data integration
Carola Aiello,Tiziana Catarci,
Diego Milano, Monica Scannapieco
Dipartimento di Informatica e Sistemistica
Università di Roma “La Sapienza”
1
Outline



Progetti precedenti
Obiettivi ESTEEM
Problematiche e direzioni di ricerca
dell’unità



Data quality: Quality-aware query
processing
Privacy: Privacy-aware record matching
Trust: Modello di trust per le sorgenti
2
DaQuinCIS project (2003)



MIUR – COFIN/PRIN
Main focus: data quality in
cooperative information systems
(CISs)
Data Quality Problems:


Record Matching
Quality-driven query processing
3
Motivations

A real example: e-Goverment project to
integrate data about Italian companies
Query
Company XYZ ?
DATA INTEGRATION LAYER
Chambers of Commerce
Social Insurance Agency Accident Insurance Agency4
Id
Chambers of Commerce
Name
Type of activity
City
Address
Social Insurance Agency Accident Insurance Agency
5
The Three Real Records
ID
Type of
Activity
City
CNCBTB765SDV
Retail of
bovine and
ovine meats
0111232223
Grocer’s
shop,
Which
beverages
CNCBTR765LDV
Novi Ligure
Name
Meat
production
of
Bartoletti
Benito
Pizzolo
Bartoletti
Formigaro
Benito
is the
actual company
XYZ to
Meat ?
be returned to the client
production
Butcher
• One of 3 ? Which ?
• A “merge” of the 3 ?
Ovada
Meat
production
in Piemonte
of
Bartoletti
Benito
Address
National
Street dei
Giovi
9, Rome
Street
4, Mazzini
Square
6
Objectives of the Research

1.
Given a set of distributed and
heterogeneous data sources that are
affected by data quality problems
Improving the quality of each data
source

2.
Record matching across sources
Provide a unified and trasparent access
to data sources

Data Integration & Quality-driven query
processing
7
Improving quality of addresses in Italian PA (2004)





Accordo di collaborazione AIPA (ora CNIPA) e ISTAT
Aprile 2002-Luglio 2004
Proposta di formati standard per l’acquisizione e
l’interscambio degli indirizzi
Proposta di ridisegno dei flussi per l’aggiornamento
degli indirizzi
Metodologia per la misurazione della qualità degli
indirizzi
Misurazione sperimentale della qualità degli indirizzi
in tre archivi nazionali:
 Agenzia delle Entrate
 Camere di Commercio
 INPS
8
Data Quality and Data Privacy
(Current)



Joint Activity with University of Purdue, Indiana
USA
Publishing elementary data may violate privacy
requirements, even when data are anonymized
 anonymization removes principal identifiers
like SSN, Name+Surname+DOB, etc.
Record matching privacy aware
 only the result of the intersection (AB) across
data sets are shared and nothing else (not AAB and not B-AB)
9
Obiettivi ESTEEM



Studio di problematiche di trust e
qualità dei dati in sistemi P2P
Specifica di sistemi di integrazione
dati P2P con requisiti di trust
Definizione di algoritmi di query
processing quality- and trust-aware
10
P2P Systems

P2P systems


loosely coupled, dynamic, open
Data sharing in such systems



no centralized global schema
peers mapping dynamically build
new peers can make available new data
schema
11
Data Quality
EmployeeID
Name
Surname
Salary
Email
arpa78
John
Smith
2000
[email protected]
eugi98
Edward
Monroe
1500
[email protected]
ghjk09
Anthony
Wite
1250
[email protected]
treg23
Marianne
Collins
1150
[email protected]
Attribute
conflict
EmployeeS1
Key
conflict
EmployeeID
Name
Surname
Salary
Email
arpa78
John
Smith
2600
[email protected]
eugi98
Edward
Monroe
1500
[email protected]
ghjk09
Anthony
White
1250
[email protected]
dref43
Marianne
Collins
1150
[email protected]
EmployeeS2
12
Quality-aware query processing - 1



Key conflicts require the application of
Record Matching techniques
Attribute conflicts are solved by query time
Conflict Resolution Techniques
The resolution of such conflicts in P2P
systems is an open issue:


Definition of a quality-aware semantics for query
answering in P2P systems
Need to develop techniques for solving such
conflicts according to the defined semantics
13
Quality-aware query processing - 2


Query language supporting the
specification of conflict resolution
strategies
Important in P2P systems: research
space pruning on the basis of
quality characterization of sources
14
Privacy


How to protect privacy when sharing data?
With the source S1 and S2 issuing the
Queries Q1 and Q2 respectively, at the end
of the interaction
 S1 must learn result Q1 and nothing else
 S2 must learn result Q2 and nothing else
Query Q1
S1
Result Q1
Query Q2
S2
Result Q2
15
Privacy-aware Record Matching - 1
A
B
AB


Secure set intersection: (i)
matching esatto; (ii) non di record;
(iii) costosi
Private data sharing: (i) matching
esatto; (ii) schema un-aware
16
Privacy-aware Query Processing - 2

Algoritmi che consentano di fare
privacy aware record matching in
contesti P2P


Problema della third party
Prime proposte ElAbbadi ICDE 2006 ma
matching esatto
17
Trust
 Trust
typically associated to a
source as a whole
 Need for finer level
characterization

Eg: Ministero delle Finanze
affidabile rispetto ai Codici Fiscali
18
Modello di Trust per le sorgenti dati -1


Previous proposals: the whole
organization (peer)
Our proposal: <Organization, Data Type>
# of <D, Orgk>
complaints sent
by Orgi
R( Org k ,
C

D ) 
n
i
i
i, k , D
i, k , D
Org i  O
# of Dexchanges of
Orgk
19
Modello di Trust per le sorgenti dati - 2


Drawback: Centralized
Need for:
 Decentralized
 More flexible model (e.g. trust associated to
views)
20
Modello di Trust per le sorgenti dati - 3

More general trust characterization
based on the evaluation of a peer’s
assertion on some metadata:
 Data
quality-aware: trust computed on
the basis of the declared quality of
provided data
 Privacy-aware: trust computed on the
basis of the declared privacy level

different roles for providers and
consumers: e.g. a provider can decide
not to release data if a requester is not
privacy - trusted (or to adopt specific
technique)
21
Scarica

Meat production in Piemonte of Bartoletti Benito Bartoletti Benito