Sampling design issues in Italian experience
on scanner data and the
possible integration with microdata coming
from traditional data collection
Claudia De Vitiis
In collaboration with:
C. Casciano, N. Cibella, A. Guandalini, F. Inglese, G. Seri, M. Terribili, F. Tiero
ISTAT - ITALY
Workshop scanner data. Rome 1-2 October 2015
Summary
1. Aims of the presentation
2. The new general sampling design for CPI
3. The context of the first experiments of sampling
from Scanner Data
4. Selection of elementary items from Scanner Data
5. First results
6. Open issues and conclusions
Workshop scanner data. Rome 1-2 October 2015
1. Aims of the presentation

Scanner data is a big opportunity to introduce improvements in the
CPI compiling not only for the data collection but also with regards
the sampling perspective

This presentation focuses on a comparison among indices of
elementary aggregates compiled using different sub-sets of series
obtained through different “selection schemes”


Elementary Index Bias

Elementary Index Sampling Variance
First experiment on a small data set:

One province, some consumption segments (Italian COICOP6),
permanent series
Workshop scanner data. Rome 1-2 October 2015
2. The new general sampling design for CPI
The current sample strategy of the CPI survey (at territorial level)

Three purposive sampling stages:
–
The first stage units are the chief towns of provinces (established by law)
–
The second stage units are the outlets, purposively chosen in each PSU
to be representative of the consumer behaviour
–
The most sold items of a fixed basked of products are observed in the
sample outlets

The elementary indices are obtained at municipality level by
unweighted geometric mean

The general index is calculated by subsequent aggregation of
elementary indices, using weights at different levels based on
population and national account data on consumer expenditure
Workshop scanner data. Rome 1-2 October 2015
2. The new general sampling design for CPI
A working group established at ISTAT is developing a probability sample
strategy

Based on the hypothesis that turnover is a good proxy of final household
consumption (expenditure)

Outlets and items are selected using probabilities proportional to the
turnover

Selection list for the outlets (local units) is obtained from business register,
ASIA-UL  ASIA-PV


The archive contains information useful for the selection and the
definition of the inclusion probability
For the item level different approaches will coexist at the beginning:

Scanner data for food and grocery in the modern distribution allow the
use of sampling methods and index compilation using weights from
quantities (or expenditures)

For traditional distribution and the other sectors, data collection and
index compilation will continue unchanged at first
Workshop scanner data. Rome 1-2 October 2015
3. The context of the first experiments of sampling
from Scanner Data

Among the first analyses on the scanner data universe we carried out
some tentative experiments of sample selection of series

Data referred to the first Italian provinces for which ISTAT got data,
for year 2014

Weekly data on turnover and quantities per EAN-code (GTIN) and
outlet allow obtaining weekly unit values

One series is considered present in a specific month if it has
associated a non zero turnover in at least one of the three central
weeks of the month

The first issue we analysed is the continuity of series (=EAN by
outlet) and the coverage of a panel of permanent series
 Following figures show examples of the coverage rate of a fixed
basket of series taken in December 2013 during 2014 months
Workshop scanner data. Rome 1-2 October 2015
Figure 1a. Presence of series fixed in Dec2013 in 2014 months - Coverage of
single series and total turnover (all products, Turin province)
100.00
93.32
90.00
85.19
91.68
83.02
89.36
85.41
81.62
80.00
81.11
79.32
77.54
83.16
76.23
82.41
73.89
81.77
81.33
71.94
82.75
82.29
80.49
73.46
73.58
73.53
74.22
sep
oct
nov
dec
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
jan
feb
mar
apr
may
jun
Coverage of items (%)
Workshop scanner data. Rome 1-2 October 2015
jul
ago
Total sales coverage (%)
Figure 1b. Presence of series fixed in Dec2013 in 2014 months - Coverage of single
series and total turnover (Coicop 6 digits - Coffee segment, Turin province)
100.00
94.63
91.56
96.02
95.97
90.23
90.00
92.89
89.14
87.78
90.81
87.12
92.58
86.38
90.73
85.05
92.00
83.09
91.75
88.14
85.02
89.07
84.50
85.32
85.26
86.10
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
jan
feb
mar
apr
may
jun
Coverage of items (%)
Workshop scanner data. Rome 1-2 October 2015
jul
ago
Total sales coverage (%)
sep
oct
nov
dec
3. The context of the first experiments of sampling from SD
Very first exercise on permanent series

The permanent series are defined as having non-zero turnover for at
least one relevant week (one of the first three full weeks) every
month for 13 months (Dec 2013-Dec 2014)

After having verified that permanent series provide a good coverage
of the total turnover


For these first analyses we focus on this PANEL, postponing the
issues related to the item life cycle, replacement of missing
items and seasonality
Our reference universe for the first experiments is constituted only of
permanent series and indices are evaluated on this sub-set of series
for each consumption segment
Workshop scanner data. Rome 1-2 October 2015
Table 1. Total turnover for all series, relevant week series and panel series, five Italian
provinces (2014)
TURNOVER
Province
% COVERAGE
NUMBER
OF PANEL
SERIES
All
Series
(A)
Relevant
weeks all series
(B)
Relevant weeks
Panel
Series
(C)
B/A
C/B
Turin
81.250.067
56.074.338
46.175.718
69,0
82,4
40.234
Ancona
22.847.504
15.988.337
12.487.957
70,0
78,1
16.516
Cagliari
18.308.833
12.726.186
9.598.138
69,5
75,4
9.165
Palermo
18.374.304
12.711.236
8.531.003
69,2
67,1
9.375
Piacenza
11.139.258
7.771.635
6.727.649
69,8
86,6
6.737
Workshop scanner data. Rome 1-2 October 2015
3. The context of the first experiments of sampling from SD

For the outlets of Turin for which we have data, we focus on three
consumption segments (Coicop 6 digits):

Coffee (01.2.1.1.0)

Pasta (01.1.1.6.1)

Mineral water (01.2.2.1.0)
Table 2. Total turnover for all series, relevant week series and panel series, 3 segments in
Turin (2014)
TURNOVER
Consumption
segment
All
Series
(A)
Relevant
weeks all
series
(B)
% COVERAGE
Relevant weeks
Panel
Series
(C)
B/A
C/B
NUMBER
OF PANEL
SERIES
Coffee
28.622.978
19.665.517
15.692.414
68,7
79,8
9.608
Pasta
26.192.517
17.902.061
13.631.744
68,4
76,2
23.636
Mineral water
26.434.572
18.506.760
16.851.559
70,0
91,1
6.990
Workshop scanner data. Rome 1-2 October 2015
4. The selection of elementary items from SD

Comparison of probabilistic and non-probabilistic sampling selection
schemes for different aggregation index formula for elementary
aggregates

Cut-off selection of series based on thresholds of covered
turnover: the index is compiled using all series covering 60%
or 80% of all turnover in each outlet for the consumption
segment, in previous year 2013

Probability sampling: pps (size= previous year turnover) for
two sampling rates (5% and 10%), selection of 500 samples

Reference method: most sold items in each outlet for
representative products (current fixed basket approach)
Workshop scanner data. Rome 1-2 October 2015
Table 3. Percentage and average number of items per outlet covering 60 and 80% of turnover,
3 segments in Turin (2014)
Percentage of Series
Consumption
segment
Turnover
threshold
60%
Turnover threshold
80%
Average Number of Items per Outlet
Total
Covering
60% of turnover
Covering
80% of
turnover
Coffee
16,2
36,1
46
8
17
Pasta
23,4
44,8
114
27
51
Mineral water
12,1
26,3
34
4
9
Workshop scanner data. Rome 1-2 October 2015
4. The selection of elementary items from SD

Sample series are selected from a sample of outlets: 30 out of 127
of outlets of retail trade modern distribution in Turin province

Outlet are selected by stratified SRS sampling with allocation
proportional to turnover of strata (6 chain by 2 types of outlet)

For each sample we compiled the elementary fixed base indices for
12 months with three classical aggregation formulas: Jevons
(unweighted), Fisher (ideal) and Lowe (weights from quantities of
previous year)

For the sample selection and weighting of indices we refer to total
annual turnover

Comparison of each estimate with the corresponding universe
value, evaluated on the complete set of panel series
Workshop scanner data. Rome 1-2 October 2015
5. First results

For each aggregation formula comparison of values obtained on the
different subset of series
 Bias
 Variance
Workshop scanner data. Rome 1-2 October 2015
Figure 2a. Lowe Indices for elementary aggregate : comparison of universe and different
sub-sets (Coffee -Turin 2014)
Lowe - Coffee
 The mean of the
sample estimates
is perfectly
overlapped to the
“true” value U
115
110
 Cut-off samples
over-estimate but
follow trend
 Most sold items
under-estimates
and alter trend
105
100
Lowe_U
95
Lowe_s80
Lowe_s60
Lowe_1
90
Dec
Lowe_s
Gen
Feb
Mar
Apr
May
Workshop scanner data. Rome 1-2 October 2015
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Figure 2b. Fisher Indices for elementary aggregate : comparison of universe and different subsets (Coffee -Turin 2014)
Fisher - Coffee
 The mean of the
sample estimates is
quite overlapped to
the “true” value U
115
 Cut-off samples overestimate but follow
the trend
110
 Most sold items
under-estimate and
alter trend
105
100
Fish_U
Fish_s80
95
Fish_s60
Fish_1
Fish_s
90
Dec
Gen
Feb
Mar
Apr
May
Workshop scanner data. Rome 1-2 October 2015
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Figure 2c. Jevons Indices for elementary aggregate : comparison of universe and different
sub-sets (Coffee -Turin 2014)
 The mean of the
sample estimates
strongly overestimates the “true”
value U and stresses
trend
Jevons - Coffee
115
110
 Cut-off samples overestimate but follow
quite the trend
 Most sold items
substancially follow
trend and levels
105
100
Jevo_U
Jevo_s80
95
Jevo_s60
Jevo_1
Jevo_s
90
Dec
Gen
Feb
Mar
Apr
May
Workshop scanner data. Rome 1-2 October 2015
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Figure 3. Lowe Indices for elementary aggregate : comparison of universe and different
sub-sets (Coffee and Pasta -Turin 2014)
115
Coffee
115
Pasta
110
110
105
105
100
Lowe_U
Lowe_s80
Lowe_s60 100
Lowe_1
Lowe_s
95
95
90
Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
90
Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
 Pasta: The mean of the sample estimates and cut-off values are overlapped to the “true” value U
 Most sold items over-estimates and alter trend
 Cut-off values and best selling items show opposite performance for the two product: this can be
explained by the different number of items and turnover distributions?
Workshop scanner data. Rome 1-2 October 2015
Figure 4. Fisher and Jevons Indices for elementary aggregate : comparison of universe and
different sub-sets (Pasta -Turin 2014)
Fisher
115
115
110
110
Jevons
Fish_U
Fish_s80 105
105
Fish_s60
Fish_1
100
Fish_s
95
100
95
90
Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
 The mean of the sample estimates is quite
overlapped to the “true” value U
 Cut-off samples slightly under-estimate but
follow the trend
 Most sold items stress trend
Workshop scanner data. Rome 1-2 October 2015
90
Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
 The mean of the sample estimates strongly
over-estimates the “true” value U and
accentuates trend
 Cut-off samples over-estimate but quite follow
trend
 Most sold items substancially follow trend and
levels
Figure 5a. Confidence band for Lowe Indices for elementary aggregate in comparison with universe values,
pps sample (Coffee Turin 2014)
Lowe - Coffee pps f=5%
115
110
Real value
105
pps f=5%
UB
100
LB
95
90
Dec Gen Feb Mar
Apr May Jun
Jul
Aug Sep Oct Nov Dec
Lowe - Coffee pps f=10%
115
110
Real value
105
pps f=10%
UB
100
LB
95
90
Dec Gen Feb
Mar
Apr
May Jun
Workshop scanner data. Rome 1-2 October 2015
Jul
Aug Sep
Oct
Nov Dec
Figure 5b. Confidence band for Jevons Indices for elementary aggregate of Coffee in comparison with of
universe values, pps sample (Turin 2014)
Jevons - Coffee pps 5%
115
110
Real value
105
pps f=5%
100
UB
LB
95
90
Dec Gen
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jevons - Coffee pps 10%
115
110
Real value
105
pps f=10%
UB
100
LB
95
90
Dec Gen
Feb
Mar
Apr
Workshop scanner data. Rome 1-2 October 2015
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Table 4. Bias and Relative Sampling Error distribution of monthly Lowe, Fisher and
Jevons indices for pps samples of series
Bias
Consumption
Segment
Coffee
Pasta
Lowe Index
Sampling
rate
Fisher Index
Jevons Index
min
max
min
max
min
max
5%
-0.07
0.12
-0.45
0.10
1.87
5.88
10%
-0.02
0.04
-0.17
0.06
0.73
3.51
5%
-0.13
0.03
-0.25
0.23
-2.26
0.03
10%
-0.05
0.03
-0.06
0.08
-2.43
0.12
Sampling relative error (%)
Consumption
Segment
Coffee
Pasta
Sampling
rate
Sample size
5%
Lowe Index
Fisher Index
Jevons Index
min
max
min
max
min
max
190
1.17
1.36
4.73
5.65
0.90
1.20
10%
380
0.65
0.91
2.27
2.79
0.43
0.58
5%
450
1.03
1.29
3.99
4.87
0.91
1.19
10%
900
0.65
0.93
2.81
3.60
0.51
0.70
Workshop scanner data. Rome 1-2 October 2015
The results highlight the following first evidences about the performance
of different series selection schemes
 Cut-off based sample are much less biased with respect to a selection of most
sold items: in general cut-off slightly overestimate while the most sold items
approach underestimate inflation (even inflation vs deflation); anyway the
results depend on the product category
 Probability pps sample produces approximately unbiased estimates for indices
using weights (Lowe and Fisher), though the second one shows a very high
variance.
 Probability srs sample produces approximately unbiased estimates when using
unweighted indices (Jevons)
 Sample scheme is not neutral with respect to index choice
 Increasing the sampling rate produce a remarkable improvement of the bias in
all indices, in addition to an obvious reduction of sampling error
 First replication on other product segments show similar evidence but
depending on the distribution of turnover
Workshop scanner data. Rome 1-2 October 2015
6. Open issues and conclusions

The sample allocation criteria are still under study, both for outlet and
items; analysis of variability will be made

In-depth studies should take into consideration the cycle of life of
items and all related implications (new items, replacements…)

A big issue to deal with is the integration between scanner data and
traditional data for index compilation, different hypothesis are under
evaluation
 Combining indices obtained with different approaches
 Gradually abandon manual collection, at least for food and grocery,
considering the high expenditure coverage of modern distribution
 Aim at define and realise a probability sampling also for traditional
distribution
Workshop scanner data. Rome 1-2 October 2015
Thank you for your attention !
Workshop scanner data. Rome 1-2 October 2015
Scarica

Presentazione di PowerPoint