Sampling design issues in Italian experience on scanner data and the possible integration with microdata coming from traditional data collection Claudia De Vitiis In collaboration with: C. Casciano, N. Cibella, A. Guandalini, F. Inglese, G. Seri, M. Terribili, F. Tiero ISTAT - ITALY Workshop scanner data. Rome 1-2 October 2015 Summary 1. Aims of the presentation 2. The new general sampling design for CPI 3. The context of the first experiments of sampling from Scanner Data 4. Selection of elementary items from Scanner Data 5. First results 6. Open issues and conclusions Workshop scanner data. Rome 1-2 October 2015 1. Aims of the presentation Scanner data is a big opportunity to introduce improvements in the CPI compiling not only for the data collection but also with regards the sampling perspective This presentation focuses on a comparison among indices of elementary aggregates compiled using different sub-sets of series obtained through different “selection schemes” Elementary Index Bias Elementary Index Sampling Variance First experiment on a small data set: One province, some consumption segments (Italian COICOP6), permanent series Workshop scanner data. Rome 1-2 October 2015 2. The new general sampling design for CPI The current sample strategy of the CPI survey (at territorial level) Three purposive sampling stages: – The first stage units are the chief towns of provinces (established by law) – The second stage units are the outlets, purposively chosen in each PSU to be representative of the consumer behaviour – The most sold items of a fixed basked of products are observed in the sample outlets The elementary indices are obtained at municipality level by unweighted geometric mean The general index is calculated by subsequent aggregation of elementary indices, using weights at different levels based on population and national account data on consumer expenditure Workshop scanner data. Rome 1-2 October 2015 2. The new general sampling design for CPI A working group established at ISTAT is developing a probability sample strategy Based on the hypothesis that turnover is a good proxy of final household consumption (expenditure) Outlets and items are selected using probabilities proportional to the turnover Selection list for the outlets (local units) is obtained from business register, ASIA-UL ASIA-PV The archive contains information useful for the selection and the definition of the inclusion probability For the item level different approaches will coexist at the beginning: Scanner data for food and grocery in the modern distribution allow the use of sampling methods and index compilation using weights from quantities (or expenditures) For traditional distribution and the other sectors, data collection and index compilation will continue unchanged at first Workshop scanner data. Rome 1-2 October 2015 3. The context of the first experiments of sampling from Scanner Data Among the first analyses on the scanner data universe we carried out some tentative experiments of sample selection of series Data referred to the first Italian provinces for which ISTAT got data, for year 2014 Weekly data on turnover and quantities per EAN-code (GTIN) and outlet allow obtaining weekly unit values One series is considered present in a specific month if it has associated a non zero turnover in at least one of the three central weeks of the month The first issue we analysed is the continuity of series (=EAN by outlet) and the coverage of a panel of permanent series Following figures show examples of the coverage rate of a fixed basket of series taken in December 2013 during 2014 months Workshop scanner data. Rome 1-2 October 2015 Figure 1a. Presence of series fixed in Dec2013 in 2014 months - Coverage of single series and total turnover (all products, Turin province) 100.00 93.32 90.00 85.19 91.68 83.02 89.36 85.41 81.62 80.00 81.11 79.32 77.54 83.16 76.23 82.41 73.89 81.77 81.33 71.94 82.75 82.29 80.49 73.46 73.58 73.53 74.22 sep oct nov dec 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 jan feb mar apr may jun Coverage of items (%) Workshop scanner data. Rome 1-2 October 2015 jul ago Total sales coverage (%) Figure 1b. Presence of series fixed in Dec2013 in 2014 months - Coverage of single series and total turnover (Coicop 6 digits - Coffee segment, Turin province) 100.00 94.63 91.56 96.02 95.97 90.23 90.00 92.89 89.14 87.78 90.81 87.12 92.58 86.38 90.73 85.05 92.00 83.09 91.75 88.14 85.02 89.07 84.50 85.32 85.26 86.10 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 jan feb mar apr may jun Coverage of items (%) Workshop scanner data. Rome 1-2 October 2015 jul ago Total sales coverage (%) sep oct nov dec 3. The context of the first experiments of sampling from SD Very first exercise on permanent series The permanent series are defined as having non-zero turnover for at least one relevant week (one of the first three full weeks) every month for 13 months (Dec 2013-Dec 2014) After having verified that permanent series provide a good coverage of the total turnover For these first analyses we focus on this PANEL, postponing the issues related to the item life cycle, replacement of missing items and seasonality Our reference universe for the first experiments is constituted only of permanent series and indices are evaluated on this sub-set of series for each consumption segment Workshop scanner data. Rome 1-2 October 2015 Table 1. Total turnover for all series, relevant week series and panel series, five Italian provinces (2014) TURNOVER Province % COVERAGE NUMBER OF PANEL SERIES All Series (A) Relevant weeks all series (B) Relevant weeks Panel Series (C) B/A C/B Turin 81.250.067 56.074.338 46.175.718 69,0 82,4 40.234 Ancona 22.847.504 15.988.337 12.487.957 70,0 78,1 16.516 Cagliari 18.308.833 12.726.186 9.598.138 69,5 75,4 9.165 Palermo 18.374.304 12.711.236 8.531.003 69,2 67,1 9.375 Piacenza 11.139.258 7.771.635 6.727.649 69,8 86,6 6.737 Workshop scanner data. Rome 1-2 October 2015 3. The context of the first experiments of sampling from SD For the outlets of Turin for which we have data, we focus on three consumption segments (Coicop 6 digits): Coffee (01.2.1.1.0) Pasta (01.1.1.6.1) Mineral water (01.2.2.1.0) Table 2. Total turnover for all series, relevant week series and panel series, 3 segments in Turin (2014) TURNOVER Consumption segment All Series (A) Relevant weeks all series (B) % COVERAGE Relevant weeks Panel Series (C) B/A C/B NUMBER OF PANEL SERIES Coffee 28.622.978 19.665.517 15.692.414 68,7 79,8 9.608 Pasta 26.192.517 17.902.061 13.631.744 68,4 76,2 23.636 Mineral water 26.434.572 18.506.760 16.851.559 70,0 91,1 6.990 Workshop scanner data. Rome 1-2 October 2015 4. The selection of elementary items from SD Comparison of probabilistic and non-probabilistic sampling selection schemes for different aggregation index formula for elementary aggregates Cut-off selection of series based on thresholds of covered turnover: the index is compiled using all series covering 60% or 80% of all turnover in each outlet for the consumption segment, in previous year 2013 Probability sampling: pps (size= previous year turnover) for two sampling rates (5% and 10%), selection of 500 samples Reference method: most sold items in each outlet for representative products (current fixed basket approach) Workshop scanner data. Rome 1-2 October 2015 Table 3. Percentage and average number of items per outlet covering 60 and 80% of turnover, 3 segments in Turin (2014) Percentage of Series Consumption segment Turnover threshold 60% Turnover threshold 80% Average Number of Items per Outlet Total Covering 60% of turnover Covering 80% of turnover Coffee 16,2 36,1 46 8 17 Pasta 23,4 44,8 114 27 51 Mineral water 12,1 26,3 34 4 9 Workshop scanner data. Rome 1-2 October 2015 4. The selection of elementary items from SD Sample series are selected from a sample of outlets: 30 out of 127 of outlets of retail trade modern distribution in Turin province Outlet are selected by stratified SRS sampling with allocation proportional to turnover of strata (6 chain by 2 types of outlet) For each sample we compiled the elementary fixed base indices for 12 months with three classical aggregation formulas: Jevons (unweighted), Fisher (ideal) and Lowe (weights from quantities of previous year) For the sample selection and weighting of indices we refer to total annual turnover Comparison of each estimate with the corresponding universe value, evaluated on the complete set of panel series Workshop scanner data. Rome 1-2 October 2015 5. First results For each aggregation formula comparison of values obtained on the different subset of series Bias Variance Workshop scanner data. Rome 1-2 October 2015 Figure 2a. Lowe Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee -Turin 2014) Lowe - Coffee The mean of the sample estimates is perfectly overlapped to the “true” value U 115 110 Cut-off samples over-estimate but follow trend Most sold items under-estimates and alter trend 105 100 Lowe_U 95 Lowe_s80 Lowe_s60 Lowe_1 90 Dec Lowe_s Gen Feb Mar Apr May Workshop scanner data. Rome 1-2 October 2015 Jun Jul Aug Sep Oct Nov Dec Figure 2b. Fisher Indices for elementary aggregate : comparison of universe and different subsets (Coffee -Turin 2014) Fisher - Coffee The mean of the sample estimates is quite overlapped to the “true” value U 115 Cut-off samples overestimate but follow the trend 110 Most sold items under-estimate and alter trend 105 100 Fish_U Fish_s80 95 Fish_s60 Fish_1 Fish_s 90 Dec Gen Feb Mar Apr May Workshop scanner data. Rome 1-2 October 2015 Jun Jul Aug Sep Oct Nov Dec Figure 2c. Jevons Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee -Turin 2014) The mean of the sample estimates strongly overestimates the “true” value U and stresses trend Jevons - Coffee 115 110 Cut-off samples overestimate but follow quite the trend Most sold items substancially follow trend and levels 105 100 Jevo_U Jevo_s80 95 Jevo_s60 Jevo_1 Jevo_s 90 Dec Gen Feb Mar Apr May Workshop scanner data. Rome 1-2 October 2015 Jun Jul Aug Sep Oct Nov Dec Figure 3. Lowe Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee and Pasta -Turin 2014) 115 Coffee 115 Pasta 110 110 105 105 100 Lowe_U Lowe_s80 Lowe_s60 100 Lowe_1 Lowe_s 95 95 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Pasta: The mean of the sample estimates and cut-off values are overlapped to the “true” value U Most sold items over-estimates and alter trend Cut-off values and best selling items show opposite performance for the two product: this can be explained by the different number of items and turnover distributions? Workshop scanner data. Rome 1-2 October 2015 Figure 4. Fisher and Jevons Indices for elementary aggregate : comparison of universe and different sub-sets (Pasta -Turin 2014) Fisher 115 115 110 110 Jevons Fish_U Fish_s80 105 105 Fish_s60 Fish_1 100 Fish_s 95 100 95 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec The mean of the sample estimates is quite overlapped to the “true” value U Cut-off samples slightly under-estimate but follow the trend Most sold items stress trend Workshop scanner data. Rome 1-2 October 2015 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec The mean of the sample estimates strongly over-estimates the “true” value U and accentuates trend Cut-off samples over-estimate but quite follow trend Most sold items substancially follow trend and levels Figure 5a. Confidence band for Lowe Indices for elementary aggregate in comparison with universe values, pps sample (Coffee Turin 2014) Lowe - Coffee pps f=5% 115 110 Real value 105 pps f=5% UB 100 LB 95 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Lowe - Coffee pps f=10% 115 110 Real value 105 pps f=10% UB 100 LB 95 90 Dec Gen Feb Mar Apr May Jun Workshop scanner data. Rome 1-2 October 2015 Jul Aug Sep Oct Nov Dec Figure 5b. Confidence band for Jevons Indices for elementary aggregate of Coffee in comparison with of universe values, pps sample (Turin 2014) Jevons - Coffee pps 5% 115 110 Real value 105 pps f=5% 100 UB LB 95 90 Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jevons - Coffee pps 10% 115 110 Real value 105 pps f=10% UB 100 LB 95 90 Dec Gen Feb Mar Apr Workshop scanner data. Rome 1-2 October 2015 May Jun Jul Aug Sep Oct Nov Dec Table 4. Bias and Relative Sampling Error distribution of monthly Lowe, Fisher and Jevons indices for pps samples of series Bias Consumption Segment Coffee Pasta Lowe Index Sampling rate Fisher Index Jevons Index min max min max min max 5% -0.07 0.12 -0.45 0.10 1.87 5.88 10% -0.02 0.04 -0.17 0.06 0.73 3.51 5% -0.13 0.03 -0.25 0.23 -2.26 0.03 10% -0.05 0.03 -0.06 0.08 -2.43 0.12 Sampling relative error (%) Consumption Segment Coffee Pasta Sampling rate Sample size 5% Lowe Index Fisher Index Jevons Index min max min max min max 190 1.17 1.36 4.73 5.65 0.90 1.20 10% 380 0.65 0.91 2.27 2.79 0.43 0.58 5% 450 1.03 1.29 3.99 4.87 0.91 1.19 10% 900 0.65 0.93 2.81 3.60 0.51 0.70 Workshop scanner data. Rome 1-2 October 2015 The results highlight the following first evidences about the performance of different series selection schemes Cut-off based sample are much less biased with respect to a selection of most sold items: in general cut-off slightly overestimate while the most sold items approach underestimate inflation (even inflation vs deflation); anyway the results depend on the product category Probability pps sample produces approximately unbiased estimates for indices using weights (Lowe and Fisher), though the second one shows a very high variance. Probability srs sample produces approximately unbiased estimates when using unweighted indices (Jevons) Sample scheme is not neutral with respect to index choice Increasing the sampling rate produce a remarkable improvement of the bias in all indices, in addition to an obvious reduction of sampling error First replication on other product segments show similar evidence but depending on the distribution of turnover Workshop scanner data. Rome 1-2 October 2015 6. Open issues and conclusions The sample allocation criteria are still under study, both for outlet and items; analysis of variability will be made In-depth studies should take into consideration the cycle of life of items and all related implications (new items, replacements…) A big issue to deal with is the integration between scanner data and traditional data for index compilation, different hypothesis are under evaluation Combining indices obtained with different approaches Gradually abandon manual collection, at least for food and grocery, considering the high expenditure coverage of modern distribution Aim at define and realise a probability sampling also for traditional distribution Workshop scanner data. Rome 1-2 October 2015 Thank you for your attention ! Workshop scanner data. Rome 1-2 October 2015