4.7 Zenga index (svyzenga) | Poverty and Inequality with Complex Survey Data

Proposed by Zenga (⊕2007Zenga, Michele. 2007. “Inequality Curve and Inequality Index Based on the Ratios Between Lower and Upper Arithmetic Means.” Statistica e Applicazioni 1 (4): 3–27.), the Zenga index is another inequality measure related to the Lorenz curve with interesting properties. For continuous populations, it is defined as:

The intuition behind the Zenga index is based on argument similar to the Quintile Share Ratio or the Palma index. While the QSR and Palma indices are based on ratios of lower and upper shares, the Zenga scores (but not the measure itself!) are the ratios of lower and upper means. The Zenga index is the average of those scores.

For a given percentile $p$, when the mean income of the poorest is not much smaller than the mean income of the richest, the ratio is close to one, resulting in $Z(p)$ close to zero. However, when the lower mean is much smaller than the upper mean, the ratio goes to 0, resulting in a $Z(p)$ close to one.

Take the mean income of the poorest 20% of a population; this is the lower mean. Its “complement” is the mean income of the richest 80%; this is the upper mean. If the lower mean is $100 and the upper mean is $1,000, then the respective point on the Zenga curve would associate the fraction $20$% to $1 - ( \$100 / \$1,000 ) = 0.9$. In other words, $Z(20\%) = 90\%$.

In this case, we took a fixed $p = 20$%; however, the full Zenga curve would be computed varying $p$ from $0$% to $100$%. Then, the Zenga index is the integral of this curve in the interval $(0,1)$. In other words, the Zenga index is the mean of the Y axis of the Zenga coordinates/scores. Unlike the relationship between the Gini coefficient and the Lorenz curve, this allows for a more straightforward graphical interpretation (see Figure 1 from Langel and Tillé (⊕2012Langel, Matti, and Yves Tillé. 2012. “Inference by Linearization for Zenga’s New Inequality Index: A Comparison with the Gini Index.” Metrika 75 (8): 1093–1110. https://doi.org/10.1007/s00184-011-0369-1.)) of the Zenga index.

In practice, convey uses the Barabesi, Diana, and Perri (⊕2016Barabesi, Lucio, Giancarlo Diana, and Pier Francesco Perri. 2016. “Linearization of Inequality Indices in the Design-Based Framework.” Statistics 50 (5): 1161–72. https://doi.org/10.1080/02331888.2015.1135924.) estimators, based on the finite population Zenga index and using Deville’s (⊕1999Deville, Jean-Claude. 1999. “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.” Survey Methodology 25 (2): 193–203. http://www.statcan.gc.ca/pub/12-001-x/1999002/article/4882-eng.pdf.) influence function method for variance estimation. Alternatively, Langel and Tillé (⊕2012Langel, Matti, and Yves Tillé. 2012. “Inference by Linearization for Zenga’s New Inequality Index: A Comparison with the Gini Index.” Metrika 75 (8): 1093–1110. https://doi.org/10.1007/s00184-011-0369-1.) uses an estimator based on smoothed quantiles and a variance estimator based on the Demnati and Rao (⊕2004Demnati, A., and J. N. K. Rao. 2004. “Linearization Variance Estimators for Survey Data.” Survey Methodology 30 (1): 17–26. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2004001/article/6991-eng.pdf.) linearization method. However, Barabesi, Diana, and Perri (⊕2016Barabesi, Lucio, Giancarlo Diana, and Pier Francesco Perri. 2016. “Linearization of Inequality Indices in the Design-Based Framework.” Statistics 50 (5): 1161–72. https://doi.org/10.1080/02331888.2015.1135924.) finds that both estimators behave very similarly.

According to Langel and Tillé (⊕2012Langel, Matti, and Yves Tillé. 2012. “Inference by Linearization for Zenga’s New Inequality Index: A Comparison with the Gini Index.” Metrika 75 (8): 1093–1110. https://doi.org/10.1007/s00184-011-0369-1.), “the Zenga index is significantly less affected by extreme observations than the Gini index. This is a very important advantage of the Zenga index because inference from income data is often confronted with extreme values”.

4.7.1 Replication Example

While it is not possible to reproduce the exact results because their simulation includes random sampling, we can replicate a Monte Carlo experiment similar to the one presented by Barabesi, Diana, and Perri (⊕2016Barabesi, Lucio, Giancarlo Diana, and Pier Francesco Perri. 2016. “Linearization of Inequality Indices in the Design-Based Framework.” Statistics 50 (5): 1161–72. https://doi.org/10.1080/02331888.2015.1135924.) in Table 2.

Load and prepare the data set:

library(convey)
library(survey)
library(parallel)

# create a temporary file on the local disk
tf <- tempfile()

# store the location of the zip file
data_zip <-
  "https://www.bancaditalia.it/statistiche/tematiche/indagini-famiglie-imprese/bilanci-famiglie/distribuzione-microdati/documenti/ind12_ascii.zip"

# download to the temporary file
download.file(data_zip , tf , mode = 'wb')

# unzip the contents of the archive to a temporary directory
td <- file.path(tempdir() , "shiw2012")
data_files <- unzip(tf , exdir = td)

# load the household, consumption and income from SHIW (2012) survey data
hhr.df <- read.csv(grep("carcom12" , data_files , value = TRUE))
cns.df <- read.csv(grep("risfam12" , data_files , value = TRUE))
inc.df <- read.csv(grep("rper12" , data_files , value = TRUE))

# fixes names
colnames(cns.df)[1] <- tolower(colnames(cns.df))[1]

# household count should be 8151
stopifnot(sum(!duplicated(hhr.df$nquest)) == 8151)

# person count should be 20022
stopifnot(sum(!duplicated(paste0(
  hhr.df$nquest , "-" , hhr.df$nord
))) == 20022)

# income recipients: should be 12986
stopifnot(sum(hhr.df$PERC) == 12986)

# combine household roster and income datasets
pop.df <-
  merge(
    hhr.df ,
    inc.df ,
    by = c("nquest" , "nord") ,
    all.x = TRUE ,
    sort = FALSE
  )

# compute household-level income
hinc.df <-
  pop.df[pop.df$PERC == 1 ,
         c("nquest" , "nord" , "PERC" , "Y" , "YL" , "YM" , "YT" , "YC") ,
         drop = FALSE]

hinc.df <-
  aggregate(rowSums(hinc.df[, c("YL"  , "YM" , "YT") , drop = FALSE]) ,
            list("nquest" = hinc.df$nquest) ,
            sum ,
            na.rm = TRUE)

pop.df <-
  merge(pop.df ,
        hinc.df ,
        by = "nquest" ,
        all.x = TRUE ,
        sort = FALSE)

# combine with consumption data
pop.df <-
  merge(pop.df ,
        cns.df ,
        by = "nquest" ,
        all.x = TRUE ,
        sort = FALSE)

# treat household-level sample as the finite population
pop.df <-
  pop.df[!duplicated(pop.df$nquest) , , drop = FALSE]

For the Monte Carlo experiment, we take 5000 samples of (expected) size 1000 using Poisson sampling1111 but not Conditional Poisson Sampling.:

# set up monte carlo attributes
mc.rep <- 5000L
n.size = 1000L
N.size = nrow(pop.df)

# calculate first order probabilities
pop.df$pik <-
  n.size * (abs(pop.df$C) / sum(abs(pop.df$C)))

# calculate second order probabilities
pikl <- pop.df$pik %*% t(pop.df$pik)
diag(pikl) <- pop.df$pik

# set random number seed
set.seed(1997)

# create list of survey design objects
# using poisson sampling
survey.list <-
  lapply(seq_len(mc.rep) ,
         function(this.rep) {
           smp.ind <- which(runif(N.size) <= pop.df$pik)
           smp.df <- pop.df[smp.ind , , drop = FALSE]
           svydesign(
             ids = ~ nquest ,
             data = smp.df ,
             fpc = ~ pik ,
             nest = FALSE ,
             pps = ppsmat(pikl[smp.ind , smp.ind]) ,
             variance = "HT"
           )
         })

… and estimate the Zenga index using each sample:

# estimate zenga index for each sample
zenga.estimate.list <-
  lapply(survey.list ,
         function(this.sample) {
           svyzenga( ~ x ,
                     subset(this.sample , x > 0) ,
                     na.rm = TRUE)
         })

Then, we evaluate the Percentage Relative Bias (PRB) of the Zenga index estimator. Under this scenario, the PRB of the Zenga index estimator is 0.3397%, a result similar to the 0.321 shown in Table 2.

# compute the (finite population) Zenga index parameter
theta.pop <-
  convey:::CalcZenga(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0))

# estimate the expected value of the Zenga index estimator
# using the average of the estimates
theta.exp <- mean(sapply(zenga.estimate.list , coef))

# estimate the percentage relative bias
100 * (theta.exp / theta.pop - 1)

## [1] 0.3396837

Then, we evaluate the PRB of the variance estimator of the Zenga index estimator. Under this scenario, the PRB of the Zenga index variance estimator is -0.4954%, another result similar to the -0.600 shown in Table 2.

# estimate the variance of the Zenga index estimator
# using the variance of the estimates
(vartheta.popest <- var(sapply(zenga.estimate.list , coef)))

## [1] 9.47244e-05

# estimate the expected value of the Zenga index variance estimator
# using the expected of the variance estimates
(vartheta.exp <- mean(sapply(zenga.estimate.list , vcov)))

## [1] 9.425515e-05

# estimate the percentage relative bias
100 * (vartheta.exp / vartheta.popest - 1)

## [1] -0.4953863

Next, we evaluate the Percentage Coverage Rate (PCR). In theory, under repeated sampling, the estimated 95% CIs should cover the population parameter 95% of the time. We can evaluate that using:

# estimate confidence intervals of the Zenga index
# for each of the samples
est.cis <-
  t(sapply(zenga.estimate.list, function(this.stat)
    c(confint(this.stat))))

# evaluate
prop.table(table((theta.pop > est.cis[, 1]) &
                   (theta.pop < est.cis[, 2])))

## 
##  FALSE   TRUE 
## 0.0548 0.9452

Our estimated 95% CIs cover the parameter 94.52% of the time, which is very close to the nominal rate and also similar to the 94.5 PCR shown in Table 2.

For additional usage examples of svyzenga, type ?convey::svyzenga in the R console.

4.7.2 Real World Examples

This section displays example results using nationally-representative surveys from both the United States and Brazil. We present a variety of surveys, levels of analysis, and subpopulation breakouts to provide users with points of reference for the range of plausible values of the svyzenga function.

To understand the construction of each survey design object and respective variables of interest, please refer to section 1.4 for CPS-ASEC, section 1.5 for PNAD Contínua, and section 1.6 for SCF.

4.7.2.1 CPS-ASEC Household Income

svyzenga( ~ htotval , subset(cps_household_design, htotval >= 0))

##           zenga     SE
## htotval 0.82849 0.0013

svyby( ~ htotval ,
       ~ sex ,
       subset(cps_household_design , htotval >= 0) ,
       svyzenga)

##           sex   htotval  se.htotval
## male     male 0.8162638 0.002082333
## female female 0.8369836 0.001590366

4.7.2.2 CPS-ASEC Family Income

svyzenga( ~ ftotval , subset(cps_family_design, ftotval >= 0))

##          zenga     SE
## ftotval 0.8037 0.0016

svyby( ~ ftotval ,
       ~ sex ,
       subset(cps_family_design , ftotval >= 0) ,
       svyzenga)

##           sex   ftotval  se.ftotval
## male     male 0.7859458 0.002818295
## female female 0.8183288 0.001978021

4.7.2.3 CPS-ASEC Worker Earnings

svyzenga( ~ pearnval , subset(cps_ftfy_worker_design, pearnval >= 0))

##            zenga     SE
## pearnval 0.74189 0.0021

svyby( ~ pearnval ,
       ~ sex ,
       subset(cps_ftfy_worker_design , pearnval >= 0),
       svyzenga)

##           sex  pearnval se.pearnval
## male     male 0.7478463 0.002621419
## female female 0.7239070 0.003807160

4.7.2.4 PNAD Contínua Per Capita Income

svyzenga(
  ~ deflated_per_capita_income ,
  subset(pnadc_design , deflated_per_capita_income >= 0) ,
  na.rm = TRUE
)

##                             zenga     SE
## deflated_per_capita_income 0.8375 0.0018

svyby(
  ~ deflated_per_capita_income ,
  ~ sex ,
  subset(pnadc_design , deflated_per_capita_income >= 0) ,
  svyzenga ,
  na.rm = TRUE
)

##           sex deflated_per_capita_income se.deflated_per_capita_income
## male     male                  0.8382859                   0.001975951
## female female                  0.8366108                   0.001893388

4.7.2.5 PNAD Contínua Worker Earnings

svyzenga(~ deflated_labor_income ,
         subset(pnadc_design , deflated_labor_income >= 0) ,
         na.rm = TRUE)

##                         zenga     SE
## deflated_labor_income 0.80485 0.0023

svyby(
  ~ deflated_labor_income ,
  ~ sex ,
  subset(pnadc_design , deflated_labor_income >= 0) ,
  svyzenga ,
  na.rm = TRUE
)

##           sex deflated_labor_income se.deflated_labor_income
## male     male             0.8053165              0.002638859
## female female             0.7977516              0.002651340

4.7.2.6 SCF Family Net Worth

scf_MIcombine(with(subset(scf_design , networth >= 0) , svyzenga(~ networth)))

## Warning in subset.svyimputationList(scf_design, networth >= 0): subset differed
## between imputations

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, networth >= 0), svyzenga(~networth)))
##            results           se
## networth 0.9725735 0.0008273708

scf_MIcombine(with(
  subset(scf_design , networth >= 0) ,
  svyby(~ networth, ~ hhsex , svyzenga)
))

## Warning in subset.svyimputationList(scf_design, networth >= 0): subset differed
## between imputations

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, networth >= 0), svyby(~networth, 
##     ~hhsex, svyzenga)))
##          results          se
## male   0.9705150 0.001063789
## female 0.9641109 0.003379364

4.7.2.7 SCF Family Income

scf_MIcombine(with(subset(scf_design , income >= 0) , svyzenga(~ income)))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, income >= 0), svyzenga(~income)))
##          results          se
## income 0.8793776 0.004642581

scf_MIcombine(with(
  subset(scf_design , income >= 0) ,
  svyby(~ income, ~ hhsex , svyzenga)
))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, income >= 0), svyby(~income, 
##     ~hhsex, svyzenga)))
##          results          se
## male   0.8738066 0.005360720
## female 0.7942315 0.008618038