3.2 At Risk of Poverty Ratio (svyarpr)

✔️ EU standard like ARPT, easy to understand, interpret, and implement
✔️ proportion of individuals below ARPT -- a "companion" function that uses svyarpt() internally
✔️ measure is easy to understand
❌ does not account for the intensity or inequality among the poor
❌ not very common outside of the EU
❌ just another name for the `svyfgt( g = 0 , thresh = "relq" )`

The at-risk-of-poverty rate (ARPR) is the share of persons with an income below the at-risk-of-poverty threshold (ARPT). The logic behind this measure is that although most people below the ARPT cannot be considered “poor”, they are the ones most vulnerable to becoming poor in the event of a negative economic phenomenon like a recession.

The ARPR estimator is a composite estimator, taking into account both the sampling error in the proportion itself and that in the ARPT estimate. The details of the linearization of the ARPR are discussed by Deville (⊕1999Deville, Jean-Claude. 1999. “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.” Survey Methodology 25 (2): 193–203. http://www.statcan.gc.ca/pub/12-001-x/1999002/article/4882-eng.pdf.) and Osier (⊕2009Osier, Guillaume. 2009. “Variance Estimation for Complex Indicators of Poverty and Inequality.” Journal of the European Survey Research Association 3 (3): 167–95. http://ojs.ub.uni-konstanz.de/srm/article/view/369.).

3.2.1 Replication Example

The R vardpoor package (⊕Breidaks, Liberts, and Ivanova 2016Breidaks, Juris, Martins Liberts, and Santa Ivanova. 2016. “Vardpoor: Estimation of Indicators on Social Exclusion and Poverty and Its Linearization, Variance Estimation.” Riga, Latvia: CSB.), created by researchers at the Central Statistical Bureau of Latvia, includes a ARPR coefficient calculation using the ultimate cluster method. The example below reproduces those statistics.

Load and prepare the same data set:

# load the convey package
library(convey)

# load the survey library
library(survey)

# load the vardpoor library
library(vardpoor)

# load the vardpoor library
library(laeken)

# load the synthetic EU statistics on income & living conditions
data(eusilc)

# make all column names lowercase
names(eusilc) <- tolower(names(eusilc))

# add a column with the row number
dati <- data.table::data.table(IDd = 1:nrow(eusilc), eusilc)

# calculate the arpr coefficient
# using the R vardpoor library
varpoord_arpr_calculation <-
  varpoord(
    # analysis variable
    Y = "eqincome",
    
    # weights variable
    w_final = "rb050",
    
    # row number variable
    ID_level1 = "IDd",
    
    # row number variable
    ID_level2 = "IDd",
    
    # strata variable
    H = "db040",
    
    N_h = NULL ,
    
    # clustering variable
    PSU = "rb030",
    
    # data.table
    dataset = dati,
    
    # arpr coefficient function
    type = "linarpr",
    
    # get linearized variable
    outp_lin = TRUE
    
  )


# construct a survey.design
# using our recommended setup
des_eusilc <-
  svydesign(
    ids = ~ rb030 ,
    strata = ~ db040 ,
    weights = ~ rb050 ,
    data = eusilc
  )

# immediately run the convey_prep function on it
des_eusilc <- convey_prep(des_eusilc)

# coefficients do match
varpoord_arpr_calculation$all_result$value

## [1] 14.44422

coef(svyarpr( ~ eqincome , des_eusilc)) * 100

## eqincome 
## 14.44422

# linearized variables do not match
# because Fprime is the derivative wrt
# to the estimated threshold, not the estimated quantile
# for more details, see
# https://github.com/ajdamico/convey/issues/372#issuecomment-1656264143
#
# vardpoor
lin_arpr_varpoord <- varpoord_arpr_calculation$lin_out$lin_arpr
# convey
lin_arpr_convey <- attr(svyarpr( ~ eqincome , des_eusilc), "lin")

# check equality
all.equal(lin_arpr_varpoord, 100 * lin_arpr_convey)

## [1] "Mean relative difference: 0.2264738"

# variances do not match exactly
attr(svyarpr( ~ eqincome , des_eusilc) , 'var') * 10000

##            eqincome
## eqincome 0.07599778

varpoord_arpr_calculation$all_result$var

## [1] 0.08718569

# standard errors do not match exactly
varpoord_arpr_calculation$all_result$se

## [1] 0.2952722

SE(svyarpr( ~ eqincome , des_eusilc)) * 100

##           eqincome
## eqincome 0.2756769

The variance estimator and the linearized variable $z$ are both defined in Linearization-Based Variance Estimation. The functions convey::svyarpr and vardpoor::linarpr produce the same linearized variable $z$ .

However, the measures of uncertainty do not line up. One of the reasons is that library(vardpoor) defaults to an ultimate cluster method that can be replicated with an alternative setup of the survey.design object.

# within each strata, sum up the weights
cluster_sums <-
  aggregate(eusilc$rb050 , list(eusilc$db040) , sum)

# name the within-strata sums of weights the `cluster_sum`
names(cluster_sums) <- c("db040" , "cluster_sum")

# merge this column back onto the data.frame
eusilc <- merge(eusilc , cluster_sums)

# construct a survey.design
# with the fpc using the cluster sum
des_eusilc_ultimate_cluster <-
  svydesign(
    ids = ~ rb030 ,
    strata = ~ db040 ,
    weights = ~ rb050 ,
    data = eusilc ,
    fpc = ~ cluster_sum
  )

# again, immediately run the convey_prep function on the `survey.design`
des_eusilc_ultimate_cluster <-
  convey_prep(des_eusilc_ultimate_cluster)

# does not match
attr(svyarpr( ~ eqincome , des_eusilc_ultimate_cluster) , 'var') * 10000

##            eqincome
## eqincome 0.07586194

varpoord_arpr_calculation$all_result$var

## [1] 0.08718569

# does not match
varpoord_arpr_calculation$all_result$se

## [1] 0.2952722

SE(svyarpr( ~ eqincome , des_eusilc_ultimate_cluster)) * 100

##           eqincome
## eqincome 0.2754305

Still, there is a difference in the estimates. This is discussed in detail in this issue. In order to still provide additional examples for our code, we proceed with a Monte Carlo experiment.

Using the eusilcP data from the simPop package (⊕Templ et al. 2017Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Olivier Dupriez. 2017. “Simulation of Synthetic Complex Data: The R Package simPop.” Journal of Statistical Software 79 (10): 1–38. https://doi.org/10.18637/jss.v079.i10.), we can compute the actual value of the at risk of poverty rate for that population:

# load libraries
library(sampling)

## 
## Attaching package: 'sampling'

## The following objects are masked from 'package:survival':
## 
##     cluster, strata

library(survey)
library(convey)
library(parallel)

# load pseudo population data
data("eusilcP" , package = "simPop")

# compute population median
q50.pop <-
  convey:::computeQuantiles(eusilcP$eqIncome , rep(1 , length(eusilcP$eqIncome)) , .50)

# compute population poverty threshold
# as 60% of the median
thresh.pop <- .60 * q50.pop

# compute population at risk of poverty rate
(theta.pop <-
    mean(eusilcP$eqIncome <= thresh.pop , na.rm = TRUE))

## [1] 0.1469295

Now, to study the distribution of the estimator under a particular sampling design, we select 5000 samples under one-stage cluster sampling of 100 households using the cluster function from the sampling package (⊕Tillé and Matei 2021Tillé, Yves, and Alina Matei. 2021. Sampling: Survey Sampling. https://CRAN.R-project.org/package=sampling.), and use the svyarpr function to estimate the ARPR for each of those samples:

# define the number of monte carlo replicates
mc.rep <- 5000L

# simulation function
arpr_sim_fun <- function(this.iter) {
  
  set.seed(this.iter)
  
  
  library(survey)
  library(convey)
  library(sampling)
  
  # load pseudo population data
  data("eusilcP" , package = "simPop")
  
  # compute size-like variable for PPS sampling design
  eusilcP$aux <-
    log(ifelse(eusilcP$eqIncome >= 1000 , eusilcP$eqIncome , 1000))
  
  
  # select sample
  tt <-
    sampling::cluster(
      data = eusilcP[sample.int(nrow(eusilcP) , nrow(eusilcP) , replace = FALSE) , ] ,
      clustername = "hid" ,
      size = 1000L ,
      method = "systematic" ,
      pik = eusilcP$aux
    )
  
  # collect data
  this.sample <- getdata(eusilcP , tt)
  
  # create survey design object
  this.desobj <-
    svydesign(
      ids = ~ hid ,
      probs = ~ Prob ,
      data = this.sample ,
      nest = FALSE
    )
  
  # prepare for convey functions
  this.desobj <- convey_prep(this.desobj)
  
  # compute estimates
  svyarpr( ~ eqIncome , this.desobj)
  
}

# run replications
cl <- makeCluster(detectCores() - 1)

arpr.estimate.list <-
  clusterApply(cl, seq_len(mc.rep) , arpr_sim_fun)

stopCluster(cl)

Then, we evaluate the Percentage Relative Bias (PRB) of the ARPR estimator. Under this scenario, the PRB of the ARPR estimator is -0.19830%.

# estimate the expected value of the ARPR estimator
# using the average of the estimates
(theta.exp <- mean(sapply(arpr.estimate.list , coef)))

## [1] 0.1466381

# estimate the percentage relative bias
(percentage_relative_bias_arpr <- 100 * (theta.exp / theta.pop - 1))

## [1] -0.1982981

stopifnot(round(percentage_relative_bias_arpr , 4) == -0.19830)

For the variance estimator, we have:

# estimate the variance of the ARPR estimator
# using the empirical variance of the estimates
(vartheta.popest <- var(sapply(arpr.estimate.list , coef)))

## [1] 0.0001048728

# estimate the expected value of the ARPR variance estimator
# using the (estimated) expected value of the variance estimates
(vartheta.exp <- mean(sapply(arpr.estimate.list , vcov)))

## [1] 0.0001081994

# estimate the percentage relative bias of the variance estimator
( percentage_relative_bias_variance <- 100 *  (vartheta.exp / vartheta.popest - 1) )

## [1] 3.172023

stopifnot( round( percentage_relative_bias_variance , 4 ) == 3.1720 )

Under this scenario, the PRB of the ARPR variance estimator is 3.1720%.

Our simulation shows that the Bias Ratio of this estimator is approximately 2%:

# compute bias ratio
100 * abs( theta.exp - theta.pop ) / sqrt( vartheta.popest )

## [1] 2.84509

if the normal approximation holds, a small bias ratio still allows for approximately valid estimates of the confidence intervals.

Next, we evaluate the Percentage Coverage Rate (PCR). In theory, under repeated sampling, the estimated 95% CIs should cover the population parameter approximately 95% of the time. We can evaluate that using:

# estimate confidence intervals of the ARPR
# for each of the samples
est.coverage <-
  sapply(arpr.estimate.list, function(this.stat)
    confint(this.stat)[, 1] <= theta.pop &
      confint(this.stat)[, 2] >= theta.pop)

# evaluate empirical coverage
(empirical_coverage <- mean(est.coverage))

## [1] 0.9516

stopifnot(round(empirical_coverage , 2) == 0.95)

Our coverages are not too far from the nominal coverage level of 95%.

For additional usage examples of svyarpr, type ?convey::svyarpr in the R console.

3.2.2 Real World Examples

This section displays example results using nationally-representative surveys from both the United States and Brazil. We present a variety of surveys, levels of analysis, and subpopulation breakouts to provide users with points of reference for the range of plausible values of the svyarpr function.

To understand the construction of each survey design object and respective variables of interest, please refer to section 1.4 for CPS-ASEC, section 1.5 for PNAD Contínua, and section 1.6 for SCF.

3.2.2.1 CPS-ASEC Household Income

svyarpr(~ htotval , cps_household_design)

##            arpr     SE
## htotval 0.30477 0.0019

svyby(~ htotval , ~ sex , cps_household_design , svyarpr)

##           sex   htotval  se.htotval
## male     male 0.2569898 0.002699161
## female female 0.3530735 0.002753344

3.2.2.2 CPS-ASEC Family Income

svyarpr(~ ftotval , cps_family_design)

##            arpr     SE
## ftotval 0.23395 0.0024

svyby(~ ftotval , ~ sex , cps_family_design , svyarpr)

##           sex   ftotval  se.ftotval
## male     male 0.1897022 0.003094091
## female female 0.2808015 0.003609486

3.2.2.3 CPS-ASEC Worker Earnings

svyarpr(~ pearnval , cps_ftfy_worker_design)

##             arpr     SE
## pearnval 0.22889 0.0063

svyby(~ pearnval , ~ sex , cps_ftfy_worker_design , svyarpr)

##           sex  pearnval se.pearnval
## male     male 0.1957581 0.005582147
## female female 0.2719207 0.007937258

3.2.2.4 PNAD Contínua Per Capita Income

svyarpr( ~ deflated_per_capita_income , pnadc_design , na.rm = TRUE)

##                               arpr     SE
## deflated_per_capita_income 0.27695 0.0035

svyby(~ deflated_per_capita_income ,
      ~ sex ,
      pnadc_design ,
      svyarpr ,
      na.rm = TRUE)

##           sex deflated_per_capita_income se.deflated_per_capita_income
## male     male                  0.2705489                   0.003702496
## female female                  0.2830603                   0.003569612

3.2.2.5 PNAD Contínua Worker Earnings

svyarpr( ~ deflated_labor_income , pnadc_design , na.rm = TRUE)

##                          arpr     SE
## deflated_labor_income 0.15562 0.0015

svyby( ~ deflated_labor_income , ~ sex , pnadc_design , svyarpr , na.rm = TRUE)

##           sex deflated_labor_income se.deflated_labor_income
## male     male             0.1290385              0.001682285
## female female             0.1917583              0.002170694

3.2.2.6 SCF Family Net Worth

scf_MIcombine(with(scf_design , svyarpr( ~ networth)))

## Multiple imputation results:
##       with(scf_design, svyarpr(~networth))
##       scf_MIcombine(with(scf_design, svyarpr(~networth)))
##            results          se
## networth 0.4070338 0.005282615

scf_MIcombine(with(scf_design , svyby( ~ networth, ~ hhsex , svyarpr)))

## Multiple imputation results:
##       with(scf_design, svyby(~networth, ~hhsex, svyarpr))
##       scf_MIcombine(with(scf_design, svyby(~networth, ~hhsex, svyarpr)))
##          results          se
## male   0.3430431 0.008215832
## female 0.5726547 0.015545485

3.2.2.7 SCF Family Income

scf_MIcombine(with(scf_design , svyarpr( ~ income)))

## Multiple imputation results:
##       with(scf_design, svyarpr(~income))
##       scf_MIcombine(with(scf_design, svyarpr(~income)))
##         results          se
## income 0.297591 0.007471504

scf_MIcombine(with(scf_design , svyby( ~ income, ~ hhsex , svyarpr)))

## Multiple imputation results:
##       with(scf_design, svyby(~income, ~hhsex, svyarpr))
##       scf_MIcombine(with(scf_design, svyby(~income, ~hhsex, svyarpr)))
##          results          se
## male   0.2092876 0.007433473
## female 0.5261313 0.017859170