✔️ EU standard like ARPT, easy to understand, interpret, and implement
✔️ proportion of individuals below ARPT -- a "companion" function that uses svyarpt() internally
✔️ measure is easy to understand
❌ does not account for the intensity or inequality among the poor
❌ not very common outside of the EU
❌ just another name for the `svyfgt( g = 0 , thresh = "relq" )`
The at-risk-of-poverty rate (ARPR) is the share of persons with an income below the at-risk-of-poverty threshold (ARPT). The logic behind this measure is that although most people below the ARPT cannot be considered “poor”, they are the ones most vulnerable to becoming poor in the event of a negative economic phenomenon like a recession.
The ARPR estimator is a composite estimator, taking into account both the sampling error in the proportion itself and that in the ARPT estimate. The details of the linearization of the ARPR are discussed by Deville (1999Deville, Jean-Claude. 1999. “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.” Survey Methodology 25 (2): 193–203. http://www.statcan.gc.ca/pub/12-001-x/1999002/article/4882-eng.pdf.) and Osier (2009Osier, Guillaume. 2009. “Variance Estimation for Complex Indicators of Poverty and Inequality.” Journal of the European Survey Research Association 3 (3): 167–95. http://ojs.ub.uni-konstanz.de/srm/article/view/369.).
The R
vardpoor
package (Breidaks, Liberts, and Ivanova 2016Breidaks, Juris, Martins Liberts, and Santa Ivanova. 2016. “Vardpoor: Estimation of Indicators on Social Exclusion and Poverty and Its Linearization, Variance Estimation.” Riga, Latvia: CSB.), created by researchers at the Central Statistical Bureau of Latvia, includes a ARPR coefficient calculation using the ultimate cluster method. The example below reproduces those statistics.
Load and prepare the same data set:
# load the convey package
library(convey)
# load the survey library
library(survey)
# load the vardpoor library
library(vardpoor)
# load the vardpoor library
library(laeken)
# load the synthetic EU statistics on income & living conditions
data(eusilc)
# make all column names lowercase
names(eusilc) <- tolower(names(eusilc))
# add a column with the row number
dati <- data.table::data.table(IDd = 1:nrow(eusilc), eusilc)
# calculate the arpr coefficient
# using the R vardpoor library
varpoord_arpr_calculation <-
varpoord(
# analysis variable
Y = "eqincome",
# weights variable
w_final = "rb050",
# row number variable
ID_level1 = "IDd",
# row number variable
ID_level2 = "IDd",
# strata variable
H = "db040",
N_h = NULL ,
# clustering variable
PSU = "rb030",
# data.table
dataset = dati,
# arpr coefficient function
type = "linarpr",
# get linearized variable
outp_lin = TRUE
)
# construct a survey.design
# using our recommended setup
des_eusilc <-
svydesign(
ids = ~ rb030 ,
strata = ~ db040 ,
weights = ~ rb050 ,
data = eusilc
)
# immediately run the convey_prep function on it
des_eusilc <- convey_prep(des_eusilc)
# coefficients do match
varpoord_arpr_calculation$all_result$value
## [1] 14.44422
## eqincome
## 14.44422
# linearized variables do not match
# because Fprime is the derivative wrt
# to the estimated threshold, not the estimated quantile
# for more details, see
# https://github.com/ajdamico/convey/issues/372#issuecomment-1656264143
#
# vardpoor
lin_arpr_varpoord <- varpoord_arpr_calculation$lin_out$lin_arpr
# convey
lin_arpr_convey <- attr(svyarpr( ~ eqincome , des_eusilc), "lin")
# check equality
all.equal(lin_arpr_varpoord, 100 * lin_arpr_convey)
## [1] "Mean relative difference: 0.2264738"
## eqincome
## eqincome 0.07599778
## [1] 0.08718569
## [1] 0.2952722
## eqincome
## eqincome 0.2756769
The variance estimator and the linearized variable \(z\) are both defined in Linearization-Based Variance Estimation. The functions convey::svyarpr
and vardpoor::linarpr
produce the same linearized variable \(z\).
However, the measures of uncertainty do not line up. One of the reasons is that library(vardpoor)
defaults to an ultimate cluster method that can be replicated with an alternative setup of the survey.design
object.
# within each strata, sum up the weights
cluster_sums <-
aggregate(eusilc$rb050 , list(eusilc$db040) , sum)
# name the within-strata sums of weights the `cluster_sum`
names(cluster_sums) <- c("db040" , "cluster_sum")
# merge this column back onto the data.frame
eusilc <- merge(eusilc , cluster_sums)
# construct a survey.design
# with the fpc using the cluster sum
des_eusilc_ultimate_cluster <-
svydesign(
ids = ~ rb030 ,
strata = ~ db040 ,
weights = ~ rb050 ,
data = eusilc ,
fpc = ~ cluster_sum
)
# again, immediately run the convey_prep function on the `survey.design`
des_eusilc_ultimate_cluster <-
convey_prep(des_eusilc_ultimate_cluster)
# does not match
attr(svyarpr( ~ eqincome , des_eusilc_ultimate_cluster) , 'var') * 10000
## eqincome
## eqincome 0.07586194
## [1] 0.08718569
## [1] 0.2952722
## eqincome
## eqincome 0.2754305
Still, there is a difference in the estimates. This is discussed in detail in this issue. In order to still provide additional examples for our code, we proceed with a Monte Carlo experiment.
Using the eusilcP
data from the simPop
package (Templ et al. 2017Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Olivier Dupriez. 2017. “Simulation of Synthetic Complex Data: The R Package simPop.” Journal of Statistical Software 79 (10): 1–38. https://doi.org/10.18637/jss.v079.i10.), we can compute the actual value of the at risk of poverty rate for that population:
##
## Attaching package: 'sampling'
## The following objects are masked from 'package:survival':
##
## cluster, strata
library(survey)
library(convey)
library(parallel)
# load pseudo population data
data("eusilcP" , package = "simPop")
# compute population median
q50.pop <-
convey:::computeQuantiles(eusilcP$eqIncome , rep(1 , length(eusilcP$eqIncome)) , .50)
# compute population poverty threshold
# as 60% of the median
thresh.pop <- .60 * q50.pop
# compute population at risk of poverty rate
(theta.pop <-
mean(eusilcP$eqIncome <= thresh.pop , na.rm = TRUE))
## [1] 0.1469295
Now, to study the distribution of the estimator under a particular sampling design, we select 5000 samples under one-stage cluster sampling of 100 households using the cluster
function from the sampling
package (Tillé and Matei 2021Tillé, Yves, and Alina Matei. 2021. Sampling: Survey Sampling. https://CRAN.R-project.org/package=sampling.), and use the svyarpr
function to estimate the ARPR for each of those samples:
# define the number of monte carlo replicates
mc.rep <- 5000L
# simulation function
arpr_sim_fun <- function(this.iter) {
set.seed(this.iter)
library(survey)
library(convey)
library(sampling)
# load pseudo population data
data("eusilcP" , package = "simPop")
# compute size-like variable for PPS sampling design
eusilcP$aux <-
log(ifelse(eusilcP$eqIncome >= 1000 , eusilcP$eqIncome , 1000))
# select sample
tt <-
sampling::cluster(
data = eusilcP[sample.int(nrow(eusilcP) , nrow(eusilcP) , replace = FALSE) , ] ,
clustername = "hid" ,
size = 1000L ,
method = "systematic" ,
pik = eusilcP$aux
)
# collect data
this.sample <- getdata(eusilcP , tt)
# create survey design object
this.desobj <-
svydesign(
ids = ~ hid ,
probs = ~ Prob ,
data = this.sample ,
nest = FALSE
)
# prepare for convey functions
this.desobj <- convey_prep(this.desobj)
# compute estimates
svyarpr( ~ eqIncome , this.desobj)
}
# run replications
cl <- makeCluster(detectCores() - 1)
arpr.estimate.list <-
clusterApply(cl, seq_len(mc.rep) , arpr_sim_fun)
stopCluster(cl)
Then, we evaluate the Percentage Relative Bias (PRB) of the ARPR estimator. Under this scenario, the PRB of the ARPR estimator is -0.19830%.
# estimate the expected value of the ARPR estimator
# using the average of the estimates
(theta.exp <- mean(sapply(arpr.estimate.list , coef)))
## [1] 0.1466381
# estimate the percentage relative bias
(percentage_relative_bias_arpr <- 100 * (theta.exp / theta.pop - 1))
## [1] -0.1982981
For the variance estimator, we have:
# estimate the variance of the ARPR estimator
# using the empirical variance of the estimates
(vartheta.popest <- var(sapply(arpr.estimate.list , coef)))
## [1] 0.0001048728
# estimate the expected value of the ARPR variance estimator
# using the (estimated) expected value of the variance estimates
(vartheta.exp <- mean(sapply(arpr.estimate.list , vcov)))
## [1] 0.0001081994
# estimate the percentage relative bias of the variance estimator
( percentage_relative_bias_variance <- 100 * (vartheta.exp / vartheta.popest - 1) )
## [1] 3.172023
Under this scenario, the PRB of the ARPR variance estimator is 3.1720%.
Our simulation shows that the Bias Ratio of this estimator is approximately 2%:
## [1] 2.84509
if the normal approximation holds, a small bias ratio still allows for approximately valid estimates of the confidence intervals.
Next, we evaluate the Percentage Coverage Rate (PCR). In theory, under repeated sampling, the estimated 95% CIs should cover the population parameter approximately 95% of the time. We can evaluate that using:
# estimate confidence intervals of the ARPR
# for each of the samples
est.coverage <-
sapply(arpr.estimate.list, function(this.stat)
confint(this.stat)[, 1] <= theta.pop &
confint(this.stat)[, 2] >= theta.pop)
# evaluate empirical coverage
(empirical_coverage <- mean(est.coverage))
## [1] 0.9516
Our coverages are not too far from the nominal coverage level of 95%.
For additional usage examples of svyarpr
, type ?convey::svyarpr
in the R
console.
This section displays example results using nationally-representative surveys from both the United States and Brazil. We present a variety of surveys, levels of analysis, and subpopulation breakouts to provide users with points of reference for the range of plausible values of the svyarpr
function.
To understand the construction of each survey design object and respective variables of interest, please refer to section 1.4 for CPS-ASEC, section 1.5 for PNAD Contínua, and section 1.6 for SCF.
## arpr SE
## htotval 0.30477 0.0019
## sex htotval se.htotval
## male male 0.2569898 0.002699161
## female female 0.3530735 0.002753344
## arpr SE
## ftotval 0.23395 0.0024
## sex ftotval se.ftotval
## male male 0.1897022 0.003094091
## female female 0.2808015 0.003609486
## arpr SE
## pearnval 0.22889 0.0063
## sex pearnval se.pearnval
## male male 0.1957581 0.005582147
## female female 0.2719207 0.007937258
## arpr SE
## deflated_per_capita_income 0.27695 0.0035
## sex deflated_per_capita_income se.deflated_per_capita_income
## male male 0.2705489 0.003702496
## female female 0.2830603 0.003569612
## arpr SE
## deflated_labor_income 0.15562 0.0015
## sex deflated_labor_income se.deflated_labor_income
## male male 0.1290385 0.001682285
## female female 0.1917583 0.002170694
## Multiple imputation results:
## m <- length(results)
## scf_MIcombine(with(scf_design, svyarpr(~networth)))
## results se
## networth 0.4070338 0.005282615
## Multiple imputation results:
## m <- length(results)
## scf_MIcombine(with(scf_design, svyby(~networth, ~hhsex, svyarpr)))
## results se
## male 0.3430431 0.008215832
## female 0.5726547 0.015545485
## Multiple imputation results:
## m <- length(results)
## scf_MIcombine(with(scf_design, svyarpr(~income)))
## results se
## income 0.297591 0.007471504
## Multiple imputation results:
## m <- length(results)
## scf_MIcombine(with(scf_design, svyby(~income, ~hhsex, svyarpr)))
## results se
## male 0.2092876 0.007433473
## female 0.5261313 0.017859170