4.12 J-Divergence and Decomposition (svyjdiv, svyjdivdec)

✔️ can be interpreted in terms of GEI indices
✔️ can be group-decomposed into within-inequality and between-inequality
✔️ does not need to (explicitly) choose inequality aversion parameters
❌ does not handle zero or negative incomes
❌ hard to interpret
❌ the decomposition interpretation is not exactly the same as the GEI, but very similar

The J-divergence measure (Rohde 2016Rohde, Nicholas. 2016. “J-Divergence Measurements of Economic Inequality.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 179 (3): 847–70. https://doi.org/10.1111/rssa.12153.) can be seen as the sum of \(GE_0\) and \(GE_1\), satisfying axioms that, individually, those two indices do not. The J-divergence measure is defined as:

\[ J = \frac{1}{N} \sum_{i \in U} \bigg( \frac{ y_i }{ \mu } -1 \bigg) \ln \bigg( \frac{y_i}{\mu} \bigg) \]

Since it is a sum of two additive decomposable measures, \(J\) itself is decomposable. The within and between parts of the J-divergence decomposition are:

\[ \begin{aligned} J_B &= \sum_{g \in G} \frac{N_g}{N} \bigg( \frac{ \mu_g }{ \mu } - 1 \bigg) \ln \bigg( \frac{\mu_g}{\mu} \bigg) \\ J_W &= \sum_{g \in G} \bigg[ \frac{Y_g}{Y} GE^{(1)}_g + \frac{N_g}{N} GE^{(0)}_g \bigg] = \sum_{g \in G} \bigg[ \bigg( \frac{Y_g N + YN_g}{YN} \bigg) J_g \bigg] - \sum_{g \in G} \bigg[ \frac{Y_g}{Y} GE^{(1)}_g + \frac{N_g}{N} GE^{(0)}_g \bigg] \end{aligned} \]

where the subscript \(g\) denotes one of the \(G\) subgroups in the population. The main difference with the additively decomposable Generalized Entropy family is that the “within” component of the J-divergence index cannot be seen as a weighted contribution of the J-divergence across groups. So \(J_W\) cannot be interpreted as a weighted mean of the J-divergence across groups, but as the sum of two terms: (1) the income-share weighted mean of the \(GE^{(1)}\) across groups; and (2) the population-share weighted mean of the \(GE^{(0)}\) across groups. In other words, the “within” component still accounts for the inequality in each group; it just does not have a direct interpretation.


4.12.1 Monte Carlo Simulation

First, we should check that the finite-population values make sense. The J-divergence can be seen as the sum of \(GE^{(0)}\) and \(GE^{(1)}\). So, taking the starting population from the svyzenga section of this text, we have:

# compute finite population J-divergence
(jdivt.pop <-
   (convey:::CalcJDiv(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0))))
## [1] 0.4332649
# compute finite population GE indices
(gei0.pop <-
    convey:::CalcGEI(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0) , 0))
## [1] 0.2215037
(gei1.pop <-
    convey:::CalcGEI(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0) , 1))
## [1] 0.2117612
# check equality; should be true
all.equal(jdivt.pop , gei0.pop + gei1.pop)
## [1] TRUE

As expected, the J-divergence matches the sum of GEs in the finite population. And as we’ve checked the GE measures before, the J-divergence computation function seems safe.

In order to assess the estimators implemented in svyjdiv and svyjdivdec, we can run a Monte Carlo experiment. Using the same samples we used in the svyzenga replication example, we have:

# estimate J-divergence with each sample
jdiv.estimate.list <-
  lapply(survey.list ,
         function(this.sample) {
           svyjdiv( ~ x ,
                    subset(this.sample , x > 0) ,
                    na.rm = TRUE)
         })

# compute the (finite population overall) J-divergence
(theta.pop <-
    convey:::CalcJDiv(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0)))
## [1] 0.4332649
# estimate the expected value of the J-divergence estimator
# using the average of the estimates
(theta.exp <- mean(sapply(jdiv.estimate.list , coef)))
## [1] 0.4327823
# estimate the percentage relative bias
100 * (theta.exp / theta.pop - 1)
## [1] -0.1113765
# estimate the variance of the J-divergence estimator
# using the variance of the estimates
(vartheta.popest <- var(sapply(jdiv.estimate.list , coef)))
## [1] 0.0005434848
# estimate the expected value of the J-divergence index variance estimator
# using the expected of the variance estimates
(vartheta.exp <- mean(sapply(jdiv.estimate.list , vcov)))
## [1] 0.0005342947
# estimate the percentage relative bias of the variance estimator
100 * (vartheta.exp / vartheta.popest - 1)
## [1] -1.690964

For the decomposition, we repeat the same procedure:

# estimate J-divergence decomposition with each sample
jdivdec.estimate.list <-
  lapply(survey.list ,
         function(this.sample) {
           svyjdivdec( ~ x ,
                       ~ SEX ,
                       subset(this.sample , x > 0) ,
                       na.rm = TRUE)
         })

# compute the (finite population) J-divergence decomposition per sex
jdivt.pop <-
  convey:::CalcJDiv(pop.df$x , ifelse(pop.df$x > 0 , 1 , 0))

overall.mean <- mean(pop.df$x[pop.df$x > 0])

group.mean <-
  c(by(pop.df$x[pop.df$x > 0] , list("SEX" = factor(pop.df$SEX[pop.df$x > 0])) , FUN = mean))

group.pshare <-
  c(prop.table(by(rep(1 , sum(
    pop.df$x > 0
  )) , list(
    "SEX" = factor(pop.df$SEX[pop.df$x > 0])
  ) , FUN = sum)))

jdivb.pop <-
  sum(group.pshare * (group.mean / overall.mean - 1) * log(group.mean / overall.mean))

jdivw.pop <- jdivt.pop - jdivb.pop

(theta.pop <- c(jdivt.pop , jdivw.pop , jdivb.pop))
## [1] 0.433264877 0.428398928 0.004865949
# estimate the expected value of the J-divergence decomposition estimator
# using the average of the estimates
(theta.exp <- rowMeans(sapply(jdivdec.estimate.list , coef)))
##       total      within     between 
## 0.432782322 0.427480257 0.005302065
# estimate the percentage relative bias
100 * (theta.exp / theta.pop - 1)
##      total     within    between 
## -0.1113765 -0.2144430  8.9626164

The estimated PRB for the total is the same as before, so we will focus on the within and between components. While the within component has a small relative bias (-0.21%), the between component PRB is significant, amounting to 8.96%.

For the variance estimator, we do:

# estimate the variance of the J-divergence estimator
# using the variance of the estimates
(vartheta.popest <-
   diag(var(t(
     sapply(jdivdec.estimate.list , coef)
   ))))
##        total       within      between 
## 5.434848e-04 5.391901e-04 8.750879e-06
# estimate the expected value of the J-divergence index variance estimator
# using the expected of the variance estimates
(vartheta.exp <-
    rowMeans(sapply(jdivdec.estimate.list , function(z)
      diag(vcov(
        z
      )))))
##        total       within      between 
## 5.342947e-04 5.286750e-04 8.891772e-06
# estimate the percentage relative bias of the variance estimator
100 * (vartheta.exp / vartheta.popest - 1)
##     total    within   between 
## -1.690964 -1.950177  1.610034

The PRB of the variance estimators for both components are small: -1.95% for the within component and 1.61% for the between component.

How much should we care about the between component bias? Our simulations show that the Squared Bias of this estimator accounts for less than 2% of its Mean Squared Error:

theta.bias2 <- (theta.exp - theta.pop) ^ 2
theta.mse <- theta.bias2 + vartheta.popest
100 * (theta.bias2 / theta.mse)
##      total     within    between 
## 0.04282728 0.15627854 2.12723212

Next, we evaluate the Percentage Coverage Rate (PCR). In theory, under repeated sampling, the estimated 95% CIs should cover the population parameter 95% of the time. We can evaluate that using:

# estimate confidence intervals of the Zenga index
# for each of the samples
est.coverage <-
  t(sapply(jdivdec.estimate.list, function(this.stat)
    confint(this.stat)[, 1] <= theta.pop &
      confint(this.stat)[, 2] >= theta.pop))

# evaluate
colMeans(est.coverage)
##   total  within between 
##  0.9390  0.9376  0.9108

Our coverages are not too far from the nominal coverage level of 95%, however the bias of the between component estimator can affect its coverage rate.

For additional usage examples of svyjdiv or svyjdivdec, type ?convey::svyjdiv or ?convey::svyjdivdec in the R console.

4.12.2 Real World Examples

This section displays example results using nationally-representative surveys from both the United States and Brazil. We present a variety of surveys, levels of analysis, and subpopulation breakouts to provide users with points of reference for the range of plausible values of the svyjdiv function.

To understand the construction of each survey design object and respective variables of interest, please refer to section 1.4 for CPS-ASEC, section 1.5 for PNAD Contínua, and section 1.6 for SCF.

4.12.2.1 CPS-ASEC Household Income

svyjdiv(
  ~ htotval ,
  subset(cps_household_design , htotval > 0)
)
##         j-divergence     SE
## htotval      0.91009 0.0091
svyby(
  ~ htotval ,
  ~ sex ,
  subset(cps_household_design , htotval > 0) ,
  svyjdiv
)
##           sex   htotval se.htotval
## male     male 0.8468719 0.01304558
## female female 0.9601656 0.01229789

4.12.2.2 CPS-ASEC Family Income

svyjdiv(
  ~ ftotval ,
  subset(cps_family_design , ftotval > 0)
)
##         j-divergence     SE
## ftotval      0.78947 0.0097
svyby(
  ~ ftotval ,
  ~ sex ,
  subset(cps_family_design , ftotval > 0) ,
  svyjdiv
)
##           sex   ftotval se.ftotval
## male     male 0.7255059  0.0150296
## female female 0.8478025  0.0123227

4.12.2.3 CPS-ASEC Worker Earnings

svyjdiv(
  ~ pearnval ,
  subset(cps_ftfy_worker_design , pearnval > 0)
)
##          j-divergence   SE
## pearnval      0.63767 0.01
svyby(
  ~ pearnval ,
  ~ sex ,
  subset(cps_ftfy_worker_design , pearnval > 0) ,
  svyjdiv
)
##           sex  pearnval se.pearnval
## male     male 0.6493827  0.01177377
## female female 0.5888245  0.01833507

4.12.2.4 PNAD Contínua Per Capita Income

svyjdiv(
  ~ deflated_per_capita_income ,
  subset(pnadc_design , deflated_per_capita_income > 0),
  na.rm = TRUE
)
##                            j-divergence     SE
## deflated_per_capita_income      0.99732 0.0162
svyby(
  ~ deflated_per_capita_income ,
  ~ sex ,
  subset(pnadc_design , deflated_per_capita_income > 0),
  svyjdiv ,
  na.rm = TRUE
)
##           sex deflated_per_capita_income se.deflated_per_capita_income
## male     male                  1.0074672                    0.01724635
## female female                  0.9864209                    0.01639167

4.12.2.5 PNAD Contínua Worker Earnings

svyjdiv(
  ~ deflated_labor_income ,
  subset(pnadc_design , deflated_labor_income > 0) ,
  na.rm = TRUE
)
##                       j-divergence     SE
## deflated_labor_income      0.92486 0.0178
svyby(
  ~ deflated_labor_income ,
  ~ sex ,
  subset(pnadc_design , deflated_labor_income > 0) ,
  svyjdiv ,
  na.rm = TRUE
)
##           sex deflated_labor_income se.deflated_labor_income
## male     male             0.9412711               0.02098598
## female female             0.8628107               0.01653023

4.12.2.6 SCF Family Net Worth

scf_MIcombine(with(subset(scf_design , networth > 0) , svyjdiv(~ networth)))
## Warning in subset.svyimputationList(scf_design, networth > 0): subset differed
## between imputations
## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, networth > 0), svyjdiv(~networth)))
##           results         se
## networth 3.755306 0.06761237
scf_MIcombine(with(
  subset(scf_design , networth > 0) ,
  svyby(~ networth, ~ hhsex , svyjdiv)
))
## Warning in subset.svyimputationList(scf_design, networth > 0): subset differed
## between imputations
## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, networth > 0), svyby(~networth, 
##     ~hhsex, svyjdiv)))
##         results         se
## male   3.604860 0.07501126
## female 3.311156 0.26160038

4.12.2.7 SCF Family Income

scf_MIcombine(with(subset(scf_design , income > 0) , svyjdiv(~ income)))
## Warning in subset.svyimputationList(scf_design, income > 0): subset differed
## between imputations
## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, income > 0), svyjdiv(~income)))
##        results        se
## income 1.67673 0.1004542
scf_MIcombine(with(
  subset(scf_design , income > 0) ,
  svyby(~ income, ~ hhsex , svyjdiv)
))
## Warning in subset.svyimputationList(scf_design, income > 0): subset differed
## between imputations
## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(subset(scf_design, income > 0), svyby(~income, 
##     ~hhsex, svyjdiv)))
##          results         se
## male   1.6283323 0.10752174
## female 0.8483978 0.07380227