4.2 The Gender Pay Gap (svygpg)

✔️ easy to understand
✔️ the difference of men and women average wages expressed as a fraction of average men wages
✔️ alternatively: the average women wage is `( 1 - GPG ) x average men wage`
❌ not an inequality measure in the Pigou-Dalton Principle sense
❌ ignores within-inequality among men and among women
❌ binary gender only

Although the Gender Pay Gap (GPG) is not an inequality measure in the usual sense1010 That is, with respect to the Pigou-Dalton principle., it can still be a useful instrument to evaluate the effects of gender discrimination. Put simply, it expresses the relative difference between the average hourly earnings of men and women, presenting this difference as a percentage of the average of hourly earnings of men. Like some other functions described in this text, the GPG can also be calculated using wealth or assets in place of earnings.

In mathematical terms, this index can be described as,

\[ GPG = \frac{ \bar{y}_{male} - \bar{y}_{female} }{ \bar{y}_{male} } \]

As we can see from the formula, if there is no difference between the groups, $GPG = 0$. Else, if $GPG > 0$, it means that the average hourly income received by women are $GPG$ percent smaller than men’s. For negative $GPG$, it means that women’s hourly earnings are $GPG$ percent larger than men’s. In other words, the larger the $GPG$, larger is the shortfall of women’s hourly earnings.

We can also develop a more straightforward idea: for every $1 raise in men’s hourly earnings, women’s hourly earnings are expected to increase $$(1-GPG)$. For instance, assuming $GPG = 0.8$, for every $1.00 increase in men’s average hourly earnings, women’s hourly earnings would increase only $0.20.

The details of the linearization of the GPG are discussed by Deville (⊕1999Deville, Jean-Claude. 1999. “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.” Survey Methodology 25 (2): 193–203. http://www.statcan.gc.ca/pub/12-001-x/1999002/article/4882-eng.pdf.) and Osier (⊕2009Osier, Guillaume. 2009. “Variance Estimation for Complex Indicators of Poverty and Inequality.” Journal of the European Survey Research Association 3 (3): 167–95. http://ojs.ub.uni-konstanz.de/srm/article/view/369.).

4.2.1 Replication Example

The R vardpoor package (⊕Breidaks, Liberts, and Ivanova 2016Breidaks, Juris, Martins Liberts, and Santa Ivanova. 2016. “Vardpoor: Estimation of Indicators on Social Exclusion and Poverty and Its Linearization, Variance Estimation.” Riga, Latvia: CSB.), created by researchers at the Central Statistical Bureau of Latvia, includes a GPG coefficient calculation using the ultimate cluster method. The example below reproduces those statistics.

Load and prepare the same data set:

# load the convey package
library(convey)

# load the survey library
library(survey)

# load the vardpoor library
library(vardpoor)

# load the laeken library
library(laeken)

# load the synthetic EU statistics on income & living conditions
data(eusilc)

# make all column names lowercase
names(eusilc) <- tolower(names(eusilc))

# coerce the gender variable to numeric 1 or 2
eusilc$one_two <- as.numeric(eusilc$rb090 == "female") + 1

# add a column with the row number
dati <- data.table::data.table(IDd = 1:nrow(eusilc), eusilc)

# calculate the gpg coefficient
# using the R vardpoor library
varpoord_gpg_calculation <-
  varpoord(
    # analysis variable
    Y = "eqincome",
    
    # weights variable
    w_final = "rb050",
    
    # row number variable
    ID_level1 = "IDd",
    
    # row number variable
    ID_level2 = "IDd",
    
    # strata variable
    H = "db040",
    
    N_h = NULL ,
    
    # clustering variable
    PSU = "rb030",
    
    # data.table
    dataset = dati,
    
    # gpg coefficient function
    type = "lingpg" ,
    
    # gender variable
    gender = "one_two",
    
    # get linearized variable
    outp_lin = TRUE
  )



# construct a survey.design
# using our recommended setup
des_eusilc <-
  svydesign(
    ids = ~ rb030 ,
    strata = ~ db040 ,
    weights = ~ rb050 ,
    data = eusilc
  )

# immediately run the convey_prep function on it
des_eusilc <- convey_prep(des_eusilc)

# coefficients do match
varpoord_gpg_calculation$all_result$value

## [1] 7.645389

coef(svygpg( ~ eqincome , des_eusilc , sex = ~ rb090)) * 100

##  eqincome 
## -8.278297

# linearized variables do match
# vardpoor
lin_gpg_varpoord <- varpoord_gpg_calculation$lin_out$lin_gpg
# convey
lin_gpg_convey <-
  attr(svygpg( ~ eqincome , des_eusilc, sex = ~ rb090), "lin")

# check equality
all.equal(lin_gpg_varpoord, 100 * lin_gpg_convey[, 1])

## [1] "Mean relative difference: 2.172419"

# variances do not match exactly
attr(svygpg( ~ eqincome , des_eusilc , sex = ~ rb090) , 'var') * 10000

##           eqincome
## eqincome 0.8926311

varpoord_gpg_calculation$all_result$var

## [1] 0.6482346

# standard errors do not match exactly
varpoord_gpg_calculation$all_result$se

## [1] 0.8051301

SE(svygpg( ~ eqincome , des_eusilc , sex = ~ rb090)) * 100

##           eqincome
## eqincome 0.9447916

the variance estimator and the linearized variable $z$ are both defined in Linearization-Based Variance Estimation. The functions convey::svygpg and vardpoor::lingpg produce the same linearized variable $z$.

However, the measures of uncertainty do not line up, because library(vardpoor) defaults to an ultimate cluster method that can be replicated with an alternative setup of the survey.design object.

# within each strata, sum up the weights
cluster_sums <-
  aggregate(eusilc$rb050 , list(eusilc$db040) , sum)

# name the within-strata sums of weights the `cluster_sum`
names(cluster_sums) <- c("db040" , "cluster_sum")

# merge this column back onto the data.frame
eusilc <- merge(eusilc , cluster_sums)

# construct a survey.design
# with the fpc using the cluster sum
des_eusilc_ultimate_cluster <-
  svydesign(
    ids = ~ rb030 ,
    strata = ~ db040 ,
    weights = ~ rb050 ,
    data = eusilc ,
    fpc = ~ cluster_sum
  )

# again, immediately run the convey_prep function on the `survey.design`
des_eusilc_ultimate_cluster <-
  convey_prep(des_eusilc_ultimate_cluster)

# matches
attr(svygpg( ~ eqincome , des_eusilc_ultimate_cluster , sex = ~ rb090) ,
     'var') * 10000

##           eqincome
## eqincome 0.8910413

varpoord_gpg_calculation$all_result$var

## [1] 0.6482346

# matches
varpoord_gpg_calculation$all_result$se

## [1] 0.8051301

SE(svygpg( ~ eqincome , des_eusilc_ultimate_cluster , sex = ~ rb090)) * 100

##           eqincome
## eqincome 0.9439499

For additional usage examples of svygpg, type ?convey::svygpg in the R console.

4.2.2 Real World Examples

This section displays example results using nationally-representative surveys from both the United States and Brazil. We present a variety of surveys, levels of analysis, and subpopulation breakouts to provide users with points of reference for the range of plausible values of the svygpg function.

To understand the construction of each survey design object and respective variables of interest, please refer to section 1.4 for CPS-ASEC, section 1.5 for PNAD Contínua, and section 1.6 for SCF.

4.2.2.1 CPS-ASEC Household Income

svygpg(~ htotval , cps_household_design , sex = ~ sex)

##              gpg    SE
## htotval -0.21961 0.013

svyby(~ htotval , ~ a_maritl , cps_household_design , svygpg , sex = ~ sex)

##                                                            a_maritl     htotval
## married - civilian spouse present married - civilian spouse present -0.03181163
## married - AF spouse present             married - AF spouse present -0.32822461
## married - spouse absent                     married - spouse absent -0.56380658
## widowed                                                     widowed -0.14137283
## divorced                                                   divorced -0.10360249
## separated                                                 separated -0.54741878
## never married                                         never married -0.17299575
##                                   se.htotval
## married - civilian spouse present 0.01401497
## married - AF spouse present       0.17997337
## married - spouse absent           0.11379058
## widowed                           0.05201811
## divorced                          0.03514135
## separated                         0.10734551
## never married                     0.02834880

4.2.2.2 CPS-ASEC Family Income

svygpg(~ ftotval , cps_family_design , sex = ~ sex)

##              gpg     SE
## ftotval -0.19287 0.0141

svyby(~ ftotval , ~ a_maritl , cps_family_design , svygpg , sex = ~ sex)

##                                                            a_maritl     ftotval
## married - civilian spouse present married - civilian spouse present -0.03138597
## married - AF spouse present             married - AF spouse present -0.32822461
## married - spouse absent                     married - spouse absent -0.72848350
## widowed                                                     widowed -0.15551741
## divorced                                                   divorced -0.13440910
## separated                                                 separated -0.60357071
## never married                                         never married -0.35042219
##                                   se.ftotval
## married - civilian spouse present 0.01401296
## married - AF spouse present       0.17997337
## married - spouse absent           0.20717616
## widowed                           0.06510014
## divorced                          0.05955588
## separated                         0.14911941
## never married                     0.05862692

4.2.2.3 CPS-ASEC Worker Earnings

svygpg( ~ pearnval , cps_ftfy_worker_design , sex = ~ sex)

##               gpg     SE
## pearnval -0.26139 0.0142

svyby( ~ pearnval , ~ a_maritl , cps_ftfy_worker_design , svygpg , sex = ~ sex)

##                                                            a_maritl
## married - civilian spouse present married - civilian spouse present
## married - AF spouse present             married - AF spouse present
## married - spouse absent                     married - spouse absent
## widowed                                                     widowed
## divorced                                                   divorced
## separated                                                 separated
## never married                                         never married
##                                       pearnval se.pearnval
## married - civilian spouse present -0.332790903  0.01764578
## married - AF spouse present       -0.237551797  0.18083461
## married - spouse absent           -0.216038485  0.08486880
## widowed                            0.009018563  0.09229872
## divorced                          -0.149661580  0.04110491
## separated                         -0.453153005  0.10300059
## never married                     -0.063720063  0.02173721

4.2.2.4 PNAD Contínua Per Capita Income

svygpg(~ deflated_per_capita_income ,
        pnadc_design ,
        na.rm = TRUE ,
        sex = ~ sex)

##                                  gpg     SE
## deflated_per_capita_income -0.044081 0.0057

svyby(
  ~ deflated_per_capita_income ,
  ~ age_categories ,
  pnadc_design ,
  svygpg ,
  na.rm = TRUE ,
  sex = ~ sex
)

##    age_categories deflated_per_capita_income se.deflated_per_capita_income
## 1               1               -0.024306689                    0.03230496
## 2               2                0.009552321                    0.03181511
## 3               3               -0.005161343                    0.02841903
## 4               4                0.013540072                    0.02252232
## 5               5               -0.079247786                    0.02238818
## 6               6               -0.122895187                    0.02997464
## 7               7               -0.137640776                    0.03014413
## 8               8               -0.133957055                    0.02826733
## 9               9               -0.081122602                    0.02683335
## 10             10               -0.104618639                    0.02774374
## 11             11               -0.016451082                    0.02122297
## 12             12               -0.050554855                    0.02824782
## 13             13               -0.040804465                    0.01106613

4.2.2.5 PNAD Contínua Worker Earnings

svygpg(~ deflated_labor_income ,
        pnadc_design ,
        na.rm = TRUE ,
        sex = ~ sex)

##                            gpg   SE
## deflated_labor_income -0.26799 0.01

svyby(
  ~ deflated_labor_income ,
  ~ age_categories ,
  pnadc_design ,
  svygpg ,
  na.rm = TRUE ,
  sex = ~ sex
)

##    age_categories deflated_labor_income se.deflated_labor_income
## 1               1                   NaN                       NA
## 2               2                   NaN                       NA
## 3               3            -0.1125043               0.18641583
## 4               4            -0.2054371               0.06174511
## 5               5            -0.1435653               0.02236847
## 6               6            -0.1349623               0.02726206
## 7               7            -0.2124605               0.03378860
## 8               8            -0.2241276               0.03192653
## 9               9            -0.2877783               0.03401662
## 10             10            -0.3935752               0.04738784
## 11             11            -0.2486947               0.03233777
## 12             12            -0.4023927               0.05638749
## 13             13            -0.5192294               0.05859540

4.2.2.6 SCF Family Net Worth

scf_MIcombine(with(scf_design , svygpg( ~ networth , sex = ~ hhsex)))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(scf_design, svygpg(~networth, sex = ~hhsex)))
##            results       se
## networth -2.714833 0.288726

scf_MIcombine(with(scf_design , svyby( ~ networth, ~ edcl , svygpg , sex = ~ hhsex)))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(scf_design, svyby(~networth, ~edcl, svygpg, 
##     sex = ~hhsex)))
##                         results        se
## less than high school -1.575998 0.7734308
## high school or GED    -1.210320 0.3505109
## some college          -2.427703 0.4537274
## college degree        -2.438002 0.4078957

4.2.2.7 SCF Family Income

scf_MIcombine(with(scf_design , svygpg( ~ income , sex = ~ hhsex)))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(scf_design, svygpg(~income, sex = ~hhsex)))
##         results       se
## income -2.00275 0.135422

scf_MIcombine(with(scf_design , svyby( ~ income, ~ edcl , svygpg , sex = ~ hhsex)))

## Multiple imputation results:
##       m <- length(results)
##       scf_MIcombine(with(scf_design, svyby(~income, ~edcl, svygpg, 
##     sex = ~hhsex)))
##                         results        se
## less than high school -0.753955 0.1693429
## high school or GED    -1.167613 0.1121024
## some college          -1.248036 0.1148635
## college degree        -2.178707 0.2109885