The twangRDC
R package is a streamline version of the twang
R package created specifically for use in the RDCs. It utilizes gradient boosted models to estimate weights to handle imbalanced data. Results using twangRDC
will not necessarily be reproduced using twang
due to important differences in implementation. The twangRDC
package allows for much larger datasets and many more covariates, but users should note that with smaller datasets, the twang
package is computationally more efficient.
The twangRDC
package utilizes gradient boosted models to derive weights for nonequivalent groups. The algorithm alternates between adding iterations to the gradient boosted model (increasing its complexity) and evaluating balance. The algorithm automatically stops when additional iterations no longer improve balance. The package allows the user to generate weights for two different scenarios: missing data and construction of a comparison group.
If the data is representative of a population, but some observations are missing data, the twangRDC
package can generate missing data weights such that the complete data is representative of the population. This is achieved by weighting complete data by the inverse probability of not being missing. The steps of the algorithm are described here assuming the user specifies strata in the model, but the steps are the same without strata by considering the data coming from a single stratum.
iters.per.step
iterations to the gradient boosted model.If the goal is the construction of a comparison group, twangRDC
can applied gradient boosted propensity score weights for the average treatment effect on the treated. Treatment observations receive a weight of 1, while comparison observations receive a weight of the odds of treatment based on the propensity score model. The steps of the algorithm are described here assuming the user specifies strata in the model, but the steps are the same without strata by considering the data coming from a single stratum.
iters.per.step
iterations to the gradient boosted model.Arguement | Description |
---|---|
formula | A symbolic description of the model to be fit with an observed data indicator or a treatment indicator on the left side of the formula and the variables to be balanced on the right side. |
population | An indicator of whether the data represents a population. |
strata | An optional factor variable identifying the strata. If specified, balance is optimized within strata. |
data | The data. |
params | xgboost parameters. Details below. |
file | An optional filename to save intermediate results. |
max_steps | An integer specifying the maximum number of steps to take. |
iters.per.step | An integer specifying the number of iterations to add to the model at each step of algorithm. |
id.var | A variable that uniquely identifies observations. |
min.width | An integer specifying the minimum number of iterations between the current number of iterations and the optimal value. |
save.model | A logical value indicating if the xgboost model be saved as part of the output object. |
weights | An optional variable that identifies user defined weights to be incorporated into the optimization. Sampling design weights are an example. |
We will highlight two uses of the twangRDC
R package. First, we will generate weights that ensure individuals with Protected Identification Keys (PIKs) are representative of their geographic region. PIKs are linking keys used by the Census Bureau to match person records across datasets1. For example, using a PIK, a respondent in the 2000 Decennial Census can be linked to their response to the 2010 Decennial Census. PIKs are assigned through a probabilistic matching process that links survey or census records containing Personally Identifiable Information (PII) like name, address, date of birth and place of birth, to a Social Security reference file. Not all survey and census records can be linked to the reference file and as such not all records will be assigned a PIK. Linkage failures may occur if the PII used to match to the reference file is of insufficient quality or if an individual has no matching record in the Social Security reference file. PIKs can be assigned for the large majority of 2000 Decennial Census records. A Decennial Census record must have a PIK in order to link it to administrative records that contain other information, such as morality or annual residence.
Second, we will highlight using the twangRDC
R package to generate propensity score weights such that a group of southern metropolitan areas can be used as a comparison group for residents of Orleans Parish.
A simulated data file was created for use in this vignette. It contains simulated records for residents of Orleans Parish, Louisiana, and other metropolitan areas in the South census region. We built the file exclusively from public use data, but it mirrors the structure of restricted versions of the 2000 Decennial Census available through the FSRDC network2. Each simulated record includes basic individual demographic characteristics, basic household characteristics, and a set of neighborhood-level characteristics. Each record also includes two important indicators, one to simulate whether the individual record received a PIK and another to denote whether the individual lived in Orleans Parish.
The data was created by extracting all “short form” variables for households and individuals from the 5% Integrated Public Use Microdata Sample (IPUMS3) of the 2000 Decennial Census for the city of New Orleans (Orleans Parish) and for a selection of other southern metropolitan areas. We extracted contextual information from public use tract level tabulations of the 2000 Decennial Census (distributed by NHGIS4) and created select factor-based measures. We simulated assignment of households to census tracts and attach tract identifiers and characteristics.
We also simulated an indicator of PIK assignment (PIK=yes/no) to person records. Public use data do not include PIK assignment, so we estimated a predicted probability of receiving a PIK by using estimated regression parameters from Bond et al. 20135. We added a random error to the predicted probability so that the PIK assignment status is not deterministic and convert the predicted probability into a dichotomous PIK=yes/no variable.
Lastly, we pooled Orleans Parish records with those from other southern metropolitan areas, create an indicator for Orleans Parish residence and, for the purposes of this vignette, sampled the data to shrink the size of the dataset. The simulated file contains individual records for 4,606 residents of Orleans Parish, LA and 9,519 individual records for residents from select southern metropolitan areas.
Table 1 provides a description of the data element included in the simulated data file.
Data element | Description | Labels and Codings |
---|---|---|
metarea | A categorical variable for metropolitan area. | Atlanta, GA (52) Memphis, TN/AR/MS (492) New Orleans, LA (556) Washington, DC/MD/VA (884) |
c00_age12 | A categorical variable for age in years at the 2000 Decennial Census. | 0 to 2 years old (1) 3 to 5 years old (2) 6 to 9 years old (3) 10 to 14 years old (4) 15 to 18 years old (5) 19 to 24 years old (6) 25 to 34 years old (7) 35 to 44 years old (8) 45 to 54 years old (9) 55 to 64 years old (10) 65 to 74 years old (11) 75 and older (12) |
c00_sex | A binary indicator of sex as reported on the 2000 Decennial Census. | Male (0) Female (1) |
c00_race | A categorical variable for race as reported on the 2000 Decennial Census. | White (1) African American (2) American Indian or Alaskan Native (3) Asian (4) Other Asian or Pacific Islander (5) Some other race (6) |
c00_nphu | The number of persons in housing unit as reported on the 2000 Decennial Census. | 1 to 16 |
hhid | Household identifier. | |
tract_id_str | Census tract identifier. | |
concdis | Tract level factor measure of concentrated disadvantage/ | |
res_stab | Tract level factor measure of residential stability. | |
imm_conc | Tract level factor measure of immigrant concentration. | |
sim_pik | Simulated binary indicator of PIK assignment. | No PIK assigned (0) PIK assigned (1) |
nola_rec | Binary indicator for record from Orleans Parish. | Not Orleans Parish Record (0) Orleans Parish Record (1) |
id |
As previously mentioned, we will generate weights that ensure individuals with PIKs are representative of their geographic region. To keep the computational time of this vignette down, we focus only on a subset of Orleans parish. First, we load the twangRDC
package and our simulated dataset.
library(twangRDC)
#> Loading required package: xgboost
#> Loading required package: data.table
#> Loading required package: ggplot2
data(nola_south)
Next, it is important that the variables of the dataset are coded as intended. In this case, we convert several of variables to factors.
# factors need to be coded as such
nola_south$metarea = as.factor(nola_south$metarea)
nola_south$c00_age12 = as.factor(nola_south$c00_age12)
nola_south$c00_race = as.factor(nola_south$c00_race)
In a final data preparation step, we limit the dataset to Orleans parish and select only 25 of the Census tracts.
# only consider Orleans parish
nola_only = subset(nola_south , metarea==556)
# keep only 10 tracts for computational speed
to.keep = unique(nola_only$tract_id_str)[1:10]
nola_only = nola_only[ nola_only$tract_id_str %in% to.keep, ]
In this case, we wish to generate weights to ensure that individuals with PIKs are representative of their entire Census tract. That is, for each Census tract, we want to generate weights such that the observations with PIKs within the Census tract are representative of the tract’s population. This is achieved by specifying population=TRUE
, which tells the function that the full data is representative of the underlying population, and strata="tract_id_str"
, which tells the function that we wish to generate weights that are representative within strata defined by Census tracts. The formula sim_pik ~ c00_age12 + c00_race + c00_sex
identifies the observations with PIK on the left hand side and the characteristics that we want to consider when estimating the weights the right hand side.
# set the model parameters
params = list(eta = .01 , max_depth = 10 , subsample = 1 ,
max_delta_step=0 , gamma=0 , lambda=0 , alpha=0,
min_child_weight=5 , objective = "binary:logistic")
# fit the xgboost model
res.pik = ps.xgb(sim_pik ~ c00_age12 + c00_race + c00_sex ,
strata="tract_id_str",
data=nola_only,
params=params,
max.steps=10,
iters.per.step=500,
min.iter=500,
id.var="id",
population = TRUE)
The parameters of the underlying xgboost
model are specified in params
. These were described in detail in a previous section. The id.var="id"
provides a unique identifier for observations such that the generated weights can easily be merged back in with the original data. The other options specified in the ps.xgb
function control how frequently the algorithm checks for convergence and how many iterations should be considered before stopping. First, iters.per.step=50
tells the algorithm to only consider every 50-th iteration when evaluating convergence. Larger values improve computational time by reducing the number of balance evaluations, while smaller values may achieve slightly better balance. Next, min.iter=50
tells the algorithm that at least 50 iterations must be used before stopping for convergence. Larger values ensure that more complex models are evaluated before determining the optimal set of weights. Finally, max.steps=2
indicates that the algorithm should only evaluate the balance of the weights twice before stopping. The maximum number of iterations of the xgboost
model is given by max.steps*iters.per.setp
, which in this case is 100. In general, this value should be large to ensure that the optimum set of weights is achieved. The default value is max.steps=Inf
, which will continue adding iterations to the model until the convergence criteria is met. Due to computational concerns, we recommend testing your code with values of iters.per.step
and max.steps
such that the total number of iterations is small (1000 to 10000). Once you have determined the model is working as intended, set max.steps
to Inf
.
Now that the weights have been estimated, we can evaluate their quality. First, we need to ensure that a sufficient number of iterations have been used such that the balance criteria is minimized.
Figure 1: Your figure caption.
The plot
function provides a plot of the balance critera, which in this case is the average of the strata-specific maximum standardized difference of the covariates, versus the number iterations. As shown in this figure, we are verifying that the algorithm has run for a sufficient number of iterations such that a clear minimum has been achieved.
Now that it has been determined that we have achieved convergence, we can access the quality of the weights using balance tables. First, the bal.table
function will produce the population mean for each covariate, as well as the unweighted and the weighted mean among those without missing data (here, those with PIK). It also provides the standardized difference for each covariate before and after weighting. The goal is for the standardized differences after weighting to all be close to zero.
Population Mean | Unadjusted Mean | Adjusted Mean | Unadjusted Standardized Difference | Adjusted Standardized Difference | |
---|---|---|---|---|---|
c00_age12:1 | 0.0356422 | 0.0666667 | 0.2750985 | -0.1673414 | -1.2915909 |
c00_age12:2 | 0.0954780 | 0.0666667 | -0.1734456 | 0.0980398 | 0.9150981 |
c00_age12:3 | 0.4090560 | 0.0666667 | 0.2111771 | 0.6963951 | 0.4024714 |
c00_age12:4 | 0.0456368 | 0.0000000 | 0.0000000 | 0.2186757 | 0.2186757 |
c00_age12:5 | 0.0365216 | 0.0666667 | 0.0552749 | -0.1607017 | -0.0999729 |
c00_age12:6 | 0.0508074 | 0.0000000 | 0.0000000 | 0.2313590 | 0.2313590 |
c00_age12:7 | 0.0796625 | 0.2000000 | 0.2048223 | -0.4444268 | -0.4622366 |
c00_age12:8 | 0.1028669 | 0.0000000 | 0.0000000 | 0.3386175 | 0.3386175 |
c00_age12:9 | 0.0566212 | 0.2666667 | 0.2168434 | -0.9088254 | -0.6932501 |
c00_age12:10 | 0.0347675 | 0.1333333 | 0.1014885 | -0.5380512 | -0.3642167 |
c00_age12:11 | 0.0367022 | 0.1333333 | 0.1087408 | -0.5139146 | -0.3831240 |
c00_age12:12 | 0.0162376 | 0.0000000 | 0.0000000 | 0.1284743 | 0.1284743 |
c00_race:1 | 0.0813496 | 0.4000000 | 0.4810458 | -1.1656329 | -1.4621011 |
c00_race:2 | 0.8229004 | 0.6000000 | 0.5189542 | 0.5838867 | 0.7961859 |
c00_race:3 | 0.0060074 | 0.0000000 | 0.0000000 | 0.0777412 | 0.0777412 |
c00_race:5 | 0.0166267 | 0.0000000 | 0.0000000 | 0.1300301 | 0.1300301 |
c00_race:6 | 0.0731159 | 0.0000000 | 0.0000000 | 0.2808621 | 0.2808621 |
c00_sex | 0.2898852 | 0.6666667 | 0.8101164 | -0.8294560 | -1.1452498 |
Next, balance can be assessed within Census tract since our goal for this model was to generate weights such that observations with PIKs are representative of their Census tract. The type='strata'
option tells bal.table
to provide the maximum absolute standardized difference by the strata variable, in this case the Census tract. We additionally specifyinclude.var=TRUE
, which identifies the covariate that the maximum absolute standardized difference corresponds to, decreasing = T
, which orders the strata in decreasing order of their weighted standardized differences, and n=3
, which only prints the top three strata.
Stratum | Unadjusted Maximum Standardized Difference | Adjusted Maximum Standardized Difference | Variable | |
---|---|---|---|---|
9 | 22071010100 | 0.1731436 | 6.433743 | c00_age12:3 |
1 | 22071001722 | 0.2900086 | 4.518877 | c00_age12:3 |
8 | 22071004900 | 0.3273268 | 3.898170 | c00_age12:3 |
This table allows us to assess which strata had the worst balance after weighting, and among those strata, which covariates were problematic.
The get.weights
function extracts the weights at the optimal iteration. The resulting data contains the weights and the ID variable specified in id.var
. The weights can then be merged back in with the original data using the id.var
. Note that base R merge function is slow compared to modern alternatives. If your data is large, consider data.table
or dplyr
.
Our second example use of twangRDC
will generate a comparison group for Orleans Parish consisting of residents of other southern metropolitan areas. The steps of the process remain the same as the PIK weighting example, but with minor adjustments to the interpretation and specification of the model. First, population = FALSE
specifies that we are no longer interested in weights that represent a population, but instead, we wish to generate a comparison group using propensity score weights. Note that ps.xgb
only estimates the average treatment effect on the treated, with the treatment group identified by records with value 1 in the left hand side of the formula. Our formula now includes three tract-level summaries that we wish to balance in addition to the three person-level characteristics. Finally, we no longer specify a stratification variable, as we are not attempting to create a comparison group for each tract of Orleans parish, but instead, a single comparison group that is representative of Orleans parish.
# set the model parameters
params = list(eta = .1 , max_depth = 10 , subsample = 1 ,
max_delta_step=0 , gamma=0 , lambda=0 , alpha=0,
min_child_weight=5 , objective = "binary:logistic")
# fit the xgboost model
res.ps = ps.xgb(nola_rec ~ c00_age12 + c00_race + c00_sex + concdis + res_stab + imm_conc ,
data=nola_south,
params=params,
max.steps=10,
iters.per.step=500,
min.iter=500,
id.var="id",
population = FALSE)
Figure 2: Your figure caption.
Population Mean | Unadjusted Mean | Adjusted Mean | Unadjusted Standardized Difference | Adjusted Standardized Difference | |
---|---|---|---|---|---|
c00_age12:1 | 0.0356422 | 0.0666667 | 0.2750985 | -0.1673414 | -1.2915909 |
c00_age12:2 | 0.0954780 | 0.0666667 | -0.1734456 | 0.0980398 | 0.9150981 |
c00_age12:3 | 0.4090560 | 0.0666667 | 0.2111771 | 0.6963951 | 0.4024714 |
c00_age12:4 | 0.0456368 | 0.0000000 | 0.0000000 | 0.2186757 | 0.2186757 |
c00_age12:5 | 0.0365216 | 0.0666667 | 0.0552749 | -0.1607017 | -0.0999729 |
c00_age12:6 | 0.0508074 | 0.0000000 | 0.0000000 | 0.2313590 | 0.2313590 |
c00_age12:7 | 0.0796625 | 0.2000000 | 0.2048223 | -0.4444268 | -0.4622366 |
c00_age12:8 | 0.1028669 | 0.0000000 | 0.0000000 | 0.3386175 | 0.3386175 |
c00_age12:9 | 0.0566212 | 0.2666667 | 0.2168434 | -0.9088254 | -0.6932501 |
c00_age12:10 | 0.0347675 | 0.1333333 | 0.1014885 | -0.5380512 | -0.3642167 |
c00_age12:11 | 0.0367022 | 0.1333333 | 0.1087408 | -0.5139146 | -0.3831240 |
c00_age12:12 | 0.0162376 | 0.0000000 | 0.0000000 | 0.1284743 | 0.1284743 |
c00_race:1 | 0.0813496 | 0.4000000 | 0.4810458 | -1.1656329 | -1.4621011 |
c00_race:2 | 0.8229004 | 0.6000000 | 0.5189542 | 0.5838867 | 0.7961859 |
c00_race:3 | 0.0060074 | 0.0000000 | 0.0000000 | 0.0777412 | 0.0777412 |
c00_race:5 | 0.0166267 | 0.0000000 | 0.0000000 | 0.1300301 | 0.1300301 |
c00_race:6 | 0.0731159 | 0.0000000 | 0.0000000 | 0.2808621 | 0.2808621 |
c00_sex | 0.2898852 | 0.6666667 | 0.8101164 | -0.8294560 | -1.1452498 |
Stratum | Unadjusted Maximum Standardized Difference | Adjusted Maximum Standardized Difference | Variable |
---|---|---|---|
Combined | 3.069672 | 3.120905 | imm_conc |
Wagner, Deborah and Mary Layne. The Person Identification Validation System (PVS): Applying the Center Administrative Records Research and Applications’ (CARRA) Record Linkage Software. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-01. 2014.↩
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0↩
Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 14.0 [Database]. Minneapolis, MN: IPUMS. 2019. http://doi.org/10.18128/D050.V14.0↩
Bond, Brittany, J. David Brown, Adela Luque and Amy O’Hara. The nature of the bias when studying only linkable person records: Evidence from the ACS. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-08, 2014.↩