split and combine_smsm

Introduction

The Introduction to synthACS briefly mentions the split and combine_smsm functionality in Sections 3.2 and 3.4 respectively. There, we note that deriving the sample synthetic micro data is a memory intensive process and advise using synthACS on a high performance machine. Of course, such a machine is not always available, which is when split and combine_smsm are needed.

A brief illustration of these two functions is provided in this vignette. The same example data is used as in the introductory vignette:

library(data.table)
library(acs)
library(synthACS)
library(retry)

ca_geo <- geo.make(state = "CA", county = "*")
ca_dat_SMSM <- pull_synth_data(2014, 5, ca_geo)

Overview of split() and combine_smsm()

The split and combine_smsm functions are used, respectively, to reduce the computational requirements of a large spatial microsimulation task into a set of smaller tasks and to recombine the results. They enable the well known “split-apply-combine” strategy for Data Analysis (Wickham 2011). In this case, the “apply” step is intentionally performed sequentially and not inside another function in order to minimize RAM usage and enable a garbage-collection step between intensive in-memory function calls.

The syntax for both is straightforward:

split takes a larger macroASC class object and splits it into n_splits smaller macroACS objects. Similarly combine_smsm takes several smaller smsm_set objects and combines them into a single, larger, smsm_set class object.

Example Code

An example of this is provided below:

# split()
n_splits <- 20
split_ca_dat <- split(ca_dat_SMSM, n_splits = n_splits)
tmp_opts <- vector("list", length= n_splits)

for (i in 1:n_splits) {
    # Section 3.3 of introduction: SMSM via simulated annealing
    # derive synthetic datasets  
    tmp_synth <- derive_synth_datasets(split_ca_dat[[i]], leave_cores = 0)
    
    # create constraints for simulated annealing
    a <- all_geog_constraint_age(tmp_synth, method = "macro.table")
    g <- all_geog_constraint_gender(tmp_synth, method = "macro.table")
    m <- all_geog_constraint_marital_status(tmp_synth, method = "macro.table")
    r <- all_geog_constraint_race(tmp_synth, method = "synthetic")
    e <- all_geog_constraint_edu(tmp_synth, method = "synthetic")
    
    cll <- all_geogs_add_constraint(attr_name = "age", attr_total_list = a, 
                                    macro_micro = tmp_synth)
    cll <- all_geogs_add_constraint(attr_name = "gender", attr_total_list = g, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "marital_status", attr_total_list = m, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "race", attr_total_list = r, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "edu_attain", attr_total_list = e, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    
    # anneal
    tmp_opts[[i]] <- all_geog_optimize_microdata(tmp_synth, seed = 6550L, verbose = TRUE,
                                          constraint_list_list = cll, p_accept = 0.4, max_iter = 10000L)
}

# create the string needed for combine_smsm(). 
paste0("tmp_opts[[", 1:n_splits, "]]", sep= ", ", collapse= "")
# [1] "tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]], 
# tmp_opts[[6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]], 
# tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]], tmp_opts[[15]], 
# tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]], tmp_opts[[19]], tmp_opts[[20]], "

# copy and paste the resulting string, excluding the final trailing comma
opt_ca <- combine_smsm(tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]], 
                       tmp_opts[[6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]], 
                       tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]], 
                       tmp_opts[[15]], tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]], 
                       tmp_opts[[19]], tmp_opts[[20]])

References

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. https://www.jstatsoft.org/v40/i01/.