sparrpowR: Power Analysis to Detect Spatial Relative Risk Clusters

Ian D. Buller (GitHub: @idblr)

2023-02-01

Start with the necessary packages and seed for the vignette.

loadedPackages <- c("geojsonsf", "ggmap", "ggplot2", "graphics", "grDevices", "sf", "sparrpowR", "spatstat.geom", "terra", "tidyterra")
invisible(lapply(loadedPackages, library, character.only = TRUE))
set.seed(1234) # for reproducibility

Import data from Open Data DC website.

# Washington, D.C. boundary
gis_path1 <- "https://opendata.arcgis.com/datasets/7241f6d500b44288ad983f0942b39663_10.geojson"
dc <- geojsonsf::geojson_sf(gis_path1)

# American Community Survey 2018 Census Tracts
gis_path2 <- "https://opendata.arcgis.com/datasets/faea4d66e7134e57bf8566197f25b3a8_0.geojson"
census <- geojsonsf::geojson_sf(gis_path2)

We want to create a realistic boundary (i.e., polygon) of our study area. We are going to spatially clip our DC boundary by the census tracts in an attempt to remove major bodies of water where people do not reside.

clipwin <- sf::st_union(census)
dcc <- sf::st_intersection(dc, clipwin)
# Plot
plot(sf::st_geometry(dc), main = "DC Boundary")
plot(sf::st_geometry(census),  main = "American Community Survey\n2018")
plot(sf::st_geometry(dcc), main = "Clipped Boundary")

Our developed method, sparrpowR, relies on the spatstat package suite to simulate data, which assumes point locations are on a planar (i.e., flat) surface. Our boundary is made up of geographical coordinates on Earth (i.e., a sphere), so we need to flatten our boundary by spatially projecting it with an appropriate spatial reference system (SRS). For the District of Columbia, we are going to use the World Geodetic System 1984 (WGS84) Universal Transverse Mercator (UTM) Zone 18N EPSG:32619. We then convert the boundary into a owin object required by the spatstat.geom package.

dcp <- sf::st_transform(sf::st_as_sf(dcc), crs = sf::st_crs(32618))
dco <- spatstat.geom::as.owin(sf::st_geometry(dcp))

In this hypothetical example, we want to estimate the local power of detecting a spatial case cluster relative to control locations. Study participants that tested positive for a disease (i.e., cases) are hypothesized to be located in a circular area around the Navy Yard, an Environmental Protection Agency (EPA) Superfund Site (see the success story).

navy <- data.frame(lon = 326414.70444451, lat = 4304571.1539442)
spf <- sf::st_as_sf(navy, coords = c("lon", "lat"), crs = sf::st_crs(32618))
# Plot
plot(sf::st_geometry(dcp), main = "Location of Hypothetical\nDisease Cluster")
plot(spf, col = "magenta", add = T, pch = 4, cex = 2)
graphics::legend("bottom", xpd = T, y.intersp = -1.5, legend = c("Navy Yard"), col = "magenta", pch = 4, cex = 1, bty = "n")

We will assume that approximately 50 cases (e.g., n_case = 50) were clustered around the center of the Navy Yard (e.g., samp_case = "MVN") with more cases near the center and fewer cases about 1 kilometers away (e.g., s_case = 1000).

If we were to conduct a study, where would we be sufficiently statistically powered to detect this spatial case cluster? To answer this question we will randomly sample approximately 950 participants (e.g., n_conrol = 950 or 5% disease prevalence) around the Navy Yard (e.g., samp_control = "MVN") sampling more participants near the center and fewer participants about 2 kilometers away (e.g., s_control = 2000). These participants would test negative for a disease (i.e., controls) were we to conduct a study. We can then resample control locations iteratively, as if we conducted the same study multiple times (e.g., sim_total = 100). We can conclude that we are sufficiently powered to detect a spatial clustering in areas where a statistically significant spatial case cluster was located in at least 80% (e.g., p_thresh = 0.8) of these theoretical studies. The spatial_power() function calculates both a one-tailed, lower tailed hypothesis (i.e., case clustering only) and a two-tailed hypothesis (i.e., case and control clustering). Use the cascon argument in the spatial_plots() function to plot either test.

start_time <- Sys.time() # record start time
sim_power <- sparrpowR::spatial_power(x_case = navy[[1]], y_case = navy[[2]], # center of cluster
                                      x_control = navy[[1]], y_control = navy[[2]], # center of cluster
                                      n_case = 50, n_control = 950, # sample size of case/control
                                      samp_case = "MVN", samp_control = "MVN", # samplers
                                      s_case = 1000, s_control = 2000, # approximate size of clusters
                                      alpha = 0.05, # critical p-value 
                                      sim_total = 100, # number of iterations
                                      win = dco, # study area
                                      resolution = 100, # number gridded knots on x-axis
                                      edge = "diggle", # correct for edge effects
                                      adapt = FALSE, # fixed-bandwidth
                                      h0 = NULL, # automatically select bandwidth for each iteration
                                      verbose = FALSE) # no printout
end_time <- Sys.time() # record end time
time_srr <- end_time - start_time # Calculate run time

The process above took about 10.5 minutes to run. Of the 100 iterations, we simulated 40 case locations and an average 766 (SD: 11.61) control locations for an average prevalence of 5.22% (SD: 0.08%). The average bandwidth for the statistic was 0.8 kilometers (SD: 0.01). Fewer case locations and controls locations were simulated than specified in the inputs due to being placed outside of our study window (i.e., Maryland, Virginia, or in the water features around the District of Columbia). Users can modify their inputs to achieve the correct number of cases and controls in their output.

We plot the statistical power for a one-tailed, lower-tail hypothesis (cascon = FALSE) at alpha = 0.05 using the spatial_plots() function.

cols <- c("deepskyblue", "springgreen", "red", "navyblue") # colors for plots
chars <- c(4,5) # symbols for point-locations
sizes <- c(0.5,0.5) # size of point-locations
p_thresh <- 0.8 # 80% of iterations with statistically significant results

## Data Visualization of Input and Power
sparrpowR::spatial_plots(input = sim_power, # use output of above simulation
                         p_thresh = p_thresh, # power cut-off
                         cascon = FALSE, # one-tail, lower tail hypothesis test (i.e., case clustering)
                         plot_pts = TRUE, # display the points in the second plot
                         chars = chars, # case, control
                         sizes = sizes, # case, control
                         cols = cols) # colors of plot

Now, lets overlay our results on top of a basemap. Here, we will use an open-source map from Stamen, that is unprojected in WGS84. We extract the rectangular box (i.e., bounding box) surrounding our polygon boundary of the District of Columbia (WGS84).

dcbb <- sf::st_bbox(sf::st_buffer(sf::st_as_sf(dc), dist = 0.015))
dcbbm <- matrix(dcbb, nrow = 2)
base_map <- ggmap::get_map(location = dcbbm, maptype = "terrain", source = "stamen")

Prepare the points from the first simulation for plotting in ggplot2 suite and prepare the original boundary for plotting in ggplot2 suite.

sim_pts <- sim_power$sim  # extract points from first iteration
sim_pts <- sf::st_as_sf(sim_pts) # convert to simple features
names(sim_pts)[1] <- "mark"
sf::st_crs(sim_pts) <- sf::st_crs(32618)
sim_pts_wgs84 <-  sf::st_transform(sim_pts, crs = sf::st_crs(4326)) # project to basemap

Prepare the SpatRaster from the simulation for plotting in ggplot2 suite.

pvalprop <- data.frame(x = sim_power$rx,
                          y = sim_power$ry,
                          z = sim_power$pval_prop_cas) # extract proportion significant
lrr_narm <- na.omit(pvalprop) # remove NAs
pvalprop_raster <- terra::rast(lrr_narm) # convert to SpatRaster
rm(pvalprop, lrr_narm) # conserve memory
terra::crs(pvalprop_raster) <- terra::crs(dcp) # set output project (UTM 18N)
pvalprop_raster <- terra::project(pvalprop_raster, dc) # unproject (WGS84)
rampcols <- grDevices::colorRampPalette(colors = c(cols[1], cols[2]), space="Lab")(length(terra::values(pvalprop_raster))) # set colorramp

Plot local power as a continuous outcome with point-locations using the ggplot2 suite.

ggmap::ggmap(base_map) + # basemap
  ggplot2::geom_sf(data = dcc, # original boundary,
                        fill = "transparent",
                        colour = "black",
                        inherit.aes = FALSE) +
  tidyterra::geom_spatraster(data = pvalprop_raster, # output SpatRaster
                             size = 0,
                             alpha = 0.5) +
  ggplot2::scale_fill_gradientn(colours = rampcols, na.value = NA) + # colors for SpatRaster
  ggplot2::geom_sf(data = sim_pts_wgs84[-1, ], # simulated point-locations
                  ggplot2::aes(color = mark, shape = mark),
                      alpha = 0.8,
                  inherit.aes = FALSE) + 
  ggplot2::scale_color_manual(values = cols[3:4]) + # fill of point-locations
  ggplot2::scale_shape_manual(values = chars) + # shape of point-locations
  ggplot2::labs(x = "", y = "", fill = "Power", color = "", shape = "") # legend labels

Plot local power as a categorical outcome with point-locations using the ggplot2 suite.

pvalprop_reclass <- pvalprop_raster
terra::values(pvalprop_reclass) <- cut(terra::values(pvalprop_raster), c(-Inf, p_thresh, Inf))

ggmap::ggmap(base_map) + # basemap 
  ggplot2::geom_sf(data = dcc, # original boundary,
                   fill = "transparent",
                   colour = "black",
                   inherit.aes = FALSE) +
  tidyterra::geom_spatraster(data = pvalprop_reclass, # output SpatRaster
                             size = 0,
                             alpha = 0.5) +
  ggplot2::scale_fill_manual(values = cols[c(1,2)],
                             labels = c("insufficient", "sufficient"),
                             na.translate = FALSE, 
                             na.value = NA) + # colors for SpatRaster
  ggplot2::labs(x = "", y = "", fill = "Power") # legend labels

Based on 100 iterations of multivariate normal sampling of approximately 766 control participants focused around the Navy Yard, we are sufficiently powered to detect the disease cluster in the Navy Yard area.

Advanced Features

We provide functionality to run the spatial_power() with parallel processing to speed up computation (parallel = TRUE). Parallelization is accomplished with the doFuture package, the future::multisession plan, and the %dorng% operator for the foreach package to produce reproducible results. (Note: simpler windows, such as unit circles, require substantially less computational resources.)

We also provide functionality to correct for multiple testing. A hypothesis is tested at each gridded knot and the tests are spatially correlated by nature. With the p_correct argument you can choose a multiple testing correction. The most conservative, p_correct = "Bonferroni" and p_correct = "Sidak", apply corrections that assumes independent tests, which are likely not appropriate for this setting but we include to allow for sensitivity tests. The p_correct = "FDR" applies a False Discovery Rate for the critical p-value that is not as conservative as the other two options.

Here, we use the same example as above, conducted in parallel with a False Discovery Rate procedure.

set.seed(1234) # reset RNG for reproducibility with previous run
start_time <- Sys.time() # record start time
sim_power <- sparrpowR::spatial_power(x_case = navy[[1]], y_case = navy[[2]], # center of cluster
                                      x_control = navy[[1]], y_control = navy[[2]], # center of cluster
                                      n_case = 50, n_control = 950, # sample size of case/control
                                      samp_case = "MVN", samp_control = "MVN", # samplers
                                      s_case = 1000, s_control = 2000, # approximate size of clusters
                                      alpha = 0.05, # critical p-value 
                                      sim_total = 100, # number of iterations
                                      win = dco, # study area
                                      resolution = 100, # number gridded knots on x-axis
                                      edge = "diggle", # correct for edge effects
                                      adapt = FALSE, # fixed-bandwidth
                                      h0 = NULL, # automatically select bandwidth for each iteration
                                      verbose = FALSE, # no printout
                                      parallel = TRUE, # Run in parallel
                                      n_core = 5, # Use 5 cores (depends on your system, default = 2)
                                      p_correct = "FDR") # use a correction for multiple testing (False Discovery Rate)
end_time <- Sys.time() # record end time
time_srr <- end_time - start_time # Calculate run time

cols <- c("deepskyblue", "springgreen", "red", "navyblue") # colors for plots
chars <- c(4,5) # symbols for point-locations
sizes <- c(0.5,0.5) # size of point-locations
p_thresh <- 0.8 # 80% of iterations with statistically significant results

## Data Visualization of Input and Power
sparrpowR::spatial_plots(input = sim_power, # use output of above simulation
                         p_thresh = p_thresh, # power cut-off
                         cascon = FALSE, # one-tail, lower tail hypothesis test (i.e., case clustering)
                         plot_pts = FALSE, # display the points in the second plot
                         chars = chars, # case, control
                         sizes = sizes, # case, control
                         cols = cols) # colors of plot

The process above took about 5 minutes to run, which is shorter than the first example. The zone with sufficient power to detect a case cluster is slightly smaller than the first example, too, due to the multiple testing correction.

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidyterra_0.3.1     terra_1.7-3         spatstat.geom_3.0-6
##  [4] spatstat.data_3.0-0 sparrpowR_0.2.7     sf_1.0-9           
##  [7] ggmap_3.0.1         ggplot2_3.4.0       geojsonsf_2.0.3    
## [10] knitr_1.42         
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-157           bitops_1.0-7           spatstat.sparse_3.0-0 
##   [4] doParallel_1.0.17      httr_1.4.4             tools_4.2.1           
##   [7] doRNG_1.8.6            bslib_0.4.2            utf8_1.2.2            
##  [10] R6_2.5.1               rpart_4.1.19           KernSmooth_2.23-20    
##  [13] mgcv_1.8-41            DBI_1.1.3              colorspace_2.1-0      
##  [16] withr_2.5.0            sp_1.6-0               tidyselect_1.2.0      
##  [19] gridExtra_2.3          curl_5.0.0             compiler_4.2.1        
##  [22] cli_3.6.0              Cairo_1.6-0            spatstat.explore_3.0-6
##  [25] labeling_0.4.2         sass_0.4.4             scales_1.2.1          
##  [28] classInt_0.4-8         proxy_0.4-27           spatstat_3.0-3        
##  [31] goftest_1.2-3          stringr_1.5.0          digest_0.6.31         
##  [34] spatstat.utils_3.0-1   rmarkdown_2.20         jpeg_0.1-10           
##  [37] pkgconfig_2.0.3        htmltools_0.5.4        parallelly_1.34.0     
##  [40] highr_0.10             fastmap_1.1.0          maps_3.4.1            
##  [43] rlang_1.0.6            rstudioapi_0.14        farver_2.1.1          
##  [46] jquerylib_0.1.4        generics_0.1.3         jsonlite_1.8.4        
##  [49] spatstat.random_3.1-3  dplyr_1.1.0            magrittr_2.0.3        
##  [52] s2_1.1.2               spatstat.linnet_3.0-4  dotCall64_1.0-2       
##  [55] Matrix_1.4-1           Rcpp_1.0.10            munsell_0.5.0         
##  [58] fansi_1.0.4            abind_1.4-5            viridis_0.6.2         
##  [61] lifecycle_1.0.3        stringi_1.7.12         yaml_2.3.6            
##  [64] plyr_1.8.8             misc3d_0.9-1           grid_4.2.1            
##  [67] parallel_4.2.1         listenv_0.9.0          deldir_1.0-6          
##  [70] lattice_0.20-45        splines_4.2.1          tensor_1.5            
##  [73] pillar_1.8.1           tcltk_4.2.1            rngtools_1.5.2        
##  [76] sparr_2.2-17           codetools_0.2-18       wk_0.7.1              
##  [79] glue_1.6.2             evaluate_0.20          doFuture_0.12.2       
##  [82] data.table_1.14.6      png_0.1-8              vctrs_0.5.2           
##  [85] spam_2.9-1             foreach_1.5.2          RgoogleMaps_1.4.5.3   
##  [88] gtable_0.3.1           purrr_1.0.1            polyclip_1.10-4       
##  [91] tidyr_1.3.0            future_1.31.0          cachem_1.0.6          
##  [94] xfun_0.36              e1071_1.7-12           class_7.3-20          
##  [97] viridisLite_0.4.1      tibble_3.1.8           spatstat.model_3.1-2  
## [100] iterators_1.0.14       fields_14.1            units_0.8-1           
## [103] globals_0.16.2