Developer Notes

library(zctaCrosswalk)
library(dplyr)

While creating this package I was acutely aware that ZCTAs change frequently. For example, back in 2016 I created a similar package called choroplethrZip. That package is now out of date, because the underlying data it stores became out of date. I expect that something similar will eventually happen with this package.

This vignette is written as a “note to my future self” in case I wind up needing to write a similar package again in the future. It is also intended to increase the number of people who understand how to create packages like this.

?zcta_crosswalk

The core data structure in this package is ?zcta_crosswalk:

data(zcta_crosswalk)

print(zcta_crosswalk, n = 5)
#> # A tibble: 46,960 × 9
#>   zcta  zcta_numeric state_name  state_usps state_fips state_fips_numeric
#>   <chr>        <int> <chr>       <chr>      <chr>                   <int>
#> 1 00601          601 puerto rico PR         72                         72
#> 2 00601          601 puerto rico PR         72                         72
#> 3 00602          602 puerto rico PR         72                         72
#> 4 00602          602 puerto rico PR         72                         72
#> 5 00603          603 puerto rico PR         72                         72
#> # ℹ 46,955 more rows
#> # ℹ 3 more variables: county_name <chr>, county_fips <chr>,
#> #   county_fips_numeric <int>

Most of the effort in creating this package was spent creating this data structure. Let’s see how it was created.

?get_zcta_crosswalk

Start by looking at the contents of the function ?get_zcta_crosswalk:

get_zcta_crosswalk = function() {

  url = "https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_county20_natl.txt"
  zcta_crosswalk = read_delim(file = url, delim = "|")

  # Select and rename columns
  zcta_crosswalk = zcta_crosswalk |>
    rename(zcta        = .data$GEOID_ZCTA5_20,
           county_fips = .data$GEOID_COUNTY_20,
           county_name = .data$NAMELSAD_COUNTY_20) |>
    select(.data$zcta, .data$county_fips, .data$county_name)

  # 1. The county FIPS is always 5 characters. And the first 2 characters always
  # indicate the state. See https://en.wikipedia.org/wiki/FIPS_county_code.
  # Breaking out the state allows for easier state selection later.
  # 2. This file has all counties, some of which do not have a ZCTA. Remove
  # those counties.
  zcta_crosswalk |>
    mutate(state_fips = str_sub(.data$county_fips, 1, 2)) |>
    filter(!is.na(.data$zcta))
}

The function reads and transforms the contents of a URL. At the time of this writing that is the URL for the Census Bureau’s “2010 ZCTA to County Relationship File”, a file which I mentioned earlier.

This means that if Census publishes an updated dataset in the same format tomorrow you could just change the URL, rerun the code and get the updated data in R. (Note that I do not know when Census plans to update this dataset or whether they plan to publish it in the same format.)

If you open the URL referenced in ?get_zcta_crosswalk in a browser you will see rows like this:

221704258470394|90210|ZCTA5 90210|27823432|153478|G6350|B5|S|275901063468976|06037|Los Angeles County|10513491099|1787501506|G4020|H1|A|27823432|153478

This tells us that ZCTA 90210 is in Los Angeles County. It also tells us that Los Angles County has FIPS Code 06037.

Adding in State Information

Unfortunately, the file does not directly contain any state information. And since I wanted to run queries like “Get all ZCTAs in a given state”, I needed to add that in.

I started by splitting out the first two characters of each County FIPS Code into a new column called state_fips. This allows a user to search for ZCTAs in a state if they know the state’s FIPS code.

However, this does not help us if users want to select a state by it’s name or Postal Code Abbreviation. To address this limitation I created a new dataframe called state_names, and used it to join against the results of ?get_zcta_crosswalk:

data(state_names)

print(state_names, n = 5)
#> # A tibble: 56 × 4
#>   full       usps  fips_numeric fips_character
#>   <chr>      <chr>        <int> <chr>         
#> 1 alaska     AK               2 02            
#> 2 alabama    AL               1 01            
#> 3 arkansas   AR               5 05            
#> 4 arizona    AZ               4 04            
#> 5 california CA               6 06            
#> # ℹ 51 more rows

One thing to keep in mind is that while there are technically only 50 states, “state” in this dataset really means “any top level administrative region”. This dataset contains 56 states (the extra ones are: the District of Columbia, Puerto Rico, US Virigin Islands, American Samoa, Guam and the Northern Mariana Islands).

I believe that it would be useful for R to have a standalone package that contains a data frame like this for all FIPS codes. I did not break state_names out into a separate package because even though it has 56 state-level entities, the full list is much larger.

The code I used to generate state_names is in inst/gen_state_states.R.

Note that while R has two built-in vectors that deal with state names (state.abb and state.name), they cannot help us here because: (1) they do not contain FIPS codes and (2) they only contain 50 states.

Learning About ZCTAs

If you would like to learn more about ZCTAs (including how they differ from ZIP Codes), I recommend two references:

  1. My free course Mapmaking in R with Choroplethr has three sections dedicated to “ZIP Code Choropleths”.
  2. In 2017 I had the pleasure of meeting Jon Sperling, who is one of the creators of the ZCTA, at the Association of Public Data Users (APDU) conference. You can learn about that meeting, including a reference to one of his papers on the topic, here.

One of my recollections from that meeting is Jon explaining that ZIP Codes are designed to follow roads. This means that different sides of a single block can have different ZIP codes. Census geography, however, treats blocks as atomic. This means that all homes on a single block must have the same ZCTA. This difference in construction means that ZIPs and ZCTAs are unlikely to ever truly be identical.

Closing Thoughts

My primary concern with this dataset is that people will assume that it is a crosswalk for present-day ZIP Codes. As stated above, ZCTAs rarely (if ever) line up perfectly with ZIP Codes. Additionally, this dataset was published in 2020, and it is not clear how many changes have occurred to ZIP Codes in the interim.

Funding

I would like to thank my employer, MarketBridge, for supporting the development of this package. This package would not have been developed without their support.