zoomerjoin

DOI Lifecycle: experimental Codecov test coverage

zoomerjoin is an R package that empowers you to fuzzy-join massive datasets rapidly, and with little memory consumption. It is powered by high-performance implementations of Locality Sensitive Hashing, an algorithm that finds the matches records between two datasets without having to compare all possible pairs of observations. In practice, this means zoomerjoin can fuzzily-join datasets days, or even years faster than other matching packages. zoomerjoin has been used in-production to join datasets of hundreds of millions of names or vectors in a matter of hours.

Installation

Installing from CRAN:

You can install from the CRAN as you would with any other package. Please be aware that you will have to have Cargo (the rust toolchain and compiler) installed to build the package from source.

install.packages(zoomerjoin)

Installing Rust

If your operating system or version of R is not installed, you must have the Rust compiler installed to compile this package from sources. After the package is compiled, Rust is no longer required, and can be safely uninstalled.

Installing Rust on Linux or Mac:

To install Rust on Linux or Mac, you can simply run the following snippet in your terminal.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing Rust on Windows:

To install Rust on windows, you can use the Rust installation wizard, rustup-init.exe, found at this site. Depending on your version of Windows, you may see an error that looks something like this:

error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed

In this case, you should run rustup install stable-x86_64-pc_windows-gnu to install the missing toolchain. If you’re missing another toolchain, simply type this in the place of stable-x86_64-pc_windows-gnu in the command above.

Installing Package from Github:

Once you have rust installed Rust, you should be able to install the package with either the install.packages function as above, or using the install_github function from the devtools package or with the pkg_install function from the pak package.

## Install with devtools
# install.packages("devtools")
devtools::install_github("beniaminogreen/zoomerjoin")

## Install with pak
# install.packages("pak")
pak::pkg_install("beniaminogreen/zoomerjoin")

Loading The Package

Once the package is installed, you can load it into memory as usual by typing:

library(zoomerjoin)

Usage:

The flagship feature of zoomerjoins are the jaccard_join and euclidean family of functions, which are designed to be near drop-ins for the corresponding dplyr/fuzzyjoin commands:

The jaccard_join family of functions provide fast fuzzy-joins for strings using the Jaccard distance while the euclidean_join family provides fuzzy-joins for points or vectors using the Euclidean distance.

Example: Joining rows of the Database on Ideology, Money in Politics, and Elections

(DIME)

Here’s a snippet showing off how to use the jaccard_inner_join() merge two lists of political donors in the Database on Ideology, Money in Politics, and Elections (DIME). You can see a more detailed example of this vignette in the introductory vignette.

I start with two corpuses I would like to combine, corpus_1:

corpus_1 <- dime_data %>%
  head(500)
names(corpus_1) <- c("a", "field")
corpus_1
## # A tibble: 500 × 2
##        a field                                                                  
##    <dbl> <chr>                                                                  
##  1     1 ufwa cope committee                                                    
##  2     2 committee to re elect charles e. bennett                               
##  3     3 montana democratic party non federal account                           
##  4     4 mississippi power & light company management political action and educ…
##  5     5 napus pac for postmasters                                              
##  6     6 aminoil good government fund                                           
##  7     7 national women's political caucus of california                        
##  8     8 minnesota gun owners' political victory fund                           
##  9     9 metropolitan detroit afl cio cope committee                            
## 10    10 carpenters legislative improvement committee united brotherhood of car…
## # ℹ 490 more rows

And corpus_2:

corpus_2 <- dime_data %>%
  tail(500)
names(corpus_2) <- c("b", "field")
corpus_2
## # A tibble: 500 × 2
##        b field                                                                  
##    <dbl> <chr>                                                                  
##  1   501 citizens for derwinski                                                 
##  2   502 progressive victory fund greater washington americans for democratic a…
##  3   503 ingham county democratic party federal campaign fund                   
##  4   504 committee for a stronger future                                        
##  5   505 atoka country supper committee                                         
##  6   506 friends of democracy pac inc                                           
##  7   507 baypac                                                                 
##  8   508 international brotherhood of electrical workers local union 278 cope/p…
##  9   509 louisville & jefferson county republican executive committee           
## 10   510 democratic party of virginia                                           
## # ℹ 490 more rows

Both corpuses have an observation ID column, and a donor name column. We would like to join the two datasets on the donor names column, but the two can’t be directly joined because of misspellings. Because of this, we will use the jaccard_inner_join function to fuzzily join the two on the donor name column.

Importantly, Locality Sensitive Hashing is a probabilistic algorithm, so it may fail to identify some matches by random chance. I adjust the hyperparameters n_bands and band_width until the chance of true matches being dropped is negligible. By default, the package will issue a warning if the chance of a true match being discovered is less than 95%. You can use the jaccard_probability and jaccard_hyper_grid_search to help understand the probability any true matches will be discarded by the algorithm.

More details and a more thorough description of how to tune the hyperparameters can be can be found in the guided tour vignette.

set.seed(1)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)
## Warning in jaccard_join(a, b, mode = "inner", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 92% chance of being compared.
## Please consider changing `n_bands` and `band_width`.

## Joining by 'field'
print(Sys.time() - start_time)
## Time difference of 0.009142399 secs
print(join_out)
## # A tibble: 19 × 4
##        a field.x                                                      b field.y 
##    <dbl> <chr>                                                    <dbl> <chr>   
##  1   292 bill bradley for u s senate '84                            913 bill br…
##  2    35 solarz for congress 82                                     671 solarz …
##  3   232 republican county committee of chester county              710 republi…
##  4    87 kentucky state democratic central executive committee      639 arizona…
##  5   478 united democrats for better government                     642 democra…
##  6   319 7th congressional district democratic party of wisconsin   792 8th con…
##  7   378 guarini for congress 1982                                  606 guarini…
##  8   378 guarini for congress 1982                                  883 guarini…
##  9   163 davies county republican executive committee               852 warren …
## 10   238 4th congressional district democratic party                792 8th con…
## 11    88 scheuer for congress 1980                                  667 scheuer…
## 12   216 kent county republican finance committee                   719 harford…
## 13   387 committee to re elect congressman staton                   805 committ…
## 14   302 americans for good government inc                          910 america…
## 15   216 kent county republican finance committee                   607 lake co…
## 16   122 tarrant county republican victory fund                     761 lake co…
## 17   238 4th congressional district democratic party                518 16th co…
## 18   230 pipefitters local union 524                                998 pipefit…
## 19    45 dole for senate committee                                  623 riegle …

Zoomerjoin is able to quickly find the matching columns without comparing all pairs of records. This saves more and more time as the size of each list increases, so it can scale to join datasets with millions or hundreds of millions of rows.

Contributing

Thanks for your interest in contributing to Zoomerjoin!

I am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by filing an issue on the issues page, where you will also find a range of templates to help you out. If you’d like to make changes to the code, you can write and file a pull request on this page. I’ll try to respond to all of these in a timely manner (within a week) although occasionally I may take longer to respond to a complicated question or issue.

Please also be aware of the contributor code of conduct for contributing to the repository.

Acknowledgments:

The Zoomerjoin was made using this SQL join illustration by Germanx and this speed limit sign from the Federal Highway Administration - MUTCD.

References:

Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.

Broder, Andrei Z. (1997), “On the resemblance and containment of documents”, Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy