Introduction to fedmatch

Chris Webster

Background

With an ever-expanding set of financial data providers, each covering many of the same firms and entities, matching data sets that do not share common identifiers has never been more important.fedmatch is a set of tools for performing record linkage between two data sets. It allows for a variety of different matching techniques, letting the user build a matching algorithm for their specific application. Although fedmatch was designed with economic data in mind (i.e. loans or companies), it is very flexible, so it can be used for any matching problem. With fedmatch, a researcher or analyst can quickly go from having 0 matches between two datasets to having many. With more time and care, they can use more advanced techniques to pull out even more matches.

Fedmatch has many features, including:

This vignette will explain the basics of fedmatch, including the merge_plus function and the clean_strings function. From there, other vignettes go further into the details of the different matching types.

Name cleaning

Before diving into the matching code, we’ll first go over the clean_strings function that can help standardize company names across data sets.

A basic example of clean strings looks like this:

raw_names <- c("Hamlin, Hamlin, McGill", "Schweibert & Cokely ", "Wexler McGill, LLC",
               "Davis and Main, Inc.")
clean_names <- clean_strings(raw_names)
clean_names
#> [1] "hamlin hamlin mcgill"  "schweibert and cokely" "wexler mcgill llc"    
#> [4] "davis and main inc"

Without any additional arguments, clean_strings does the following:

As described in the manual for clean_strings, one can specify further arguments to remove words or try different replacements. Fedmatch comes with a set of words that are commonly used for this, but you can use whatever you’d like. (You can also use word_frequency to look for common words in your data.)

fedmatch::corporate_words[1:5]
#>     abbr     long.names
#> 1: accep     acceptance
#> 2:  amer        america
#> 3: assoc     associates
#> 4:    cl company listed
#> 5: cmnty      community
scrubbed_names <- clean_strings(raw_names, common_words = fedmatch::corporate_words)
scrubbed_names
#> [1] "hamlin hamlin mcgill"                       
#> [2] "schweibert and cokely"                      
#> [3] "wexler mcgill limited liability corporation"
#> [4] "davis and main incorporated"

Through string cleaning, we can make it so that even if two different databases record names differently (e.g. “Hamlin Hamlin McGill INC” vs “Hamlin Hamlin McGill Incorporated”), we will still count these records as a match.

Basics: merge_plus

The workhorse of fedmatch is merge_plus. merge_plus is an extremely flexible function that can perform several different types of matches: exact, fuzzy, and multivar.

Exact matching

Here are the example datasets that come with fedmatch:

fedmatch::corp_data1
#>                Company Country State  SIC Revenue unique_key_1
#>  1:            Walmart     USA    OH 3300     485            1
#>  2:   Bershire Hataway     USA       2222     223            2
#>  3:              Apple     USA    CA 3384     215            3
#>  4:       Exxon Mobile     USA    TX 2222     205            4
#>  5:          McKesson  Germany    MA  222     192            5
#>  6: UnitedHealth Group     USA    MA   NA     184            6
#>  7:         CVS Health     USA    RI 1112     177            7
#>  8:     General Motors     USA    MI 2222     166            8
#>  9:               AT&T     USA    TN 4000     163            9
#> 10: Ford Motor Company     USA    MI   NA     151           10
fedmatch::corp_data2
#>                   Name country state_code SIC_code earnings unique_key_2
#>  1:            Walmart     USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway     USA         NE     2220  220,000            2
#>  3:     Apple Computer     USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.      USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                 MA     2222  190,000            5
#>  6: UnitedHealth Group     USA         MA     1130  180,000            6
#>  7:                CVS                 RI     1122  180,000            7
#>  8:                 GM                 MI     2222  170,000            8
#>  9:             AT & T     USA         TN     4000  160,000            9
#> 10:         Ford Motor     USA         MI     2222  150,000           10

The most basic way to use merge_plus is by simply making it equivalent to base::merge.

basic_merge <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "exact", 
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")

This code will run merge on the “Company” and “Name” variables, and return cases where the two have an exact match. The only differences between this and base::merge are

  1. merge_plus requires data1 and data2 each to have a “unique key” that can be used to identify an observation.
  2. merge_plus returns a list.

Let’s take a look at each of the elements of the list returned by merge_plus. These will always be the same, no matter which match_type you select in merge_plus.

The first item is the matches themselves. This is a data.table with one row for each matching observation, along with all variables present in each data set.

print(basic_merge$matches)
#>               Company Country State  SIC Revenue unique_key_1 country
#> 1:            Walmart     USA    OH 3300     485            1     USA
#> 2: UnitedHealth Group     USA    MA   NA     184            6     USA
#>    state_code SIC_code earnings unique_key_2               Name tier
#> 1:         OH     3380  490,000            1            Walmart  all
#> 2:         MA     1130  180,000            6 UnitedHealth Group  all

The next item is matches_filter. In this example, it’s empty, because we didn’t supply the argument filter. If we did supply filter (which can either be a function that filters, or a numeric cutoff for a matchscore (more on this later)), we would see a subsample of the matches dataset.

print(basic_merge$matches_filter)
#> Null data.table (0 rows and 0 cols)

Next in the list is data1_nomatch and data2_nomatch, which return the rows that were not matched from the datasets.

print(basic_merge$data1_nomatch)
#>               Company Country State  SIC Revenue unique_key_1
#> 1:   Bershire Hataway     USA       2222     223            2
#> 2:              Apple     USA    CA 3384     215            3
#> 3:       Exxon Mobile     USA    TX 2222     205            4
#> 4:          McKesson  Germany    MA  222     192            5
#> 5:         CVS Health     USA    RI 1112     177            7
#> 6:     General Motors     USA    MI 2222     166            8
#> 7:               AT&T     USA    TN 4000     163            9
#> 8: Ford Motor Company     USA    MI   NA     151           10
print(basic_merge$data2_nomatch)
#>                  Name country state_code SIC_code earnings unique_key_2
#> 1:  Bershire Hathaway     USA         NE     2220  220,000            2
#> 2:     Apple Computer     USA         CA       NA  220,000            3
#> 3: Exxon Mobile Inc.      USA         TX     2222  210,000            4
#> 4:     McKesson Corp.                 MA     2222  190,000            5
#> 5:                CVS                 RI     1122  180,000            7
#> 6:                 GM                 MI     2222  170,000            8
#> 7:             AT & T     USA         TN     4000  160,000            9
#> 8:         Ford Motor     USA         MI     2222  150,000           10

Finally, there is match_evaluation, which is a data.table that summarizes how well the match worked. It shows the number of matches in each dataset broken down by tier (more on tiers later), along with the percent matched.

print(basic_merge$match_evaluation)
#>    tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1:  all       2                2                2           0.2           0.2
#>    new_unique_1 new_unique_2
#> 1:            2            2

Fuzzy matching

We can also use merge_plus to perform “fuzzy” matches. A fuzzy match uses a string distance algorithm to compute the distance between one string and a set of other strings, then picks the closest string that’s over a certain threshold. fedmatch uses stringdist::amatch to execute these matches, and you can read more about string distances in the stringdist package documentation.

Here is an example of how this is implemented in merge_plus.

fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#>    unique_key_2 unique_key_1 Country State  SIC Revenue            Company
#> 1:            1            1     USA    OH 3300     485            Walmart
#> 2:            2            2     USA       2222     223   Bershire Hataway
#> 3:            6            6     USA    MA   NA     184 UnitedHealth Group
#>                  Name country state_code SIC_code earnings tier
#> 1:            Walmart     USA         OH     3380  490,000  all
#> 2:  Bershire Hathaway     USA         NE     2220  220,000  all
#> 3: UnitedHealth Group     USA         MA     1130  180,000  all

We can see that we picked up an additional match here: “Bershire Hataway” and “Bershire Hathaway.” These are off by 1 character, so the exact match didn’t pick them up, but the fuzzy match did. We can also tweak the fuzzy match settings with the argument fuzzy_settings. This is a list that will be passed to stringdist::amatch.

fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          fuzzy_settings = build_fuzzy_settings(maxDist = .5),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#>     unique_key_2 unique_key_1 Country State  SIC Revenue            Company
#>  1:            1            1     USA    OH 3300     485            Walmart
#>  2:            2            2     USA       2222     223   Bershire Hataway
#>  3:            3            3     USA    CA 3384     215              Apple
#>  4:            4            4     USA    TX 2222     205       Exxon Mobile
#>  5:            5            5 Germany    MA  222     192          McKesson 
#>  6:            6            6     USA    MA   NA     184 UnitedHealth Group
#>  7:            7            7     USA    RI 1112     177         CVS Health
#>  8:            9            9     USA    TN 4000     163               AT&T
#>  9:           10            8     USA    MI 2222     166     General Motors
#> 10:           10           10     USA    MI   NA     151 Ford Motor Company
#>                   Name country state_code SIC_code earnings tier
#>  1:            Walmart     USA         OH     3380  490,000  all
#>  2:  Bershire Hathaway     USA         NE     2220  220,000  all
#>  3:     Apple Computer     USA         CA       NA  220,000  all
#>  4: Exxon Mobile Inc.      USA         TX     2222  210,000  all
#>  5:     McKesson Corp.                 MA     2222  190,000  all
#>  6: UnitedHealth Group     USA         MA     1130  180,000  all
#>  7:                CVS                 RI     1122  180,000  all
#>  8:             AT & T     USA         TN     4000  160,000  all
#>  9:         Ford Motor     USA         MI     2222  150,000  all
#> 10:         Ford Motor     USA         MI     2222  150,000  all

So, cranking up the maxDist (maximum distance between strings, a threshold for determining matches) gave us a bunch more matches. Note that we return multiple matches per the same unique key sometimes, for example Ford Motor got matched to General Motors and Ford Motor Company. There are many tweaks that one can make via fuzzy_settings, and these change the match behavior significantly. It is worth exploring various options to see which make the most sense for your specific application. See the fuzzy matching vignette for more details, including a new method of string comparison that we call a “Weighted Jaccard” comparison.

Multivar matching

The final setting for match_type in merge_plus is a “multivariable match”, or “multivar” for short. This match is complex, and may take some playing around with the code to fully understand how it works. We’ll just go over the basic usage here. The idea behind the multivariable match is to use several variables from each dataset to execute a match, rather than just using the name of an entity.

One way is to take the set of variables (say, company name, state, and earnings), compare them with some numeric metric, and then perform a linear combination of those metrics to arrive at a final score. Then, you can compare each observation in one dataset to each other observation in the other dataset to pick the match with the highest score.

The other way is similar, but instead of a linear combination of scores, you can use a logit model. In this method, you create a hand-verified match set between your two datasets, then use a logit model to estimate how much each variable contributes to determining a match.

Here’s an example of the first method, the linear combination:

# for simplicity's sake, rename columns in corp_data2
data.table::setnames(corp_data2, c("Name", "country"), c("Company", "Country"))
multivar_linear_result <- merge_plus(corp_data1, corp_data2, 
                                     match_type = "multivar",
                                     by = c("Country", "Company"), 
                                     unique_key_1 = "unique_key_1",
                                     suffixes = c("_1", "_2"),
                                     unique_key_2 = "unique_key_2",
                                    
                                     multivar_settings = build_multivar_settings(compare_type = c("indicator", "stringdist"),
                                                               wgts = c(.5, .5),
                                                              top = 1))
multivar_linear_result$matches                                     
#>     unique_key_1          Company_1 Country_1 State  SIC Revenue
#>  1:            1            Walmart       USA    OH 3300     485
#>  2:            2   Bershire Hataway       USA       2222     223
#>  3:            3              Apple       USA    CA 3384     215
#>  4:            4       Exxon Mobile       USA    TX 2222     205
#>  5:            5          McKesson    Germany    MA  222     192
#>  6:            6 UnitedHealth Group       USA    MA   NA     184
#>  7:            7         CVS Health       USA    RI 1112     177
#>  8:            9               AT&T       USA    TN 4000     163
#>  9:            8     General Motors       USA    MI 2222     166
#> 10:           10 Ford Motor Company       USA    MI   NA     151
#>              Company_2 Country_2 state_code SIC_code earnings unique_key_2
#>  1:            Walmart       USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway       USA         NE     2220  220,000            2
#>  3:     Apple Computer       USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.        USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                   MA     2222  190,000            5
#>  6: UnitedHealth Group       USA         MA     1130  180,000            6
#>  7: UnitedHealth Group       USA         MA     1130  180,000            6
#>  8:             AT & T       USA         TN     4000  160,000            9
#>  9:         Ford Motor       USA         MI     2222  150,000           10
#> 10:         Ford Motor       USA         MI     2222  150,000           10
#>     Country_compare Company_compare multivar_score tier
#>  1:               1       1.0000000      1.0000000  all
#>  2:               1       0.9882353      0.9941176  all
#>  3:               1       0.8714286      0.9357143  all
#>  4:               1       0.9333333      0.9666667  all
#>  5:               0       0.9285714      0.4642857  all
#>  6:               1       1.0000000      1.0000000  all
#>  7:               1       0.5333333      0.7666667  all
#>  8:               1       0.9111111      0.9555556  all
#>  9:               1       0.7333333      0.8666667  all
#> 10:               1       0.9111111      0.9555556  all

To specify this type of match, we put in the match_type as “multivar,” and then we specified how we wanted the match to run by passing the list multivar_settings. Each element of this list is a separate argument to go into multivar_match. The compare_type argument tells the multivar how to compare each variable in the by argument. Because “Country” is a binary variable, we specify “indicator”, and because “Company” is a string variable, we specify “stringdist.” You can see a full list of options for comparison in the multivar_match documentation, or equivalently, the merge_plus score_settings documentation. (merge_plus has an option to compute matchscores post-hoc, as a method of evaluation.)

Next, here’s an example the second method, using a logit model. First, we’ll set up a fake training table. Normally, one would construct a human-verified match set. Here, I just create a table where the first half are matches, the second half are a mix of matches and not, and then the two comparison variables are biased to be more of a match in the first half of the sample. This is just a way to ensure that our logit model gives us positive coefficients, so that our example makes a little more sense.

set.seed(111)
training_table <- data.table::data.table(match = c(rep(1, 5e4), sample(c(0,1 ), 5e4, replace = TRUE)),
                                Company_compare = seq(1, 0.00001, -.00001),
                                Country_compare = c(rep(1, 5e4), sample(c(1, 0), 5e4, replace = TRUE)))
# training_table
logit_model <- glm(match ~ Company_compare + Country_compare, family = "binomial",
                   data = training_table)
summary(logit_model)
#> 
#> Call:
#> glm(formula = match ~ Company_compare + Country_compare, family = "binomial", 
#>     data = training_table)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.9655  -0.7099   0.3077   0.5579   1.7452  
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -1.27701    0.01702  -75.04   <2e-16 ***
#> Company_compare  4.98075    0.04134  120.49   <2e-16 ***
#> Country_compare  0.56183    0.01821   30.85   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 112745  on 99999  degrees of freedom
#> Residual deviance:  83050  on 99997  degrees of freedom
#> AIC: 83056
#> 
#> Number of Fisher Scoring iterations: 5

Then, we plug our logit model into the multivar_settings. The code will then use our trained logit model on the variables we specified. Note that the name of the columns in the training set must match the name of the variables in the match datasets, with "_compare" at the end.

result <- merge_plus(corp_data1, corp_data2, by = c("Country", "Company"), unique_key_1 = "unique_key_1",
                        unique_key_2 = "unique_key_2", 
                     match_type = "multivar",
                     multivar_settings = list(logit = logit_model, compare_type = c("indicator", "stringdist"),
                        wgts = NULL),
                        suffixes = c("_1", "_2"))
result$matches
#>     unique_key_1          Company_1 Country_1 State  SIC Revenue
#>  1:            1            Walmart       USA    OH 3300     485
#>  2:            2   Bershire Hataway       USA       2222     223
#>  3:            3              Apple       USA    CA 3384     215
#>  4:            4       Exxon Mobile       USA    TX 2222     205
#>  5:            5          McKesson    Germany    MA  222     192
#>  6:            6 UnitedHealth Group       USA    MA   NA     184
#>  7:            7         CVS Health       USA    RI 1112     177
#>  8:            9               AT&T       USA    TN 4000     163
#>  9:            8     General Motors       USA    MI 2222     166
#> 10:           10 Ford Motor Company       USA    MI   NA     151
#>              Company_2 Country_2 state_code SIC_code earnings unique_key_2
#>  1:            Walmart       USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway       USA         NE     2220  220,000            2
#>  3:     Apple Computer       USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.        USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                   MA     2222  190,000            5
#>  6: UnitedHealth Group       USA         MA     1130  180,000            6
#>  7:                CVS                   RI     1122  180,000            7
#>  8:             AT & T       USA         TN     4000  160,000            9
#>  9:         Ford Motor       USA         MI     2222  150,000           10
#> 10:         Ford Motor       USA         MI     2222  150,000           10
#>     Country_compare Company_compare multivar_score tier
#>  1:               1       1.0000000      0.9861508  all
#>  2:               1       0.9882353      0.9853272  all
#>  3:               1       0.8714286      0.9740476  all
#>  4:               1       0.9333333      0.9808013  all
#>  5:               0       0.9285714      0.9660386  all
#>  6:               1       1.0000000      0.9861508  all
#>  7:               0       0.8366667      0.9473611  all
#>  8:               1       0.9111111      0.9786024  all
#>  9:               1       0.7333333      0.9496635  all
#> 10:               1       0.9111111      0.9786024  all

Note the last few columns in the data.table: we see the comparison metrics, just like in the linear combination version of multivar_match. But, note that instead of computing a 50/50 linear combination like before, we are now computing a matchscore as the fitted probability of a match based on our logit model. In this toy example, the coefficients are a little strange because of the random data we fed in. But, we see the behavior we’d expect: a higher company name comparison and a country match gives us a higher matchscore.

Summary and next steps

We’ve covered the several different types of matching with fedmatch: exact matching, fuzzy matching, and multivar matching. Each match is useful in its own right, and they become even more useful when combined. That’s where the next step comes in: tier matching with the function tier_match. See the vignette for tier match for more details.