Introduction to DemografixeR

Matthias Brenninkmeijer

2020-05-06

Introduction

Let’s illustrate the usefulness of DemografixeR with a simple example. Say we know the first name of a sample of customers, but useful information about gender, age or nationality is unavailable:

Customers: Maria Ben Claudia Adam Hannah Robert

It’s common knowledge that names have a strong sociocultural influence - names’ popularity vary across time and location - and these naming conventions may be good predictors for other useful variables such as gender, age & nationality. Here’s where DemografixeR comes in:

DemografixeR allows R users to connect directly to the (1) genderize.io API, the (2) agify.io API and the (3) nationalize.io API to obtain the (1) gender, (2) age & (3) nationality of a name in a tidy format.”

DemografixeR deals with the hassle of API pagination, missing values, duplicated names, trimming whitespace and parsing the results in a tidy format, giving the user time to analyze instead of tidying the data.

To do so, DemografixeR is based on these three main pillar functions, which we will use to predict the key demographic variables of the previous sample of customers, so that we can ‘fix’ the missing demographic information:

API R function Estimated variable
https://genderize.io genderize(name) Gender
https://agify.io agify(name) Age
https://nationalize.io nationalize(name) Nationality

They all work similarly, and allow to be integrated in multiple workflows. Using the previous group of customers, we can obtain the following results:

Customers: Maria Ben Claudia Adam Hannah Robert
Estimated gender: female male female male female male
Estimated age: 21 48 45 34 27 59
Estimated nationality: CY AU CL PL SL US

To see how to get to these results, read on!

Get Started

Setup

First, we need to load the package:

API credentials

The following step is optional, it is only necessary if you plan to estimate gender, age or nationality for more than 1000 different names a day. To do so, you need to obtain an API key from the following link:

To use the API key, simply save it only once with the save_api_key(key) and you’re all set. All the functions will automatically retrieve the key once saved:

Please be careful when dealing with secrets/tokens/credentials and do not share them publicly. Yet, if you wish explicitly know which API key you’ve saved, retrieve it with the get_api_key() function. To fully remove the saved key use the remove_api_key() function.

Gender

We start by predicting the gender from our customers. For this we use the genderize(name) function:

We see that genderize(name) returns the estimated gender for each name as a character vector:

Yet, it is also possible to obtain a detailed data.frame object with additional information. DemografixeR also allows to use ‘pipes’:

name type gender probability count
Maria gender female 0.98 334287
Ben gender male 0.95 77991
Claudia gender female 0.98 118604
Adam gender male 0.98 116396
Hannah gender female 0.97 13198
Robert gender male 0.99 177418

Age

We continue with the age estimation of our customers. As with the genderize(name) function, the simplify parameter also works with the agify(name) function to retrieve a data.frame:

name type age count
Maria age 21 517258
Ben age 48 75632
Claudia age 45 110105
Adam age 34 110754
Hannah age 27 12843
Robert age 59 160915

Nationality

Last but not least, we finish with the nationality extrapolation. Equally as with the genderize(name) and agify(name) function, the simplify parameter also works with the nationalize(name) function to retrieve a data.frame:

name type country_id probability
Maria nationality CY 0.0550798
Ben nationality AU 0.0665534
Claudia nationality CL 0.0559340
Adam nationality PL 0.0905836
Hannah nationality SL 0.2673254
Robert nationality US 0.0909442

Other parameters

country_id parameter

Responses of names will in a lot of cases be more accurate if the data is narrowed to a specific country. Luckily, both the genderize(name) and agify(name) function support passing a country code parameter (following the common ISO 3166-1 alpha-2 country code convention). For obvious reasons the nationalize(name) does not:

To obtain a data.frame of all supported countries, use the supported_countries(type) function. Here’s an example of 5 countries:

country_id name total
AD Andorra 29783
AE United Arab Emirates 145847
AF Afghanistan 23531
AG Antigua and Barbuda 1723
AI Anguilla 1081

In this case the total column reflects the number of observations the API has for each country. The beauty of the country_id parameter lies in that it allows to pass a single character string or a character vector with the same length as the name parameter. An example illustrates this better:

name type age count country_id
Hannah age 54 67 US
Ben age 38 1980 GB

In this previous example we passed two names - Hannah & Ben - and two country codes - US & GB. Thus, the functions allow to pass vectorized vectors - this is especially useful for workflows where we are using a data.frame with a variable with names and another variable containing country codes.

meta parameter

All three functions have a parameter defined as meta, which returns information about the API itself, such as:

  • The amount of names available in the current time window
  • The number of names left in the current time window
  • Seconds remaining until a new time window opens

Here’s an example:

name type gender probability count api_rate_limit api_rate_remaining api_rate_reset api_request_timestamp
Hannah gender female 0.97 13198 1000 977 7218 2020-05-05 21:59:42

sliced parameter

The nationalize(name) function has the useful sliced parameter. Logically, names can have multiple estimated nationalities - and the nationalize(name) function automatically ranks them by probability. This logical parameter allows to ‘slice’/keep only the value with the highest probability to keep a single estimate for each name (one country per name) - and is set by default to TRUE. But you may wish to see all to potential countries a name can be associated to. For this simply set the parameter to FALSE:

name type country_id probability
Matthias nationality DE 0.4161638
Matthias nationality AT 0.2650625
Matthias nationality CH 0.1106922

In the last example you see that instead of returning a single country code, it returns multiple country codes with their associated probability.

Customers example

Let’s replicate the initial example with our group of customers. Voilà!

library(dplyr)

df<-data.frame("Customers:"=c("Maria", "Ben", "Claudia",
                           "Adam", "Hannah", "Robert"), 
               stringsAsFactors = FALSE,
               check.names = FALSE)

df <- df %>% mutate(`Estimated gender:`= genderize(`Customers:`),
                    `Estimated age:`= agify(`Customers:`),
                    `Estimated nationality:`= nationalize(`Customers:`))

df %>% t() %>% knitr::kable(col.names = NULL)
Customers: Maria Ben Claudia Adam Hannah Robert
Estimated gender: female male female male female male
Estimated age: 21 48 45 34 27 59
Estimated nationality: CY AU CL PL SL US

Further information

For more information access the package documentation at https://matbmeijer.github.io/DemografixeR.