Basic usage

library(varsExplore)
library(dplyr)

Motivation

One of the things that Stata has that RStudio lacks is the variable explorer. This extremely useful especially if you’re working with datasets with a large number of variables with hard to remember names, but descriptive labels. In Stata you just search in the variable explorer and then click on the variable to get its name into the console.

Stata’s variable explorer

As an example of a dataset like this, consider the Quality of Government standard dataset. Here’s the 2018 version of the cross-section data:

qog <- rio::import("http://www.qogdata.pol.gu.se/dataarchive/qog_std_cs_jan18.dta")

It has 194 observations (different countries) and 1882 variables.

The variables have names like wdi_acelu, bci_bci, eu_eco2gdpeurhab, gle_cgdpc etc. Not exactly things you want to remember.

Working with this in Stata is relatively easy because you just search in the variable explorer for things like “sanitation”, “corruption”, “GDP”, etc. and you find the variable names.

Unfortunately, RStudio doesn’t have a variable explorer panel. But you can improvise something like the following:

data.frame(Description = sjlabelled::get_label(qog)) %>% DT::datatable()

BAM! We just made a variable explorer! If you run this code in the console it opens the DT::datatable in the RStudio’s Viewer pane, which is pretty much replicating the Stata experience (except that it is read-only).

But we can do better! Why not include additional information, like the number of missing observations, summary statistics, or an overview of the values of each variable?

Introducing vars_explore

Full usage

vars_explore(qog)

This will create a searchable variable explorer, and calculate summary statistics for each variable:

The table is searchable, and you can, furthermore, arrange it, say, based on which variable has least missing values. For instance, search for “GDP per capita” and see which variable provides most complete information.

If you click on the + next to a variable, you will get both the summary statistics, and unique values and, if present, the value labels. The option value.labels.width limits how many characters to include in the “Value labels” and “Values” columns. Default is 500.

If you set minimal = TRUE, only “Variable”, “Description”, “Obs.”, and “Missing” will be shown (and none the summary stats will be calculated).

Creating summary statistics tables

By default, vars_explore only shows the summary in RStudio’s Viewer Pane, and doesn’t return anything. But you can change this by setting viewer = FALSE and silent = FALSE. This speeds it us massively because it’s the DT::datatable() that takes most of the time, and allows you to build summary stats (e.g. for a paper):

vdem_summary <- qog %>% 
  select(starts_with("vdem_")) %>%
  vars_explore(viewer = FALSE, silent = FALSE) %>% 
  select(-Values, -`Value labels`)

knitr::kable(vdem_summary)
Variable Description Type Obs. Missing Mean Median Std.Dev. Min Max
vdem_corr Political corruption index numeric 168 26 0.51 0.57 0.29 0.010031 0.968504
vdem_delibdem Deliberative democracy index numeric 168 26 0.41 0.39 0.26 0.002548 0.912120
vdem_dl_delib Deliberative component index numeric 168 26 0.66 0.74 0.25 0.020510 0.989109
vdem_edcomp_thick Electoral component index numeric 168 26 0.62 0.65 0.24 0.004832 0.939636
vdem_egal Egalitarian component index numeric 168 26 0.62 0.63 0.20 0.086303 0.969009
vdem_egaldem Egalitarian democracy index numeric 168 26 0.42 0.36 0.23 0.043189 0.869084
vdem_elvotbuy Election vote buying numeric 162 32 2.09 1.82 1.03 0.211176 3.982266
vdem_exbribe Executive bribery and corrupt exchanges numeric 168 26 1.93 1.79 1.06 0.097352 3.938878
vdem_excrptps Public sector corrupt exchanges numeric 168 26 1.82 1.69 1.01 0.113801 3.937302
vdem_execorr Executive corruption index numeric 168 26 0.48 0.52 0.30 0.010515 0.974874
vdem_exembez Executive embezzlement and theft numeric 168 26 2.13 2.11 1.11 0.091897 3.924967
vdem_exthftps Public sector theft numeric 168 26 0.06 -0.13 1.49 -3.137724 3.399789
vdem_gcrrpt Legislature corrupt activities numeric 168 26 1.85 1.67 0.89 0.299263 3.894069
vdem_gender Women political empowerment index numeric 168 26 0.74 0.79 0.17 0.274245 0.956486
vdem_jucorrdc Judicial corruption decision numeric 168 26 2.26 2.25 0.85 0.475427 3.825114
vdem_libdem Liberal democracy index numeric 168 26 0.43 0.40 0.26 0.015430 0.886297
vdem_liberal Liberal component index numeric 168 26 0.62 0.68 0.25 0.037652 0.975519
vdem_mecorrpt Media corrupt numeric 168 26 2.54 2.62 0.97 0.079457 3.966262
vdem_partip Participatory component index numeric 168 26 0.47 0.52 0.16 0.042004 0.711231
vdem_partipdem Participatory democracy index numeric 168 26 0.34 0.34 0.19 0.014923 0.672246
vdem_polyarchy Electoral democracy index numeric 168 26 0.55 0.57 0.24 0.026478 0.924651
vdem_pubcorr Public sector corruption index numeric 168 26 0.50 0.52 0.30 0.005353 0.981405

You can also opt to use View() instead of the Viewer Pane, which works much faster than DT::datatable(), although, given how RStudio works, this makes hard to see both the variable explorer and your script at the same time.

Alternatives

The best alternative to vars_explore is the vtable package. The downsides of vtable are

  1. it doesn’t provide a searchable table in the Viewer Pane, and
  2. it puts all summary stats in a single text column. This makes it hard to sort based on, say, the number of missing values.

You can, however, use vtable to generate a dataframe that can be opened with View(), just like you can with vars_explore(silent = FALSE, viewer = FALSE). Unfortunately, what makes vtable faster is precisely its limitations, as the most time consuming part is loading up the DT::datatable, rather than calculating anything. vtable works fast because it creates a simple HTML file, but that is not searchable in the Viewer Pane.

Another alternative is sjPlot::view_df, which provides summary stats in individual columns, but it is very slow. Also, like vtable, it doesn’t provide a searchable table in the Viewer Pane.

Acknowledgements

This was made possible by Reigo Hendrikson’s datatable2: http://www.reigo.eu/2018/04/extending-dt-child-row-example/

As far as I know, Reigo hasn’t made this available in a package. It is included in this package, with some minor modifications, and you can use it with varsExplore::datatable2().