Introduction to vtree 0.1.4

Nick Barrowman

01-Nov-2018 at 12:05

PNGdir <- tempdir()

Introduction

vtree is a tool for drawing variable trees. Variable trees display information about nested subsets of a data frame. This turns out to be useful in a wide variety of situations.

The figure below is a variable tree based on a data frame called FakeData (which represents 46 fictitious patients) in which subsets defined by Sex (M or F) are shown within subsets defined by disease Severity (Mild, Moderate, Severe, or NA).

The variable tree consists of nodes connected by arrows. At the top of the diagram above, the root node of the tree contains all 46 patients. The nodes in the next level of the tree (the children of the root node) represent patients with different values of Severity. The nodes in the next level of the tree represents males and females within each value of Severity.

By default, if a variable has any missing values there will be a missing-value node for that variable. For example, 6 patients had missing Severity. Thus, unlike with some visualization methods, missing values are shown. Furthermore, like any node, a missing-value node can have children. For example, of the 6 patients with missing Severity, 3 are are female and 3 are male. Thus the full missing-value structure of the specified variables in the data frame is displayed.

When there are two variables, a variable tree is equivalent to a two-way contingency table with either row or column percentages, depending on the ordering of the variables. In the tree above, Sex is shown within levels of Severity. This corresponds to the following contingency table, in which percentages within each column add to 100%. (These are known as column percentages.)

Mild Moderate Severe NA
F 11 (58%) 11 (69%) 2 (40%) 3 (50%)
M 8 (42%) 5 (31%) 3 (60%) 3 (50%)

Because variable trees visually depict subsetting, they can be easier to understand than contingency tables, especially when there are more than two variables.

Uses of vtree

Variable trees can represent:

Features of vtree

vtree is designed to be quick and easy to use, so that it is convenient for exploratory data analysis, but also flexible enough that it can be used to prepare publication-ready figures.

To make variable trees easy to interpret, vtree supports custom labeling of variables and nodes . One challenge is that as variables are added, variable trees can grow quickly. For this reason, vtree includes tools for pruning trees. Another feature that enhances the value of variable trees is the ability to display summary statistics for other variables in the nodes. For example, the mean of a continuous variable, such as age, can be displayed.

To summarize, vtree implements several additional features:

Technical overview (optional)

The root node of the variable tree represents the entire data frame. There is a child of the root node for each observed value of the first variable that was specified. Each of these child nodes represents a subset of the data frame with a specific value of the variable, and is labeled with the number of observations in the subset and the corresponding percentage of the number of observations in the entire data frame. In general, the nth level of the variable tree corresponds to the nth variable specified. Apart from the root node, each node in the variable tree represents the subset of its parent defined by a specific observed value of the variable at that level of the tree, and is labeled with the number of observations in that subset and the corresponding percentage of the number of observations in its parent node.

Note that a node always represents at least one observation. Unlike a contingency table, which can have empty cells, a variable tree cannot have any empty nodes.

The vtree function

The basic syntax is: vtree(data_frame,variable_names). Numerous additional parameters can be supplied. For example, by default vtree produces a horizontal tree, but sometimes a vertical tree is preferable. When horiz=FALSE is specified, vtree generates a vertical tree.

Mini tutorial

Here is a simple example using just one variable:

vtree(FakeData,"Severity",horiz=FALSE)

Now here’s an example with two variables, Severity and Sex. The variables are specified with a space between them in a single character string like this: "Severity Sex". Alternatively the variables can be specified like this: c("Severity","Sex"). This takes a little more typing but is necessary if there are spaces in any of the variable names.

vtree(FakeData,"Severity Sex",horiz=FALSE)

The top node represents the entire data frame. Each subsequent level of the tree corresponds to a different variable (first Severity, then Sex). Within each level, each node represents the subset of its parent node where the variable has a specific value. Displayed in each node is the number of observations and the conditional percentage (i.e. the number of observations in the node expressed as a percentage of the observations in its parent node).

Percentages

Let’s return to the variable tree for Severity. Note that it is equivalent to a one-way table of frequencies and percentages. Missing values are always shown, but do not have associated percentages. The percentages shown are “valid percentages”, i.e. the denominator is the total number of non-missing values.

Alternatively, if you specify vp=FALSE, the denominator will be the total number of observations, including any missing values. In this case, a percentage is shown in each node, including nodes for missing values.

vtree(FakeData,"Severity",vp=FALSE,horiz=FALSE)

Hiding variable names

By default, vtree shows the variable names next to the corresponding levels of the tree. These can be removed by specifying showlevels=FALSE.

Text wrapping

By default, vtree wraps text whenever a space occurs after at least 20 characters. This can be adjusted, for example, to 10 characters, by specifying splitwidth=10.

Pruning using the prune parameter

Suppose you don’t want the tree to show individuals whose disease is Mild or NA. You can use the prune parameter to remove those nodes, and all of their descendants.

vtree(FakeData,"Severity Sex",prune=list("Severity"=c("Mild","NA")))

Pruning is useful when the tree gets too big, or when you are not interested in parts of the tree. In practice, as variables are added, variable trees quickly get too large to display without some kind of pruning.

Pruning using the prunebelow parameter

The prune parameter completely eliminates nodes (along with their descendants). A disadvantage of this is that the frequencies shown in child nodes do not add up to the frequency shown in the parent node. For example in the variable tree above, of a total of 46 patients, 16 have moderate disease and 5 have severe disease. One might wonder what happened to the other 25 patients.

An alternative is to prune below the specified nodes, i.e. show the nodes but not any of their descendants.

vtree(FakeData,"Severity Sex",prunebelow=list("Severity"=c("Mild","NA")))

Pruning using the keep and follow parameters

There are two other ways to prune variable trees. The keep parameter is used to specify nodes that should not be pruned (all other nodes at that level of the tree will be pruned). The follow parameter is like the keep parameter except that no nodes at that level of the tree will be pruned. Instead, those nodes that are not “followed” will be pruned at the next level.

Renaming nodes using the labelnode parameter

By default, vtree names nodes (except for the root node) using the values of the variable in question. (If the variable is a factor, the levels of the factor are used). Sometimes it is convenient to instead specify custom labels for nodes. You can use the labelnode argument to relabel the values. For example, you might want to use “Male” and “Female” instead of “M” and “F”. The labelnode argument argument is specified as a list whose element names are variable names. You specify the relabeling like this: "New label"="Old label". For example:

vtree(FakeData,"Severity Sex",labelnode=list(Sex=(c("Male"="M","Female"="F"))),horiz=FALSE)

Renaming variables using the labelvar parameter

It’s often useful to specify a more informative label in place of the variable name. The labelvar argument is used to specify these labels. As an example, suppose Severity represents severity on day 1.

vtree(FakeData,"Severity Group",labelvar=c("Severity"="Severity on day 1"),horiz=FALSE)

Fancy formatting of labels using HTML-style codes

NOTE: The section after this one shows how to use an easy alternative to HTML-style codes.

Graphviz (the open source graph visualization software that provides the basis for vtree) allows fancy formatting of text using HTML-style codes. In particular,

See https://www.graphviz.org/doc/info/shapes.html#html for more details.

Note: To use these HTML-style codes, it is necessary to specify HTMLtext=TRUE as in the example below:

vtree(FakeData,"Severity",HTMLtext=TRUE,horiz=FALSE,
  labelnode=list("Severity"=c(
    "<B>Mild</B><BR ALIGN='LEFT'/>"="Mild",
    "<I>Moderate</I><BR/>"="Moderate",
    "Severe<FONT POINT-SIZE='10'><SUP>Superscript</SUP></FONT><BR/>"="Severe",
    "NA<FONT POINT-SIZE='10'><SUB>Superscript</SUB></FONT><BR/>"="NA")),
  title="<FONT FACE='Times-Roman' COLOR='red' POINT-SIZE='20'>Whole group</FONT><BR/>")

Fancy formatting of labels using markdown-style codes

The vtree package implements an easier way to perform some fancy formatting tasks.

vtree(FakeData,"Severity",horiz=FALSE,
  labelnode=list("Severity"=c(
    "**Mild**"="Mild",
    "*Moderate*"="Moderate",
    "Severe^Superscript^"="Severe",
    "NA~subscript~"="NA")),
  title="%%red Whole group%%")

Checking for missing values with the check.is.na parameter

The check.is.na argument is used to produce a tree that only shows whether the specified variables are missing or not. Whereas the variables that vtree uses to build variable trees are usually categorical, this is a situation where non-categorical variables can be used, because their missingness is represented, not their actual values.

vtree(FakeData,"Severity Age",check.is.na=TRUE)

Adding text to nodes using the text parameter

The argument text lets you add text to nodes. A list is specified whose element names are variable names, and text is specified as: "Label"="Text to add".

vtree(FakeData,"Severity",horiz=FALSE,
  text=list("Severity"=c("Mild"="Includes first-time visits")))

Displaying summary statistics using the summary parameter

A simple example

The summary argument can be used to specify summary statistics to display in each node. For example,

vtree(FakeData,"Severity",summary="Score %mean%",horiz=FALSE)

In each node the mean value of the Score variable is displayed. The summary argument ("Score %mean%") starts with the name of the variable to be summarized (Score), and is followed by a code (%mean%) indicating that we want to see the mean.

The following codes are defined:

  • %mean% mean
  • %SD% standard deviation
  • %min% minimum
  • %max% maximum
  • %pX% Xth percentile (e.g. p50 means the 50th percentile)
  • %median% median, i.e. p50
  • %IQR% IQR, i.e. p25, p75
  • %npct% n (%). By default “valid percentages” are used. Any missing values are also reported.
  • %list% list of the individual values
  • %mv% the number of missing values
  • %v% the name of the variable
  • %noroot% flag: Do not show summary in the root node.
  • %leafonly% flag: Only show summary in leaf nodes.

The summary argument can use any number of these codes, mixed with text and formatting codes.

More than one variable

It is also possible to summarize more than one variable by specifying a vector of character strings, each one referring to a different variable.

For example,

vtree(FakeData,"Severity",horiz=FALSE,showlevels=FALSE,
  summary=c(
    "Score \nScore\n mean(SD) %mean%(%SD%)",
    "Pre \n\nPre\n range %min%, %max%"))

The %npct% code

Consider the following variable tree. Note that Viral represents whether a patient has (TRUE) or does not have (FALSE) a viral infection.

vtree(FakeData,"Severity Viral",horiz=FALSE,showlevels=FALSE)

The tree above has lots of nodes. If what we’re looking for is simply the number and percentage of patients with viral infection in each severity group, the %npct% code can be used. This results in a simpler tree:

vtree(FakeData,"Severity",summary="Viral \nViral %npct%",horiz=FALSE,showlevels=FALSE)

Note that in each node, “mv” indicates the number of missing values (if any).

The %list% code

It is sometimes convenient to see individual values of a variable in each node. To do this, use the %list% code. It may also be convenient to not show the individual values in the root node. To do this, use the %noroot% code. For example,

vtree(FakeData,"Severity",summary="id \nid = %list% %noroot%",horiz=FALSE)

A single vector that is not part of the data frame

When you want to produce a flowchart for a single variable stored in a vector (i.e. to generate simple frequencies for that variable), there is a simplified way to call vtree. Rather than specifying a data frame and a variable name, simply pass in the vector.

vtree(c("A","A","A","B","C"),horiz=FALSE)

Examining the DOT script generated by vtree

Specifying getscript=TRUE lets you capture the DOT script representing a flowchart. Here is an example:

dotscript <- vtree(FakeData,"Severity",getscript=TRUE)
cat(dotscript)
digraph boxes_and_circles {
graph [layout = dot, compound=true, nodesep=0.5, ranksep=0.5, fontsize=12]
node [fontname = Helvetica, fontcolor = black,shape = rectangle, color = black,margin =  0.2]
rankdir=LR;
Node_L0[style=invisible]
Node_L1[label=<Severity> shape=none]

edge[style=invis];
Node_L0->Node_L1

edge[style=solid];
Node_1->Node_2 Node_1->Node_3 Node_1->Node_4 Node_1->Node_5

Node_1[label=<Sample<BR/>46> color=black style="rounded,filled" fillcolor=<#EFF3FF>] Node_2[label=<Mild<BR/>19 (48%)> color=black style="rounded,filled" fillcolor=<#C6DBEF>] Node_3[label=<Moderate<BR/>16 (40%)> color=black style="rounded,filled" fillcolor=<#C6DBEF>] Node_4[label=<Severe<BR/>5 (12%)> color=black style="rounded,filled" fillcolor=<#C6DBEF>] Node_5[label=<NA<BR/>6> color=black style="rounded,filled" fillcolor=<#C6DBEF>]

}

The grVizToPNG function: Saving trees to PNG files

vtree uses the DiagrammeR package (which in turn is built on the open source graph visualization software, Graphviz).

Diagrams created using vtree will automatically render to HTML (for example, in the RStudio Viewer pane, or from R in a browser window). But to include them in Microsoft Word documents, you need to create a PNG file. The function grVizToPNG does this.

For example, if you saved the output of a call to vtree to an object called example1, you could use grVizToPNG to create a PNG file called example1.png like this:

grVizToPNG(example1)

Notes:

Embedding a PNG image into R Markdown output

Suppose you are using R Markdown, and wish to embed the PNG image generated by calling grVizToPNG into your output (for example a Word document). If you want the image scaled to, say, 3 inches tall, add this code inline (i.e. not in a code chunk):

![](example1.png){ height=3in }

If, in your call to grVizToPNG, you specified that graphics files should be stored in a subfolder called MyFolder, use the following code:

![](MyFolder/example1.png){ height=3in }