Banner

Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."

Chapter 1: The basics

About trees

Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other fields. Trees are especially useful when we are facing hierarchical data. For example, trees are used:

  • in decision theory (cf. decision trees)
  • in machine learning (e.g. classification trees)
  • in finance, e.g. to classify financial instruments into asset classes
  • in routing algorithms
  • in computer science and programming (e.g. binary search trees, XML)
  • e.g. for family trees

Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no high-level hierarchical data structure that could be used as conveniently and generically as, say, data.frame.

As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames (or - perish the thought! - in Excel sheets). But hierarchies don’t marry with tables and various workarounds are usually required.

This package offers an alternative. The tree package allows you to create hierarchies with the Node object. Node provides basic traversal, search, and sort operations. You can decorate Nodes with attributes and methods, extending the package to your needs.

The package also provides convenient methods for neatly printing trees, and converting trees to data.frames for integration with other packages.

The example in this vignette revolves around decision trees.

Tree creation

Let’s start by creating a tree of Nodes. In our example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 0) is the company. On level 1, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:

library(data.tree)
acme <- Node$new("Acme Inc.")
  accounting <- acme$AddChild("Accounting")
    software <- accounting$AddChild("New Software")
    standards <- accounting$AddChild("New Accounting Standards")
  research <- acme$AddChild("Research")
    newProductLine <- research$AddChild("New Product Line")
    newLabs <- research$AddChild("New Labs")
  it <- acme$AddChild("IT")
    outsource <- it$AddChild("Outsource")
    agile <- it$AddChild("Go agile")
    goToR <- it$AddChild("Switch to R")

print(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

Note that Node is an R6 reference class. Essentially, this has two implications:

  1. You can call methods on a Node in OO style
  2. You can call methods on a Node that modify it, without having to re-assign to a new variable. This is different from the value semantics, which is much more widely used in R.

Calling methods in OO style

For example, we can check if a Node is the root:

acme$isRoot
## [1] TRUE

Custom attributes

Now, let’s associate some costs with the projects. We do this by setting custom attributes on the leaf Nodes:

software$cost <- 1000000
standards$cost <- 500000
newProductLine$cost <- 2000000
newLabs$cost <- 750000
outsource$cost <- 400000
agile$cost <- 250000
goToR$cost <- 50000

Also, we set the probabilities that the projects will be executed in the next year:

software$p <- 0.5
standards$p <- 0.75
newProductLine$p <- 0.25
newLabs$p <- 0.9
outsource$p <- 0.2
agile$p <- 0.05
goToR$p <- 1

Converting to data.frame

We can now convert the tree into a data.frame. Note that we always call such methods on the root Node:

acmedf <- as.data.frame(acme)

The same can be achieved by using the OO-style method Node$ToDataFrame:

acme$ToDataFrame()

Adding the project cost to our data.frame is easy to do with the Get method. We’ll explain the Get method in more detail below.

acmedf$level <- acme$Get("level")
acmedf$cost <- acme$Get("cost")

We could have achieved the same result in one go, using the OO-style ToDataFrame method:

acme$ToDataFrame("level", "cost")
##                           levelName level    cost
## 1  Acme Inc.                            0      NA
## 2   ¦--Accounting                       1      NA
## 3   ¦   ¦--New Software                 2 1000000
## 4   ¦   °--New Accounting Standards     2  500000
## 5   ¦--Research                         1      NA
## 6   ¦   ¦--New Product Line             2 2000000
## 7   ¦   °--New Labs                     2  750000
## 8   °--IT                               1      NA
## 9       ¦--Outsource                    2  400000
## 10      ¦--Go agile                     2  250000
## 11      °--Switch to R                  2   50000

Internally, the same is called when printing a tree:

print(acme, "level", "cost")

Using Get when converting to data.frame and for printing

Above, we saw how we can add the name of an attribute to the ellipsis argument of the as.data.frame. We can also add the results of the Get method directly to the as.data.frame . This allows, for example, formatting the column in a specific way. Details of the Get method are explained in the next section.

acme$ToDataFrame("level",
                  probability = acme$Get("p", format = FormatPercent)
                )
##                           levelName level probability
## 1  Acme Inc.                            0            
## 2   ¦--Accounting                       1            
## 3   ¦   ¦--New Software                 2     50.00 %
## 4   ¦   °--New Accounting Standards     2     75.00 %
## 5   ¦--Research                         1            
## 6   ¦   ¦--New Product Line             2     25.00 %
## 7   ¦   °--New Labs                     2     90.00 %
## 8   °--IT                               1            
## 9       ¦--Outsource                    2     20.00 %
## 10      ¦--Go agile                     2      5.00 %
## 11      °--Switch to R                  2    100.00 %

Chapter 2: Tree Traversal

Get method (Tree Traversal)

Tree traversal is one of the core concepts of trees. See, for example, here: Tree Traversal on Wikipedia. The Get method traverses the tree and collects values from each node. It then returns a vector containing the collected values.

Additional features of the Get method are:

  • execute a function on each node, and append the function’s result to the returned vector
  • execute a Node method on each node, and append the method’s return value to the returned vector
  • assign the function or method return value to a Node’s attribute

Traversal order

The Get method can traverse the tree in various ways. This is called traversal order.

Pre-Order

The default traversal mode is pre-order.

pre-order

This is what is used e.g. in as.data.frame and its OO-style counterpart Node$ToDataFrame:

acme$ToDataFrame("level")
##                           levelName level
## 1  Acme Inc.                            0
## 2   ¦--Accounting                       1
## 3   ¦   ¦--New Software                 2
## 4   ¦   °--New Accounting Standards     2
## 5   ¦--Research                         1
## 6   ¦   ¦--New Product Line             2
## 7   ¦   °--New Labs                     2
## 8   °--IT                               1
## 9       ¦--Outsource                    2
## 10      ¦--Go agile                     2
## 11      °--Switch to R                  2

Post-Order

The post-order traversal mode returns children first, returning parents only after all children have been traversed:

post-order

We can use it like this on the Get method:

data.frame(level = acme$Get('level', traversal = "post-order"))
##                          level
## New Software                 2
## New Accounting Standards     2
## Accounting                   1
## New Product Line             2
## New Labs                     2
## Research                     1
## Outsource                    2
## Go agile                     2
## Switch to R                  2
## IT                           1
## Acme Inc.                    0

This is useful if your parent’s value depends on the children, as we’ll see below.

Ancestor

This is a non-standard traversal mode that does not traverse the entire tree. Instead, the ancestor mode starts from a Node, then walks the tree along the path from ancestor to ancestor, up to the root.

data.frame(level = agile$Get('level', traversal = "ancestor"))
##           level
## Go agile      2
## IT            1
## Acme Inc.     0

Get using a function

Pass a function to the Get method

You can pass a standard R function to the Get method. For example:

ExpectedCost <- function(node) {
  result <- node$cost * node$p
  if(length(result) == 0) result <- NA
  return (result)
}

data.frame(acme$Get(ExpectedCost))
##                          acme.Get.ExpectedCost.
## Acme Inc.                                    NA
## Accounting                                   NA
## New Software                             500000
## New Accounting Standards                 375000
## Research                                     NA
## New Product Line                         500000
## New Labs                                 675000
## IT                                           NA
## Outsource                                 80000
## Go agile                                  12500
## Switch to R                               50000

The requirements for the function (ExpectedCost in the above example) are the following:

  • the first argument of the function is a Node
  • it needs to return a scalar

Using recursion

In the following examples, we use magrittr to enhance readability of the code.

library(magrittr)
ExpectedCost <- function(node) {
  result <- node$cost * node$p
  if(length(result) == 0) {
    if (node$isLeaf) result <- NA
    else {
      node$children %>% sapply(ExpectedCost) %>% sum -> result
    }
  }
  return (result)
}

data.frame(ec = acme$Get(ExpectedCost))
##                               ec
## Acme Inc.                2192500
## Accounting                875000
## New Software              500000
## New Accounting Standards  375000
## Research                 1175000
## New Product Line          500000
## New Labs                  675000
## IT                        142500
## Outsource                  80000
## Go agile                   12500
## Switch to R                50000

Add parameters to the passed function

The Traverse method accepts an ellipsis (...). Any additional parameters with which Get is called will be passed on to the ExpectedCost function. This gives us more flexibility. For instance, we don’t have to hard-code the sum function into ExpectedCost, but we can leave it to the caller to provide the function to use:

ExpectedCost <- function(node, fun = sum) {
  result <- node$cost * node$p
  if(length(result) == 0) {
    if (node$isLeaf) result <- NA
    else {
      node$children %>% sapply(function(x) ExpectedCost(x, fun = fun)) %>% fun -> result
    }
  }
  return (result)
}

data.frame(ec = acme$Get(ExpectedCost, fun = mean))
##                              ec
## Acme Inc.                357500
## Accounting               437500
## New Software             500000
## New Accounting Standards 375000
## Research                 587500
## New Product Line         500000
## New Labs                 675000
## IT                        47500
## Outsource                 80000
## Go agile                  12500
## Switch to R               50000

Assigning values using Get

We can tell the Get method to assign the value to a specific attribute for each Node it traverses. This is especially useful if the attribute parameter is a function, as in the previous examples. For instance, we can store the calculated expected cost for later use and printing:

acme$Get(function(x) x$p * x$cost, assign = "expectedCost")
##                Acme Inc.               Accounting             New Software 
##                       NA                       NA                   500000 
## New Accounting Standards                 Research         New Product Line 
##                   375000                       NA                   500000 
##                 New Labs                       IT                Outsource 
##                   675000                       NA                    80000 
##                 Go agile              Switch to R 
##                    12500                    50000
print(acme, "p", "cost", "expectedCost")
##                           levelName    p    cost expectedCost
## 1  Acme Inc.                          NA      NA           NA
## 2   ¦--Accounting                     NA      NA           NA
## 3   ¦   ¦--New Software             0.50 1000000       500000
## 4   ¦   °--New Accounting Standards 0.75  500000       375000
## 5   ¦--Research                       NA      NA           NA
## 6   ¦   ¦--New Product Line         0.25 2000000       500000
## 7   ¦   °--New Labs                 0.90  750000       675000
## 8   °--IT                             NA      NA           NA
## 9       ¦--Outsource                0.20  400000        80000
## 10      ¦--Go agile                 0.05  250000        12500
## 11      °--Switch to R              1.00   50000        50000

Combine assignment and calculation

In the above recursion example, we iterate - for each node - to all descendants straight to the leaf, repeating the very same calculations various times.

We can avoid repeating calculations by piggy-backing on precalculated values. Obviously, this requires us to traverse the tree in post-order mode: We want to start calculating at the leaves, cache the results for later use, then walk back towards the root.

In the following example, we calculate the average expected cost, just as above. As this now depends only on a Node’s children, and because we walk the tree in post-order mode, we can be sure that our children have the value calculated when we traverse the parent.

ExpectedCost <- function(node, variableName = "avgExpectedCost", fun = sum) {
  #if the "cache" is filled, I return it. This stops the recursion
  if(!is.null(node[[variableName]])) return (node[[variableName]])
  
  #otherwise, I calculate from my own properties
  result <- node$cost * node$p
  
  #if the properties are not set, I calculate the mean from my children
  if(length(result) == 0) {
    if (node$isLeaf) result <- NA
    else {
      node$children %>%
      sapply(function(x) ExpectedCost(x, variableName = variableName, fun = fun)) %>%
      fun -> result
    }
  }
  return (result)
}

We can use our method like this:

invisible(
  acme$Get(ExpectedCost, fun = mean, traversal = "post-order", assign = "avgExpectedCost")
)
print(acme, "cost", "p", "avgExpectedCost")
##                           levelName    cost    p avgExpectedCost
## 1  Acme Inc.                             NA   NA          357500
## 2   ¦--Accounting                        NA   NA          437500
## 3   ¦   ¦--New Software             1000000 0.50          500000
## 4   ¦   °--New Accounting Standards  500000 0.75          375000
## 5   ¦--Research                          NA   NA          587500
## 6   ¦   ¦--New Product Line         2000000 0.25          500000
## 7   ¦   °--New Labs                  750000 0.90          675000
## 8   °--IT                                NA   NA           47500
## 9       ¦--Outsource                 400000 0.20           80000
## 10      ¦--Go agile                  250000 0.05           12500
## 11      °--Switch to R                50000 1.00           50000

Formatting Get

We can pass a formatting function to the Get method, which will convert the returned value to a human-readable string for printing.

PrintMoney <- function(x) {
  format(x, digits=10, nsmall=2, decimal.mark=".", big.mark="'", scientific = FALSE)
}

print(acme, cost = acme$Get("cost", format = PrintMoney))
##                           levelName         cost
## 1  Acme Inc.                                  NA
## 2   ¦--Accounting                             NA
## 3   ¦   ¦--New Software             1'000'000.00
## 4   ¦   °--New Accounting Standards   500'000.00
## 5   ¦--Research                               NA
## 6   ¦   ¦--New Product Line         2'000'000.00
## 7   ¦   °--New Labs                   750'000.00
## 8   °--IT                                     NA
## 9       ¦--Outsource                  400'000.00
## 10      ¦--Go agile                   250'000.00
## 11      °--Switch to R                 50'000.00

Note that the format is not used for assignment with the assign parameter, but only for the values returned by Get:

acme$Get("cost", format = PrintMoney, assign = "cost2")
##                Acme Inc.               Accounting             New Software 
##                     "NA"                     "NA"           "1'000'000.00" 
## New Accounting Standards                 Research         New Product Line 
##             "500'000.00"                     "NA"           "2'000'000.00" 
##                 New Labs                       IT                Outsource 
##             "750'000.00"                     "NA"             "400'000.00" 
##                 Go agile              Switch to R 
##             "250'000.00"              "50'000.00"
print(acme, cost = acme$Get("cost2"))
##                           levelName    cost
## 1  Acme Inc.                             NA
## 2   ¦--Accounting                        NA
## 3   ¦   ¦--New Software             1000000
## 4   ¦   °--New Accounting Standards  500000
## 5   ¦--Research                          NA
## 6   ¦   ¦--New Product Line         2000000
## 7   ¦   °--New Labs                  750000
## 8   °--IT                                NA
## 9       ¦--Outsource                 400000
## 10      ¦--Go agile                  250000
## 11      °--Switch to R                50000

The format function is useful not only for formatting numbers, but also for displaying a printable representation of a Node field that is not a numeric (but e.g. a matrix).

Set method

The Set method is the counterpart to the Get method. The Set method takes a vector or a single value as an input, and traverses the tree in a certain order. Each Node is assigned a value from the vector, one after the other.

Assigning values

employees <- c(NA, 52, NA, NA, 78, NA, NA, 39, NA, NA, NA)
acme$Set(employees)
print(acme, "employees")
##                           levelName employees
## 1  Acme Inc.                               NA
## 2   ¦--Accounting                          52
## 3   ¦   ¦--New Software                    NA
## 4   ¦   °--New Accounting Standards        NA
## 5   ¦--Research                            78
## 6   ¦   ¦--New Product Line                NA
## 7   ¦   °--New Labs                        NA
## 8   °--IT                                  39
## 9       ¦--Outsource                       NA
## 10      ¦--Go agile                        NA
## 11      °--Switch to R                     NA

The Set method can take multiple vectors as an input, and, optionally, you can define the name of the attribute:

secretaries <- c(NA, 5, NA, NA, 6, NA, NA, 2, NA, NA, NA)
acme$Set(secretaries, secPerEmployee = secretaries/employees)
print(acme, "employees", "secretaries", "secPerEmployee")
##                           levelName employees secretaries secPerEmployee
## 1  Acme Inc.                               NA          NA             NA
## 2   ¦--Accounting                          52           5     0.09615385
## 3   ¦   ¦--New Software                    NA          NA             NA
## 4   ¦   °--New Accounting Standards        NA          NA             NA
## 5   ¦--Research                            78           6     0.07692308
## 6   ¦   ¦--New Product Line                NA          NA             NA
## 7   ¦   °--New Labs                        NA          NA             NA
## 8   °--IT                                  39           2     0.05128205
## 9       ¦--Outsource                       NA          NA             NA
## 10      ¦--Go agile                        NA          NA             NA
## 11      °--Switch to R                     NA          NA             NA

Just as for the Get method, the traversal order is important for the Set.

Often, it is useful to use Get and Set together:

ec <- acme$Get(function(x) x$p * x$cost)
acme$Set(expectedCost = ec)
print(acme, "p", "cost", "expectedCost")
##                           levelName    p    cost expectedCost
## 1  Acme Inc.                          NA      NA           NA
## 2   ¦--Accounting                     NA      NA           NA
## 3   ¦   ¦--New Software             0.50 1000000       500000
## 4   ¦   °--New Accounting Standards 0.75  500000       375000
## 5   ¦--Research                       NA      NA           NA
## 6   ¦   ¦--New Product Line         0.25 2000000       500000
## 7   ¦   °--New Labs                 0.90  750000       675000
## 8   °--IT                             NA      NA           NA
## 9       ¦--Outsource                0.20  400000        80000
## 10      ¦--Go agile                 0.05  250000        12500
## 11      °--Switch to R              1.00   50000        50000

This is equivalent to using Get with the assign parameter.

Deleting attributes

The Set method can also be used to assign a single value directly to all Nodes traversed. For example, to remove the avgExpectedCost, we assign NULL on each node:

acme$Set(avgExpectedCost = NULL)

Note that unassigned values also have NULL:

acme$newAttribute
## NULL

Chaining

As Node is an R6 reference object, we can chain the arguments:

acme$Set(avgExpectedCost = NULL)$Set(expectedCost = NA)
print(acme, "avgExpectedCost", "expectedCost")
##                           levelName avgExpectedCost expectedCost
## 1  Acme Inc.                                     NA           NA
## 2   ¦--Accounting                                NA           NA
## 3   ¦   ¦--New Software                          NA           NA
## 4   ¦   °--New Accounting Standards              NA           NA
## 5   ¦--Research                                  NA           NA
## 6   ¦   ¦--New Product Line                      NA           NA
## 7   ¦   °--New Labs                              NA           NA
## 8   °--IT                                        NA           NA
## 9       ¦--Outsource                             NA           NA
## 10      ¦--Go agile                              NA           NA
## 11      °--Switch to R                           NA           NA

This is equivalent to:

acme$Set(avgExpectedCost =NULL, expectedCost = NA)

A word on Null and NA

Also note that setting a value to NA or to NULL looks equivalent when printing to a data.frame, but internally it is not:

acme$avgExpectedCost
## NULL
acme$expectedCost
## [1] NA

The reason is that NULL is always converted to NA for printing, and when using the Get method.

Chapter 3: Advanced Features

Aggregate method

For simple cases, you don’t have to write your own function to be passed along to the Get method. For example, the Aggregate method provides a shorthand for the oft-used case when a parent is the aggregate of its children values:

acme$Aggregate("cost", sum)
## [1] 4950000

We can use this in the Get method:

acme$Get("Aggregate", "cost", sum)
##                Acme Inc.               Accounting             New Software 
##                  4950000                  1500000                  1000000 
## New Accounting Standards                 Research         New Product Line 
##                   500000                  2750000                  2000000 
##                 New Labs                       IT                Outsource 
##                   750000                   700000                   400000 
##                 Go agile              Switch to R 
##                   250000                    50000

This is the equivalent of:

GetCost <- function(node) {
  result <- node$cost
  if(length(result) == 0) {
    if (node$isLeaf) stop(paste("Cost for ", node$name, " not available!"))
    else {
      node$children %>% sapply(GetCost) %>% sum -> result
    }
  }
  return (result)
}

acme$Get(GetCost)
##                Acme Inc.               Accounting             New Software 
##                  4950000                  1500000                  1000000 
## New Accounting Standards                 Research         New Product Line 
##                   500000                  2750000                  2000000 
##                 New Labs                       IT                Outsource 
##                   750000                   700000                   400000 
##                 Go agile              Switch to R 
##                   250000                    50000

Sort method

You can sort an entire tree by using the Sort method on the root. The method will sort recursively and, for each Node, sort children by a child attribute. As before, the child attribute can also be a function or a method (e.g. of a sub-class of Node, see below).

acme$Get(ExpectedCost, assign = "expectedCost")
##                Acme Inc.               Accounting             New Software 
##                  2192500                   875000                   500000 
## New Accounting Standards                 Research         New Product Line 
##                   375000                  1175000                   500000 
##                 New Labs                       IT                Outsource 
##                   675000                   142500                    80000 
##                 Go agile              Switch to R 
##                    12500                    50000
acme$Sort("expectedCost", decreasing = TRUE)
print(acme, "expectedCost")
##                           levelName expectedCost
## 1  Acme Inc.                             2192500
## 2   ¦--Research                          1175000
## 3   ¦   ¦--New Labs                       675000
## 4   ¦   °--New Product Line               500000
## 5   ¦--Accounting                         875000
## 6   ¦   ¦--New Software                   500000
## 7   ¦   °--New Accounting Standards       375000
## 8   °--IT                                 142500
## 9       ¦--Outsource                       80000
## 10      ¦--Switch to R                     50000
## 11      °--Go agile                        12500

Naturally, you can also sort a sub-tree by calling Sort on the sub-tree’s parent node.

Subclassing Node

We can create a subclass of Node, and add custom methods to our subclass. This comes naturally to users with experience in OO languages such as Java, Python or C#:

library(R6)
MyNode <- R6Class("MyNode",
                    inherit = Node,
                    lock = FALSE,
                    
                    #public fields and function
                    public = list(
                        
                        p = NULL, 
                        
                        cost = NULL,
                        
                        AddChild = function(name) {
                          child <- MyNode$new(name)
                          invisible (self$AddChildNode(child))
                        }
                        
                    ),
                    
                    #active
                    active = list(
                      
                      expectedCost = function() {
                        if ( is.null(self$p) || is.null(self$cost)) return (NULL)
                        self$p * self$cost                    
                      }
                      
                    )
                )

The AddChild utility function in the subclass allows us to construct the tree just as before.

The expectedCost function is now a Method, and we can call it in a more R6-ish way.