Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other fields. Trees are especially useful when we are facing hierarchical data. For example, trees are used:
Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no high-level hierarchical data structure that could be used as conveniently and generically as, say, data.frame.
As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames (or - perish the thought! - in Excel sheets). But hierarchies don’t marry with tables and various workarounds are usually required.
This package offers an alternative. The tree package allows you to create hierarchies with the Node
object. Node
provides basic traversal, search, and sort operations. You can decorate Node
s with attributes and methods, extending the package to your needs.
The package also provides convenient methods for neatly printing trees, and converting trees to data.frame
s for integration with other packages.
The example in this vignette revolves around decision trees.
Let’s start by creating a tree of Node
s. In our example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 0) is the company. On level 1, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:
library(data.tree)
acme <- Node$new("Acme Inc.")
accounting <- acme$AddChild("Accounting")
software <- accounting$AddChild("New Software")
standards <- accounting$AddChild("New Accounting Standards")
research <- acme$AddChild("Research")
newProductLine <- research$AddChild("New Product Line")
newLabs <- research$AddChild("New Labs")
it <- acme$AddChild("IT")
outsource <- it$AddChild("Outsource")
agile <- it$AddChild("Go agile")
goToR <- it$AddChild("Switch to R")
print(acme)
## levelName
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software
## 4 ¦ °--New Accounting Standards
## 5 ¦--Research
## 6 ¦ ¦--New Product Line
## 7 ¦ °--New Labs
## 8 °--IT
## 9 ¦--Outsource
## 10 ¦--Go agile
## 11 °--Switch to R
Note that Node
is an R6
reference class. Essentially, this has two implications:
Node
in OO styleNode
that modify it, without having to re-assign to a new variable. This is different from the value semantics, which is much more widely used in R.For example, we can check if a Node
is the root:
acme$isRoot
## [1] TRUE
Now, let’s associate some costs with the projects. We do this by setting custom attributes on the leaf Node
s:
software$cost <- 1000000
standards$cost <- 500000
newProductLine$cost <- 2000000
newLabs$cost <- 750000
outsource$cost <- 400000
agile$cost <- 250000
goToR$cost <- 50000
Also, we set the probabilities that the projects will be executed in the next year:
software$p <- 0.5
standards$p <- 0.75
newProductLine$p <- 0.25
newLabs$p <- 0.9
outsource$p <- 0.2
agile$p <- 0.05
goToR$p <- 1
data.frame
We can now convert the tree into a data.frame
. Note that we always call such methods on the root Node
:
acmedf <- as.data.frame(acme)
The same can be achieved by using the OO-style method Node$ToDataFrame
:
acme$ToDataFrame()
Adding the project cost to our data.frame
is easy to do with the Get
method. We’ll explain the Get
method in more detail below.
acmedf$level <- acme$Get("level")
acmedf$cost <- acme$Get("cost")
We could have achieved the same result in one go, using the OO-style ToDataFrame
method:
acme$ToDataFrame("level", "cost")
## levelName level cost
## 1 Acme Inc. 0 NA
## 2 ¦--Accounting 1 NA
## 3 ¦ ¦--New Software 2 1000000
## 4 ¦ °--New Accounting Standards 2 500000
## 5 ¦--Research 1 NA
## 6 ¦ ¦--New Product Line 2 2000000
## 7 ¦ °--New Labs 2 750000
## 8 °--IT 1 NA
## 9 ¦--Outsource 2 400000
## 10 ¦--Go agile 2 250000
## 11 °--Switch to R 2 50000
Internally, the same is called when printing a tree:
print(acme, "level", "cost")
Get
when converting to data.frame
and for printingAbove, we saw how we can add the name of an attribute to the ellipsis argument of the as.data.frame
. We can also add the results of the Get
method directly to the as.data.frame
. This allows, for example, formatting the column in a specific way. Details of the Get
method are explained in the next section.
acme$ToDataFrame("level",
probability = acme$Get("p", format = FormatPercent)
)
## levelName level probability
## 1 Acme Inc. 0
## 2 ¦--Accounting 1
## 3 ¦ ¦--New Software 2 50.00 %
## 4 ¦ °--New Accounting Standards 2 75.00 %
## 5 ¦--Research 1
## 6 ¦ ¦--New Product Line 2 25.00 %
## 7 ¦ °--New Labs 2 90.00 %
## 8 °--IT 1
## 9 ¦--Outsource 2 20.00 %
## 10 ¦--Go agile 2 5.00 %
## 11 °--Switch to R 2 100.00 %
Get
method (Tree Traversal)Tree traversal is one of the core concepts of trees. See, for example, here: Tree Traversal on Wikipedia. The Get
method traverses the tree and collects values from each node. It then returns a vector containing the collected values.
Additional features of the Get
method are:
Node
method on each node, and append the method’s return value to the returned vectorNode
’s attributeThe Get
method can traverse the tree in various ways. This is called traversal order.
The default traversal mode is pre-order.
This is what is used e.g. in as.data.frame
and its OO-style counterpart Node$ToDataFrame
:
acme$ToDataFrame("level")
## levelName level
## 1 Acme Inc. 0
## 2 ¦--Accounting 1
## 3 ¦ ¦--New Software 2
## 4 ¦ °--New Accounting Standards 2
## 5 ¦--Research 1
## 6 ¦ ¦--New Product Line 2
## 7 ¦ °--New Labs 2
## 8 °--IT 1
## 9 ¦--Outsource 2
## 10 ¦--Go agile 2
## 11 °--Switch to R 2
The post-order traversal mode returns children first, returning parents only after all children have been traversed:
We can use it like this on the Get
method:
data.frame(level = acme$Get('level', traversal = "post-order"))
## level
## New Software 2
## New Accounting Standards 2
## Accounting 1
## New Product Line 2
## New Labs 2
## Research 1
## Outsource 2
## Go agile 2
## Switch to R 2
## IT 1
## Acme Inc. 0
This is useful if your parent’s value depends on the children, as we’ll see below.
This is a non-standard traversal mode that does not traverse the entire tree. Instead, the ancestor mode starts from a Node
, then walks the tree along the path from ancestor to ancestor, up to the root.
data.frame(level = agile$Get('level', traversal = "ancestor"))
## level
## Go agile 2
## IT 1
## Acme Inc. 0
Get
using a functionGet
methodYou can pass a standard R function to the Get
method. For example:
ExpectedCost <- function(node) {
result <- node$cost * node$p
if(length(result) == 0) result <- NA
return (result)
}
data.frame(acme$Get(ExpectedCost))
## acme.Get.ExpectedCost.
## Acme Inc. NA
## Accounting NA
## New Software 500000
## New Accounting Standards 375000
## Research NA
## New Product Line 500000
## New Labs 675000
## IT NA
## Outsource 80000
## Go agile 12500
## Switch to R 50000
The requirements for the function (ExpectedCost
in the above example) are the following:
Node
In the following examples, we use magrittr
to enhance readability of the code.
library(magrittr)
ExpectedCost <- function(node) {
result <- node$cost * node$p
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>% sapply(ExpectedCost) %>% sum -> result
}
}
return (result)
}
data.frame(ec = acme$Get(ExpectedCost))
## ec
## Acme Inc. 2192500
## Accounting 875000
## New Software 500000
## New Accounting Standards 375000
## Research 1175000
## New Product Line 500000
## New Labs 675000
## IT 142500
## Outsource 80000
## Go agile 12500
## Switch to R 50000
The Traverse
method accepts an ellipsis (...
). Any additional parameters with which Get
is called will be passed on to the ExpectedCost
function. This gives us more flexibility. For instance, we don’t have to hard-code the sum
function into ExpectedCost
, but we can leave it to the caller to provide the function to use:
ExpectedCost <- function(node, fun = sum) {
result <- node$cost * node$p
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>% sapply(function(x) ExpectedCost(x, fun = fun)) %>% fun -> result
}
}
return (result)
}
data.frame(ec = acme$Get(ExpectedCost, fun = mean))
## ec
## Acme Inc. 357500
## Accounting 437500
## New Software 500000
## New Accounting Standards 375000
## Research 587500
## New Product Line 500000
## New Labs 675000
## IT 47500
## Outsource 80000
## Go agile 12500
## Switch to R 50000
Get
We can tell the Get
method to assign the value to a specific attribute for each Node
it traverses. This is especially useful if the attribute parameter is a function, as in the previous examples. For instance, we can store the calculated expected cost for later use and printing:
acme$Get(function(x) x$p * x$cost, assign = "expectedCost")
## Acme Inc. Accounting New Software
## NA NA 500000
## New Accounting Standards Research New Product Line
## 375000 NA 500000
## New Labs IT Outsource
## 675000 NA 80000
## Go agile Switch to R
## 12500 50000
print(acme, "p", "cost", "expectedCost")
## levelName p cost expectedCost
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting NA NA NA
## 3 ¦ ¦--New Software 0.50 1000000 500000
## 4 ¦ °--New Accounting Standards 0.75 500000 375000
## 5 ¦--Research NA NA NA
## 6 ¦ ¦--New Product Line 0.25 2000000 500000
## 7 ¦ °--New Labs 0.90 750000 675000
## 8 °--IT NA NA NA
## 9 ¦--Outsource 0.20 400000 80000
## 10 ¦--Go agile 0.05 250000 12500
## 11 °--Switch to R 1.00 50000 50000
In the above recursion example, we iterate - for each node - to all descendants straight to the leaf, repeating the very same calculations various times.
We can avoid repeating calculations by piggy-backing on precalculated values. Obviously, this requires us to traverse the tree in post-order mode: We want to start calculating at the leaves, cache the results for later use, then walk back towards the root.
In the following example, we calculate the average expected cost, just as above. As this now depends only on a Node
’s children, and because we walk the tree in post-order mode, we can be sure that our children have the value calculated when we traverse the parent.
ExpectedCost <- function(node, variableName = "avgExpectedCost", fun = sum) {
#if the "cache" is filled, I return it. This stops the recursion
if(!is.null(node[[variableName]])) return (node[[variableName]])
#otherwise, I calculate from my own properties
result <- node$cost * node$p
#if the properties are not set, I calculate the mean from my children
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>%
sapply(function(x) ExpectedCost(x, variableName = variableName, fun = fun)) %>%
fun -> result
}
}
return (result)
}
We can use our method like this:
invisible(
acme$Get(ExpectedCost, fun = mean, traversal = "post-order", assign = "avgExpectedCost")
)
print(acme, "cost", "p", "avgExpectedCost")
## levelName cost p avgExpectedCost
## 1 Acme Inc. NA NA 357500
## 2 ¦--Accounting NA NA 437500
## 3 ¦ ¦--New Software 1000000 0.50 500000
## 4 ¦ °--New Accounting Standards 500000 0.75 375000
## 5 ¦--Research NA NA 587500
## 6 ¦ ¦--New Product Line 2000000 0.25 500000
## 7 ¦ °--New Labs 750000 0.90 675000
## 8 °--IT NA NA 47500
## 9 ¦--Outsource 400000 0.20 80000
## 10 ¦--Go agile 250000 0.05 12500
## 11 °--Switch to R 50000 1.00 50000
Get
We can pass a formatting function to the Get
method, which will convert the returned value to a human-readable string for printing.
PrintMoney <- function(x) {
format(x, digits=10, nsmall=2, decimal.mark=".", big.mark="'", scientific = FALSE)
}
print(acme, cost = acme$Get("cost", format = PrintMoney))
## levelName cost
## 1 Acme Inc. NA
## 2 ¦--Accounting NA
## 3 ¦ ¦--New Software 1'000'000.00
## 4 ¦ °--New Accounting Standards 500'000.00
## 5 ¦--Research NA
## 6 ¦ ¦--New Product Line 2'000'000.00
## 7 ¦ °--New Labs 750'000.00
## 8 °--IT NA
## 9 ¦--Outsource 400'000.00
## 10 ¦--Go agile 250'000.00
## 11 °--Switch to R 50'000.00
Note that the format is not used for assignment with the assign
parameter, but only for the values returned by Get
:
acme$Get("cost", format = PrintMoney, assign = "cost2")
## Acme Inc. Accounting New Software
## "NA" "NA" "1'000'000.00"
## New Accounting Standards Research New Product Line
## "500'000.00" "NA" "2'000'000.00"
## New Labs IT Outsource
## "750'000.00" "NA" "400'000.00"
## Go agile Switch to R
## "250'000.00" "50'000.00"
print(acme, cost = acme$Get("cost2"))
## levelName cost
## 1 Acme Inc. NA
## 2 ¦--Accounting NA
## 3 ¦ ¦--New Software 1000000
## 4 ¦ °--New Accounting Standards 500000
## 5 ¦--Research NA
## 6 ¦ ¦--New Product Line 2000000
## 7 ¦ °--New Labs 750000
## 8 °--IT NA
## 9 ¦--Outsource 400000
## 10 ¦--Go agile 250000
## 11 °--Switch to R 50000
The format
function is useful not only for formatting numbers, but also for displaying a printable representation of a Node
field that is not a numeric (but e.g. a matrix
).
Set
methodThe Set
method is the counterpart to the Get
method. The Set
method takes a vector or a single value as an input, and traverses the tree in a certain order. Each Node
is assigned a value from the vector, one after the other.
employees <- c(NA, 52, NA, NA, 78, NA, NA, 39, NA, NA, NA)
acme$Set(employees)
print(acme, "employees")
## levelName employees
## 1 Acme Inc. NA
## 2 ¦--Accounting 52
## 3 ¦ ¦--New Software NA
## 4 ¦ °--New Accounting Standards NA
## 5 ¦--Research 78
## 6 ¦ ¦--New Product Line NA
## 7 ¦ °--New Labs NA
## 8 °--IT 39
## 9 ¦--Outsource NA
## 10 ¦--Go agile NA
## 11 °--Switch to R NA
The Set
method can take multiple vectors as an input, and, optionally, you can define the name of the attribute:
secretaries <- c(NA, 5, NA, NA, 6, NA, NA, 2, NA, NA, NA)
acme$Set(secretaries, secPerEmployee = secretaries/employees)
print(acme, "employees", "secretaries", "secPerEmployee")
## levelName employees secretaries secPerEmployee
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting 52 5 0.09615385
## 3 ¦ ¦--New Software NA NA NA
## 4 ¦ °--New Accounting Standards NA NA NA
## 5 ¦--Research 78 6 0.07692308
## 6 ¦ ¦--New Product Line NA NA NA
## 7 ¦ °--New Labs NA NA NA
## 8 °--IT 39 2 0.05128205
## 9 ¦--Outsource NA NA NA
## 10 ¦--Go agile NA NA NA
## 11 °--Switch to R NA NA NA
Just as for the Get
method, the traversal order is important for the Set
.
Often, it is useful to use Get
and Set
together:
ec <- acme$Get(function(x) x$p * x$cost)
acme$Set(expectedCost = ec)
print(acme, "p", "cost", "expectedCost")
## levelName p cost expectedCost
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting NA NA NA
## 3 ¦ ¦--New Software 0.50 1000000 500000
## 4 ¦ °--New Accounting Standards 0.75 500000 375000
## 5 ¦--Research NA NA NA
## 6 ¦ ¦--New Product Line 0.25 2000000 500000
## 7 ¦ °--New Labs 0.90 750000 675000
## 8 °--IT NA NA NA
## 9 ¦--Outsource 0.20 400000 80000
## 10 ¦--Go agile 0.05 250000 12500
## 11 °--Switch to R 1.00 50000 50000
This is equivalent to using Get
with the assign
parameter.
The Set
method can also be used to assign a single value directly to all Node
s traversed. For example, to remove the avgExpectedCost
, we assign NULL
on each node:
acme$Set(avgExpectedCost = NULL)
Note that unassigned values also have NULL
:
acme$newAttribute
## NULL
As Node
is an R6 reference object, we can chain the arguments:
acme$Set(avgExpectedCost = NULL)$Set(expectedCost = NA)
print(acme, "avgExpectedCost", "expectedCost")
## levelName avgExpectedCost expectedCost
## 1 Acme Inc. NA NA
## 2 ¦--Accounting NA NA
## 3 ¦ ¦--New Software NA NA
## 4 ¦ °--New Accounting Standards NA NA
## 5 ¦--Research NA NA
## 6 ¦ ¦--New Product Line NA NA
## 7 ¦ °--New Labs NA NA
## 8 °--IT NA NA
## 9 ¦--Outsource NA NA
## 10 ¦--Go agile NA NA
## 11 °--Switch to R NA NA
This is equivalent to:
acme$Set(avgExpectedCost =NULL, expectedCost = NA)
Null
and NA
Also note that setting a value to NA
or to NULL
looks equivalent when printing to a data.frame, but internally it is not:
acme$avgExpectedCost
## NULL
acme$expectedCost
## [1] NA
The reason is that NULL
is always converted to NA for printing, and when using the Get
method.
Aggregate
methodFor simple cases, you don’t have to write your own function to be passed along to the Get
method. For example, the Aggregate
method provides a shorthand for the oft-used case when a parent is the aggregate of its children values:
acme$Aggregate("cost", sum)
## [1] 4950000
We can use this in the Get
method:
acme$Get("Aggregate", "cost", sum)
## Acme Inc. Accounting New Software
## 4950000 1500000 1000000
## New Accounting Standards Research New Product Line
## 500000 2750000 2000000
## New Labs IT Outsource
## 750000 700000 400000
## Go agile Switch to R
## 250000 50000
This is the equivalent of:
GetCost <- function(node) {
result <- node$cost
if(length(result) == 0) {
if (node$isLeaf) stop(paste("Cost for ", node$name, " not available!"))
else {
node$children %>% sapply(GetCost) %>% sum -> result
}
}
return (result)
}
acme$Get(GetCost)
## Acme Inc. Accounting New Software
## 4950000 1500000 1000000
## New Accounting Standards Research New Product Line
## 500000 2750000 2000000
## New Labs IT Outsource
## 750000 700000 400000
## Go agile Switch to R
## 250000 50000
Sort
methodYou can sort an entire tree by using the Sort
method on the root. The method will sort recursively and, for each Node
, sort children by a child attribute. As before, the child attribute can also be a function or a method (e.g. of a sub-class of Node
, see below).
acme$Get(ExpectedCost, assign = "expectedCost")
## Acme Inc. Accounting New Software
## 2192500 875000 500000
## New Accounting Standards Research New Product Line
## 375000 1175000 500000
## New Labs IT Outsource
## 675000 142500 80000
## Go agile Switch to R
## 12500 50000
acme$Sort("expectedCost", decreasing = TRUE)
print(acme, "expectedCost")
## levelName expectedCost
## 1 Acme Inc. 2192500
## 2 ¦--Research 1175000
## 3 ¦ ¦--New Labs 675000
## 4 ¦ °--New Product Line 500000
## 5 ¦--Accounting 875000
## 6 ¦ ¦--New Software 500000
## 7 ¦ °--New Accounting Standards 375000
## 8 °--IT 142500
## 9 ¦--Outsource 80000
## 10 ¦--Switch to R 50000
## 11 °--Go agile 12500
Naturally, you can also sort a sub-tree by calling Sort on the sub-tree’s parent node.
Node
We can create a subclass of Node
, and add custom methods to our subclass. This comes naturally to users with experience in OO languages such as Java, Python or C#:
library(R6)
MyNode <- R6Class("MyNode",
inherit = Node,
lock = FALSE,
#public fields and function
public = list(
p = NULL,
cost = NULL,
AddChild = function(name) {
child <- MyNode$new(name)
invisible (self$AddChildNode(child))
}
),
#active
active = list(
expectedCost = function() {
if ( is.null(self$p) || is.null(self$cost)) return (NULL)
self$p * self$cost
}
)
)
The AddChild
utility function in the subclass allows us to construct the tree just as before.
The expectedCost
function is now a Method, and we can call it in a more R6-ish way.