This vignette provides a stylized example of the capabilities of the data.tree package. The code for this vignette was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.
If you are not a blind believer in organic food, and if you are older than 9, there is a non-zero probability that you’d tackle this problem differently than my dear family members. Still, there are dozens of different methods to choose from, but chances are you’d come across a class of models that are called classification trees. These models let you classify observations (e.g. things, outcomes) according to the observations’ qualities, called features. Essentially, all of these models consist of creating a tree, where each node acts as a router. You insert your mushroom instance at the root of the tree, and then, depending on the mushroom’s features (size, points, color, etc.), you follow along a different path, until a leaf node spits out your mushroom’s class, i.e. whether it’s edibel or not.
You might have guessed already that there are two different steps involved in using such a model: training (i.e. constructing the tree), and predicting (i.e. using the tree to predict whether a given mushroom is poisonous). This vignette provides code to do both, using one of the very early algorithms to classify data according to discrete features: ID3.
We will not go into more details about ID3. You will find lots of documentation on this and more refined algorithms on the internet. For example, lecture notes that build on similar data can be found here.
Also, be assured that this example is by no means meant to be used in the real world. It is overly simplistic, and far too little data is used for training. Nor do we claim that this is a complete discussion of the ID3 algorithm, let alone classification models.
On the contrary, the only reason why we chose this example is to provide a simple to grasp application of the data.tree
package, in order to demonstrate how easy it is to build hierarchical models with it.
As mentioned above, during the prediction step, each node routes our mushroom according to a feature. But how do we chose the feature? Should we first separate our set according to color or size? That is where classification models differ.
In ID3, we pick at each node the feature with the highest Information Gain. In a nutshell, this is the feature which splits the sample in the possibly purest subsets. For example, in the case of mushrooms, “dots” might be a more sensible feature than “organic”.
We define a subset to be completely pure if it contains only a single class. For example, if a subset contains only poisonous mushrooms, it is completely pure. In R, assuming that the last column contains the class (i.e. the category to be predicted), this can be written as:
IsPure <- function(data) {
length(unique(data[,ncol(data)])) == 1
}
The entropy is a measure of the purity of a dataset.
Entropy <- function( vls ) {
res <- vls/sum(vls) * log2(vls/sum(vls))
res[vls == 0] <- 0
-sum(res)
}
If a dataset is completely pure, then it has entropy 0:
Entropy(c(10, 0))
## [1] 0
Entropy(c(0, 10))
## [1] 0
If a set contains 5 poisonous and 5 edible mushrooms, the entropy becomes 1, as the purity is at its lowest:
Entropy(c(5, 5))
## [1] 1
We can plot the entropy as a function of the number of edible mushrooms in a set of, say, 100 mushrooms:
entropy <- function(edible) Entropy(c(edible, 100 - edible))
entropy <- Vectorize(entropy)
curve( entropy, from = 0, to = 100, xname = 'edible')
Mathematically, the information gain IG is defined as:
\[ IG(T,a) = H(T)-\sum_{v\in vals(a)}\frac{|\{\textbf{x}\in T|x_a=v\}|}{|T|} \cdot H(\{\textbf{x}\in T|x_a=v\}) \]
In words, the information gain measures the difference between the entropy before the split, and the weighted sum of the entropies after the split:
So, let’s rewrite that in R:
InformationGain <- function( tble ) {
tble <- as.data.frame.matrix(tble)
entropyBefore <- Entropy(colSums(tble))
s <- rowSums(tble)
entropyAfter <- sum (s / sum(s) * apply(tble, MARGIN = 1, FUN = Entropy ))
informationGain <- entropyBefore - entropyAfter
return (informationGain)
}
For example, using the mushroom
data set:
library(data.tree)
data(mushroom)
tble <- table(mushroom[,c('color', 'edibility')])
tble
## edibility
## color edible toxic
## brown 2 0
## green 1 0
## red 1 1
InformationGain(tble)
## [1] 0.3219281
InformationGain(table(mushroom[,c('size', 'edibility')]))
## [1] 0.1709506
InformationGain(table(mushroom[,c('points', 'edibility')]))
## [1] 0.3219281
We are all set for the ID3 training algorithm. We start with the entire training data, and with a root. Then:
For the following implementation, we assume that the classifying features are in columns 1 to n-1, whereas the class (the edibility) is in the last column.
TrainID3 <- function(node, data) {
node$obsCount <- nrow(data)
#if the data-set is pure (e.g. all toxic), then
if (IsPure(data)) {
#construct a leaf having the name of the pure feature (e.g. 'toxic')
child <- node$AddChild(unique(data[,ncol(data)]))
node$feature <- tail(names(data), 1)
child$obsCount <- nrow(data)
child$feature <- ''
} else {
#chose the feature with the highest information gain (e.g. 'color')
ig <- sapply(colnames(data)[-ncol(data)],
function(x) InformationGain(
table(data[,x], data[,ncol(data)])
)
)
feature <- names(ig)[ig == max(ig)][1]
node$feature <- feature
#take the subset of the data-set having that feature value
childObs <- split(data[,!(names(data) %in% feature)], data[,feature], drop = TRUE)
for(i in 1:length(childObs)) {
#construct a child having the name of that feature value (e.g. 'red')
child <- node$AddChild(names(childObs)[i])
#call the algorithm recursively on the child and the subset
TrainID3(child, childObs[[i]])
}
}
}
We are ready to run the function:
tree <- Node$new("mushroom")
TrainID3(tree, mushroom)
print(tree, "feature", "obsCount")
## levelName feature obsCount
## 1 mushroom color 5
## 2 ¦--brown edibility 2
## 3 ¦ °--edible 2
## 4 ¦--green edibility 1
## 5 ¦ °--edible 1
## 6 °--red size 2
## 7 ¦--large edibility 1
## 8 ¦ °--edible 1
## 9 °--small edibility 1
## 10 °--toxic 1
Now, let’s predict some classes. For this, we need a predict function, which will route data through our tree:
Predict <- function(tree, features) {
if (tree$children[[1]]$isLeaf) return (tree$children[[1]]$name)
child <- tree$children[[features[[tree$feature]]]]
return ( Predict(child, features))
}
And now we use it to predict:
Predict(tree, c(color = 'red',
size = 'large',
points = 'yes')
)
## [1] "edible"
Oops! Maybe organic wasn’t such a bad predictor, after all :-)