Intro

We stick to a rather simple, but not unrealistic example to explain some further functionalities: Applying two classification learners to the famous iris data set (Anderson 1935), vary a few hyperparameters and evaluate the effect on the classification performance.

First, we create a registry, the central meta-data object which records technical details and the setup of the experiments. We use an ExperimentRegistry where the job definition is split into creating problems and algorithms. See the paper on BatchJobs and BatchExperiments for a detailed explanation. Again, we use a temporary registry and make it the default registry.

library(batchtools)
reg = makeExperimentRegistry(file.dir = NA, seed = 1)

Problems and algorithms

By adding a problem to the registry, we can define the data on which certain computational jobs shall work. This can be a matrix, data frame or array that always stays the same for all subsequent experiments. But it can also be of a more dynamic nature, e.g., subsamples of a dataset or random numbers drawn from a probability distribution . Therefore the function addProblem() accepts static parts in its data argument, which is passed to the argument fun which generates a (possibly stochastic) problem instance. For data, any R object can be used. If only data is given, the generated instance is data. The argument fun has to be a function with the arguments data and job (and optionally other arbitrary parameters). The argument job is an object of type Job which holds additional information about the job.

We want to split the iris data set into a training set and test set. In this example we use use subsampling which just randomly takes a fraction of the observations as training set. We define a problem function which returns the indices of the respective training and test set for a split with ratio% of the observations being in the test set:

subsample = function(data, job, ratio, ...) {
  n = nrow(data)
  train = sample(n, floor(n * ratio))
  test = setdiff(seq_len(n), train)
  list(test = test, train = train)
}

addProblem() files the problem to the file system and the problem gets recorded in the registry.

data("iris", package = "datasets")
addProblem(name = "iris", data = iris, fun = subsample, seed = 42)

The function call will be evaluated at a later stage on the workers. In this process, the data part will be loaded and passed to the function. Note that we set a problem seed to synchronize the experiments in the sense that the same resampled training and test sets are used for the algorithm comparison in each distinct replication.

The algorithms for the jobs are added to the registry in a similar manner. When using addAlgorithm(), an identifier as well as the algorithm to apply to are required arguments. The algorithm must be given as a function with arguments job, data and instance. Further arbitrary arguments (e.g., hyperparameters or strategy parameters) may be defined analogously as for the function in addProblem. The objects passed to the function via job and data are here the same as above, while via instance the return value of the evaluated problem function is passed. The algorithm can return any R object which will automatically be stored on the file system for later retrieval. Firstly, we create an algorithm which applies a support vector machine:

svm.wrapper = function(data, job, instance, ...) {
  library("e1071")
  mod = svm(Species ~ ., data = data[instance$train, ], ...)
  pred = predict(mod, newdata = data[instance$test, ], type = "class")
  table(data$Species[instance$test], pred)
}
addAlgorithm(name = "svm", fun = svm.wrapper)

Secondly, a random forest of classification trees:

forest.wrapper = function(data, job, instance, ...) {
  library("ranger")
  mod = ranger(Species ~ ., data = data[instance$train, ], write.forest = TRUE)
  pred = predict(mod, data = data[instance$test, ])
  table(data$Species[instance$test], pred$predictions)
}
addAlgorithm(name = "forest", fun = forest.wrapper)

Both algorithms return a confusion matrix for the predictions on the test set, which will later be used to calculate the misclassification rate.

Note that using the ... argument in the wrapper definitions allows us to circumvent naming specific design parameters for now. This is an advantage if we later want to extend the set of algorithm parameters in the experiment. The algorithms get recorded in the registry and the corresponding functions are stored on the file system.

Defined problems and algorithms can be queried:

getProblemIds()
## [1] "iris"
getAlgorithmIds()
## [1] "svm"    "forest"

Creating jobs

addExperiments() is used to parametrize the jobs and thereby define computational jobs. To do so, you have to pass named lists of parameters to addExperiments(). The elements of the respective list (one for problems and one for algorithms) must be named after the problem or algorithm they refer to. The data frames contain parameter constellations for the problem or algorithm function where columns must have the same names as the target arguments. When the problem design and the algorithm design are combined in addExperiments(), each combination of the parameter sets of the two designs defines a distinct job. How often each of these jobs should be computed can be determined with the argument repls.

# problem design: try two values for the ratio parameter
pdes = list(iris = data.frame(ratio = c(0.67, 0.9)))

# algorithm design: try combinations of kernel and epsilon exhaustively,
# try different number of trees for the forest
ades = list(
  svm = expand.grid(kernel = c("linear", "polynomial", "radial"), epsilon = c(0.01, 0.1)),
  forest = data.frame(ntree = c(100, 500, 1000))
)

addExperiments(pdes, ades, repls = 5)
## Adding 60 experiments ('iris'[2] x 'svm'[6] x repls[5]) ...
## Adding 30 experiments ('iris'[2] x 'forest'[3] x repls[5]) ...

The jobs are now available in the registry with an individual job ID for each. The function summarizeExperiments() returns a table which gives a quick overview over all defined experiments.

summarizeExperiments()
##    problem algorithm .count
## 1:    iris       svm     60
## 2:    iris    forest     30
summarizeExperiments(by = c("problem", "algorithm", "ratio"))
##    problem algorithm ratio .count
## 1:    iris       svm  0.67     30
## 2:    iris       svm  0.90     30
## 3:    iris    forest  0.67     15
## 4:    iris    forest  0.90     15

Before you submit

Before submitting all jobs to the batch system, we encourage you to test each algorithm individually. Or sometimes you want to submit only a subset of experiments because the jobs vastly differ in runtime. Another reoccurring task is the collection of results for only a subset of experiments. For all these use cases, findExperiments() can be employed to conveniently select a particular subset of jobs. It returns the IDs of all experiments that match the given criteria. Your selection can depend on substring matches of problem or algorithm IDs using prob.name or algo.name, respectively. You can also pass R expressions, which will be evaluated in your problem parameter setting (prob.pars) or algorithm parameter setting (algo.pars). The expression is then expected to evaluate to a Boolean value. Furthermore, you can restrict the experiments to specific replication numbers.

To illustrate findExperiments(), we will select two experiments, one with a support vector machine and the other with a random forest and the parameter ntree = 1000. The selected experiment IDs are then passed to testJob.

id1 = head(findExperiments(algo.name = "svm"), 1)
print(id1)
##    job.id
## 1:      1
id2 = head(findExperiments(algo.name = "forest", algo.pars = (ntree == 1000)), 1)
print(id2)
##    job.id
## 1:     71
testJob(id = id1)
## Generating problem instance for problem 'iris' ...
## Applying algorithm 'svm' on problem 'iris' ...
##             pred
##              setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0         16         2
##   virginica       0          0        15
testJob(id = id2)
## Generating problem instance for problem 'iris' ...
## Applying algorithm 'forest' on problem 'iris' ...
##             
##              setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0         15         3
##   virginica       0          0        15

If something goes wrong, batchtools comes with a bunch of useful debugging utilities (see separate vignette on error handling). If everything turns out fine, we can proceed with the calculation.

Submitting and collecting results

To submit the jobs, we call submitJobs() and wait for all jobs to terminate using waitForJobs().

submitJobs()
## Submitting 90 jobs in 90 chunks using cluster functions 'Interactive' ...
waitForJobs()
## Syncing 90 files ...
## [1] TRUE

After jobs are finished, the results can be collected with reduceResultsDataTable() where we directly extract the mean misclassification error:

results = reduceResultsDataTable(fun = function(res) (list(mce = (sum(res) - sum(diag(res))) / sum(res))))
head(results)
##    job.id  mce
## 1:      1 0.04
## 2:      2 0.00
## 3:      3 0.06
## 4:      4 0.04
## 5:      5 0.02
## 6:      6 0.06

Next, we merge the results table with the table of job parameters using one of the join helpers provided by batchtools (here, we use an inner join):

tab = ijoin(getJobPars(), results)
head(tab)
##    job.id problem algorithm ratio     kernel epsilon ntree  mce
## 1:      1    iris       svm  0.67     linear    0.01    NA 0.04
## 2:      2    iris       svm  0.67     linear    0.01    NA 0.00
## 3:      3    iris       svm  0.67     linear    0.01    NA 0.06
## 4:      4    iris       svm  0.67     linear    0.01    NA 0.04
## 5:      5    iris       svm  0.67     linear    0.01    NA 0.02
## 6:      6    iris       svm  0.67 polynomial    0.01    NA 0.06

We now aggregate the results group-wise. You can use data.table, base::aggregate(), or the dplyr package for this purpose. Here, we use data.table to subset the table to jobs where the ratio is 0.67 and group by algorithm the algorithm hyperparameters:

tab[ratio == 0.67, list(mmce = mean(mce)), by = c("algorithm", "kernel", "epsilon", "ntree")]
##    algorithm     kernel epsilon ntree  mmce
## 1:       svm     linear    0.01    NA 0.032
## 2:       svm polynomial    0.01    NA 0.088
## 3:       svm     radial    0.01    NA 0.048
## 4:       svm     linear    0.10    NA 0.032
## 5:       svm polynomial    0.10    NA 0.088
## 6:       svm     radial    0.10    NA 0.048
## 7:    forest         NA      NA   100 0.048
## 8:    forest         NA      NA   500 0.044
## 9:    forest         NA      NA  1000 0.048