This package will allow you to send function calls as cluster jobs (using LSF, SGE or SLURM) using a minimal interface provided by the Q
function:
# load the library and create a simple function
library(clustermq)
fx = function(x) x * 2
# queue the function call
Q(fx, x=1:3, n_jobs=1)
# list(2,4,6)
# this will submit a cluster job that connects to the master via TCP
# the master will then send the function and argument chunks to the worker
# and the worker will return the results to the master
# until everything is done and you get back your result
# we can also use dplyr's mutate to modify data frames
library(dplyr)
iris %>%
mutate(area = Q(`*`, e1=Sepal.Length, e2=Sepal.Width, n_jobs=1))
Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. This removes the biggest bottleneck in distributed computing.
Using this approach, we can easily do load-balancing, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.
First, we need the ZeroMQ system library. Most likely, your package manager will provide this:
brew install zeromq # homebrew on OS-X
sudo apt-get install libzmq3-dev # ubuntu
sudo yum install zeromq3-devel # fedora
The package on CRAN should be relatively up-to-date. You can install it using:
install.packages('clustermq')
Alternatively, you can get the latest version from Github:
# install.packages('devtools')
devtools::install_github('mschubert/clustermq', ref="develop")
An HPC cluster’s scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq
interfaces with in order to do computations. See the links below to set up the respective schedulers.
You can also use the schedulers above from your local machine via SSH.
The package is designed to distribute arbitrary function calls on HPC worker nodes. There are, however, a couple of caveats to observe as the R session running on a worker does not share your local memory.
The simplest example is to a function call that is completely self-sufficient, and there is one argument (x
) that we iterate through:
fx = function(x) x * 2
Q(fx, x=1:3, n_jobs=1)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
Non-iterated arguments are supported by the const
argument:
fx = function(x, y) x * 2 + y
Q(fx, x=1:3, const=list(y=10), n_jobs=1)
## [[1]]
## [1] 12
##
## [[2]]
## [1] 14
##
## [[3]]
## [1] 16
If a function relies on objects in its environment that are not passed as arguments, they can be exported using the export
argument:
fx = function(x) x * 2 + y
Q(fx, x=1:3, export=list(y=10), n_jobs=1)
## [[1]]
## [1] 12
##
## [[2]]
## [1] 14
##
## [[3]]
## [1] 16
Finally, if we want to use a package function we need to load it on the worker using a library()
call or referencing it with package_name::
:
fx = function(x) {
library(dplyr)
x %>%
mutate(area = Sepal.Length *Sepal.Width) %>%
head()
}
Q(fx, x=list(iris), n_jobs=1)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
The following arguments are supported by Q
:
fun
- The function to call. This needs to be self-sufficient (because it will not have access to the master
environment)...
- All iterated arguments passed to the function. If there is more than one, all of them need to be namedconst
- A named list of non-iterated arguments passed to fun
Behavior can further be fine-tuned using the options below:
fail_on_error
- Whether to stop if one of the calls returns an errorseed
- A common seed that is combined with job number for reproducible resultsmemory
- Amount of memory to request for the job (bsub -M
)n_jobs
- Number of jobs to submit for all the function callsjob_size
- Number of function calls per job. If used in combination with n_jobs
the latter will be overall limitchunk_size
- How many calls a worker should process before reporting back to the master. Default: every worker will report back 100 times totalwait_time
- How long the master should wait between checking for resultsThere are some packages that provide high-level parallelization of R function calls on a computing cluster. A thorough comparison of features and performance is available on the wiki.
In short, use ClusterMQ
if you want:
Use batchtools
if:
Use flowr
, remake
or Snakemake if:
Don’t use batch
(last updated 2013) or BatchJobs
(issues with SQLite on network-mounted storage).