Use case: Cache

The archivist package allows to store, restore and look for R objects in repositories stored on hard disk. There are different strategies that can be used to find an object, through it’s name, date of creation of meta data. The package is mainly designed as a repository of artifacts, but it can be used in different use-cases.

Let’s see how it can be used as caching engine.

Let’s consider a function with few arguments, which evaluation may takes a significant amount of time. If there is a chance that the function will be executed with same parameteres more than just one, it would be desireble to cache results to avoid unncessary evaluations.

Such cache can be easily constructed with the archivist package.

Heavyweight function

Let’s see an example. The ,,Heavyweight’‘function ’getMaxDistribution’ summarizes the distribution of maximum from N draw of random variables from distribuition D with the use of R replications.

getMaxDistribution <- function(D = rnorm, N = 10, R = 1000000) {
  res <- replicate(R, max(D(N)))
  summary(res)
}

system.time( getMaxDistribution(rnorm, 10) )
##    user  system elapsed 
##   7.350   0.244   7.596
system.time( getMaxDistribution(rexp, 20) )
##    user  system elapsed 
##   6.632   0.385   7.018
system.time( getMaxDistribution(rnorm, 10) )
##    user  system elapsed 
##   5.558   0.358   5.920

Now, let’s load the archivist package and prepare a repository for cached objects.

Cache preparation

require(devtools)
if (!require(archivist))   install_github("pbiecek/archivist", build_vignettes=FALSE)
## Warning: package 'jsonlite' was built under R version 3.1.1
cache <- function(cacheRepo, FUN, ...) {
  tmpl <- list(...)
  tmpl$.FUN <- FUN
  outputHash <- digest(tmpl)
  isInRepo <- searchInLocalRepo(paste0("cacheId:", outputHash), cacheRepo)
  if (length(isInRepo) > 0)
    return(loadFromLocalRepo(isInRepo[1], repoDir = cacheRepo, value = TRUE))
  
  output <- do.call(FUN, list(...))
  attr( output, "tags") <- paste0("cacheId:", outputHash)
  attr( output, "call") <- ""
  saveToRepo(output, repoDir = cacheRepo, archiveData = TRUE, archiveMiniature = FALSE, rememberName = FALSE)
  output
}

cacheRepo <- tempdir()
createEmptyRepo( cacheRepo )

How to work with cache

The ‘cacheRepo’ is a folder with already evaluated function calls. How to use it?

system.time( cache(cacheRepo, getMaxDistribution, rnorm, 10) )
##    user  system elapsed 
##   5.703   0.388   6.097
system.time( cache(cacheRepo, getMaxDistribution, rexp, 10) )
##    user  system elapsed 
##   5.692   0.402   6.110
system.time( cache(cacheRepo, getMaxDistribution, rnorm, 10) )
##    user  system elapsed 
##   0.003   0.001   0.004

The second evaluation of ‘getMaxDistribution’ is much, much faster. Results are just read from disk.

How the ‘cache’ function works?

It create a md5 signature of the function FUN and it’s arguments and use this signature as a key. If such key is present in the cache repository, then the object is just restored. If it’s not present then the call is evaluated and result is stored. Note that, if ‘cacheRepo’ is a shared folder, then you get a shared cache repository!