Memoisation

Memoisation

Memoisation allows the caching of the output for a function for a given set of inputs.

That is, for a set of arguments the result is only ever calculated once and then stored in a cache. If those set of arguments ever happen again, then the result is pulled from the cache rather than the function being re-run.

This technique is really useful when a function takes a long time to execute.

There are at least 2 memoisation packages in R:

  • memoise by Hadley Wickham [aut], Jim Hester [aut, cre], Kirill Müller [aut], Daniel Cook [aut].
  • R.cache by Henrik Bengtsson [aut, cre, cph]

I will use memoise simply because I’m more familiar with it.

My problem

A part of my solution method for nonograms was to generate all possible patterns for a particular row based upon the clues. (First discussed in this post)

The problem boils down to the following:

  • Find all integer sequences (of the target length) which add to the target sum
  • integer here means positive, non-zero whole number
  • target length is
    • always greater than 1
    • never greater than 35 (for the nonograms I’m attempting to solve)
  • target sum is never less than the target length
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Find all permutations of all vectors of positive integers of a given length 
#' which sum to a given value
#'
#' e.g. find_integers_with_sum(4, 2) gives
#'    1, 3
#'    2, 2
#'    3, 1
#'
#' @param target_length target length
#' @param target_sum target sum
#' @return matrix or raw values, with each row being a solution
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
find_integers_with_sum <- function(target_sum, target_length) {

  if (target_length == 1L) {
    return(matrix(as.raw(target_sum)))
  }

  max_choice <- target_sum - target_length + 1L
  solutions  <- vector('list', max_choice)
  lens       <- integer(max_choice)


  for (this_value in seq_len(max_choice)) {
    solution                <- find_integers_with_sum(target_sum - this_value, target_length - 1L)
    solutions[[this_value]] <- solution
    lens[this_value]        <- nrow(solution)
  }


  cbind(as.raw(rep.int(seq_len(max_choice), lens)), do.call(rbind, solutions), deparse.level = 0L)
}
find_integers_with_sum(target_sum=4, target_length=2)
     [,1] [,2]
[1,]   01   03
[2,]   02   02
[3,]   03   01

The challenge

When the target_sum and target_length get “large”, the number of solutions becomes huge.

  • find_integers_with_sum(25, 10)
    • 1,307,504 solutions
    • 10 seconds to generate.
    • 13MB matrix
  • find_integers_with_sum(30, 10)
    • 10 million solutions
    • 60 seconds to generate
    • 100MB matrix
  • find_integers_with_sum(35, 10)
    • over 52 million solutions
    • takes forever to generate (I’ve honestly never done it without memoising!)
    • 525MB matrix

The way find_integers_with_sums() recurses to find its solution means that it’s calling itself in a nested way to solve smaller and smaller versions of the problem.

This means that inner-most, smallest versions of the problem are solved mutiple times, but are always producing the same result.

This function will benefit from cacheing those inner results.

Memoisation

To memoise this function, I will be using the memoise package.

I will consider three possible backends:

  1. In-memory
  2. Filesystem in RAM disk (on OSX)
    • diskutil erasevolume HFS+ ‘ramdisk’ hdiutil attach -nomount ram://1165430
  3. Filesystem on SSD

I will be using find_integers_with_sum(30, 10) as my test case.

No cache

# Before doing anything else, I'm keeping a copy of the original un-cached function
find_integers_with_sum_ <- find_integers_with_sum

# run without memoisation
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})

In-memory cache

find_integers_with_sum  <- memoise(find_integers_with_sum_, cache = cache_memory())
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})

Filesystem on Ram disk

path <- "/Volumes/ramdisk"
unlink(path, recursive = TRUE)
find_integers_with_sum  <- memoise::memoise(find_integers_with_sum_, cache=cache_filesystem(path))
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})

Filesystem on SSD

path <- tempdir()
unlink(path, recursive = TRUE)
find_integers_with_sum  <- memoise::memoise(find_integers_with_sum_, cache=cache_filesystem(path))
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})

Timing Summary

Table 1: Run times
cache first_run second_run
None 60.00 60.000
Memory 1.60 0.001
Ram disk 3.28 0.130
SSD 3.46 0.130

Notes

  • Memoise in memory just uses a standard R environment to store objects
  • Memoise to filesystem uses saveRDS()
  • Memory cache is fastest (obviously).
  • No difference in speed between RAM disk and SSD
    • This may just be the SSD aggressively using its own disk cache? Not sure.
  • The filesystem cache in memoise does not do any compression.
    • On slower drives, it may be worth compressing the data

Future

  • Consider trying the following modifications to memoise
    • Add compression option to the filesystem cache (and s3 and gcs caches)
    • Revisit a way for caches to have limitations on the maximum size of a result which gets cached. I tried this in a recent post
    • Investigate a multi-tiered cache with small results saved to ram and bigger results saved to the filesystem.