Memoisation
Memoisation allows the caching of the output for a function for a given set of inputs.
That is, for a set of arguments the result is only ever calculated once and then stored in a cache. If those set of arguments ever happen again, then the result is pulled from the cache rather than the function being re-run.
This technique is really useful when a function takes a long time to execute.
There are at least 2 memoisation packages in R:
memoise
by Hadley Wickham [aut], Jim Hester [aut, cre], Kirill Müller [aut], Daniel Cook [aut].R.cache
by Henrik Bengtsson [aut, cre, cph]
I will use memoise
simply because I’m more familiar with it.
My problem
A part of my solution method for nonograms was to generate all possible patterns for a particular row based upon the clues. (First discussed in this post)
The problem boils down to the following:
- Find all integer sequences (of the target length) which add to the target sum
- integer here means positive, non-zero whole number
- target length is
- always greater than 1
- never greater than 35 (for the nonograms I’m attempting to solve)
- target sum is never less than the target length
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Find all permutations of all vectors of positive integers of a given length
#' which sum to a given value
#'
#' e.g. find_integers_with_sum(4, 2) gives
#' 1, 3
#' 2, 2
#' 3, 1
#'
#' @param target_length target length
#' @param target_sum target sum
#' @return matrix or raw values, with each row being a solution
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
find_integers_with_sum <- function(target_sum, target_length) {
if (target_length == 1L) {
return(matrix(as.raw(target_sum)))
}
max_choice <- target_sum - target_length + 1L
solutions <- vector('list', max_choice)
lens <- integer(max_choice)
for (this_value in seq_len(max_choice)) {
solution <- find_integers_with_sum(target_sum - this_value, target_length - 1L)
solutions[[this_value]] <- solution
lens[this_value] <- nrow(solution)
}
cbind(as.raw(rep.int(seq_len(max_choice), lens)), do.call(rbind, solutions), deparse.level = 0L)
}
find_integers_with_sum(target_sum=4, target_length=2)
[,1] [,2]
[1,] 01 03
[2,] 02 02
[3,] 03 01
The challenge
When the target_sum
and target_length
get “large”, the number of solutions
becomes huge.
find_integers_with_sum(25, 10)
- 1,307,504 solutions
- 10 seconds to generate.
- 13MB matrix
find_integers_with_sum(30, 10)
- 10 million solutions
- 60 seconds to generate
- 100MB matrix
find_integers_with_sum(35, 10)
- over 52 million solutions
- takes forever to generate (I’ve honestly never done it without memoising!)
- 525MB matrix
The way find_integers_with_sums()
recurses to find its solution means that it’s
calling itself in a nested way to solve smaller and smaller versions of the problem.
This means that inner-most, smallest versions of the problem are solved mutiple times, but are always producing the same result.
This function will benefit from cacheing those inner results.
Memoisation
To memoise this function, I will be using the memoise
package.
I will consider three possible backends:
- In-memory
- Filesystem in RAM disk (on OSX)
- diskutil erasevolume HFS+ ‘ramdisk’
hdiutil attach -nomount ram://1165430
- diskutil erasevolume HFS+ ‘ramdisk’
- Filesystem on SSD
I will be using find_integers_with_sum(30, 10)
as my test case.
No cache
# Before doing anything else, I'm keeping a copy of the original un-cached function
find_integers_with_sum_ <- find_integers_with_sum
# run without memoisation
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})
In-memory cache
find_integers_with_sum <- memoise(find_integers_with_sum_, cache = cache_memory())
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})
Filesystem on Ram disk
path <- "/Volumes/ramdisk"
unlink(path, recursive = TRUE)
find_integers_with_sum <- memoise::memoise(find_integers_with_sum_, cache=cache_filesystem(path))
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})
Filesystem on SSD
path <- tempdir()
unlink(path, recursive = TRUE)
find_integers_with_sum <- memoise::memoise(find_integers_with_sum_, cache=cache_filesystem(path))
system.time({find_integers_with_sum(30, 10)})
system.time({find_integers_with_sum(30, 10)})
Timing Summary
cache | first_run | second_run |
---|---|---|
None | 60.00 | 60.000 |
Memory | 1.60 | 0.001 |
Ram disk | 3.28 | 0.130 |
SSD | 3.46 | 0.130 |
Notes
- Memoise in memory just uses a standard R environment to store objects
- Memoise to filesystem uses
saveRDS()
- Memory cache is fastest (obviously).
- No difference in speed between RAM disk and SSD
- This may just be the SSD aggressively using its own disk cache? Not sure.
- The filesystem cache in
memoise
does not do any compression.- On slower drives, it may be worth compressing the data
Future
- Consider trying the following modifications to
memoise
- Add compression option to the filesystem cache (and s3 and gcs caches)
- Revisit a way for caches to have limitations on the maximum size of a result which gets cached. I tried this in a recent post
- Investigate a multi-tiered cache with small results saved to ram and bigger results saved to the filesystem.