smallfactor - store factors in raw
rather than integer
smallfactor
is an experiment to see what tradeoffs there might be for storing
a factor in a raw byte instead of an integer.
Since an R raw
value is only a single byte, but an integer
is 4 bytes,
there are space saving for large factors if stored in raw
vector -
as long as the number of levels is less than 256.
What’s in the box
smallfactor()
- similar tofactor()
, but creates an object backed by raw bytes rather than an integer.
Limitations
This package includes creating an object of the smallfactor
class,
and includes a few expected methods for common usage e.g. as.character.smallfactor()
,
as.data.frame.smallfactor()
.
It seems possible that you could write more methods to make smallfactor
behaviour
almost identical to regular factor
behaviour, but this package is not attempting
this (yet).
Installation
You can install from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/smallfactor')
Example simple usage
small <- smallfactor(c('a', 'b', 'c', 'a', 'd'))
small
#> [smallfactor]
#> [1] a b c a d
#> Levels: a b c d
small[1:3]
#> [smallfactor]
#> [1] a b c
#> Levels: a b c d
df <- data.frame(value = small, y = 1:5)
df
#> value y
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 a 4
#> 5 d 5
library(dplyr)
df %>%
filter(value < 3)
#> value y
#> 1 a 1
#> 2 b 2
#> 3 a 4
Example - storing some DNA in 1/4 the space of a regular factor
library(smallfactor)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Generate some random DNA
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dna <- sample(c('A', 'T', 'G', 'C'), 1e6, replace = TRUE)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a `factor` and a `smallfactor` using the same basic syntax
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
small <- smallfactor(dna, levels = c('A', 'T', 'G', 'C'))
regular <- factor (dna, levels = c('A', 'T', 'G', 'C'))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# `smallfactor` is approx 1/4 the size of the regular factor
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lobstr::obj_size(dna)
#> 8,000,272 B
lobstr::obj_size(regular)
#> 4,000,688 B
lobstr::obj_size(small)
#> 1,000,696 B