R packages - internal and external data

Packages in R can contain blobs of data - in fact some packages are only data without any code e.g. {nycflights13}.

Data in a package is classed as being internal or external to the package. In broad terms, internal data is data that the functions in the package use, and external data is the data you’d like the user to see.

A recent issue is that I wanted data in the {phon} package to be both external and internal. That is, all the internal functions rely on a particular dataset, but I want to make that dataset easily accessible to users as well.

Now that I’ve figured out how to do it, I’m writing this post so that I don’t forget!

Thanks to Colin Fay and Hadley Wickham for their wizardly help!

Hadley’s R Packages Book is a great source of background reading on this topic.

Internal Data

Internal data is data within the package, but not (generally) made available to the user.

{usethis} has a handy function to help setup this data called usethis::use_data(). The following code snippet saves the data objects x and y to be internal data for your package:

usethis::use_data(x, y, internal = TRUE)

All functions within the package can now freely access the x and y variables, but the user won’t see them.

If the user would like to access this data, then they could use the triple-colon accessor

mypackage:::x

External data

External data is data contained in the package that is made available to the user, but is not (generally) available to the functions within the package.

Again, usethis::use_data() can be used to set up the data:

usethis::use_data(a, b, internal = FALSE)

Once the package is loaded, the variables a and b will be made available to the user

library(mypackage)
a

#> "this is string a"

Accessing external data as if it were internal data

Data which is setup to be external can be used internally within the package if you explicitly namespace every occurrence. I.e. if the external data is called x, then everywhere it is accessed within functions within the package it should be accessed via mypackage::x.

There are other ways to do this, but this method is:

  1. known to work
  2. known to pass devtools::check()
  3. accepted by CRAN - see for example {proustr}

Thanks to Colin Fay for showing me this solution.

Exporting internal data to make it seem like external data.

If data is set up to be internal to the package, in order for a user to access it

  1. use triple-colon accessor e.g. mypackage:::x, or
  2. explicitly export the data

The first method is a little non-standard for day-to-day usage, and mostly you would like to avoid having to mess with the “hidden” functions and methods using the triple-colon accessor.

The second method, from Hadley Wickham, is to explicitly export the data as follows (if using roxygen)

#' Cool data which is used internally in the package
#' @name x
#' @export
"x"

Now while the R Packages book insists that you “Never @export a data set”, this seems like the exception to the rule.

Should I externalise internal data? Or internalise external data?

I don’t know if there’s a better method (in terms of hackishness and/or pleasing the CRAN gods) but keeping data internal and manually exporting seems like less work, so that’s the one I’m going with.