Evaluating R code from potentially malicious sources

I’m looking into the idea of running R code which may originate from potentially malicious sources e.g. code from a web interface, or a database or even a tweet!

If we allow user input of R code, eventually someone is going to try system("rm -rf /") or something equally dangerous, so I would like to forbid the use of system while still allowing the use of mean, +, c() etc.

Ignoring other considerations (like limiting runtime of the code) my idea consists of 2 parts:

Identify the functions in the supplied code
Check them against a whitelist of functions that are allowed to run

Note

I know this is dangerous
I want to run this in the current process and use the result i.e. no remote sandboxes or docker images.
If this is impossible, or there’s an actual way to do this, or a better way of thinking of the problem, I’m keen to hear it!

Ping me on twitter

Naive Approach for identifying function names in R code

A naive approach for identifying functions in an expression is just to get all the names in the expression and all the variables in the expression and find the difference.

i.e. the items which are names but not variables must be functions!

expr <- parse(text="a <- 1:5; mean(a);")

all.names(expr)

[1] "<-"   "a"    ":"    "mean" "a"

all.vars(expr)

[1] "a"

# These should be the function names!
setdiff(all.names(expr), all.vars(expr))

[1] "<-"   ":"    "mean"

This seems to work and gives all the function calls in the given code.

Why the naive Approach fails…

The naive approach failes because a function can be easily assigned to a variable and then called.

This means that a name in an expression can easily be a variable which represents a function.

In the following code, myfunc is a variable which contains a function, so it is not classified as being a function name under the naive scheme.

expr <- parse(text="a <- 1:5; myfunc = mean; myfunc(a);")

all.names(expr)

[1] "<-"     "a"      ":"      "="      "myfunc" "mean"   "myfunc" "a"

all.vars(expr)

[1] "a"      "myfunc" "mean"

# Hoping to get all the function names - but 'myfunc' and 'mean' are both missing
setdiff(all.names(expr), all.vars(expr))

[1] "<-" ":"  "="

A more rigorous approach for identifying function names in R code (i.e. what hadley suggests!)

In Hadley Wickham’s Advanced R book, the chapter on Expressions suggests that walking the Abstract Syntax Tree (AST) with a recursive function the way to properly determine things about an expression.

So I’ve taken hadley’s example code and tweaked it for my needs.

#-----------------------------------------------------------------------------
#' find all function calls in an expression
#'
#' Adapted from: @hadley's http://adv-r.had.co.nz/Expressions.html
#'
#' @param x expression
#'
#' @return a character vector of function names
#-----------------------------------------------------------------------------
find_calls <- function(x) {
  calls <- c()
  
  if (is.call(x)) {
    # Call recurse_call recursively
    this_call <- as.character(x[[1]])
    sub_calls <- purrr::flatten_chr(lapply(x[-1], find_calls))
    calls     <- c(this_call, sub_calls)
  } else if (is.pairlist(x)) {
    stop("Pairlists not allowed")
  } 
  
  calls
}

Since the function find_calls is for a single expression, use purrr to map the function over a vector of multiple expressions and get all function names.

exprs <- parse(text="a <- 1:5; myfunc = mean; myfunc(a);")

exprs %>%
  purrr::map(find_calls) %>% 
  purrr::flatten_chr()

[1] "<-"     ":"      "="      "myfunc"

Note that we have now correctly identified myfunc as a function, but not that it is really just calling mean.

Naive method for whitelisting functions

Now that we can identify function names in R code, can we enforce a whitelist?

I.e. Can safe/unsafe functions be detected by just comparing all the function names in the given code against a list of whitelisted functions?

TLDR: No.

A variable with the name of a whitelisted function can easily be made to point to an unsafe function. So in the example below, if the whitelisted functions include mean (which is pretty safe to call), a malicous user could simply have the name mean point to the system function and proceed to wreak havok.

exprs <- parse(text="mean <- system; mean('echo bad stuff happens');")

exprs %>%
  purrr::map(find_calls) %>% 
  purrr::flatten_chr()

[1] "<-"   "mean"

We have only identified that a function named mean has been called, but are totally ignorant of the fact that this is really a system call.

Next idea: compare function signatures not just function names

Even being able to get all function calls and compare against a whitelist isn’t enough to make it safe to evaluate R code from a possibly malicious source.

My next idea is not just to compare function names, but to actually compare function signatures at each step of the evaluation i.e. for each expression:

find the function names
get() the functions from the working environment and calculate a function signature e.g sha1 digest
compare the function name and signature against a known list of function names/signatures