mikefc

Problem: %in% isn’t strict enough for some of my use cases

%in% will silently ignore

  • arguments of different types
  • RHS arguments that don’t match any of the possible LHS inputs
  • Empty RHS or LHS

For example, the following is a non-sensical filter in which none of the elements we’re trying to match against are actually in the Species vector!

iris %>% 
  filter(Species %in% c('leaf', 'tree', 'butterfly'))
# A tibble: 0 x 5
# ... with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
#   Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>

jonocarrol has brought up this issue as a pretty neat PR to the forcats package.

I’m not trying to be as elegant as he is, so I’m just going to make something work in isolation.

My use case:

  • comparing atomic vectors only, usually character strings
  • lhs and rhs arguments must be of the same class
  • lhs and rhs arguments must have at last 1 element each

Some test cases.

Cases which should run without error

1:5         %in% 2:4            # Good
[1] FALSE  TRUE  TRUE  TRUE FALSE
c('a', 'b') %in% 'a'            # Good
[1]  TRUE FALSE

Cases for which I’d like an explicit error

c('a', 'b') %in% 'A'            # Bad: Some RHS values not in LHS
[1] FALSE FALSE
1:5         %in% c('a', 'b')    # Bad: type mismatch
[1] FALSE FALSE FALSE FALSE FALSE
1:5         %in% 5:8            # Bad: Some RHS values not in LHS
[1] FALSE FALSE FALSE FALSE  TRUE
1           %in% 1:5            # Bad: type mismatch
[1] TRUE
1:5         %in% integer(0)     # Bad: empty RHS
[1] FALSE FALSE FALSE FALSE FALSE
integer(0)  %in% 1:5            # Bad: empty LHS
logical(0)
1           %in% list(1, 2, 3)  # Bad: RHS not atomic
[1] TRUE

%IN% - a stricter %in%

%IN% expects the following

  • LHS and RHS are atomic
  • LHS and RHS are not empty
  • LHS and RHS are identical classes
  • All elenents in the RHS are in the LHS
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Stricter version of %in%
#'
#' Check arguments for type mismatch, emptiness, atomic-ness and assert that
#' all elements in RHS are also in the LHS
#'
#' @param lhs left-hand side
#' @param rhs right-hand side
#' 
#' @return A logical vector the same length as 'lhs' which is TRUE if the 
#' correponding value in the lhs exists somewhere in the rhs.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'%IN%' <- function(lhs, rhs) {
  if (!is.atomic(lhs) || !is.atomic(rhs)) {
    stop("%IN%: LHS and RHS must be atomic")
  }
  if (length(rhs) == 0) {
    stop("%IN%: RHS has no elements")
  }
  if (length(lhs) == 0) {
    stop("%IN%: LHS has no elements")
  }
  if (!identical(class(lhs), class(rhs))) {
    stop("%IN%: Classes not identical. LHS: ", deparse(class(lhs)), " RHS: ", deparse(class(rhs)))
  }
  if (!all(rhs %in% lhs)) {
    stop("%IN%: Some RHS elements not in LHS - ", deparse(setdiff(rhs, lhs), nlines=1))
  }
  lhs %in% rhs
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Negated version
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'%NOTIN%' <- Negate('%IN%')

How %IN% handles cases which should run without error

1:5         %IN% 2:4            # Good
[1] FALSE  TRUE  TRUE  TRUE FALSE
c('a', 'b') %IN% 'a'            # Good
[1]  TRUE FALSE

How %IN% handles cases for which I’d like an explicit error

c('a', 'b') %IN% 'A'            # Bad: Some RHS values not in LHS
Error in c("a", "b") %IN% "A": %IN%: Some RHS elements not in LHS - "A"
1:5         %IN% c('a', 'b')    # Bad: type mismatch
Error in 1:5 %IN% c("a", "b"): %IN%: Classes not identical. LHS: "integer" RHS: "character"
1:5         %IN% 5:8            # Bad: Some RHS values not in LHS
Error in 1:5 %IN% 5:8: %IN%: Some RHS elements not in LHS - 6:8
1           %IN% 1:5            # Bad: type mismatch
Error in 1 %IN% 1:5: %IN%: Classes not identical. LHS: "numeric" RHS: "integer"
1:5         %IN% integer(0)     # Bad: empty RHS
Error in 1:5 %IN% integer(0): %IN%: RHS has no elements
integer(0)  %IN% 1:5            # Bad: empty LHS
Error in integer(0) %IN% 1:5: %IN%: LHS has no elements
1           %IN% list(1, 2, 3)  # Bad: RHS not atomic
Error in 1 %IN% list(1, 2, 3): %IN%: LHS and RHS must be atomic

Future

  • I feel that this version of %IN% may be a little too strict to be useful.
  • Not sure I’m really attached to the name %IN%
  • I assume the vctrs package will solve all of this for me soon :)
  • Waiting for someone to tell me that %IN% is equivalent to some base R function I’ve never heard of ;)