The Zombie/Vampire Apocalypse - A use case for strict membership tests

Problem: %in% isn’t strict enough for some of my use cases

This is a second attempt at creating a stricter version of %in% to test membership. (My first attempt is here)

My problems with %in% arise because:

  • It silently promotes arguments of different types.
    • I’d actually like an error if doing membership tests of vectors of different types
  • It is not robust to variations which might arise as the data evolves
    • See Zombies/Vampires use case below

In the following use-case, I show why %in% isn’t ideal with data sets which evolve and update.

Rather than R letting you know when your membership test is no longer up-to-date, the user must remain vigilant with every data update to ensure that it is still valid i.e.

  • Did the categories change?
  • Did I account for all spelling variations?
  • Did the underlying format of the data change?

I would like a membership test which fails very noisily if an assumption made about the data is no longer true.

The Zombie/Vampire Apocalypse - A use case for strict membership tests

Testing for membership (testing with %in%)

  • Every week you get an updated list of team members.
  • Your job is to find the dangerous ones and recommend them to HR for review.
  • You wrote a script (using %in%) to find the dangerous team members.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(affliction %in% c('zombie', 'werewolf'))
# A tibble: 1 x 2
  name  affliction
  <chr> <chr>     
1 carl  zombie    

Membership changes (testing with %in%)

  • Debbie from the workshop is found to be a vampire.
  • However, you weren’t notified of the new classification, so according to your standard script only Carl is dangerous.
  • The next nightshift with Debbie is going to be interesting!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - debbie-the-vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members doesn't do its job and Debbie 
# (a vampire) isn't considered as dangerous!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(affliction %in% c('zombie', 'werewolf'))
# A tibble: 1 x 2
  name  affliction
  <chr> <chr>     
1 carl  zombie    

Membership typos (testing with %in%)

  • After the incident with Debbie, you’ve now updated your classification to include ‘vampire’ as a dangerous affliction.
  • Evan has also been diagnosied as a vampire.
  • However, the work-experience data-entry person has entered his affliction as ‘Vampire’.
  • So according to your standard script only Carl and Debbie are dangerous.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Evan-the-Vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire",
  'evan'    ,  "Vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members doesn't do its job and Evan 
# (a vampire) isn't considered dangerous because of different capitalisation!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(affliction %in% c('zombie', 'werewolf', 'vampire'))
# A tibble: 2 x 2
  name   affliction
  <chr>  <chr>     
1 carl   zombie    
2 debbie vampire   

Membership changes from text to a number (testing with %in%)

  • For privacy reasons, management has decided to have all afflictions represented by 0 or 1, where 0 is healthy and 1 represents dangerous afflictions
  • You diligently update your scripts.
  • However, the work-experience data-entry person didn’t get the memo and still has afflictions represented by strings.
  • You now think nobody is dangerous!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Data has not yet been updated to numeric coding
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire",
  'evan'    ,  "Vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your code has been updated to expect numeric coding of afflications, but
# thre is a type mismatch and now no one is considered dangerous
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(affliction == 1)
# A tibble: 0 x 2
# … with 2 variables: name <chr>, affliction <chr>

Strict membership testing with is_within()

  • I need a function which takes 3 arguments
    1. the input values
    2. an ‘ingroup’ which defines what is to be matched against
    3. an ‘outgroup’ which defines what it must not match against
  • By explicitly specifying the allowed and dis-allowed values, is_within() can be much more robust to unexpected and unhandled values.
  • Because there are now 3 arguments, this can no longer be an infix operator like %in%
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' A strict version of '%in%' where both the in-group and out-group must be completely specified
#'
#' The membership test is strict.
#'   - if 'universe' is defined, then `outgroup = setdiff(universe, ingroup)`
#'   - Every value of 'x' must exist within either 'ingroup' or 'outgroup'
#'   - 'ingroup' and 'outgroup' must be disjoint sets
#'   - May specify only one of 'outgroup' or 'universe'

#'
#' @param x input values.
#' @param ingroup vector of values against which elements of 'x' should be checked
#'           for membership.
#' @param outgroup vector of values to which the elements of 'x' should not belong
#' @param universe vector of all possible values to expect
#'
#'
#' @return A logical vector the same length as 'x' which is TRUE if the
#' correponding value in x is a member of 'ingroup' and is not a member
#' of 'outgroup'.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
is_within <- function(x, ingroup, outgroup=NULL, universe=NULL) {

  if (!xor(is.null(outgroup), is.null(universe))) {
    stop("is_within(): Must only specify one (and only one) of 'outgroup' or 'universe'")
  }
  
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Define outgroup to be disjoint from ingroup if 'universe' given,
  # otherwise check that given ingroup/group are disjoint
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  if (!is.null(universe)) {
    outgroup <- setdiff(universe, ingroup)
  } else {
    if (length(intersect(ingroup, outgroup)) > 0L) {
      stop("is_within(): 'ingroup' and 'outgroup' must not have overlapping elements. The following elements were found in both - ", 
           deparse(intersect(ingroup, outgroup)))
    }
  }
  
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Check classes match
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  if (length(intersect(class(x), intersect(class(ingroup), class(outgroup)))) == 0L) {
    stop("is_within(): Classes must be identical. x: ", deparse(class(x)),
         " ingroup: ", deparse(class(ingroup)), " outgroup: ", deparse(class(outgroup)))
  }
  
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Check inputs have length >= 1
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  if (length(x)       == 0L) { stop("is_within(): 'x' must have at least 1 element")}
  if (length(ingroup) == 0L) { stop("is_within(): 'ingroup' must have at least 1 element")}
  
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Actually perform the membership tests
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  res <- x %in% ingroup
  neg <- x %in% outgroup
  
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Check: input values must appear in one of 'ingroup' or 'outgroup', but not both.
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  if (any(!xor(res, neg))) {
    stop("is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - ", deparse(x[!xor(res, neg)]))
  }
  
  
  res
}

Use case for strict membership tests using is_within()

Testing for membership (testing with is_within())

  • Every week you get an updated list of team members.
  • Your job is to find the dangerous ones and recommend them to HR for review.
  • You wrote a script to find the dangerous team members using belong().
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members using `is_within()`
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(is_within(affliction, ingroup = c('zombie', 'werewolf'), outgroup=c("healthy", "cough")))
# A tibble: 1 x 2
  name  affliction
  <chr> <chr>     
1 carl  zombie    

Membership changes (testing with is_within())

  • Debbie from the workshop is found to be a vampire.
  • However, you weren’t notified of the new classification!
  • Luckily, is_within() ensures that all input values must be classified, and so raises an error to let you know that your script is out of date.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - debbie-the-vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Using `is_within()` your script fails noisily because 'vampire' is not
# defined in the ingroup or outgroup
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(is_within(affliction, ingroup = c('zombie', 'werewolf'), outgroup=c("healthy", "cough")))
Error in is_within(affliction, ingroup = c("zombie", "werewolf"), outgroup = c("healthy", : is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - "vampire"

Membership typos (testing with is_within())

  • You’ve updated your classification to include ‘vampire’ as a dangerous affliction.
  • Evan has also been diagnosied as a vampire.
  • However, the work-experience data-entry person has entered his affliction as ‘Vampire’.
  • is_within() finds that “Vampire” is not categorisable and raises an error.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Evan-the-Vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire",
  'evan'    ,  "Vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Using `is_within()` your script fails noisily because 'Vampire' (with a
# capital 'V') is not defined in the ingroup or outgroup
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(is_within(affliction, ingroup = c('zombie', 'werewolf', 'vampire'), outgroup=c("healthy", "cough")))
Error in is_within(affliction, ingroup = c("zombie", "werewolf", "vampire"), : is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - "Vampire"

Membership changes from text to a number (testing with is_within())

  • For privacy reasons, management has decided to have all afflictions represented by 0 or 1, where 0 is healthy and 1 represents dangerous afflictions
  • You diligently update your scripts.
  • The work-experience data-entry person didn’t get the memo and still has afflictions represented by strings.
  • Because of the mismatch in input types, is_within() raises an error.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Data has not yet been updated to numeric coding
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire",
  'evan'    ,  "Vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members fails  noisily 
# because of the type mismatch
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(is_within(affliction, ingroup = 0L, outgroup=1L))
Error in is_within(affliction, ingroup = 0L, outgroup = 1L): is_within(): Classes must be identical. x: "character" ingroup: "integer" outgroup: "integer"

Membership changes - vampires are now not dangerous (testing with is_within())

  • After reading Twilight, management has decided vampirism is no longer a dangerous affliction.
  • You update your script to add ‘vampire’ in with the ‘healthy’ group.
  • However, you weren’t paying attention and left ‘vampire’ also grouped in with zombies and werewolves.
  • is_within() detects you’ve specified the term ‘vampire’ in two locations and raises an error.
team <- tribble(
  ~name     ,  ~affliction,
  "adam"    ,  "healthy",
  "barbara" ,  "cough",
  "carl"    ,  "zombie",
  'debbie'  ,  "vampire",
  'evan'    ,  "Vampire"
)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your code has been updated to expect numeric coding of afflications
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>% 
  filter(is_within(affliction, ingroup = c('zombie', 'werewolf', 'vampire'), outgroup=c("healthy", "cough", 'vampire')))
Error in is_within(affliction, ingroup = c("zombie", "werewolf", "vampire"), : is_within(): 'ingroup' and 'outgroup' must not have overlapping elements. The following elements were found in both - "vampire"

Summary

  • This seems like a much more useful/useable solution than Strict ‘%IN%’
  • I’m still not 100% happy with the name of the function (is_within) or its arguments (ingroup and outgroup).
    • It sort of reads OK: “is x within the ingroup, but never the outgroup”
    • It’s hard trying to find the right verb/phase! And the verb/pharse needs to make sense when negated.
    • Suggestions welcomed!