Problem: %in%
isn’t strict enough for some of my use cases
This is a second attempt at creating a stricter version of %in%
to test
membership. (My first attempt is here)
My problems with %in%
arise because:
- It silently promotes arguments of different types.
- I’d actually like an error if doing membership tests of vectors of different types
- It is not robust to variations which might arise as the data evolves
- See Zombies/Vampires use case below
In the following use-case, I show why %in%
isn’t ideal with data sets which
evolve and update.
Rather than R letting you know when your membership test is no longer up-to-date, the user must remain vigilant with every data update to ensure that it is still valid i.e.
- Did the categories change?
- Did I account for all spelling variations?
- Did the underlying format of the data change?
I would like a membership test which fails very noisily if an assumption made about the data is no longer true.
The Zombie/Vampire Apocalypse - A use case for strict membership tests
Testing for membership (testing with %in%
)
- Every week you get an updated list of team members.
- Your job is to find the dangerous ones and recommend them to HR for review.
- You wrote a script (using
%in%
) to find the dangerous team members.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(affliction %in% c('zombie', 'werewolf'))
# A tibble: 1 x 2
name affliction
<chr> <chr>
1 carl zombie
Membership changes (testing with %in%
)
- Debbie from the workshop is found to be a vampire.
- However, you weren’t notified of the new classification, so according to your standard script only Carl is dangerous.
- The next nightshift with Debbie is going to be interesting!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - debbie-the-vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members doesn't do its job and Debbie
# (a vampire) isn't considered as dangerous!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(affliction %in% c('zombie', 'werewolf'))
# A tibble: 1 x 2
name affliction
<chr> <chr>
1 carl zombie
Membership typos (testing with %in%
)
- After the incident with Debbie, you’ve now updated your classification to include ‘vampire’ as a dangerous affliction.
- Evan has also been diagnosied as a vampire.
- However, the work-experience data-entry person has entered his affliction as ‘Vampire’.
- So according to your standard script only Carl and Debbie are dangerous.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Evan-the-Vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire",
'evan' , "Vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members doesn't do its job and Evan
# (a vampire) isn't considered dangerous because of different capitalisation!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(affliction %in% c('zombie', 'werewolf', 'vampire'))
# A tibble: 2 x 2
name affliction
<chr> <chr>
1 carl zombie
2 debbie vampire
Membership changes from text to a number (testing with %in%
)
- For privacy reasons, management has decided to have all afflictions represented by 0 or 1, where 0 is healthy and 1 represents dangerous afflictions
- You diligently update your scripts.
- However, the work-experience data-entry person didn’t get the memo and still has afflictions represented by strings.
- You now think nobody is dangerous!
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Data has not yet been updated to numeric coding
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire",
'evan' , "Vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your code has been updated to expect numeric coding of afflications, but
# thre is a type mismatch and now no one is considered dangerous
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(affliction == 1)
# A tibble: 0 x 2
# … with 2 variables: name <chr>, affliction <chr>
Strict membership testing with is_within()
- I need a function which takes 3 arguments
- the input values
- an ‘ingroup’ which defines what is to be matched against
- an ‘outgroup’ which defines what it must not match against
- By explicitly specifying the allowed and dis-allowed values,
is_within()
can be much more robust to unexpected and unhandled values. - Because there are now 3 arguments, this can no longer be an infix operator like
%in%
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' A strict version of '%in%' where both the in-group and out-group must be completely specified
#'
#' The membership test is strict.
#' - if 'universe' is defined, then `outgroup = setdiff(universe, ingroup)`
#' - Every value of 'x' must exist within either 'ingroup' or 'outgroup'
#' - 'ingroup' and 'outgroup' must be disjoint sets
#' - May specify only one of 'outgroup' or 'universe'
#'
#' @param x input values.
#' @param ingroup vector of values against which elements of 'x' should be checked
#' for membership.
#' @param outgroup vector of values to which the elements of 'x' should not belong
#' @param universe vector of all possible values to expect
#'
#'
#' @return A logical vector the same length as 'x' which is TRUE if the
#' correponding value in x is a member of 'ingroup' and is not a member
#' of 'outgroup'.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
is_within <- function(x, ingroup, outgroup=NULL, universe=NULL) {
if (!xor(is.null(outgroup), is.null(universe))) {
stop("is_within(): Must only specify one (and only one) of 'outgroup' or 'universe'")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define outgroup to be disjoint from ingroup if 'universe' given,
# otherwise check that given ingroup/group are disjoint
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (!is.null(universe)) {
outgroup <- setdiff(universe, ingroup)
} else {
if (length(intersect(ingroup, outgroup)) > 0L) {
stop("is_within(): 'ingroup' and 'outgroup' must not have overlapping elements. The following elements were found in both - ",
deparse(intersect(ingroup, outgroup)))
}
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check classes match
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (length(intersect(class(x), intersect(class(ingroup), class(outgroup)))) == 0L) {
stop("is_within(): Classes must be identical. x: ", deparse(class(x)),
" ingroup: ", deparse(class(ingroup)), " outgroup: ", deparse(class(outgroup)))
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check inputs have length >= 1
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (length(x) == 0L) { stop("is_within(): 'x' must have at least 1 element")}
if (length(ingroup) == 0L) { stop("is_within(): 'ingroup' must have at least 1 element")}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Actually perform the membership tests
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
res <- x %in% ingroup
neg <- x %in% outgroup
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check: input values must appear in one of 'ingroup' or 'outgroup', but not both.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (any(!xor(res, neg))) {
stop("is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - ", deparse(x[!xor(res, neg)]))
}
res
}
Use case for strict membership tests using is_within()
Testing for membership (testing with is_within()
)
- Every week you get an updated list of team members.
- Your job is to find the dangerous ones and recommend them to HR for review.
- You wrote a script to find the dangerous team members using
belong()
.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members using `is_within()`
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(is_within(affliction, ingroup = c('zombie', 'werewolf'), outgroup=c("healthy", "cough")))
# A tibble: 1 x 2
name affliction
<chr> <chr>
1 carl zombie
Membership changes (testing with is_within()
)
- Debbie from the workshop is found to be a vampire.
- However, you weren’t notified of the new classification!
- Luckily,
is_within()
ensures that all input values must be classified, and so raises an error to let you know that your script is out of date.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - debbie-the-vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Using `is_within()` your script fails noisily because 'vampire' is not
# defined in the ingroup or outgroup
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(is_within(affliction, ingroup = c('zombie', 'werewolf'), outgroup=c("healthy", "cough")))
Error in is_within(affliction, ingroup = c("zombie", "werewolf"), outgroup = c("healthy", : is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - "vampire"
Membership typos (testing with is_within()
)
- You’ve updated your classification to include ‘vampire’ as a dangerous affliction.
- Evan has also been diagnosied as a vampire.
- However, the work-experience data-entry person has entered his affliction as ‘Vampire’.
is_within()
finds that “Vampire” is not categorisable and raises an error.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Evan-the-Vampire joins the team
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire",
'evan' , "Vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Using `is_within()` your script fails noisily because 'Vampire' (with a
# capital 'V') is not defined in the ingroup or outgroup
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(is_within(affliction, ingroup = c('zombie', 'werewolf', 'vampire'), outgroup=c("healthy", "cough")))
Error in is_within(affliction, ingroup = c("zombie", "werewolf", "vampire"), : is_within(): All elements should appear in the 'ingroup' or 'outgroup' vectors. The following input elements were not found in either - "Vampire"
Membership changes from text to a number (testing with is_within()
)
- For privacy reasons, management has decided to have all afflictions represented by 0 or 1, where 0 is healthy and 1 represents dangerous afflictions
- You diligently update your scripts.
- The work-experience data-entry person didn’t get the memo and still has afflictions represented by strings.
- Because of the mismatch in input types,
is_within()
raises an error.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Team Members - Data has not yet been updated to numeric coding
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire",
'evan' , "Vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your script to find dangerous team members fails noisily
# because of the type mismatch
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(is_within(affliction, ingroup = 0L, outgroup=1L))
Error in is_within(affliction, ingroup = 0L, outgroup = 1L): is_within(): Classes must be identical. x: "character" ingroup: "integer" outgroup: "integer"
Membership changes - vampires are now not dangerous (testing with is_within()
)
- After reading Twilight, management has decided vampirism is no longer a dangerous affliction.
- You update your script to add ‘vampire’ in with the ‘healthy’ group.
- However, you weren’t paying attention and left ‘vampire’ also grouped in with zombies and werewolves.
is_within()
detects you’ve specified the term ‘vampire’ in two locations and raises an error.
team <- tribble(
~name , ~affliction,
"adam" , "healthy",
"barbara" , "cough",
"carl" , "zombie",
'debbie' , "vampire",
'evan' , "Vampire"
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Your code has been updated to expect numeric coding of afflications
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
team %>%
filter(is_within(affliction, ingroup = c('zombie', 'werewolf', 'vampire'), outgroup=c("healthy", "cough", 'vampire')))
Error in is_within(affliction, ingroup = c("zombie", "werewolf", "vampire"), : is_within(): 'ingroup' and 'outgroup' must not have overlapping elements. The following elements were found in both - "vampire"
Summary
- This seems like a much more useful/useable solution than Strict ‘%IN%’
- I’m still not 100% happy with the name of the function (
is_within
) or its arguments (ingroup
andoutgroup
).- It sort of reads OK: “is x within the ingroup, but never the outgroup”
- It’s hard trying to find the right verb/phase! And the verb/pharse needs to make sense when negated.
- Suggestions welcomed!