Left join lhs and rhs, using a phased approach

We use three 'phases' to left-join lhs and rhs

Exact match: well, exact match both on exact_by and phased_by
Quasi-exact match: transform phased_by vars on lhs and rhs using quasi_fun and then matches on the transformed variables. Default for quasi_fun is efun::normalize_text, which removes spaces, dots, commas and non-ascii characters to avoid encoding issues
Fuzzy match using a two-way 'contains' approach: this is powered by fuzzyjoin::fuzzy_left_join using as matching function match_fun = ~ stringr::str_detect(.x, .y) | stringr::str_detect(.y, .x)

phased_left_join(
  lhs,
  rhs,
  phased_by,
  exact_by = NULL,
  drop_join_vars = TRUE,
  quasi_fun = normalize_text,
  suffix = c(".x", ".y")
)

Arguments

lhs: A data.frame-like
rhs: A data.frame-like
phased_by: A character vector of variables to join by, using a named vector to join by different variables from lhs and rhs (à la dplyr. See ?dplyr::join). The variables indicated here will be subject to the phased approach.
exact_by: An optional character vector of variables to join by, using always an exact match.
drop_join_vars: Whether to drop auxiliary variables for the match (e.g. transformed variables). Defaults to TRUE, but for debugging may be useful to set to FALSE.
quasi_fun: The function to apply to phased_by variables, for the quasi-exact match. Defaults to efun::normalize_text
suffix: If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

a joined data-frame

We apply those phases in that order, and every phase works only on the unmatched rows from previous phases.