We use three 'phases' to left-join lhs and rhs

  • Exact match: well, exact match both on exact_by and phased_by

  • Quasi-exact match: transform phased_by vars on lhs and rhs using quasi_fun and then matches on the transformed variables. Default for quasi_fun is efun::normalize_text, which removes spaces, dots, commas and non-ascii characters to avoid encoding issues

  • Fuzzy match using a two-way 'contains' approach: this is powered by fuzzyjoin::fuzzy_left_join using as matching function match_fun = ~ stringr::str_detect(.x, .y) | stringr::str_detect(.y, .x)

phased_left_join(
  lhs,
  rhs,
  phased_by,
  exact_by = NULL,
  drop_join_vars = TRUE,
  quasi_fun = normalize_text,
  suffix = c(".x", ".y")
)

Arguments

lhs

A data.frame-like

rhs

A data.frame-like

phased_by

A character vector of variables to join by, using a named vector to join by different variables from lhs and rhs (à la dplyr. See ?dplyr::join). The variables indicated here will be subject to the phased approach.

exact_by

An optional character vector of variables to join by, using always an exact match.

drop_join_vars

Whether to drop auxiliary variables for the match (e.g. transformed variables). Defaults to TRUE, but for debugging may be useful to set to FALSE.

quasi_fun

The function to apply to phased_by variables, for the quasi-exact match. Defaults to efun::normalize_text

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

Value

a joined data-frame

Details

We apply those phases in that order, and every phase works only on the unmatched rows from previous phases.