We use three 'phases' to left-join lhs
and rhs
Exact match: well, exact match both on exact_by
and phased_by
Quasi-exact match: transform phased_by
vars on lhs
and rhs
using
quasi_fun
and then matches on the transformed variables. Default for
quasi_fun
is efun::normalize_text
, which removes spaces, dots, commas
and non-ascii characters to avoid encoding issues
Fuzzy match using a two-way 'contains' approach: this is powered by
fuzzyjoin::fuzzy_left_join
using as matching function
match_fun = ~ stringr::str_detect(.x, .y) | stringr::str_detect(.y, .x)
phased_left_join(
lhs,
rhs,
phased_by,
exact_by = NULL,
drop_join_vars = TRUE,
quasi_fun = normalize_text,
suffix = c(".x", ".y")
)
A data.frame-like
A data.frame-like
A character vector of variables to join by, using a named
vector to join by different variables from lhs
and rhs
(à la dplyr
.
See ?dplyr::join). The variables indicated here will be subject to the phased
approach.
An optional character vector of variables to join by, using always an exact match.
Whether to drop auxiliary variables for the match (e.g. transformed variables). Defaults to TRUE, but for debugging may be useful to set to FALSE.
The function to apply to phased_by
variables, for the
quasi-exact match. Defaults to efun::normalize_text
If there are non-joined duplicate variables in x
and
y
, these suffixes will be added to the output to disambiguate them.
Should be a character vector of length 2.
a joined data-frame
We apply those phases in that order, and every phase works only on the unmatched rows from previous phases.