R/tar_duck_r.R
, R/tar_duck_rmd.R
tar_duck_r.Rd
Creates a target that keeps a DBI
connection to a DuckDB database, creating
one DuckDB file per target. With tar_duck_r()
the user writes R code
and can interact with DuckDB's database directly via DBI
(e.g.
dbGetQuery
, dbExecute
) or dbplyr
-backed operations (so either write SQL
as strings or let dbplyr
write the SQL for you). With tar_duck_rmd()
the
user writes the code in a Rmd Notebook, basically to take advantage of sql
chunks to write pure SQL.
Watch out: this target factory does very unorthodox things! It does not
have much guardrails, it can be fragile and could probably break in
unexpected ways. We do ugly things here. There are side-effects all over the
place. It is still experimental and tailored to very specific use cases. See
details for a description of what it does, how it does it and why we decided
to do it that way. So, use it only if you can stomach all that -or if you are
part of the team and kind of have to ;)-
tar_duck_r(
target_name,
command,
duckdb_path = paste0("./duckdb/",
targets::tar_deparse_language(substitute(target_name))),
conn_name = "db"
)
tar_duck_rmd(
sql_rmd,
duckdb_path = paste0("./duckdb/", target_name),
target_name = make.names(fs::path_ext_remove(fs::path_file(sql_rmd))),
conn_name = "db"
)
target name, ala name in targets::tar_target
.
R code to run the target, just like targets::tar_target, but
this code can (and should) take advantage of the connection to a DuckDB
database whose name is given by conn_name
.
path of the duckdb file. The default is to use the target name as filename (without extension) and save it in a duckdb folder in the current directory. It can be set to any relative or absolute path.
name to give to the DBI connection to the DuckDB db.
path to the Rmd file
a custom target
This target factory aims to support the following workflow / use case:
Write SQL code in a Rmd file
Render the Rmd file, running the SQL code on a DuckDB database
Each Rmd file run against a single DuckDB file (persistent DB) So typically each Rmd file creates one table in the DB, as part of some data wrangling
Let's keep here a list of the idiosyncrasies of this target factory:
This is perhaps the most controversial: We are keeping the DBI connection in the global environment. There might be a better way and we would need to take a closer look at targets's environment management to find it. But for now, this is working as we intended and you just have to be aware of this, and of course, make sure you do not create conflicts by creating in your code variables with the same name as the one given to the connection (conn_name)
To keep it as flexible as possible, we do not want to use targets storage for the DuckDB files. Instead, we decided to keep the files somewhere else (duh, in the duckdb folder, or wherever you want, since you can simply pass the full path to the target factory). In the targets storage we will keep only some metadata to re-create the DBI connection.
So, what about the DBI connection? Well, connections are one of those
non-exportable objects that are tied to a given R session and cannot be
easily saved and simply read on another session. So that cannot be the
return value of the target. Typical workarounds for that in targets
include the use of custom functions to create and close the connection
after the job is done, and the use of hooks with similar purpose.
(https://books.ropensci.org/targets/targets.html#return-value).
tar_duck_*
will create the DBI connection and close the connection behind
the scenes and make it available in the target's command environment, so
that the user's code can use it, just refering the object with the name
passed to tar_duck_*
via conn_name
argument.
So what does it mean to read/load a target that has already been built?
In other words, what is the return object of a tar_duck_*
target?
We use a custom format, in which the read method does the heavy lifting.
Write method is simply a writeRDS, just like in any tar_target. But the
read method makes sure a DBI connection pointing to the duckdb file is
available. So the users should access the data they are interested in via
the DBI connection. Yet, the most common use case we have is one table per
target. Thus, if in the database of the target exists a table matching
the target's name, we also return a dplyr/dbplyr handle to the table.
But what about dependencies?, if you keep the data from each target in
separate duckdb files, how can I write a new target that uses data from
one or more upstream targets? Here tar_duck_*
makes use of the ATTACH
functionality from duckdb. It makes sure to attach all duckdb files that
the current target depends on, to the current connection.
We are currently hashing the duckdb file when the target's command finishes, and keep that data in the target's storage. We do this to make sure changes will propagate downstream and potentially could be used to detect changes in the data, made outside from the pipeline. Check TODOs.
to keep the target's scripts as concise as possible, for tar_duck_rmd, we will use the filename of the Rmd as the name of the target (it's just a default, you can override that by giving it explicitly other name)
Why this? Transitioning from a Postgres-backed data pipeline to use DuckDB, we hit a couple of issues issue with the size of the file: i) connecting/disconnecting to large files can be slow (EDIT: there have been improvements recently and now DuckDB has "Lazy-loading of storage metadata for faster startup times") and ii) the files can get pretty big real quick, which can cause storage issues (how big?, currently, one pipeline hit almost 2TB). To deal with this, we want to experiment with the also recently introduced feature of DuckDB: ATTACH, which basically let's you connect to another DuckDB database, that can be read from and written to, from within the currently running DB/connection. So, the plan is to try and keep one table per duckdb file, which should limit the individual file size to a typical table size (i.e. around 25GB), thereby helping to solve or cirmcunvent the issues above. So, instad of haven a monolitic DuckDB file with all the data, we would have one file per table, and we can ATTACH other tables as needed on-the-fly. Now, this can be cumbersome, because you need to handle each connection and manually attach each DB you need, and then manually edit DuckDB's search path to emulate one single database (or you would have to use fully qualified names to refer to the tables, becasue ATTACH puts each file in its own catalog). And here's where targets can be handy, because it is super smart in handling dependencies and loading them on the fly. So the idea is to leverage that power in a bunch of targets that handle the DBI connection to each duckdb file, opening it and attaching dependencies on the fly. Let's see how it goes ...
tar_duck_rmd()
: write the sql code in a Rmd Notebook
targets::tar_script({
suppressMessages(library(dplyr))
list(
# a couple of targets to ingest the data
# (note the use of the DBI connection, named `db`)
flowme::tar_duck_r(band_members_duck, {
dplyr::copy_to(db, dplyr::band_members, name = "band_members_duck",
overwrite = TRUE, temporary = FALSE)
}),
flowme::tar_duck_r(band_instruments_duck, {
dplyr::copy_to(db, dplyr::band_instruments, name = "band_instruments_duck",
overwrite = TRUE, temporary = FALSE)
}),
# a "normal" target that depends on both upstream targets
# band_members_duck and band_instruments_duck. `tar_duck_r`'S
# custom load mechanism makes sure you get a valid
# dplyr/dbplyr handle to the table, stored in a DuckDB database.
# None of the data have been fed into R, but everything operated
# seamlessly on DuckDB. But since this is a normal target, to
# save the results you need to collect and return them.
# Notice here that although band_members_duck and
# band_instruments_duck are stored in a separated DuckDB databases,
# you can operate seamlessly on both of them simultaneously.
tar_target(join_to_targets, {
band_members_duck |>
left_join(band_instruments_duck) |>
collect() |> # to collect the data from DuckDB into R
arrange(name)
}),
# Now, if you want to keep operating on DuckDB (e.g. because the
# output will also be larger than memory), you would keep using
# `tar_duck_r`. So, similar as the above, but now it's again a
# `tar_duck_r` target and the result of the join should be saved
# in a duckdb file
flowme::tar_duck_r(join_to_duck, {
foo <- band_members_duck |>
left_join(band_instruments_duck) |>
collect() |>
arrange(name)
dplyr::copy_to(db, foo, name = "join_to_duck",
overwrite = TRUE, temporary = FALSE)
}),
NULL
)
})
targets::tar_visnetwork()
# After running tar_make() you will find a duckdb file for each of
# the `tar_duck_r` targets above. Go check that.
# See the testthat files for examples of `tar_duck_rmd`