Copy data directly from a delimited text-file (csv) to a table in PostgreSQL

This is a convenience function meant to be used to import data into a locally-run PostgreSQL server. It basically uses PostgreSQL's COPY command to import the data and by default, it automates the creation of the table taking column names from the header (first row) of the file and guessing column types ala ?data.table::fread

pg_copy_file(
  con,
  file_path,
  table,
  sep = ";",
  nrows = 10000,
  header = TRUE,
  na.strings = c("", "NA", "NULL"),
  colClasses = NULL,
  col.names,
  encoding = "UTF-8",
  ...,
  unlogged = FALSE,
  drop_table = FALSE,
  cascade = FALSE,
  if_not_exists = FALSE,
  create_table = FALSE,
  execute = FALSE,
  quote = "\b",
  escape = quote,
  null = "NULL"
)

Arguments

con

A database connection.

file_path

path to a text file (csv, tab-delimited, etc.)

table

The table name, passed on to dbQuoteIdentifier(). Options are:

a character string with the unquoted DBMS table name, e.g. "table_name",
a call to Id() with components to the fully qualified table name, e.g. Id(schema = "my_schema", table = "table_name")
a call to SQL() with the quoted and fully qualified table name given verbatim, e.g. SQL('"my_schema"."table_name"')

sep

The separator between columns. Defaults to the character in the set [,\t |;:] that separates the sample of rows into the most number of lines with the same number of fields. Use NULL or "" to specify no separator; i.e. each line a single character column like base::readLines does.

nrows

The maximum number of rows to read. Unlike read.table, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by fread almost instantly using the large sample of lines. nrows=0 returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.

header

Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.

na.strings

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

colClasses

As in utils::read.csv; i.e., an unnamed vector of types corresponding to the columns in the file, or a named vector specifying types for a subset of the columns by name. The default, NULL means types are inferred from the data in the file. Further, data.table supports a named list of vectors of column names or numbers where the list names are the class names; see examples. The list form makes it easier to set a batch of columns to be a particular class. When column numbers are used in the list form, they refer to the column number in the file not the column number after select or drop has been applied. If type coercion results in an error, introduces NAs, or would result in loss of accuracy, the coercion attempt is aborted for that column with warning and the column's type is left unchanged. If you really desire data loss (e.g. reading 3.14 as integer) you have to truncate such columns afterwards yourself explicitly so that this is clear to future readers of your code.

col.names

A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number. This is applied after check.names and before key and index.

encoding

default is "unknown". Other possible options are "UTF-8" and "Latin-1". Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding.

...

Other arguments used by individual methods.

unlogged

logical, whether to create an UNLOGGED table

drop_table

logical, whether to drop the table before creating it

cascade

logical, whether to add CASCADE to the DROP statement

if_not_exists

logical, to add IF NOT EXISTS to the query

create_table

boolean TRUE if the table should be created. Otherwise, it assumes the table exists

execute

logical, whether to execute the query using con

quote

"quoting character to be used when a data value is quoted" the default value is just a dirty-little-trick to use a value very unlikely to appear

escape

"character that should appear before a data character that matches the QUOTE value"

null

"Specifies the string that represents a null value."

Value

a dplyr reference to the table

Details

This function uses PostgreSQL's COPY FROM filename command. https://www.postgresql.org/docs/current/static/sql-copy.html. According to the documentation, "The file must be accessible to the server and the name must be specified from the viewpoint of the server". Thererfore, this function is intended to be used when you run the PostgreSQL server locally (if you manage to put the file in the server machine, you could also use it, though).

Use UNLOGGED tables and, if possible, create the table and copy command within the same transaction (to improve performance). See some discussion here
TODO: use also the COPY FROM PROGRAM syntax, to import directly from zip files https://www.postgresql.org/docs/current/static/sql-copy.html
TODO: also perhaps use named pipes to read from zip files directly http://www.ralree.com/2009/09/04/reading-compressed-files-with-postgres- using-named-pipes/ https://stackoverflow.com/questions/41738829/importing-zipped-csv-file-into-postgresql