R/text-file.R
guess_types.Rd
Guess data types in a delimited text file (thin wrapper on data.table::fread)
guess_types(
file,
sep = "auto",
sep2 = "auto",
dec = ".",
quote = "\"",
nrows = 10000,
header = "auto",
na.strings = c("", "NA", "NULL"),
skip = "__auto__",
select = NULL,
drop = NULL,
colClasses = NULL,
col.names,
check.names = FALSE,
encoding = "unknown",
...
)
File name in working directory, path to file (passed through path.expand
for convenience), or a URL starting http://, file://, etc. Compressed files with extension .gz
and .bz2
are supported if the R.utils
package is installed.
The separator between columns. Defaults to the character in the set [,\t |;:]
that separates the sample of rows into the most number of lines with the same number of fields. Use NULL
or ""
to specify no separator; i.e. each line a single character column like base::readLines
does.
The separator within columns. A list
column will be returned where each cell is a vector of values. This is much faster using less working memory than strsplit
afterwards or similar techniques. For each column sep2
can be different and is the first character in the same set above [,\t |;
], other than sep
, that exists inside each field outside quoted regions in the sample. NB: sep2
is not yet implemented.
The decimal separator as in utils::read.csv
. If not "." (default) then usually ",". See details.
By default ("\""
), if a field starts with a double quote, fread
handles embedded quotes robustly as explained under Details
. If it fails, then another attempt is made to read the field as is, i.e., as if quotes are disabled. By setting quote=""
, the field is always read as if quotes are disabled. It is not expected to ever need to pass anything other than \"\" to quote; i.e., to turn it off.
The maximum number of rows to read. Unlike read.table
, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by fread
almost instantly using the large sample of lines. nrows=0
returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.
Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.
A character vector of strings which are to be interpreted as NA
values. By default, ",,"
for columns of all types, including type character
is read as NA
for consistency. ,"",
is unambiguous and read as an empty string. To read ,NA,
as NA
, set na.strings="NA"
. To read ,,
as blank string ""
, set na.strings=NULL
. When they occur in the file, the strings in na.strings
should not appear quoted since that is how the string literal ,"NA",
is distinguished from ,NA,
, for example, when na.strings="NA"
.
If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row. skip>0
means ignore the first skip
rows manually. skip="string"
searches for "string"
in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).
A vector of column names or numbers to keep, drop the rest. select
may specify types too in the same way as colClasses
; i.e., a vector of colname=type
pairs, or a list
of type=col(s)
pairs. In all forms of select
, the order that the columns are specified determines the order of the columns in the result.
Vector of column names or numbers to drop, keep the rest.
As in utils::read.csv
; i.e., an unnamed vector of types corresponding to the columns in the file, or a named vector specifying types for a subset of the columns by name. The default, NULL
means types are inferred from the data in the file. Further, data.table
supports a named list
of vectors of column names or numbers where the list
names are the class names; see examples. The list
form makes it easier to set a batch of columns to be a particular class. When column numbers are used in the list
form, they refer to the column number in the file not the column number after select
or drop
has been applied.
If type coercion results in an error, introduces NA
s, or would result in loss of accuracy, the coercion attempt is aborted for that column with warning and the column's type is left unchanged. If you really desire data loss (e.g. reading 3.14
as integer
) you have to truncate such columns afterwards yourself explicitly so that this is clear to future readers of your code.
A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number. This is applied after check.names
and before key
and index
.
default is FALSE
. If TRUE
then the names of the variables in the data.table
are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names
) so that they are, and also to ensure that there are no duplicates.
default is "unknown"
. Other possible options are "UTF-8"
and "Latin-1"
. Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding.
further arguments passed to data.table::fread
data.table