Fast Read and Fast Write

Translations of this document are available in: en | fr

The fread() and fwrite() functions in the data.table R package are not only optimized for speed on large files, but also offer powerful and convenient features for working with small datasets. This vignette highlights their usability, flexibility, and performance for efficient data import and export.

1. fread()

1.1 Using command line tools directly

The fread() function from data.table can read data piped from shell commands, letting you filter or preprocess data before it even enters R.

# Create a sample file with some unwanted lines
writeLines(
'HEADER: Some metadata
HEADER: More metadata
1 2.0 3.0
2 4.5 6.7
HEADER: Yet more
3 8.9 0.1
4 1.2 3.4',
"example_data.txt")

library(data.table)
fread("grep -v HEADER example_data.txt")
#       V1    V2    V3
#    <int> <num> <num>
# 1:     1   2.0   3.0
# 2:     2   4.5   6.7
# 3:     3   8.9   0.1
# 4:     4   1.2   3.4

The -v option makes grep return all lines except those containing the string ‘HEADER’.

“Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to perform more complex string filtering (e.g., matching strings at the beginning or end of lines), the grep syntax is very powerful. Learning its syntax is a transferable skill for other languages and environments.”

— Matt Dowle

Look at this example for more detail.

On Windows, command line tools like grep are available through various environments, such as Rtools, Cygwin, or the Windows Subsystem for Linux (WSL). On Linux and macOS, these tools are typically included with the operating system.

1.1.1 Reading directly from a text string

fread() can read data directly from a character string in R using the text argument. This is particularly handy for creating reproducible examples, testing code snippets, or working with data generated programmatically within your R session. Each line in the string should be separated by a newline character \n.

my_data_string = "colA,colB,colC\n1,apple,TRUE\n2,banana,FALSE\n3,orange,TRUE"
dt_from_text = fread(text = my_data_string)
print(dt_from_text)
#     colA   colB   colC
#    <int> <char> <lgcl>
# 1:     1  apple   TRUE
# 2:     2 banana  FALSE
# 3:     3 orange   TRUE

1.1.2 Reading from URLs

fread() can read data directly from web URLs by passing the URL as a character string to its file argument. This allows you to download and read data from the internet in one step.

# dt = fread("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
# print(dt)

1.1.3 Automatic decompression of compressed files

In many cases, fread() can automatically detect and decompress files with common compression extensions directly, without needing an explicit connection object or shell commands. This works by checking the file extension.

Supported extensions typically include:

.gz / .bz2 (gzip / bzip2): Supported and works out of the box.
.zip / .tar (ZIP / tar archives, single file): Supported—fread() will read the first file in the archive if only one file is present.

Note: If there are multiple files in the archive, fread() will fail with an error.

1.2 Automatic separator and skip detection

fread automates delimiter and header detection, eliminating the need for manual specification in most cases. You simply provide the filename—fread intelligently detects the structure:

Separator Detection

fread tests common separators (,,\t, |, space, :, ;) and selects the one that results in the most consistent number of fields across sampled rows. For non-standard delimiters, you can override this using the sep= parameter.

Header Detection

After applying any skip or nrows settings (if specified), the first row with a consistent number of fields is examined:

If all fields in this line are interpretable as character and the values do not strongly resemble a data row (e.g., a row of purely numeric-looking strings might still be considered data), it is typically used as the header (column names).

Otherwise (e.g., if the line contains detected numeric types, or character strings that strongly resemble numbers and could be data), it is treated as a data row, and default column names (V1, V2, …) are assigned.

You can explicitly tell fread whether a header exists using header = TRUE or header = FALSE.

Skip Detection

By default (skip="auto"), fread will automatically skip blank lines and comment lines (e.g., starting with #) before the data header. To manually specify a different number of lines to skip, use

skip=n to skip the first n lines.
skip="string" to search for a line containing a substring (typically from the column names, like skip="Date"). Reading begins at the first matching line. This is useful for skipping metadata, or selecting sub-tables in multi-table files. This feature is inspired by the read.xls function in the gdata package.

1.3 High-Quality Automatic Column Type Detection

Many real-world datasets contain columns that are initially blank, zero-filled, or appear numeric but later contain characters. To handle such inconsistencies, fread() employs a robust column type detection strategy.

Since v1.10.5, fread() samples rows by reading blocks of contiguous rows from multiple equally spaced points across the file, including the start, middle, and end. The total number of rows sampled is chosen dynamically based on the file size and structure, and is typically around 10,000, but can be smaller or slightly larger. This wide sampling helps detect type changes that occur later in the data (e.g., 001 to 0A0 or blanks becoming populated).

Efficient File Access with mmap

To implement this sampling efficiently, fread() uses the operating system’s memory-mapped file access (mmap), allowing it to jump to arbitrary positions in the file without sequential scanning. This lazy, on-demand strategy makes sampling nearly instantaneous, even for very large files.

If a jump lands within a quoted field that includes newlines, fread() tests subsequent lines until it finds 5 consecutive rows with the expected number of fields, ensuring correct parsing even in complex files.

Accurate and Optimized Type Detection

The type for each column is inferred based on the lowest required type from the following ordered list:

logical < integer < integer64 < double < character

This ensures:

Single up-front allocation of memory using the correct type
Avoidance of rereading the file or manually setting colClasses
Improved speed and memory efficiency

Out-of-Sample Type Exceptions

If a type change occurs outside the sampled rows, fread() automatically detects it and rereads the file to ensure correct type assignment, without requiring user intervention. For example, a column sampled as integer might later contain 00A — triggering an automatic reread as character.

All detection logic and any rereads are detailed when verbose=TRUE is enabled.

1.4 Early Error Detection at End-of-File

Because the large sample explicitly includes the very end of the file, critical issues—such as an inconsistent number of columns, a malformed footer, or an opening quote without a matching closing quote—can be detected and reported almost instantly. This early error detection avoids the unnecessary overhead of processing the entire file or allocating excessive memory, only to encounter a failure at the final step. It ensures faster feedback and more efficient resource usage, especially when working with large datasets.

1.5 `integer64` Support

By default, fread detects integers larger than 2³¹ and reads them as bit64::integer64 to preserve full precision. This behavior can be overridden in three ways:

Per-column: Use the colClasses argument to specify the type for individual columns.
Per-call: Use the integer64 argument in fread() to set how all detected integer64 columns are read.
Globally: Set the option datatable.integer64 in your R session or .Rprofile file to change the default behavior for all fread calls.

The integer64 argument (and corresponding option) accepts the following values:

"integer64" (default): Reads large integers as bit64::integer64 with full precision.
"double" or "numeric": Reads large integers as double-precision numbers, potentially losing precision silently (similar to utils::read.csv in base R).
"character": Reads large integers as character strings.

To check or set the global default, use:

# fread's default behavior is to treat large integers as "integer64"; however, this global setting can be changed:
options(datatable.integer64 = "double")   # Example: set globally to "double"
getOption("datatable.integer64") 
# [1] "double"

1.6 Drop or Select Columns by Name or Position

To save memory and improve performance, use fread()’s select or drop arguments to read only the columns you need.

If you need only a few columns, use select.
If you want to exclude just a few, use drop—this avoids listing everything you want to keep.

Key points:

select: Vector of column names/positions to keep (discards others).
drop: Vector of column names/positions to discard (keeps others).
Do not use select and drop together—they are mutually exclusive.
fread() will warn you if any specified column is missing in the file.

For details, see the manual page by running ?fread in R.

1.7 Automatic Quote Escape Detection (Including No-Escape)

fread automatically detects how quotes are escaped—including doubled (””) or backslash-escaped (") quotes—without requiring user input. This is determined using a large sample of the data (see point 3), and validated against the entire file.

Supported Scenarios:

Unescaped quotes inside quoted fields e.g., "This "quote" is invalid, but fread works anyway" — supported as long as column count remains consistent :

data.table::fread(text='x,y\n"This "quote" is invalid, but fread works anyway",1')
# Warning in data.table::fread(text = "x,y\n\"This \"quote\" is invalid, but
# fread works anyway\",1"): Found and resolved improper quoting in first 100
# rows. If the fields are not quoted (e.g. field separator does not appear within
# any field), try quote="" to avoid this warning.
#                                                  x     y
#                                             <char> <int>
# 1: This "quote" is invalid, but fread works anyway     1

Unquoted fields that begin with quotes e.g., Invalid"Field,10,20 — recognized correctly as not a quoted field.

data.table::fread(text='x,y\nNot"Valid,1')
#            x     y
#       <char> <int>
# 1: Not"Valid     1

Requirements & Limitations:

Escaping rules and column counts must be consistent throughout the file.
Not supported when fill=TRUE — in that case, the file must follow RFC4180-compliant quoting/escaping.

Version-Specific Robustness: From v1.10.6, fread resolves ambiguities more reliably across the entire file using full-column-count consistency (default is fill=FALSE). Warnings are issued if parsing fails due to ambiguity.

2. fwrite()

fwrite() is the fast file writer companion to fread(). It’s designed for speed, sensible defaults, and ease of use, mirroring many of the conveniences found in fread.

2.1 Intelligent and Minimalist Quoting (quote=“auto”)

When data is written as strings (either inherently, like character columns, or by choice, like dateTimeAs="ISO"), quote="auto" (default) intelligently quotes fields:

Contextual Quoting:Fields are quoted only when necessary. This happens if they contain the delimiter (sep), a double quote ("), a newline (\n), a carriage return (\r), or if the field is an empty string (""). Quoting the empty string is done to distinguish it from an NA value when the file is read.

Bypassed for Direct Numeric Output: If specific columns are written as their underlying numeric types (e.g., via dateTimeAs="epoch" for POSIXct, or if a user pre-converts Date to integer), then quoting logic is naturally bypassed for those numeric fields, contributing to efficiency.

dt_quoting_scenario = data.table(
  text_field = c("Contains,a,comma", "Contains \"a quote\"", "Clean_text", "", NA),
  numeric_field = 1:5
)
temp_quote_adv = tempfile(fileext = ".csv")

fwrite(dt_quoting_scenario, temp_quote_adv)
# Note the output: the empty string is quoted (""), but the NA is not.
cat(readLines(temp_quote_adv), sep = "\n")
# text_field,numeric_field
# "Contains,a,comma",1
# "Contains ""a quote""",2
# Clean_text,3
# "",4
# ,5

2.2 Fine-Grained Date/Time Serialization (`dateTimeAs` argument)

Offers precise control for POSIXct/Date types:

dateTimeAs="ISO" (Default for POSIXct): ISO 8601 format (e.g., YYYY-MM-DDTHH:MM:SS.ffffffZ), preserving sub-second precision for unambiguous interchange.
dateTimeAs="epoch": POSIXct as seconds since epoch (numeric).

dt_timestamps = data.table(
  ts = as.POSIXct("2023-10-26 14:35:45.123456", tz = "GMT"),
  dt = as.Date("2023-11-15")
)
temp_dt_iso = tempfile(fileext = ".csv")
fwrite(dt_timestamps, temp_dt_iso, dateTimeAs = "ISO")
cat(readLines(temp_dt_iso), sep = "\n")
# ts,dt
# 2023-10-26T14:35:45.123456Z,2023-11-15
unlink(temp_dt_iso)

2.3 Handling of `bit64::integer64`

Full Precision for Large Integers: fwrite writes bit64::integer64 columns by converting them to strings with full precision. This prevents data loss or silent conversion to double that might occur with less specialized writers. This is crucial for IDs or measurements requiring more than R’s standard 32-bit integer range or 53-bit double precision.

Direct Handling: This direct and careful handling of specialized numerics ensures data integrity and efficient I/O, without unnecessary intermediate conversions to less precise types.

if (requireNamespace("bit64", quietly = TRUE)) {
  dt_i64 = data.table(uid = bit64::as.integer64("1234567890123456789"), val = 100)
  temp_i64_out = tempfile(fileext = ".csv")
  fwrite(dt_i64, temp_i64_out)
  cat(readLines(temp_i64_out), sep = "\n")
  unlink(temp_i64_out)
}
# uid,val
# 1234567890123456789,100

2.4 Column Order and Subset Control

To control the order and subset of columns written to file, subset the data.table before calling fwrite(). The col.names argument in fwrite() is a logical (TRUE/FALSE) that controls whether the header row is written, not which columns are written.

dt = data.table(A = 1:3, B = 4:6, C = 7:9)

# Write only columns C and A, in that order
fwrite(dt[, .(C, A)], "out.csv")
cat(readLines("out.csv"), sep = "\n")
# C,A
# 7,1
# 8,2
# 9,3
file.remove("out.csv")
# [1] TRUE

3. A Note on Performance

While this vignette focuses on features and usability, the primary motivation for fread and fwrite is speed.

For users interested in detailed, up-to-date performance comparisons, we recommend these external blog posts which use the atime package for rigorous analysis:

data.table asymptotic timings: Compares fread and fwrite performance against other popular R packages like readr and arrow.
Benchmarking data.table with polars, duckdb, and pandas: Compares data.table I/O and grouping performance against leading Python libraries.

These benchmarks consistently show that fread and fwrite are highly competitive and often state-of-the-art for performance in the R ecosystem.