R: Merge multiple data.tables

mergelist {data.table}

R Documentation

Merge multiple data.tables

Description

Faster merge of multiple data.tables.

Usage

  mergelist(l, on, cols=NULL,
    how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
    mult, join.many=getOption("datatable.join.many"))
  setmergelist(l, on, cols=NULL,
    how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
    mult, join.many=getOption("datatable.join.many"))

Arguments

l

list of data.tables to merge.

on

character vector of column names to merge on; when missing, the key of the join-to table is used (see Details).

cols

Optional list of character column names corresponding to tables in l, used to subset columns during merges. NULL means all columns, all tables; NULL entries in a list means all columns for the corresponding table.

how

Character string, controls how to merge tables. Allowed values are "left" (default), "inner", "full", "right", "semi", "anti", and "cross". See Details.

mult

Character string, controls how to proceed when multiple rows in the join-to table match to the row in the join-from table. Allowed values are "error", "all", "first", "last". The default value depends on how; see Details. See Examples for how to detect duplicated matches. When using "all", we recommend specifying join.many=FALSE as a precaution to prevent unintended explosion of rows.

join.many

logical, defaulting to getOption("datatable.join.many"), which is TRUE by default; when FALSE and mult="all", an error is thrown when any many-to-many matches are detected between pairs of tables. This is essentially a stricter version of the allow.cartesian option in [.data.table. Note that the option "datatable.join.many" also controls the behavior of joins in [.data.table.

Details

Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.

Merging is performed sequentially from "left to right", so that for l of 3 tables, it will do something like merge(merge(l[[1L]], l[[2L]]), l[[3L]]). Non-equi joins are not supported. Column names to merge on must be common in both tables on each merge.

Arguments on, how, mult, join.many could be lists as well, each of length length(l)-1L, to provide argument to be used for each single tables pair to merge, see examples.

The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns. Whether each refers to the "left" or "right" table of a pair depends on the how argument:

how %in% c("left", "semi", "anti"): join-to is RHS, join-from is LHS.
how %in% c("inner", "full", "cross"): LHS and RHS tables are treated equally, so that the terms are interchangeable.
how == "right": join-to is LHS, join-from is RHS.

Using mult="error" will throw an error when multiple rows in join-to table match to the row in join-from table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.

When not specified, mult takes its default depending on the how argument:

When how %in% c("left", "inner", "full", "right"), mult="error".
When how %in% c("semi", "anti"), mult="last", although this is equivalent to mult="first".
When how == "cross", mult="all".

When the on argument is missing, it will be determined based how argument:

When how %in% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table. \item When \code{how %in% c("inner", full"), if only one table has a key, then this key is used; if both tables have keys, then on = intersect(key(lhs), key(rhs)), having its order aligned to shorter key.

When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a right outer join can be used to optimize the sequence of merges, see Examples.

Value

A new data.table based on the merged objects.

For setmergelist, if possible, a copy of the inputs is avoided.

Note

Using how="inner" or how="full" together with mult!="all" is sub-efficient. Unlike during joins in [.data.table, it will apply mult on both tables. This ensures that the join is symmetric so that the LHS and RHS tables can be swapped, regardless of mult. It is always possible to apply a mult-like filter manually and join using mult="all".

Using join.many=FALSE is also sub-efficient. Note that it only takes effect when mult="all". If input data are verified not to have duplicate matches, then this can safely use the default TRUE. Otherwise, for mult="all" merges it is recommended to use join.many=FALSE, unless of course many-to-many joins, duplicating rows, are intended.

References

https://en.wikipedia.org/wiki/Snowflake_schema, https://en.wikipedia.org/wiki/Star_schema

Examples

l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8),
  data.table(id1=2:3, v2=1:2),
  data.table(id1=3:5, v3=1:3)
)
mergelist(l, on="id1")

## using keys
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8),
  data.table(id1=3:5, id2=1:3, v2=1:3, key="id1"),
  data.table(id2=1:4, v3=4:1, key="id2")
)
mergelist(l)

## select columns
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8, v2=8:1),
  data.table(id1=3:5, v3=1:3, v4=3:1, v5=1L, key="id1")
)
mergelist(l, cols=list(NULL, c("v3", "v5")))

## different arguments for each merge pair
l = list(
  data.table(id1=1:4, id2=4:1),
  data.table(id1=c(1:3, 1:2), v2=c(1L, 1L, 1:2, 2L)),
  data.table(id2=4:5)
)
mergelist(l,
  on = list("id1", "id2"),     ## first merge on id1, second on id2
  how = list("inner", "anti"), ## first inner join, second anti join
  mult = list("last", NULL))   ## use default 'mult' in second join

## detecting duplicates matches
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8), ## dups in LHS are fine
  data.table(id1=c(2:3, 2L), v2=1:3),  ## dups in RHS
  data.table(id1=3:5, v3=1:3)
)
lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
try(mergelist(l, on="id1"))


## 'star schema' and 'snowflake schema' examples (realistic data sizes)

### populate fact: US population by state and date

gt = state.x77[, "Population"]
gt = data.table(state_id=seq_along(state.name), p=gt[state.name] / sum(gt), k=1L)
tt = as.IDate(paste0(as.integer(time(uspop)), "-01-01"))
tt = as.data.table(stats::approx(tt, c(uspop), tt[1L]:tt[length(tt)]))
tt = tt[, .(date=as.IDate(x), date_id=seq_along(x), pop=y, k=1L)]
fact = tt[gt, on="k", allow.cartesian=TRUE,
          .(state_id=i.state_id, date_id=x.date_id, population=x.pop * i.p)]
setkeyv(fact, c("state_id", "date_id"))

### populate dimensions: time and geography

time = data.table(key="date_id",
  date_id= seq_along(tt$date), date=tt$date,
  month_id=month(tt$date), month=month.name[month(tt$date)],
  year_id=year(tt$date)-1789L, year=as.character(year(tt$date)),
  week_id=week(tt$date), week=as.character(week(tt$date)),
  weekday_id=wday(tt$date)-1L, weekday=weekdays(tt$date)
)
time[weekday_id == 0L, weekday_id := 7L][]
geog = data.table(key="state_id",
  state_id=seq_along(state.name), state_abb=state.abb, state_name=state.name,
  division_id=as.integer(state.division),
  division_name=as.character(state.division),
  region_id=as.integer(state.region),
  region_name=as.character(state.region)
)
rm(gt, tt)

### denormalize 'star schema'

l = list(fact, time, geog)
str(l)
mergelist(l)

rm(l)

### turn 'star schema' into 'snowflake schema'

make.lvl = function(x, cols) {
  stopifnot(is.data.table(x))
  lvl = x[, unique(.SD), .SDcols=cols]
  setkeyv(lvl, cols[1L])
  setindexv(lvl, as.list(cols))
  lvl
}
time = list(
  date = make.lvl(
    time, c("date_id", "date", "year_id", "month_id", "week_id", "weekday_id")),
  weekday = make.lvl(time, c("weekday_id", "weekday")),
  week = make.lvl(time, c("week_id", "week")),
  month = make.lvl(time, c("month_id", "month")),
  year = make.lvl(time, c("year_id", "year"))
)
geog = list(
  state = make.lvl(geog, c("state_id", "state_abb", "state_name", "division_id")),
  division = make.lvl(geog, c("division_id", "division_name", "region_id")),
  region = make.lvl(geog, c("region_id", "region_name"))
)

### denormalize 'snowflake schema'

#### left join all
l = c(list(fact=fact), time, geog)
str(l)
mergelist(l)

rm(l)
#### merge hierarchies alone, reduce sizes in merges of geog dimension
ans = mergelist(list(
  fact,
  mergelist(time),
  mergelist(rev(geog), how="right")
))

rm(ans)
#### same but no unnecessary copies
ans = mergelist(list(
  fact,
  setmergelist(time),
  setmergelist(rev(geog), how="right")
))

[Package data.table version 1.17.99 Index]