mergelist {data.table} | R Documentation |
Merge multiple data.tables
Description
Faster merge of multiple data.table
s.
Usage
mergelist(l, on, cols=NULL,
how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
mult, join.many=getOption("datatable.join.many"))
setmergelist(l, on, cols=NULL,
how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
mult, join.many=getOption("datatable.join.many"))
Arguments
l |
|
on |
|
cols |
Optional |
how |
Character string, controls how to merge tables. Allowed values are |
mult |
Character string, controls how to proceed when multiple rows in the join-to table match to the row in the join-from table. Allowed values are |
join.many |
|
Details
Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
Merging is performed sequentially from "left to right", so that for l
of 3 tables, it will do something like merge(merge(l[[1L]], l[[2L]]), l[[3L]])
. Non-equi joins are not supported. Column names to merge on must be common in both tables on each merge.
Arguments on
, how
, mult
, join.many
could be lists as well, each of length length(l)-1L
, to provide argument to be used for each single tables pair to merge, see examples.
The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns.
Whether each refers to the "left" or "right" table of a pair depends on the how
argument:
-
how %in% c("left", "semi", "anti")
: join-to is RHS, join-from is LHS. -
how %in% c("inner", "full", "cross")
: LHS and RHS tables are treated equally, so that the terms are interchangeable. -
how == "right"
: join-to is LHS, join-from is RHS.
Using mult="error"
will throw an error when multiple rows in join-to table match to the row in join-from table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.
When not specified, mult
takes its default depending on the how
argument:
When
how %in% c("left", "inner", "full", "right")
,mult="error"
.When
how %in% c("semi", "anti")
,mult="last"
, although this is equivalent tomult="first"
.When
how == "cross"
,mult="all"
.
When the on
argument is missing, it will be determined based how
argument:
When
how %in% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table. \item When \code{how %in% c("inner", full")
, if only one table has a key, then this key is used; if both tables have keys, thenon = intersect(key(lhs), key(rhs))
, having its order aligned to shorter key.
When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a right outer join can be used to optimize the sequence of merges, see Examples.
Value
A new data.table
based on the merged objects.
For setmergelist
, if possible, a copy
of the inputs is avoided.
Note
Using how="inner"
or how="full"
together with mult!="all"
is sub-efficient. Unlike during joins in [.data.table
, it will apply mult
on both tables. This ensures that the join is symmetric so that the LHS and RHS tables can be swapped, regardless of mult
. It is always possible to apply a mult
-like filter manually and join using mult="all"
.
Using join.many=FALSE
is also sub-efficient. Note that it only takes effect when mult="all"
. If input data are verified not to have duplicate matches, then this can safely use the default TRUE
. Otherwise, for mult="all"
merges it is recommended to use join.many=FALSE
, unless of course many-to-many joins, duplicating rows, are intended.
References
https://en.wikipedia.org/wiki/Snowflake_schema, https://en.wikipedia.org/wiki/Star_schema
See Also
[.data.table
, merge.data.table
Examples
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8),
data.table(id1=2:3, v2=1:2),
data.table(id1=3:5, v3=1:3)
)
mergelist(l, on="id1")
## using keys
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8),
data.table(id1=3:5, id2=1:3, v2=1:3, key="id1"),
data.table(id2=1:4, v3=4:1, key="id2")
)
mergelist(l)
## select columns
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8, v2=8:1),
data.table(id1=3:5, v3=1:3, v4=3:1, v5=1L, key="id1")
)
mergelist(l, cols=list(NULL, c("v3", "v5")))
## different arguments for each merge pair
l = list(
data.table(id1=1:4, id2=4:1),
data.table(id1=c(1:3, 1:2), v2=c(1L, 1L, 1:2, 2L)),
data.table(id2=4:5)
)
mergelist(l,
on = list("id1", "id2"), ## first merge on id1, second on id2
how = list("inner", "anti"), ## first inner join, second anti join
mult = list("last", NULL)) ## use default 'mult' in second join
## detecting duplicates matches
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8), ## dups in LHS are fine
data.table(id1=c(2:3, 2L), v2=1:3), ## dups in RHS
data.table(id1=3:5, v3=1:3)
)
lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
try(mergelist(l, on="id1"))
## 'star schema' and 'snowflake schema' examples (realistic data sizes)
### populate fact: US population by state and date
gt = state.x77[, "Population"]
gt = data.table(state_id=seq_along(state.name), p=gt[state.name] / sum(gt), k=1L)
tt = as.IDate(paste0(as.integer(time(uspop)), "-01-01"))
tt = as.data.table(stats::approx(tt, c(uspop), tt[1L]:tt[length(tt)]))
tt = tt[, .(date=as.IDate(x), date_id=seq_along(x), pop=y, k=1L)]
fact = tt[gt, on="k", allow.cartesian=TRUE,
.(state_id=i.state_id, date_id=x.date_id, population=x.pop * i.p)]
setkeyv(fact, c("state_id", "date_id"))
### populate dimensions: time and geography
time = data.table(key="date_id",
date_id= seq_along(tt$date), date=tt$date,
month_id=month(tt$date), month=month.name[month(tt$date)],
year_id=year(tt$date)-1789L, year=as.character(year(tt$date)),
week_id=week(tt$date), week=as.character(week(tt$date)),
weekday_id=wday(tt$date)-1L, weekday=weekdays(tt$date)
)
time[weekday_id == 0L, weekday_id := 7L][]
geog = data.table(key="state_id",
state_id=seq_along(state.name), state_abb=state.abb, state_name=state.name,
division_id=as.integer(state.division),
division_name=as.character(state.division),
region_id=as.integer(state.region),
region_name=as.character(state.region)
)
rm(gt, tt)
### denormalize 'star schema'
l = list(fact, time, geog)
str(l)
mergelist(l)
rm(l)
### turn 'star schema' into 'snowflake schema'
make.lvl = function(x, cols) {
stopifnot(is.data.table(x))
lvl = x[, unique(.SD), .SDcols=cols]
setkeyv(lvl, cols[1L])
setindexv(lvl, as.list(cols))
lvl
}
time = list(
date = make.lvl(
time, c("date_id", "date", "year_id", "month_id", "week_id", "weekday_id")),
weekday = make.lvl(time, c("weekday_id", "weekday")),
week = make.lvl(time, c("week_id", "week")),
month = make.lvl(time, c("month_id", "month")),
year = make.lvl(time, c("year_id", "year"))
)
geog = list(
state = make.lvl(geog, c("state_id", "state_abb", "state_name", "division_id")),
division = make.lvl(geog, c("division_id", "division_name", "region_id")),
region = make.lvl(geog, c("region_id", "region_name"))
)
### denormalize 'snowflake schema'
#### left join all
l = c(list(fact=fact), time, geog)
str(l)
mergelist(l)
rm(l)
#### merge hierarchies alone, reduce sizes in merges of geog dimension
ans = mergelist(list(
fact,
mergelist(time),
mergelist(rev(geog), how="right")
))
rm(ans)
#### same but no unnecessary copies
ans = mergelist(list(
fact,
setmergelist(time),
setmergelist(rev(geog), how="right")
))