Merge multiple data.tables

Faster merge of multiple data.tables.

Usage

mergelist(l, on, cols=NULL,
    how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
    mult, join.many=getOption("datatable.join.many"))
  setmergelist(l, on, cols=NULL,
    how=c("left", "inner", "full", "right", "semi", "anti", "cross"),
    mult, join.many=getOption("datatable.join.many"))

Arguments

l: list of data.tables to merge.
on: character vector of column names to merge on; when missing, the key of the join-to table is used (see Details).
cols: Optional list of character column names corresponding to tables in l, used to subset columns during merges. NULL means all columns, all tables; NULL entries in a list means all columns for the corresponding table.
how: Character string, controls how to merge tables. Allowed values are "left" (default), "inner", "full", "right", "semi", "anti", and "cross". See Details.
mult: Character string, controls how to proceed when multiple rows in the join-to table match to the row in the join-from table. Allowed values are "error", "all", "first", "last". The default value depends on how; see Details. See Examples for how to detect duplicated matches. When using "all", we recommend specifying join.many=FALSE as a precaution to prevent unintended explosion of rows.
join.many: logical, defaulting to getOption("datatable.join.many"), which is TRUE by default; when FALSE and mult="all", an error is thrown when any many-to-many matches are detected between pairs of tables. This is essentially a stricter version of the allow.cartesian option in [.data.table. Note that the option "datatable.join.many" also controls the behavior of joins in [.data.table.

Details

Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.

Merging is performed sequentially from "left to right", so that for l of 3 tables, it will do something like merge(merge(l[[1L]], l[[2L]]), l[[3L]]). Non-equi joins are not supported. Column names to merge on must be common in both tables on each merge.

Arguments on, how, mult, join.many could be lists as well, each of length length(l)-1L, to provide argument to be used for each single tables pair to merge, see examples.

The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns. Whether each refers to the "left" or "right" table of a pair depends on the how argument:

how %in% c("left", "semi", "anti"): join-to is RHS, join-from is LHS.
how %in% c("inner", "full", "cross"): LHS and RHS tables are treated equally, so that the terms are interchangeable.
how == "right": join-to is LHS, join-from is RHS.

Using mult="error" will throw an error when multiple rows in join-to table match to the row in join-from table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.

When not specified, mult takes its default depending on the how argument:

When how %in% c("left", "inner", "full", "right"), mult="error".
When how %in% c("semi", "anti"), mult="last", although this is equivalent to mult="first".
When how == "cross", mult="all".

When the on argument is missing, it will be determined based how argument:

When how %in% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table. \item When \code{how %in% c("inner", full"), if only one table has a key, then this key is used; if both tables have keys, then on = intersect(key(lhs), key(rhs)), having its order aligned to shorter key.

When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a right outer join can be used to optimize the sequence of merges, see Examples.

Value

A new data.table based on the merged objects.

For setmergelist, if possible, a copy of the inputs is avoided.

Note

Using how="inner" or how="full" together with mult!="all" is sub-efficient. Unlike during joins in [.data.table, it will apply mult on both tables. This ensures that the join is symmetric so that the LHS and RHS tables can be swapped, regardless of mult. It is always possible to apply a mult-like filter manually and join using mult="all".

Using join.many=FALSE is also sub-efficient. Note that it only takes effect when mult="all". If input data are verified not to have duplicate matches, then this can safely use the default TRUE. Otherwise, for mult="all" merges it is recommended to use join.many=FALSE, unless of course many-to-many joins, duplicating rows, are intended.

References

https://en.wikipedia.org/wiki/Snowflake_schema, https://en.wikipedia.org/wiki/Star_schema

Examples

l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8),
  data.table(id1=2:3, v2=1:2),
  data.table(id1=3:5, v3=1:3)
)
mergelist(l, on="id1")
#>      id1    v1    v2    v3
#>    <int> <int> <int> <int>
#> 1:     1     1    NA    NA
#> 2:     2     2     1    NA
#> 3:     3     3     2     1
#> 4:     4     4    NA     2
#> 5:     2     5     1    NA
#> 6:     3     6     2     1
#> 7:     4     7    NA     2
#> 8:     5     8    NA     3

## using keys
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8),
  data.table(id1=3:5, id2=1:3, v2=1:3, key="id1"),
  data.table(id2=1:4, v3=4:1, key="id2")
)
mergelist(l)
#>      id2   id1    v1    v2    v3
#>    <int> <int> <int> <int> <int>
#> 1:    NA     1     1    NA    NA
#> 2:    NA     2     2    NA    NA
#> 3:     1     3     3     1     4
#> 4:     2     4     4     2     3
#> 5:    NA     2     5    NA    NA
#> 6:     1     3     6     1     4
#> 7:     2     4     7     2     3
#> 8:     3     5     8     3     2

## select columns
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8, v2=8:1),
  data.table(id1=3:5, v3=1:3, v4=3:1, v5=1L, key="id1")
)
mergelist(l, cols=list(NULL, c("v3", "v5")))
#>      id1    v1    v2    v3    v5
#>    <int> <int> <int> <int> <int>
#> 1:     1     1     8    NA    NA
#> 2:     2     2     7    NA    NA
#> 3:     3     3     6     1     1
#> 4:     4     4     5     2     1
#> 5:     2     5     4    NA    NA
#> 6:     3     6     3     1     1
#> 7:     4     7     2     2     1
#> 8:     5     8     1     3     1

## different arguments for each merge pair
l = list(
  data.table(id1=1:4, id2=4:1),
  data.table(id1=c(1:3, 1:2), v2=c(1L, 1L, 1:2, 2L)),
  data.table(id2=4:5)
)
mergelist(l,
  on = list("id1", "id2"),     ## first merge on id1, second on id2
  how = list("inner", "anti"), ## first inner join, second anti join
  mult = list("last", NULL))   ## use default 'mult' in second join
#>      id1   id2    v2
#>    <int> <int> <int>
#> 1:     2     3     2
#> 2:     3     2     1

## detecting duplicates matches
l = list(
  data.table(id1=c(1:4, 2:5), v1=1:8), ## dups in LHS are fine
  data.table(id1=c(2:3, 2L), v2=1:3),  ## dups in RHS
  data.table(id1=3:5, v3=1:3)
)
lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
#> [[1]]
#>      id1    v2
#>    <int> <int>
#> 1:     2     1
#> 2:     2     3
#> 
#> [[2]]
#> Empty data.table (0 rows and 1 cols): id1
#> 
try(mergelist(l, on="id1"))
#> Error in bmerge(i, x, icols, xcols, roll = 0, rollends = c(FALSE, TRUE),  : 
#>   mult='error' and multiple matches during merge

# \donttest{
## 'star schema' and 'snowflake schema' examples (realistic data sizes)

### populate fact: US population by state and date

gt = state.x77[, "Population"]
gt = data.table(state_id=seq_along(state.name), p=gt[state.name] / sum(gt), k=1L)
tt = as.IDate(paste0(as.integer(time(uspop)), "-01-01"))
tt = as.data.table(stats::approx(tt, c(uspop), tt[1L]:tt[length(tt)]))
tt = tt[, .(date=as.IDate(x), date_id=seq_along(x), pop=y, k=1L)]
fact = tt[gt, on="k", allow.cartesian=TRUE,
          .(state_id=i.state_id, date_id=x.date_id, population=x.pop * i.p)]
setkeyv(fact, c("state_id", "date_id"))

### populate dimensions: time and geography

time = data.table(key="date_id",
  date_id= seq_along(tt$date), date=tt$date,
  month_id=month(tt$date), month=month.name[month(tt$date)],
  year_id=year(tt$date)-1789L, year=as.character(year(tt$date)),
  week_id=week(tt$date), week=as.character(week(tt$date)),
  weekday_id=wday(tt$date)-1L, weekday=weekdays(tt$date)
)
#> Warning: The default behavior of week() is changing. Previously ('legacy' mode), week numbers advanced every 7th day of the year. The new 'sequential' mode ensures the first week always has 7 days. For example, as.IDate('2023-01-07') returns week 2 in legacy mode but week 1 in sequential mode (week 2 starts on '2023-01-08'). To adopt the new behavior now, set options(datatable.week = 'sequential'). To keep the old results and silence this warning, set options(datatable.week = 'legacy'). See https://github.com/Rdatatable/data.table/issues/2611
#> Warning: The default behavior of week() is changing. Previously ('legacy' mode), week numbers advanced every 7th day of the year. The new 'sequential' mode ensures the first week always has 7 days. For example, as.IDate('2023-01-07') returns week 2 in legacy mode but week 1 in sequential mode (week 2 starts on '2023-01-08'). To adopt the new behavior now, set options(datatable.week = 'sequential'). To keep the old results and silence this warning, set options(datatable.week = 'legacy'). See https://github.com/Rdatatable/data.table/issues/2611
time[weekday_id == 0L, weekday_id := 7L][]
#> Key: <date_id>
#>        date_id       date month_id    month year_id   year week_id   week
#>          <int>     <IDat>    <int>   <char>   <int> <char>   <int> <char>
#>     1:       1 1790-01-01        1  January       1   1790       1      1
#>     2:       2 1790-01-02        1  January       1   1790       1      1
#>     3:       3 1790-01-03        1  January       1   1790       1      1
#>     4:       4 1790-01-04        1  January       1   1790       1      1
#>     5:       5 1790-01-05        1  January       1   1790       1      1
#>    ---                                                                   
#> 65740:   65740 1969-12-28       12 December     180   1969      52     52
#> 65741:   65741 1969-12-29       12 December     180   1969      52     52
#> 65742:   65742 1969-12-30       12 December     180   1969      53     53
#> 65743:   65743 1969-12-31       12 December     180   1969      53     53
#> 65744:   65744 1970-01-01        1  January     181   1970       1      1
#>        weekday_id   weekday
#>             <int>    <char>
#>     1:          5    Friday
#>     2:          6  Saturday
#>     3:          7    Sunday
#>     4:          1    Monday
#>     5:          2   Tuesday
#>    ---                     
#> 65740:          7    Sunday
#> 65741:          1    Monday
#> 65742:          2   Tuesday
#> 65743:          3 Wednesday
#> 65744:          4  Thursday
geog = data.table(key="state_id",
  state_id=seq_along(state.name), state_abb=state.abb, state_name=state.name,
  division_id=as.integer(state.division),
  division_name=as.character(state.division),
  region_id=as.integer(state.region),
  region_name=as.character(state.region)
)
rm(gt, tt)

### denormalize 'star schema'

l = list(fact, time, geog)
str(l)
#> List of 3
#>  $ :Classes ‘data.table’ and 'data.frame':	3287200 obs. of  3 variables:
#>   ..$ state_id  : int [1:3287200] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ date_id   : int [1:3287200] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ population: num [1:3287200] 0.0669 0.0669 0.0669 0.0669 0.0669 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr [1:2] "state_id" "date_id"
#>  $ :Classes ‘data.table’ and 'data.frame':	65744 obs. of  10 variables:
#>   ..$ date_id   : int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ date      : IDate[1:65744], format: "1790-01-01" "1790-01-02" ...
#>   ..$ month_id  : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ month     : chr [1:65744] "January" "January" "January" "January" ...
#>   ..$ year_id   : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ year      : chr [1:65744] "1790" "1790" "1790" "1790" ...
#>   ..$ week_id   : int [1:65744] 1 1 1 1 1 1 2 2 2 2 ...
#>   ..$ week      : chr [1:65744] "1" "1" "1" "1" ...
#>   ..$ weekday_id: int [1:65744] 5 6 7 1 2 3 4 5 6 7 ...
#>   ..$ weekday   : chr [1:65744] "Friday" "Saturday" "Sunday" "Monday" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "date_id"
#>   ..- attr(*, "index")= int(0) 
#>  $ :Classes ‘data.table’ and 'data.frame':	50 obs. of  7 variables:
#>   ..$ state_id     : int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ state_abb    : chr [1:50] "AL" "AK" "AZ" "AR" ...
#>   ..$ state_name   : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
#>   ..$ division_id  : int [1:50] 4 9 8 5 9 8 1 3 3 3 ...
#>   ..$ division_name: chr [1:50] "East South Central" "Pacific" "Mountain" "West South Central" ...
#>   ..$ region_id    : int [1:50] 2 4 4 2 4 4 1 2 2 2 ...
#>   ..$ region_name  : chr [1:50] "South" "West" "West" "South" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "state_id"
mergelist(l)
#> Key: <state_id, date_id>
#>          state_id date_id population       date month_id    month year_id
#>             <int>   <int>      <num>     <IDat>    <int>   <char>   <int>
#>       1:        1       1 0.06691260 1790-01-01        1  January       1
#>       2:        1       2 0.06691903 1790-01-02        1  January       1
#>       3:        1       3 0.06692547 1790-01-03        1  January       1
#>       4:        1       4 0.06693190 1790-01-04        1  January       1
#>       5:        1       5 0.06693833 1790-01-05        1  January       1
#>      ---                                                                 
#> 3287196:       50   65740 0.35980124 1969-12-28       12 December     180
#> 3287197:       50   65741 0.35981283 1969-12-29       12 December     180
#> 3287198:       50   65742 0.35982442 1969-12-30       12 December     180
#> 3287199:       50   65743 0.35983600 1969-12-31       12 December     180
#> 3287200:       50   65744 0.35984759 1970-01-01        1  January     181
#>            year week_id   week weekday_id   weekday state_abb state_name
#>          <char>   <int> <char>      <int>    <char>    <char>     <char>
#>       1:   1790       1      1          5    Friday        AL    Alabama
#>       2:   1790       1      1          6  Saturday        AL    Alabama
#>       3:   1790       1      1          7    Sunday        AL    Alabama
#>       4:   1790       1      1          1    Monday        AL    Alabama
#>       5:   1790       1      1          2   Tuesday        AL    Alabama
#>      ---                                                                
#> 3287196:   1969      52     52          7    Sunday        WY    Wyoming
#> 3287197:   1969      52     52          1    Monday        WY    Wyoming
#> 3287198:   1969      53     53          2   Tuesday        WY    Wyoming
#> 3287199:   1969      53     53          3 Wednesday        WY    Wyoming
#> 3287200:   1970       1      1          4  Thursday        WY    Wyoming
#>          division_id      division_name region_id region_name
#>                <int>             <char>     <int>      <char>
#>       1:           4 East South Central         2       South
#>       2:           4 East South Central         2       South
#>       3:           4 East South Central         2       South
#>       4:           4 East South Central         2       South
#>       5:           4 East South Central         2       South
#>      ---                                                     
#> 3287196:           8           Mountain         4        West
#> 3287197:           8           Mountain         4        West
#> 3287198:           8           Mountain         4        West
#> 3287199:           8           Mountain         4        West
#> 3287200:           8           Mountain         4        West

rm(l)

### turn 'star schema' into 'snowflake schema'

make.lvl = function(x, cols) {
  stopifnot(is.data.table(x))
  lvl = x[, unique(.SD), .SDcols=cols]
  setkeyv(lvl, cols[1L])
  setindexv(lvl, as.list(cols))
  lvl
}
time = list(
  date = make.lvl(
    time, c("date_id", "date", "year_id", "month_id", "week_id", "weekday_id")),
  weekday = make.lvl(time, c("weekday_id", "weekday")),
  week = make.lvl(time, c("week_id", "week")),
  month = make.lvl(time, c("month_id", "month")),
  year = make.lvl(time, c("year_id", "year"))
)
geog = list(
  state = make.lvl(geog, c("state_id", "state_abb", "state_name", "division_id")),
  division = make.lvl(geog, c("division_id", "division_name", "region_id")),
  region = make.lvl(geog, c("region_id", "region_name"))
)

### denormalize 'snowflake schema'

#### left join all
l = c(list(fact=fact), time, geog)
str(l)
#> List of 9
#>  $ fact    :Classes ‘data.table’ and 'data.frame':	3287200 obs. of  3 variables:
#>   ..$ state_id  : int [1:3287200] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ date_id   : int [1:3287200] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ population: num [1:3287200] 0.0669 0.0669 0.0669 0.0669 0.0669 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr [1:2] "state_id" "date_id"
#>  $ date    :Classes ‘data.table’ and 'data.frame':	65744 obs. of  6 variables:
#>   ..$ date_id   : int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ date      : IDate[1:65744], format: "1790-01-01" "1790-01-02" ...
#>   ..$ year_id   : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ month_id  : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ week_id   : int [1:65744] 1 1 1 1 1 1 2 2 2 2 ...
#>   ..$ weekday_id: int [1:65744] 5 6 7 1 2 3 4 5 6 7 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "date_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__date_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__date")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__year_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:181] 1 366 731 1097 1462 1827 2192 2558 2923 3288 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 366
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__month_id")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "starts")= int [1:12] 1 5582 10665 16245 21645 27225 32625 38205 43785 49185 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 5581
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__week_id")= int [1:65744] 1 2 3 4 5 6 366 367 368 369 ...
#>   .. .. ..- attr(*, "starts")= int [1:53] 1 1082 2342 3602 4862 6122 7382 8642 9902 11162 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1260
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__weekday_id")= int [1:65744] 4 11 18 25 32 39 46 53 60 67 ...
#>   .. .. ..- attr(*, "starts")= int [1:7] 1 9393 18785 28177 37569 46961 56353
#>   .. .. ..- attr(*, "maxgrpn")= int 9392
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ weekday :Classes ‘data.table’ and 'data.frame':	7 obs. of  2 variables:
#>   ..$ weekday_id: int [1:7] 1 2 3 4 5 6 7
#>   ..$ weekday   : chr [1:7] "Monday" "Tuesday" "Wednesday" "Thursday" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "weekday_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__weekday_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:7] 1 2 3 4 5 6 7
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__weekday")= int [1:7] 5 1 6 7 4 2 3
#>   .. .. ..- attr(*, "starts")= int [1:7] 1 2 3 4 5 6 7
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ week    :Classes ‘data.table’ and 'data.frame':	53 obs. of  2 variables:
#>   ..$ week_id: int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ week   : chr [1:53] "1" "2" "3" "4" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "week_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__week_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__week")= int [1:53] 1 10 11 12 13 14 15 16 17 18 ...
#>   .. .. ..- attr(*, "starts")= int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ month   :Classes ‘data.table’ and 'data.frame':	12 obs. of  2 variables:
#>   ..$ month_id: int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ month   : chr [1:12] "January" "February" "March" "April" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "month_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__month_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__month")= int [1:12] 4 8 12 2 1 7 6 3 5 11 ...
#>   .. .. ..- attr(*, "starts")= int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ year    :Classes ‘data.table’ and 'data.frame':	181 obs. of  2 variables:
#>   ..$ year_id: int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ year   : chr [1:181] "1790" "1791" "1792" "1793" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "year_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__year_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__year")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ state   :Classes ‘data.table’ and 'data.frame':	50 obs. of  4 variables:
#>   ..$ state_id   : int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ state_abb  : chr [1:50] "AL" "AK" "AZ" "AR" ...
#>   ..$ state_name : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
#>   ..$ division_id: int [1:50] 4 9 8 5 9 8 1 3 3 3 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "state_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__state_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__state_abb")= int [1:50] 2 1 4 3 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__state_name")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__division_id")= int [1:50] 7 19 21 29 39 45 30 32 38 8 ...
#>   .. .. ..- attr(*, "starts")= int [1:9] 1 7 10 18 22 26 31 38 46
#>   .. .. ..- attr(*, "maxgrpn")= int 8
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ division:Classes ‘data.table’ and 'data.frame':	9 obs. of  3 variables:
#>   ..$ division_id  : int [1:9] 1 2 3 4 5 6 7 8 9
#>   ..$ division_name: chr [1:9] "New England" "Middle Atlantic" "South Atlantic" "East South Central" ...
#>   ..$ region_id    : int [1:9] 1 1 2 2 2 3 3 4 4
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "division_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__division_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:9] 1 2 3 4 5 6 7 8 9
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__division_name")= int [1:9] 6 4 2 8 1 9 3 7 5
#>   .. .. ..- attr(*, "starts")= int [1:9] 1 2 3 4 5 6 7 8 9
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__region_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:4] 1 3 6 8
#>   .. .. ..- attr(*, "maxgrpn")= int 3
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>  $ region  :Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#>   ..$ region_id  : int [1:4] 1 2 3 4
#>   ..$ region_name: chr [1:4] "Northeast" "South" "North Central" "West"
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "region_id"
#>   ..- attr(*, "index")= int(0) 
#>   .. ..- attr(*, "__region_id")= int(0) 
#>   .. .. ..- attr(*, "starts")= int [1:4] 1 2 3 4
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
#>   .. ..- attr(*, "__region_name")= int [1:4] 3 1 2 4
#>   .. .. ..- attr(*, "starts")= int [1:4] 1 2 3 4
#>   .. .. ..- attr(*, "maxgrpn")= int 1
#>   .. .. ..- attr(*, "anyna")= int 0
#>   .. .. ..- attr(*, "anyinfnan")= int 0
#>   .. .. ..- attr(*, "anynotascii")= int 0
#>   .. .. ..- attr(*, "anynotutf8")= int 0
mergelist(l)
#> Key: <state_id, date_id>
#>          region_id division_id state_id year_id month_id week_id weekday_id
#>              <int>       <int>    <int>   <int>    <int>   <int>      <int>
#>       1:         2           4        1       1        1       1          5
#>       2:         2           4        1       1        1       1          6
#>       3:         2           4        1       1        1       1          7
#>       4:         2           4        1       1        1       1          1
#>       5:         2           4        1       1        1       1          2
#>      ---                                                                   
#> 3287196:         4           8       50     180       12      52          7
#> 3287197:         4           8       50     180       12      52          1
#> 3287198:         4           8       50     180       12      53          2
#> 3287199:         4           8       50     180       12      53          3
#> 3287200:         4           8       50     181        1       1          4
#>          date_id population       date   weekday   week    month   year
#>            <int>      <num>     <IDat>    <char> <char>   <char> <char>
#>       1:       1 0.06691260 1790-01-01    Friday      1  January   1790
#>       2:       2 0.06691903 1790-01-02  Saturday      1  January   1790
#>       3:       3 0.06692547 1790-01-03    Sunday      1  January   1790
#>       4:       4 0.06693190 1790-01-04    Monday      1  January   1790
#>       5:       5 0.06693833 1790-01-05   Tuesday      1  January   1790
#>      ---                                                               
#> 3287196:   65740 0.35980124 1969-12-28    Sunday     52 December   1969
#> 3287197:   65741 0.35981283 1969-12-29    Monday     52 December   1969
#> 3287198:   65742 0.35982442 1969-12-30   Tuesday     53 December   1969
#> 3287199:   65743 0.35983600 1969-12-31 Wednesday     53 December   1969
#> 3287200:   65744 0.35984759 1970-01-01  Thursday      1  January   1970
#>          state_abb state_name      division_name region_name
#>             <char>     <char>             <char>      <char>
#>       1:        AL    Alabama East South Central       South
#>       2:        AL    Alabama East South Central       South
#>       3:        AL    Alabama East South Central       South
#>       4:        AL    Alabama East South Central       South
#>       5:        AL    Alabama East South Central       South
#>      ---                                                    
#> 3287196:        WY    Wyoming           Mountain        West
#> 3287197:        WY    Wyoming           Mountain        West
#> 3287198:        WY    Wyoming           Mountain        West
#> 3287199:        WY    Wyoming           Mountain        West
#> 3287200:        WY    Wyoming           Mountain        West

rm(l)
#### merge hierarchies alone, reduce sizes in merges of geog dimension
ans = mergelist(list(
  fact,
  mergelist(time),
  mergelist(rev(geog), how="right")
))

rm(ans)
#### same but no unnecessary copies
ans = mergelist(list(
  fact,
  setmergelist(time),
  setmergelist(rev(geog), how="right")
))
# }