Merge multiple data.tables
mergelist.Rd
Faster merge of multiple data.table
s.
Arguments
- l
list
ofdata.table
s to merge.- on
character
vector of column names to merge on; when missing, thekey
of the join-to table is used (see Details).- cols
Optional
list
ofcharacter
column names corresponding to tables inl
, used to subset columns during merges.NULL
means all columns, all tables;NULL
entries in a list means all columns for the corresponding table.- how
Character string, controls how to merge tables. Allowed values are
"left"
(default),"inner"
,"full"
,"right"
,"semi"
,"anti"
, and"cross"
. See Details.- mult
Character string, controls how to proceed when multiple rows in the join-to table match to the row in the join-from table. Allowed values are
"error"
,"all"
,"first"
,"last"
. The default value depends onhow
; see Details. See Examples for how to detect duplicated matches. When using"all"
, we recommend specifyingjoin.many=FALSE
as a precaution to prevent unintended explosion of rows.- join.many
logical
, defaulting togetOption("datatable.join.many")
, which isTRUE
by default; whenFALSE
andmult="all"
, an error is thrown when any many-to-many matches are detected between pairs of tables. This is essentially a stricter version of theallow.cartesian
option in[.data.table
. Note that the option"datatable.join.many"
also controls the behavior of joins in[.data.table
.
Details
Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.
Merging is performed sequentially from "left to right", so that for l
of 3 tables, it will do something like merge(merge(l[[1L]], l[[2L]]), l[[3L]])
. Non-equi joins are not supported. Column names to merge on must be common in both tables on each merge.
Arguments on
, how
, mult
, join.many
could be lists as well, each of length length(l)-1L
, to provide argument to be used for each single tables pair to merge, see examples.
The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns.
Whether each refers to the "left" or "right" table of a pair depends on the how
argument:
how %in% c("left", "semi", "anti")
: join-to is RHS, join-from is LHS.how %in% c("inner", "full", "cross")
: LHS and RHS tables are treated equally, so that the terms are interchangeable.how == "right"
: join-to is LHS, join-from is RHS.
Using mult="error"
will throw an error when multiple rows in join-to table match to the row in join-from table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.
When not specified, mult
takes its default depending on the how
argument:
When
how %in% c("left", "inner", "full", "right")
,mult="error"
.When
how %in% c("semi", "anti")
,mult="last"
, although this is equivalent tomult="first"
.When
how == "cross"
,mult="all"
.
When the on
argument is missing, it will be determined based how
argument:
When
how %in% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table. \item When \code{how %in% c("inner", full")
, if only one table has a key, then this key is used; if both tables have keys, thenon = intersect(key(lhs), key(rhs))
, having its order aligned to shorter key.
When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a right outer join can be used to optimize the sequence of merges, see Examples.
Value
A new data.table
based on the merged objects.
For setmergelist
, if possible, a copy
of the inputs is avoided.
Note
Using how="inner"
or how="full"
together with mult!="all"
is sub-efficient. Unlike during joins in [.data.table
, it will apply mult
on both tables. This ensures that the join is symmetric so that the LHS and RHS tables can be swapped, regardless of mult
. It is always possible to apply a mult
-like filter manually and join using mult="all"
.
Using join.many=FALSE
is also sub-efficient. Note that it only takes effect when mult="all"
. If input data are verified not to have duplicate matches, then this can safely use the default TRUE
. Otherwise, for mult="all"
merges it is recommended to use join.many=FALSE
, unless of course many-to-many joins, duplicating rows, are intended.
References
https://en.wikipedia.org/wiki/Snowflake_schema, https://en.wikipedia.org/wiki/Star_schema
Examples
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8),
data.table(id1=2:3, v2=1:2),
data.table(id1=3:5, v3=1:3)
)
mergelist(l, on="id1")
#> id1 v1 v2 v3
#> <int> <int> <int> <int>
#> 1: 1 1 NA NA
#> 2: 2 2 1 NA
#> 3: 3 3 2 1
#> 4: 4 4 NA 2
#> 5: 2 5 1 NA
#> 6: 3 6 2 1
#> 7: 4 7 NA 2
#> 8: 5 8 NA 3
## using keys
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8),
data.table(id1=3:5, id2=1:3, v2=1:3, key="id1"),
data.table(id2=1:4, v3=4:1, key="id2")
)
mergelist(l)
#> id2 id1 v1 v2 v3
#> <int> <int> <int> <int> <int>
#> 1: NA 1 1 NA NA
#> 2: NA 2 2 NA NA
#> 3: 1 3 3 1 4
#> 4: 2 4 4 2 3
#> 5: NA 2 5 NA NA
#> 6: 1 3 6 1 4
#> 7: 2 4 7 2 3
#> 8: 3 5 8 3 2
## select columns
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8, v2=8:1),
data.table(id1=3:5, v3=1:3, v4=3:1, v5=1L, key="id1")
)
mergelist(l, cols=list(NULL, c("v3", "v5")))
#> id1 v1 v2 v3 v5
#> <int> <int> <int> <int> <int>
#> 1: 1 1 8 NA NA
#> 2: 2 2 7 NA NA
#> 3: 3 3 6 1 1
#> 4: 4 4 5 2 1
#> 5: 2 5 4 NA NA
#> 6: 3 6 3 1 1
#> 7: 4 7 2 2 1
#> 8: 5 8 1 3 1
## different arguments for each merge pair
l = list(
data.table(id1=1:4, id2=4:1),
data.table(id1=c(1:3, 1:2), v2=c(1L, 1L, 1:2, 2L)),
data.table(id2=4:5)
)
mergelist(l,
on = list("id1", "id2"), ## first merge on id1, second on id2
how = list("inner", "anti"), ## first inner join, second anti join
mult = list("last", NULL)) ## use default 'mult' in second join
#> id1 id2 v2
#> <int> <int> <int>
#> 1: 2 3 2
#> 2: 3 2 1
## detecting duplicates matches
l = list(
data.table(id1=c(1:4, 2:5), v1=1:8), ## dups in LHS are fine
data.table(id1=c(2:3, 2L), v2=1:3), ## dups in RHS
data.table(id1=3:5, v3=1:3)
)
lapply(l[-1L], `[`, j = if (.N>1L) .SD, by = "id1") ## duplicated rows
#> [[1]]
#> id1 v2
#> <int> <int>
#> 1: 2 1
#> 2: 2 3
#>
#> [[2]]
#> Empty data.table (0 rows and 1 cols): id1
#>
try(mergelist(l, on="id1"))
#> Error in bmerge(i, x, icols, xcols, roll = 0, rollends = c(FALSE, TRUE), :
#> mult='error' and multiple matches during merge
# \donttest{
## 'star schema' and 'snowflake schema' examples (realistic data sizes)
### populate fact: US population by state and date
gt = state.x77[, "Population"]
gt = data.table(state_id=seq_along(state.name), p=gt[state.name] / sum(gt), k=1L)
tt = as.IDate(paste0(as.integer(time(uspop)), "-01-01"))
tt = as.data.table(stats::approx(tt, c(uspop), tt[1L]:tt[length(tt)]))
tt = tt[, .(date=as.IDate(x), date_id=seq_along(x), pop=y, k=1L)]
fact = tt[gt, on="k", allow.cartesian=TRUE,
.(state_id=i.state_id, date_id=x.date_id, population=x.pop * i.p)]
setkeyv(fact, c("state_id", "date_id"))
### populate dimensions: time and geography
time = data.table(key="date_id",
date_id= seq_along(tt$date), date=tt$date,
month_id=month(tt$date), month=month.name[month(tt$date)],
year_id=year(tt$date)-1789L, year=as.character(year(tt$date)),
week_id=week(tt$date), week=as.character(week(tt$date)),
weekday_id=wday(tt$date)-1L, weekday=weekdays(tt$date)
)
time[weekday_id == 0L, weekday_id := 7L][]
#> Key: <date_id>
#> date_id date month_id month year_id year week_id week
#> <int> <IDat> <int> <char> <int> <char> <int> <char>
#> 1: 1 1790-01-01 1 January 1 1790 1 1
#> 2: 2 1790-01-02 1 January 1 1790 1 1
#> 3: 3 1790-01-03 1 January 1 1790 1 1
#> 4: 4 1790-01-04 1 January 1 1790 1 1
#> 5: 5 1790-01-05 1 January 1 1790 1 1
#> ---
#> 65740: 65740 1969-12-28 12 December 180 1969 52 52
#> 65741: 65741 1969-12-29 12 December 180 1969 52 52
#> 65742: 65742 1969-12-30 12 December 180 1969 53 53
#> 65743: 65743 1969-12-31 12 December 180 1969 53 53
#> 65744: 65744 1970-01-01 1 January 181 1970 1 1
#> weekday_id weekday
#> <int> <char>
#> 1: 5 Friday
#> 2: 6 Saturday
#> 3: 7 Sunday
#> 4: 1 Monday
#> 5: 2 Tuesday
#> ---
#> 65740: 7 Sunday
#> 65741: 1 Monday
#> 65742: 2 Tuesday
#> 65743: 3 Wednesday
#> 65744: 4 Thursday
geog = data.table(key="state_id",
state_id=seq_along(state.name), state_abb=state.abb, state_name=state.name,
division_id=as.integer(state.division),
division_name=as.character(state.division),
region_id=as.integer(state.region),
region_name=as.character(state.region)
)
rm(gt, tt)
### denormalize 'star schema'
l = list(fact, time, geog)
str(l)
#> List of 3
#> $ :Classes ‘data.table’ and 'data.frame': 3287200 obs. of 3 variables:
#> ..$ state_id : int [1:3287200] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ date_id : int [1:3287200] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ population: num [1:3287200] 0.0669 0.0669 0.0669 0.0669 0.0669 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr [1:2] "state_id" "date_id"
#> $ :Classes ‘data.table’ and 'data.frame': 65744 obs. of 10 variables:
#> ..$ date_id : int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ date : IDate[1:65744], format: "1790-01-01" "1790-01-02" ...
#> ..$ month_id : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ month : chr [1:65744] "January" "January" "January" "January" ...
#> ..$ year_id : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ year : chr [1:65744] "1790" "1790" "1790" "1790" ...
#> ..$ week_id : int [1:65744] 1 1 1 1 1 1 2 2 2 2 ...
#> ..$ week : chr [1:65744] "1" "1" "1" "1" ...
#> ..$ weekday_id: int [1:65744] 5 6 7 1 2 3 4 5 6 7 ...
#> ..$ weekday : chr [1:65744] "Friday" "Saturday" "Sunday" "Monday" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "date_id"
#> ..- attr(*, "index")= int(0)
#> $ :Classes ‘data.table’ and 'data.frame': 50 obs. of 7 variables:
#> ..$ state_id : int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ state_abb : chr [1:50] "AL" "AK" "AZ" "AR" ...
#> ..$ state_name : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> ..$ division_id : int [1:50] 4 9 8 5 9 8 1 3 3 3 ...
#> ..$ division_name: chr [1:50] "East South Central" "Pacific" "Mountain" "West South Central" ...
#> ..$ region_id : int [1:50] 2 4 4 2 4 4 1 2 2 2 ...
#> ..$ region_name : chr [1:50] "South" "West" "West" "South" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "state_id"
mergelist(l)
#> Key: <state_id, date_id>
#> state_id date_id population date month_id month year_id
#> <int> <int> <num> <IDat> <int> <char> <int>
#> 1: 1 1 0.06691260 1790-01-01 1 January 1
#> 2: 1 2 0.06691903 1790-01-02 1 January 1
#> 3: 1 3 0.06692547 1790-01-03 1 January 1
#> 4: 1 4 0.06693190 1790-01-04 1 January 1
#> 5: 1 5 0.06693833 1790-01-05 1 January 1
#> ---
#> 3287196: 50 65740 0.35980124 1969-12-28 12 December 180
#> 3287197: 50 65741 0.35981283 1969-12-29 12 December 180
#> 3287198: 50 65742 0.35982442 1969-12-30 12 December 180
#> 3287199: 50 65743 0.35983600 1969-12-31 12 December 180
#> 3287200: 50 65744 0.35984759 1970-01-01 1 January 181
#> year week_id week weekday_id weekday state_abb state_name
#> <char> <int> <char> <int> <char> <char> <char>
#> 1: 1790 1 1 5 Friday AL Alabama
#> 2: 1790 1 1 6 Saturday AL Alabama
#> 3: 1790 1 1 7 Sunday AL Alabama
#> 4: 1790 1 1 1 Monday AL Alabama
#> 5: 1790 1 1 2 Tuesday AL Alabama
#> ---
#> 3287196: 1969 52 52 7 Sunday WY Wyoming
#> 3287197: 1969 52 52 1 Monday WY Wyoming
#> 3287198: 1969 53 53 2 Tuesday WY Wyoming
#> 3287199: 1969 53 53 3 Wednesday WY Wyoming
#> 3287200: 1970 1 1 4 Thursday WY Wyoming
#> division_id division_name region_id region_name
#> <int> <char> <int> <char>
#> 1: 4 East South Central 2 South
#> 2: 4 East South Central 2 South
#> 3: 4 East South Central 2 South
#> 4: 4 East South Central 2 South
#> 5: 4 East South Central 2 South
#> ---
#> 3287196: 8 Mountain 4 West
#> 3287197: 8 Mountain 4 West
#> 3287198: 8 Mountain 4 West
#> 3287199: 8 Mountain 4 West
#> 3287200: 8 Mountain 4 West
rm(l)
### turn 'star schema' into 'snowflake schema'
make.lvl = function(x, cols) {
stopifnot(is.data.table(x))
lvl = x[, unique(.SD), .SDcols=cols]
setkeyv(lvl, cols[1L])
setindexv(lvl, as.list(cols))
lvl
}
time = list(
date = make.lvl(
time, c("date_id", "date", "year_id", "month_id", "week_id", "weekday_id")),
weekday = make.lvl(time, c("weekday_id", "weekday")),
week = make.lvl(time, c("week_id", "week")),
month = make.lvl(time, c("month_id", "month")),
year = make.lvl(time, c("year_id", "year"))
)
geog = list(
state = make.lvl(geog, c("state_id", "state_abb", "state_name", "division_id")),
division = make.lvl(geog, c("division_id", "division_name", "region_id")),
region = make.lvl(geog, c("region_id", "region_name"))
)
### denormalize 'snowflake schema'
#### left join all
l = c(list(fact=fact), time, geog)
str(l)
#> List of 9
#> $ fact :Classes ‘data.table’ and 'data.frame': 3287200 obs. of 3 variables:
#> ..$ state_id : int [1:3287200] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ date_id : int [1:3287200] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ population: num [1:3287200] 0.0669 0.0669 0.0669 0.0669 0.0669 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr [1:2] "state_id" "date_id"
#> $ date :Classes ‘data.table’ and 'data.frame': 65744 obs. of 6 variables:
#> ..$ date_id : int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ date : IDate[1:65744], format: "1790-01-01" "1790-01-02" ...
#> ..$ year_id : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ month_id : int [1:65744] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ week_id : int [1:65744] 1 1 1 1 1 1 2 2 2 2 ...
#> ..$ weekday_id: int [1:65744] 5 6 7 1 2 3 4 5 6 7 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "date_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__date_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__date")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__year_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:181] 1 366 731 1097 1462 1827 2192 2558 2923 3288 ...
#> .. .. ..- attr(*, "maxgrpn")= int 366
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__month_id")= int [1:65744] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "starts")= int [1:12] 1 5582 10665 16245 21645 27225 32625 38205 43785 49185 ...
#> .. .. ..- attr(*, "maxgrpn")= int 5581
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__week_id")= int [1:65744] 1 2 3 4 5 6 366 367 368 369 ...
#> .. .. ..- attr(*, "starts")= int [1:53] 1 1082 2342 3602 4862 6122 7382 8642 9902 11162 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1260
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__weekday_id")= int [1:65744] 4 11 18 25 32 39 46 53 60 67 ...
#> .. .. ..- attr(*, "starts")= int [1:7] 1 9393 18785 28177 37569 46961 56353
#> .. .. ..- attr(*, "maxgrpn")= int 9392
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ weekday :Classes ‘data.table’ and 'data.frame': 7 obs. of 2 variables:
#> ..$ weekday_id: int [1:7] 1 2 3 4 5 6 7
#> ..$ weekday : chr [1:7] "Monday" "Tuesday" "Wednesday" "Thursday" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "weekday_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__weekday_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:7] 1 2 3 4 5 6 7
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__weekday")= int [1:7] 5 1 6 7 4 2 3
#> .. .. ..- attr(*, "starts")= int [1:7] 1 2 3 4 5 6 7
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ week :Classes ‘data.table’ and 'data.frame': 53 obs. of 2 variables:
#> ..$ week_id: int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ week : chr [1:53] "1" "2" "3" "4" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "week_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__week_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__week")= int [1:53] 1 10 11 12 13 14 15 16 17 18 ...
#> .. .. ..- attr(*, "starts")= int [1:53] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ month :Classes ‘data.table’ and 'data.frame': 12 obs. of 2 variables:
#> ..$ month_id: int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ month : chr [1:12] "January" "February" "March" "April" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "month_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__month_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__month")= int [1:12] 4 8 12 2 1 7 6 3 5 11 ...
#> .. .. ..- attr(*, "starts")= int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ year :Classes ‘data.table’ and 'data.frame': 181 obs. of 2 variables:
#> ..$ year_id: int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ year : chr [1:181] "1790" "1791" "1792" "1793" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "year_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__year_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__year")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:181] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ state :Classes ‘data.table’ and 'data.frame': 50 obs. of 4 variables:
#> ..$ state_id : int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ state_abb : chr [1:50] "AL" "AK" "AZ" "AR" ...
#> ..$ state_name : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> ..$ division_id: int [1:50] 4 9 8 5 9 8 1 3 3 3 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "state_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__state_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__state_abb")= int [1:50] 2 1 4 3 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__state_name")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__division_id")= int [1:50] 7 19 21 29 39 45 30 32 38 8 ...
#> .. .. ..- attr(*, "starts")= int [1:9] 1 7 10 18 22 26 31 38 46
#> .. .. ..- attr(*, "maxgrpn")= int 8
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ division:Classes ‘data.table’ and 'data.frame': 9 obs. of 3 variables:
#> ..$ division_id : int [1:9] 1 2 3 4 5 6 7 8 9
#> ..$ division_name: chr [1:9] "New England" "Middle Atlantic" "South Atlantic" "East South Central" ...
#> ..$ region_id : int [1:9] 1 1 2 2 2 3 3 4 4
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "division_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__division_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:9] 1 2 3 4 5 6 7 8 9
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__division_name")= int [1:9] 6 4 2 8 1 9 3 7 5
#> .. .. ..- attr(*, "starts")= int [1:9] 1 2 3 4 5 6 7 8 9
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__region_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:4] 1 3 6 8
#> .. .. ..- attr(*, "maxgrpn")= int 3
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> $ region :Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
#> ..$ region_id : int [1:4] 1 2 3 4
#> ..$ region_name: chr [1:4] "Northeast" "South" "North Central" "West"
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr "region_id"
#> ..- attr(*, "index")= int(0)
#> .. ..- attr(*, "__region_id")= int(0)
#> .. .. ..- attr(*, "starts")= int [1:4] 1 2 3 4
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
#> .. ..- attr(*, "__region_name")= int [1:4] 3 1 2 4
#> .. .. ..- attr(*, "starts")= int [1:4] 1 2 3 4
#> .. .. ..- attr(*, "maxgrpn")= int 1
#> .. .. ..- attr(*, "anyna")= int 0
#> .. .. ..- attr(*, "anyinfnan")= int 0
#> .. .. ..- attr(*, "anynotascii")= int 0
#> .. .. ..- attr(*, "anynotutf8")= int 0
mergelist(l)
#> Key: <state_id, date_id>
#> region_id division_id state_id year_id month_id week_id weekday_id
#> <int> <int> <int> <int> <int> <int> <int>
#> 1: 2 4 1 1 1 1 5
#> 2: 2 4 1 1 1 1 6
#> 3: 2 4 1 1 1 1 7
#> 4: 2 4 1 1 1 1 1
#> 5: 2 4 1 1 1 1 2
#> ---
#> 3287196: 4 8 50 180 12 52 7
#> 3287197: 4 8 50 180 12 52 1
#> 3287198: 4 8 50 180 12 53 2
#> 3287199: 4 8 50 180 12 53 3
#> 3287200: 4 8 50 181 1 1 4
#> date_id population date weekday week month year
#> <int> <num> <IDat> <char> <char> <char> <char>
#> 1: 1 0.06691260 1790-01-01 Friday 1 January 1790
#> 2: 2 0.06691903 1790-01-02 Saturday 1 January 1790
#> 3: 3 0.06692547 1790-01-03 Sunday 1 January 1790
#> 4: 4 0.06693190 1790-01-04 Monday 1 January 1790
#> 5: 5 0.06693833 1790-01-05 Tuesday 1 January 1790
#> ---
#> 3287196: 65740 0.35980124 1969-12-28 Sunday 52 December 1969
#> 3287197: 65741 0.35981283 1969-12-29 Monday 52 December 1969
#> 3287198: 65742 0.35982442 1969-12-30 Tuesday 53 December 1969
#> 3287199: 65743 0.35983600 1969-12-31 Wednesday 53 December 1969
#> 3287200: 65744 0.35984759 1970-01-01 Thursday 1 January 1970
#> state_abb state_name division_name region_name
#> <char> <char> <char> <char>
#> 1: AL Alabama East South Central South
#> 2: AL Alabama East South Central South
#> 3: AL Alabama East South Central South
#> 4: AL Alabama East South Central South
#> 5: AL Alabama East South Central South
#> ---
#> 3287196: WY Wyoming Mountain West
#> 3287197: WY Wyoming Mountain West
#> 3287198: WY Wyoming Mountain West
#> 3287199: WY Wyoming Mountain West
#> 3287200: WY Wyoming Mountain West
rm(l)
#### merge hierarchies alone, reduce sizes in merges of geog dimension
ans = mergelist(list(
fact,
mergelist(time),
mergelist(rev(geog), how="right")
))
rm(ans)
#### same but no unnecessary copies
ans = mergelist(list(
fact,
setmergelist(time),
setmergelist(rev(geog), how="right")
))
# }