frollapply {data.table}R Documentation

Rolling user-defined function

Description

Fast rolling user-defined function (UDF) to calculate on a sliding window. Experimental. Please read, at least, caveats section below. For "time-aware" (irregularly spaced time series) rolling function see frolladapt.

Usage

  frollapply(X, N, FUN, ..., by.column=TRUE, fill=NA,
    align=c("right","left","center"), adaptive=FALSE, partial=FALSE,
    give.names=FALSE, simplify=TRUE, x, n)

Arguments

X

Atomic vector, data.frame, data.table or a list on which sliding window calculates FUN function. How the X is handled depends on the by.column argument. It supports vectorized input, for by.column=TRUE it needs to be a data.table, data.frame or a list, and for by.column=FALSE list of data.frames/data.tables, but not list of lists.

N

Integer, non-negative, rolling window size. This is the total number of included values in aggregate function. In case of an adaptive rolling function window size has to be provided as a vector for each indivdual value of X. It supports vectorized input, then it needs to be a vector, or in case of an adaptive rolling a list of vectors.

FUN

The function to be applied on subsets of X.

...

Extra arguments passed to FUN. Note that arguments passed to ... cannot have same names as arguments of frollapply.

by.column

Logical. When TRUE (default) then X of types list/data.frame/data.table is treated as vectorized input rather an object to apply rolling window on. Setting to FALSE allows rolling window to be applied on multiple variables, using data.frame, data.table or a list, as a whole. For details see by.column argument section below.

fill

An object; value to pad by. Defaults to NA. When partial=TRUE this argument is ignored.

align

Character, specifying the "alignment" of the rolling window, defaulting to "right". For details see froll.

adaptive

Logical, default FALSE. Should the rolling function be calculated adaptively? For details see froll.

partial

Logical, default FALSE. Should the rolling window size(s) provided in N be trimmed to available observations. For details see froll.

give.names

Logical, default FALSE. When TRUE, names are automatically generated corresponding to names of X and names of N. If answer is an atomic vector, then the argument is ignored, see examples.

simplify

Logical or a function. When TRUE (default) then internal simplifylist function is applied on a list storing results of all computations. When FALSE then list is returned without any post-processing. Argument can take a function as well, then the function is applied to a list that would have been returned when simplify=FALSE. If results are not automatically simplified when simplify=TRUE then, for backward compatibility, one should use simplify=FALSE explicitly. See simplify argument section below for details.

x

Deprecated, use X instead.

n

Deprecated, use N instead.

Value

Argument simplify impacts the type returned. Its default value TRUE is set for convenience and backward compatibility, but it is advised to use simplify=unlist (or other desired function) instead.

by.column argument

Setting by.column to FALSE allows to apply function on multiple variables rather than a single vector. Then X expects to be data.table, data.table or a list of equal length vectors, and window size provided in N refers to number of rows (or length of a vectors in a list). See examples for use cases. Error "incorrect number of dimensions" can be commonly observed when by.column was not set to FALSE when FUN expects its input to be a data.frame/data.table.

simplify argument

When set to TRUE, the default, results from rolling function which are normally stored in a list may be simplified either with unlist or rbindlist. It also attempts to match type, size and names of fill argument to the results of a function. One should avoid simplify=TRUE when writing robust code. One reason is performance, as explained in Performance consideration section below. Another is backward compatibility. For backward compatibility and performance one should always provide desired function to simplify explicitly. In future version we may change internal simplifylist function, then simplify=TRUE may return object of a different type, breaking downstream code.

Caveats

With great power comes great responsibility.

  1. An optimization used to avoid repeated allocation of window subsets (explained more deeply in Implementation section below) may, in special cases, return rather surprising results:

    setDTthreads(1)
    frollapply(c(1, 9), N=1L, FUN=identity) ## unexpected
    #[1] 9 9
    frollapply(c(1, 9), N=1L, FUN=list) ## unexpected
    #      V1
    #   <num>
    #1:     9
    #2:     9
    setDTthreads(2, throttle=1) ## disable throttle
    frollapply(c(1, 9), N=1L, FUN=identity) ## good only because threads >= input
    #[1] 1 9                                ## on Linux and Macos
    frollapply(c(1, 5, 9), N=1L, FUN=identity) ## unexpected again
    #[1] 5 5 9
    

    Problem occurs, in rather unlikely scenarios for rolling computations, when objects returned from a function can be its input (i.e. identity), or a reference to it (i.e. list), then one has to add extra copy call:

    setDTthreads(1)
    frollapply(c(1, 9), N=1L, FUN=function(x) copy(identity(x))) ## only 'copy' would be equivalent here
    #[1] 1 9
    frollapply(c(1, 9), N=1L, FUN=function(x) copy(list(x)))
    #      V1
    #   <num>
    #1:     1
    #2:     9
    
  2. FUN calls are internally passed to parallel::mcparallel to evaluate them in parallel. We inherit few limitations from parallel package explained below. This optimization can be disabled completely by calling setDTthreads(1), then limitations listed below do not apply because all iterations of FUN evaluation will be made sequentially without use of parallel package. Note that on Windows platform this optimization is always disabled due to lack of fork used by parallel package. One can use options(datatable.verbose=TRUE) to get extra information if frollapply is running multithreaded or not.

    • Warnings produced inside the function are silently ignored; for consistency we ignore warnings also when running single threaded path.

    • FUN should not use any on-screen devices, GUI elements, tcltk, multithreaded libraries. Note that setDTthreads(1L) is passed to forked processes, therefore any data.table code inside FUN will be forced to be single threaded. It is advised to not call setDTthreads inside FUN. frollapply is already parallelized and nested parallelism is rarely a good idea.

    • Any operation that could misbehave when run in parallel has to be handled. For example writing to the same file from multiple CPU threads.

      old = setDTthreads(1L)
      frollapply(iris, 5L, by.column=FALSE, FUN=fwrite, file="rolling-data.csv", append=TRUE)
      setDTthreads(old)
      
    • Objects returned from forked processes, FUN, are serialized. This may cause problems for objects that are meant not to be serialized, like data.table. We are handling that for data.table class internally in frollapply whenever FUN is returning data.table (which is checked on the results of the first FUN call so it assumes function is type stable). If data.table is nested in another object returned from FUN then the problem may still manifest, in such case one has to call setDT on objects returned from FUN. This can be also nicely handled via simplify argument when passing a function that calls setDT on nested data.table objects returned from FUN. Anyway, returning data.table from FUN should, in majority of cases, be avoided from the performance reasons, see UDF optimization section for details.

      setDTthreads(2, throttle=1) ## disable throttle
      ## frollapply will fix DT in most cases
      ans = frollapply(1:2, 2, data.table)
      .selfref.ok(ans)
      #[1] TRUE
      ans = frollapply(1:2, 2, data.table, simplify=FALSE)
      .selfref.ok(ans[[2L]])
      #[1] TRUE
      
      ## nested DT not fixed
      ans = frollapply(1:2, 2, function(x) list(data.table(x)), fill=list(data.table(NA)), simplify=FALSE)
      .selfref.ok(ans[[2L]][[1L]])
      #[1] FALSE
      #### now if we want to use it
      set(ans[[2L]][[1L]],, "newcol", 1L)
      #Error in set(ans[[2L]][[1L]], , "newcol", 1L) :
      #  This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.
      #### fix as explained in error message
      ans = lapply(ans, lapply, setDT)
      .selfref.ok(ans[[2L]][[1L]])
      #[1] TRUE
      
      ## fix inside frollapply via simplify
      simplifix = function(x) lapply(x, lapply, setDT)
      ans = frollapply(1:2, 2, function(x) list(data.table(x)), fill=list(data.table(NA)), simplify=simplifix)
      .selfref.ok(ans[[2L]][[1L]])
      #[1] TRUE
      
      ## automatic fix may not work for a non-type stable function
      f = function(x) (if (x[1L]==1L) data.frame else data.table)(x)
      ans = frollapply(1:3, 2, f, fill=data.table(NA), simplify=FALSE)
      .selfref.ok(ans[[3L]])
      #[1] FALSE
      #### fix inside frollapply via simplify
      simplifix = function(x) lapply(x, function(y) if (is.data.table(y)) setDT(y) else y)
      ans = frollapply(1:3, 2, f, fill=data.table(NA), simplify=simplifix)
      .selfref.ok(ans[[3L]])
      #[1] TRUE
      
      setDTthreads(2, throttle=1024) ## enable throttle
      
  3. Due to possible future improvements of handling simplification of results returned from rolling function, the default simplify=TRUE may not be backward compatible for functions that produce results that haven't been already automatically simplified. See simplify argument section for details.

Performance consideration

frollapply is meant to run any UDF function. If one needs to use a common function like mean, sum, max, etc., then we have highly optimized, implemented in C language, rolling functions described in froll manual.
Most crucial optimizations are the ones to be applied on UDF. Those are discussed in next section UDF optimization below.

UDF optimization

FUN will be evaluated many times so should be highly optimized. Tips below are not specific to frollapply and can be applied to any code is meant to run in many iterations.

Implementation

Evaluation of UDF comes with very limited capabilities for optimizations, therefore speed improvements in frollapply should not be expected as good as in other data.table fast functions. frollapply is implemented almost exclusively in R, rather than C. Its speed improvement comes from two optimizations that have been applied:

  1. No repeated allocation of a rolling window subset.
    Object (type of X and size of N) is allocated once (for each CPU thread), and then for each iteration this object is being re-used by copying expected subset of data into it. This means we still have to subset data on each iteration, but we only copy data into pre-allocated window object, instead of allocating in each iteration. Allocation is carrying much bigger overhead than copy. The faster the FUN evaluates the more relative speedup we are getting, because allocation of a subset does not depend on how fast or slow FUN evaluates. See caveats section for possible edge cases caused by this optimization.

  2. Parallel evaluation of FUN calls.
    Until now (September 2025) all the multithreaded code in data.table was using OpenMP. It can be used only in C language and it has very low overhead. Unfortunately it could not be applied in frollapply because to evaluate UDF from C code one has to call R's C api that is not thread safe (can be run only from single threaded C code). Therefore frollapply uses parallel-package to provide parallelism on R language level. It uses fork parallelism, which has low overhead as well (unless results of computation are big in size which is not an issue for rolling statistics). Fork is not available on Windows OS. See caveats section for limitations caused by using this optimization.

Note

Be aware that rolling functions operates on the physical order of input. If the intent is to roll values in a vector by a logical window, for example an hour, or a day, then one has to ensure that there are no gaps in the input, or use adaptive rolling function to handle gaps, for which we provide helper function frolladapt to generate adaptive window size.

See Also

froll, frolladapt, shift, data.table, setDTthreads

Examples

frollapply(1:16, 4, median)
frollapply(1:9, 3, toString)

## vectorized input
x = list(1:10, 10:1)
n = c(3, 4)
frollapply(x, n, sum)
## give names
x = list(data1 = 1:10, data2 = 10:1)
n = c(small = 3, big = 4)
frollapply(x, n, sum, give.names=TRUE)

## by.column=FALSE
x = as.data.table(iris)
flow = function(x) {
  v1 = x[[1L]]
  v2 = x[[2L]]
  (v1[2L] - v1[1L] * (1+v2[2L])) / v1[1L]
}
x[, "flow" := frollapply(.(Sepal.Length, Sepal.Width), 2L, flow, by.column=FALSE),
  by = Species][]

## rolling regression: by.column=FALSE
f = function(x) coef(lm(v2 ~ v1, data=x))
x = data.table(v1=rnorm(120), v2=rnorm(120))
frollapply(x, 4, f, by.column=FALSE)

[Package data.table version 1.17.99 Index]