ASOF Joins, OLS Regression, and extra summarizers

April 29, 2025

162

ASOF Joins, OLS Regression, and extra summarizers

Since sparklyr.flint, a sparklyr extension for leveraging Flint time sequence functionalities by means of sparklyr, was launched in September, we’ve got made a lot of enhancements to it, and have efficiently submitted sparklyr.flint 0.2 to CRAN.

On this weblog put up, we spotlight the next new options and enhancements from sparklyr.flint 0.2:

ASOF Joins

For these unfamiliar with the time period, ASOF joins are temporal be part of operations based mostly on inexact matching of timestamps. Throughout the context of Apache Spark, a be part of operation, loosely talking, matches data from two knowledge frames (let’s name them left and proper) based mostly on some standards. A temporal be part of implies matching data in left and proper based mostly on timestamps, and with inexact matching of timestamps permitted, it’s usually helpful to affix left and proper alongside one of many following temporal instructions:

Trying behind: if a report from left has timestamp t, then it will get matched with ones from proper having the latest timestamp lower than or equal to t.
Trying forward: if a report from left has timestamp t, then it will get matched with ones from proper having the smallest timestamp better than or equal to (or alternatively, strictly better than) t.

Nonetheless, oftentimes it’s not helpful to think about two timestamps as “matching” if they’re too far aside. Subsequently, an extra constraint on the utmost period of time to look behind or look forward is normally additionally a part of an ASOF be part of operation.

In sparklyr.flint 0.2, all ASOF be part of functionalities of Flint are accessible through the asof_join() technique. For instance, given 2 timeseries RDDs left and proper:

library(sparklyr)
library(sparklyr.flint)

sc %
  from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
proper %
  from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")

The next prints the results of matching every report from left with the latest report(s) from proper which might be at most 1 second behind.

print(asof_join(left, proper, tol = "1s", path = ">=") %>% to_sdf())

## # Supply: spark> [?? x 3]
##    time                    u     v
##                   
##  1 1970-01-01 00:00:01     1    NA
##  2 1970-01-01 00:00:02     2     2
##  3 1970-01-01 00:00:03     3     3
##  4 1970-01-01 00:00:04     4     4
##  5 1970-01-01 00:00:05     5     5
##  6 1970-01-01 00:00:06     6     6
##  7 1970-01-01 00:00:07     7     7
##  8 1970-01-01 00:00:08     8     8
##  9 1970-01-01 00:00:09     9     9
## 10 1970-01-01 00:00:10    10    10

Whereas if we alter the temporal path to “left will probably be matched with any report(s) from proper that’s strictly sooner or later and is at most 1 second forward of the present report from left:

print(asof_join(left, proper, tol = "1s", path = "% to_sdf())

## # Supply: spark> [?? x 3]
##    time                    u     v
##                   
##  1 1970-01-01 00:00:01     1     2
##  2 1970-01-01 00:00:02     2     3
##  3 1970-01-01 00:00:03     3     4
##  4 1970-01-01 00:00:04     4     5
##  5 1970-01-01 00:00:05     5     6
##  6 1970-01-01 00:00:06     6     7
##  7 1970-01-01 00:00:07     7     8
##  8 1970-01-01 00:00:08     8     9
##  9 1970-01-01 00:00:09     9    10
## 10 1970-01-01 00:00:10    10    11

Discover no matter which temporal path is chosen, an outer-left be part of is all the time carried out (i.e., all timestamp values and u values of left from above will all the time be current within the output, and the v column within the output will comprise NA every time there isn’t a report from proper that meets the matching standards).

OLS Regression

You could be questioning whether or not the model of this performance in Flint is kind of an identical to lm() in R. Seems it has far more to supply than lm() does. An OLS regression in Flint will compute helpful metrics reminiscent of Akaike data criterion and Bayesian data criterion, each of that are helpful for mannequin choice functions, and the calculations of each are parallelized by Flint to totally make the most of computational energy out there in a Spark cluster. As well as, Flint helps ignoring regressors which might be fixed or almost fixed, which turns into helpful when an intercept time period is included. To see why that is the case, we have to briefly look at the purpose of the OLS regression, which is to seek out some column vector of coefficients (mathbf{beta}) that minimizes (|mathbf{y} – mathbf{X} mathbf{beta}|^2), the place (mathbf{y}) is the column vector of response variables, and (mathbf{X}) is a matrix consisting of columns of regressors plus a complete column of (1)s representing the intercept phrases. The answer to this drawback is (mathbf{beta} = (mathbf{X}^intercalmathbf{X})^{-1}mathbf{X}^intercalmathbf{y}), assuming the Gram matrix (mathbf{X}^intercalmathbf{X}) is non-singular. Nonetheless, if (mathbf{X}) incorporates a column of all (1)s of intercept phrases, and one other column fashioned by a regressor that’s fixed (or almost so), then columns of (mathbf{X}) will probably be linearly dependent (or almost so) and (mathbf{X}^intercalmathbf{X}) will probably be singular (or almost so), which presents a problem computation-wise. Nonetheless, if a regressor is fixed, then it basically performs the identical position because the intercept phrases do. So merely excluding such a relentless regressor in (mathbf{X}) solves the issue. Additionally, talking of inverting the Gram matrix, readers remembering the idea of “situation quantity” from numerical evaluation should be considering to themselves how computing (mathbf{beta} = (mathbf{X}^intercalmathbf{X})^{-1}mathbf{X}^intercalmathbf{y}) could possibly be numerically unstable if (mathbf{X}^intercalmathbf{X}) has a big situation quantity. For this reason Flint additionally outputs the situation variety of the Gram matrix within the OLS regression outcome, in order that one can sanity-check the underlying quadratic minimization drawback being solved is well-conditioned.

So, to summarize, the OLS regression performance carried out in Flint not solely outputs the answer to the issue, but additionally calculates helpful metrics that assist knowledge scientists assess the sanity and predictive high quality of the ensuing mannequin.

To see OLS regression in motion with sparklyr.flint, one can run the next instance:

mtcars_sdf %
  dplyr::mutate(time = 0L)
mtcars_ts % to_sdf()

print(mannequin %>% dplyr::choose(akaikeIC, bayesIC, cond))

## # Supply: spark> [?? x 3]
##   akaikeIC bayesIC    cond
##            
## 1     155.    159. 345403.

# ^ output says situation variety of the Gram matrix was inside purpose

and acquire (mathbf{beta}), the vector of optimum coefficients, with the next:

print(mannequin %>% dplyr::pull(beta))

## [[1]]
## [1] -0.03177295 -3.87783074

Extra Summarizers

The EWMA (Exponential Weighted Transferring Common), EMA half-life, and the standardized second summarizers (particularly, skewness and kurtosis) together with a couple of others which have been lacking in sparklyr.flint 0.1 are actually totally supported in sparklyr.flint 0.2.

Higher Integration With `sparklyr`

Whereas sparklyr.flint 0.1 included a gather() technique for exporting knowledge from a Flint time-series RDD to an R knowledge body, it didn’t have an identical technique for extracting the underlying Spark knowledge body from a Flint time-series RDD. This was clearly an oversight. In sparklyr.flint 0.2, one can name to_sdf() on a timeseries RDD to get again a Spark knowledge body that’s usable in sparklyr (e.g., as proven by mannequin %>% to_sdf() %>% dplyr::choose(...) examples from above). One can even get to the underlying Spark knowledge body JVM object reference by calling spark_dataframe() on a Flint time-series RDD (that is normally pointless in overwhelming majority of sparklyr use circumstances although).

Conclusion

We now have introduced a lot of new options and enhancements launched in sparklyr.flint 0.2 and deep-dived into a few of them on this weblog put up. We hope you might be as enthusiastic about them as we’re.

Thanks for studying!

Acknowledgement

The writer want to thank Mara (@batpigandme), Sigrid (@skeydan), and Javier (@javierluraschi) for his or her improbable editorial inputs on this weblog put up!

Previous articleAsserting second-generation AWS Outposts racks with breakthrough efficiency and scalability on-premises

Next articleFreepik releases an ‘open’ AI picture generator educated on licensed information

ASOF Joins, OLS Regression, and extra summarizers

ASOF Joins

OLS Regression

Extra Summarizers

Higher Integration With `sparklyr`

Conclusion

Acknowledgement

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Give iOS app full copy/paste clipboard privileges or entitlement?

A brand new, extremely mutated COVID variant known as ‘Cicada’ is spreading within the US. – NanoApps Medical – Official web site

Thoughts Robotics raises Collection A to develop AI-driven industrial automation

This Week’s Superior Tech Tales From Across the Net (By means of March 28)

Recent Comments

ABOUT US

POPULAR POSTS

Give iOS app full copy/paste clipboard privileges or entitlement?

A brand new, extremely mutated COVID variant known as ‘Cicada’ is spreading within the US. – NanoApps Medical – Official web site

Thoughts Robotics raises Collection A to develop AI-driven industrial automation

POPULAR CATEGORY

ASOF Joins, OLS Regression, and extra summarizers

ASOF Joins

OLS Regression

Extra Summarizers

Higher Integration With sparklyr

Conclusion

Acknowledgement

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

Higher Integration With `sparklyr`