higher dplyr interface, extra sdf_* capabilities, and RDS-based serialization routines

By Jules Jackson

April 28, 2025

0

176

higher dplyr interface, extra sdf_* capabilities, and RDS-based serialization routines

We’re thrilled to announce sparklyr 1.5 is now
out there on CRAN!

To put in sparklyr 1.5 from CRAN, run

On this weblog submit, we’ll spotlight the next elements of sparklyr 1.5:

Higher dplyr interface

A big fraction of pull requests that went into the sparklyr 1.5 launch have been centered on making
Spark dataframes work with varied dplyr verbs in the identical means that R dataframes do.
The total listing of dplyr-related bugs and have requests that have been resolved in
sparklyr 1.5 will be present in right here.

On this part, we’ll showcase three new dplyr functionalities that have been shipped with sparklyr 1.5.

Stratified sampling

Stratified sampling on an R dataframe will be completed with a mix of dplyr::group_by() adopted by
dplyr::sample_n() or dplyr::sample_frac(), the place the grouping variables specified within the dplyr::group_by()
step are those that outline every stratum. For example, the next question will group mtcars by quantity
of cylinders and return a weighted random pattern of measurement two from every group, with out substitute, and weighted by
the mpg column:

## # A tibble: 6 x 11
## # Teams:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 2  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
## 3  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 4  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 5  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 6  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2

Ranging from sparklyr 1.5, the identical will also be completed for Spark dataframes with Spark 3.0 or above, e.g.,:

library(sparklyr)

sc  spark_connect(grasp = "native", model = "3.0.0")
mtcars_sdf  copy_to(sc, mtcars, change = TRUE, repartition = 3)

mtcars_sdf %>%
  dplyr::group_by(cyl) %>%
  dplyr::sample_n(measurement = 2, weight = mpg, change = FALSE) %>%
  print()

# Supply: spark> [?? x 11]
# Teams: cyl
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
            
1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
3  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
5  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
6  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2

or

## # Supply: spark> [?? x 11]
## # Teams: cyl
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
## 4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 5  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 6  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 7  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
## 8  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3

Row sums

The rowSums() performance provided by dplyr is useful when one must sum up
numerous columns inside an R dataframe which can be impractical to be enumerated
individually.
For instance, right here we now have a six-column dataframe of random actual numbers, the place the
partial_sum column within the end result accommodates the sum of columns b via d inside
every row:

## # A tibble: 5 x 7
##         a     b     c      d     e      f partial_sum
##                   
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40

Starting with sparklyr 1.5, the identical operation will be carried out with Spark dataframes:

## # Supply: spark> [?? x 7]
##         a     b     c      d     e      f partial_sum
##                   
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40

As a bonus from implementing the rowSums function for Spark dataframes,
sparklyr 1.5 now additionally affords restricted assist for the column-subsetting
operator on Spark dataframes.
For instance, all code snippets under will return some subset of columns from
the dataframe named sdf:

# choose columns `b` via `e`
sdf[2:5]

# choose columns `b` and `c`
sdf[c("b", "c")]

# drop the primary and third columns and return the remaining
sdf[c(-1, -3)]

Weighted-mean summarizer

Much like the 2 dplyr capabilities talked about above, the weighted.imply() summarizer is one other
helpful perform that has change into a part of the dplyr interface for Spark dataframes in sparklyr 1.5.
One can see it in motion by, for instance, evaluating the output from the next

with output from the equal operation on mtcars in R:

each of them ought to consider to the next:

##     cyl mpg_wm
##     
## 1     4   25.9
## 2     6   19.6
## 3     8   14.8

New additions to the `sdf_*` household of capabilities

sparklyr gives numerous comfort capabilities for working with Spark dataframes,
and all of them have names beginning with the sdf_ prefix.

On this part we’ll briefly point out 4 new additions
and present some instance eventualities through which these capabilities are helpful.

`sdf_expand_grid()`

Because the identify suggests, sdf_expand_grid() is solely the Spark equal of broaden.grid().
Fairly than working broaden.grid() in R and importing the ensuing R dataframe to Spark, one
can now run sdf_expand_grid(), which accepts each R vectors and Spark dataframes and helps
hints for broadcast hash joins. The instance under reveals sdf_expand_grid() making a
100-by-100-by-10-by-10 grid in Spark over 1000 Spark partitions, with broadcast hash be a part of hints
on variables with small cardinalities:

library(sparklyr)

sc  spark_connect(grasp = "native")

grid_sdf  sdf_expand_grid(
  sc,
  var1 = seq(100),
  var2 = seq(100),
  var3 = seq(10),
  var4 = seq(10),
  broadcast_vars = c(var3, var4),
  repartition = 1000
)

grid_sdf %>% sdf_nrow() %>% print()

## [1] 1e+06

`sdf_partition_sizes()`

As sparklyr consumer @sbottelli advised right here,
one factor that may be nice to have in sparklyr is an environment friendly technique to question partition sizes of a Spark dataframe.
In sparklyr 1.5, sdf_partition_sizes() does precisely that:

library(sparklyr)

sc  spark_connect(grasp = "native")

sdf_len(sc, 1000, repartition = 5) %>%
  sdf_partition_sizes() %>%
  print(row.names = FALSE)

##  partition_index partition_size
##                0            200
##                1            200
##                2            200
##                3            200
##                4            200

`sdf_unnest_longer()` and `sdf_unnest_wider()`

sdf_unnest_longer() and sdf_unnest_wider() are the equivalents of
tidyr::unnest_longer() and tidyr::unnest_wider() for Spark dataframes.
sdf_unnest_longer() expands all components in a struct column into a number of rows, and
sdf_unnest_wider() expands them into a number of columns. As illustrated with an instance
dataframe under,

library(sparklyr)

sc  spark_connect(grasp = "native")
sdf  copy_to(
  sc,
  tibble::tibble(
    id = seq(3),
    attribute = listing(
      listing(identify = "Alice", grade = "A"),
      listing(identify = "Bob", grade = "B"),
      listing(identify = "Carol", grade = "C")
    )
  )
)

sdf %>%
  sdf_unnest_longer(col = document, indices_to = "key", values_to = "worth") %>%
  print()

evaluates to

## # Supply: spark> [?? x 3]
##      id worth key
##     
## 1     1 A     grade
## 2     1 Alice identify
## 3     2 B     grade
## 4     2 Bob   identify
## 5     3 C     grade
## 6     3 Carol identify

whereas

sdf %>%
  sdf_unnest_wider(col = document) %>%
  print()

evaluates to

## # Supply: spark> [?? x 3]
##      id grade identify
##     
## 1     1 A     Alice
## 2     2 B     Bob
## 3     3 C     Carol

RDS-based serialization routines

Some readers should be questioning why a model new serialization format would have to be carried out in sparklyr in any respect.
Lengthy story quick, the reason being that RDS serialization is a strictly higher substitute for its CSV predecessor.
It possesses all fascinating attributes the CSV format has,
whereas avoiding a variety of disadvantages which can be frequent amongst text-based knowledge codecs.

On this part, we’ll briefly define why sparklyr ought to assist at the least one serialization format apart from arrow,
deep-dive into points with CSV-based serialization,
after which present how the brand new RDS-based serialization is free from these points.

Why `arrow` is just not for everybody?

To switch knowledge between Spark and R accurately and effectively, sparklyr should depend on some knowledge serialization
format that’s well-supported by each Spark and R.
Sadly, not many serialization codecs fulfill this requirement,
and among the many ones that do are text-based codecs corresponding to CSV and JSON,
and binary codecs corresponding to Apache Arrow, Protobuf, and as of latest, a small subset of RDS model 2.
Additional complicating the matter is the extra consideration that
sparklyr ought to assist at the least one serialization format whose implementation will be absolutely self-contained throughout the sparklyr code base,
i.e., such serialization shouldn’t depend upon any exterior R bundle or system library,
in order that it may accommodate customers who wish to use sparklyr however who don’t essentially have the required C++ compiler software chain and
different system dependencies for establishing R packages corresponding to arrow or
protolite.
Previous to sparklyr 1.5, CSV-based serialization was the default various to fallback to when customers would not have the arrow bundle put in or
when the kind of knowledge being transported from R to Spark is unsupported by the model of arrow out there.

Why is the CSV format not perfect?

There are at the least three causes to consider CSV format is just not your best option on the subject of exporting knowledge from R to Spark.

One cause is effectivity. For instance, a double-precision floating level quantity corresponding to .Machine$double.eps must
be expressed as "2.22044604925031e-16" in CSV format in an effort to not incur any lack of precision, thus taking over 20 bytes
somewhat than 8 bytes.

However extra vital than effectivity are correctness considerations. In a R dataframe, one can retailer each NA_real_ and
NaN in a column of floating level numbers. NA_real_ ought to ideally translate to null inside a Spark dataframe, whereas
NaN ought to proceed to be NaN when transported from R to Spark. Sadly, NA_real_ in R turns into indistinguishable
from NaN as soon as serialized in CSV format, as evident from a fast demo proven under:

##     x is_nan
## 1  NA  FALSE
## 2 NaN   TRUE

csv_file  "/tmp/knowledge.csv"
write.csv(original_df, file = csv_file, row.names = FALSE)
deserialized_df  learn.csv(csv_file)
deserialized_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()

##    x is_nan
## 1 NA  FALSE
## 2 NA  FALSE

One other correctness situation very a lot much like the one above was the truth that
"NA" and NA inside a string column of an R dataframe change into indistinguishable
as soon as serialized in CSV format, as accurately identified in
this Github situation
by @caewok and others.

RDS to the rescue!

RDS format is likely one of the most generally used binary codecs for serializing R objects.
It’s described in some element in chapter 1, part 8 of
this doc.
Amongst benefits of the RDS format are effectivity and accuracy: it has a fairly
environment friendly implementation in base R, and helps all R knowledge sorts.

Additionally value noticing is the truth that when an R dataframe containing solely knowledge sorts
with wise equivalents in Apache Spark (e.g., RAWSXP, LGLSXP, CHARSXP, REALSXP, and so on)
is saved utilizing RDS model 2,
(e.g., serialize(mtcars, connection = NULL, model = 2L, xdr = TRUE)),
solely a tiny subset of the RDS format will probably be concerned within the serialization course of,
and implementing deserialization routines in Scala able to decoding such a restricted
subset of RDS constructs is in reality a fairly easy and easy activity
(as proven in
right here
).

Final however not least, as a result of RDS is a binary format, it permits NA_character_, "NA",
NA_real_, and NaN to all be encoded in an unambiguous method, therefore permitting sparklyr
1.5 to keep away from all correctness points detailed above in non-arrow serialization use circumstances.

Different advantages of RDS serialization

Along with correctness ensures, RDS format additionally affords fairly a number of different benefits.

One benefit is in fact efficiency: for instance, importing a non-trivially-sized dataset
corresponding to nycflights13::flights from R to Spark utilizing the RDS format in sparklyr 1.5 is
roughly 40%-50% sooner in comparison with CSV-based serialization in sparklyr 1.4. The
present RDS-based implementation remains to be nowhere as quick as arrow-based serialization
although (arrow is about 3-4x sooner), so for performance-sensitive duties involving
heavy serialization, arrow ought to nonetheless be the best choice.

One other benefit is that with RDS serialization, sparklyr can import R dataframes containing
uncooked columns immediately into binary columns in Spark. Thus, use circumstances such because the one under
will work in sparklyr 1.5

Whereas most sparklyr customers in all probability received’t discover this functionality of importing binary columns
to Spark instantly helpful of their typical sparklyr::copy_to() or sparklyr::accumulate()
usages, it does play an important function in decreasing serialization overheads within the Spark-based
foreach parallel backend that
was first launched in sparklyr 1.2.
It is because Spark staff can immediately fetch the serialized R closures to be computed
from a binary Spark column as a substitute of extracting these serialized bytes from intermediate
representations corresponding to base64-encoded strings.
Equally, the R outcomes from executing employee closures will probably be immediately out there in RDS
format which will be effectively deserialized in R, somewhat than being delivered in different
much less environment friendly codecs.

Acknowledgement

In chronological order, we want to thank the next contributors for making their pull
requests a part of sparklyr 1.5:

We’d additionally like to specific our gratitude in direction of quite a few bug experiences and have requests for
sparklyr from a incredible open-source group.

Lastly, the writer of this weblog submit is indebted to
@javierluraschi,
@batpigandme,
and @skeydan for his or her beneficial editorial inputs.

If you happen to want to study extra about sparklyr, try sparklyr.ai,
spark.rstudio.com, and a number of the earlier launch posts corresponding to
sparklyr 1.4 and
sparklyr 1.3.

Thanks for studying!

Previous articleAuthor Palmyra X5 and X4 basis fashions at the moment are accessible in Amazon Bedrock

Next articleApple strikes to a fourth spherical of developer betas

higher dplyr interface, extra sdf_* capabilities, and RDS-based serialization routines

Higher dplyr interface

Stratified sampling

Row sums

Weighted-mean summarizer

New additions to the `sdf_*` household of capabilities

`sdf_expand_grid()`

`sdf_partition_sizes()`

`sdf_unnest_longer()` and `sdf_unnest_wider()`

RDS-based serialization routines

Why `arrow` is just not for everybody?

Why is the CSV format not perfect?

RDS to the rescue!

Different advantages of RDS serialization

Acknowledgement

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Could

AURA Foresight Reaches International XPRIZE Wildfire Finals in Alaska

Methods to match the width of sheets in swiftUI to match the background?

Scientists Simply Found a Mobile Survival System That Was By no means Supposed To Exist – NanoApps Medical – Official web site

Recent Comments

ABOUT US

POPULAR POSTS

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Could

AURA Foresight Reaches International XPRIZE Wildfire Finals in Alaska

Methods to match the width of sheets in swiftUI to match the background?

POPULAR CATEGORY

higher dplyr interface, extra sdf_* capabilities, and RDS-based serialization routines

Higher dplyr interface

Stratified sampling

Row sums

Weighted-mean summarizer

New additions to the sdf_* household of capabilities

sdf_expand_grid()

sdf_partition_sizes()

sdf_unnest_longer() and sdf_unnest_wider()

RDS-based serialization routines

Why arrow is just not for everybody?

Why is the CSV format not perfect?

RDS to the rescue!

Different advantages of RDS serialization

Acknowledgement

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

New additions to the `sdf_*` household of capabilities

`sdf_expand_grid()`

`sdf_partition_sizes()`

`sdf_unnest_longer()` and `sdf_unnest_wider()`

Why `arrow` is just not for everybody?