sparklyr.sedona
is now out there
because the sparklyr
-based R interface for Apache Sedona.
To put in sparklyr.sedona
from GitHub utilizing
the remotes
bundle
, run
remotes::install_github(repo = "apache/incubator-sedona", subdir = "R/sparklyr.sedona")
On this weblog put up, we’ll present a fast introduction to sparklyr.sedona
, outlining the motivation behind
this sparklyr
extension, and presenting some instance sparklyr.sedona
use instances involving Spark spatial RDDs,
Spark dataframes, and visualizations.
Motivation for sparklyr.sedona
A suggestion from the
mlverse survey outcomes earlier
this 12 months talked about the necessity for up-to-date R interfaces for Spark-based GIS frameworks.
Whereas wanting into this suggestion, we realized about
Apache Sedona, a geospatial knowledge system powered by Spark
that’s trendy, environment friendly, and simple to make use of. We additionally realized that whereas our pals from the
Spark open-source group had developed a
sparklyr
extension for GeoSpark, the
predecessor of Apache Sedona, there was no comparable extension making newer Sedona
functionalities simply accessible from R but.
We subsequently determined to work on sparklyr.sedona
, which goals to bridge the hole between
Sedona and R.
The lay of the land
We hope you’re prepared for a fast tour by way of among the RDD-based and
Spark-dataframe-based functionalities in sparklyr.sedona
, and in addition, some bedazzling
visualizations derived from geospatial knowledge in Spark.
In Apache Sedona,
Spatial Resilient Distributed Datasets(SRDDs)
are fundamental constructing blocks of distributed spatial knowledge encapsulating
“vanilla” RDDs of
geometrical objects and indexes. SRDDs help low-level operations resembling Coordinate Reference System (CRS)
transformations, spatial partitioning, and spatial indexing. For instance, with sparklyr.sedona
, SRDD-based operations we will carry out embody the next:
- Importing some exterior knowledge supply right into a SRDD:
library(sparklyr)
library(sparklyr.sedona)
sedona_git_repo normalizePath("~/incubator-sedona")
data_dir file.path(sedona_git_repo, "core", "src", "check", "sources")
sc spark_connect(grasp = "native")
pt_rdd sedona_read_dsv_to_typed_rdd(
sc,
location = file.path(data_dir, "arealm.csv"),
kind = "level"
)
- Making use of spatial partitioning to all knowledge factors:
sedona_apply_spatial_partitioner(pt_rdd, partitioner = "kdbtree")
- Constructing spatial index on every partition:
sedona_build_index(pt_rdd, kind = "quadtree")
- Becoming a member of one spatial knowledge set with one other utilizing “include” or “overlap” because the be a part of predicate:
polygon_rdd sedona_read_dsv_to_typed_rdd(
sc,
location = file.path(data_dir, "primaryroads-polygon.csv"),
kind = "polygon"
)
pts_per_region_rdd sedona_spatial_join_count_by_key(
pt_rdd,
polygon_rdd,
join_type = "include",
partitioner = "kdbtree"
)
It’s price mentioning that sedona_spatial_join()
will carry out spatial partitioning
and indexing on the inputs utilizing the partitioner
and index_type
provided that the inputs
usually are not partitioned or listed as specified already.
From the examples above, one can see that SRDDs are nice for spatial operations requiring
fine-grained management, e.g., for making certain a spatial be a part of question is executed as effectively
as attainable with the correct kinds of spatial partitioning and indexing.
Lastly, we will strive visualizing the be a part of end result above, utilizing a choropleth map:
which supplies us the next:

Wait, however one thing appears amiss. To make the visualization above look nicer, we will
overlay it with the contour of every polygonal area:
contours sedona_render_scatter_plot(
polygon_rdd,
resolution_x = 1000,
resolution_y = 600,
output_location = tempfile("scatter-plot-"),
boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
base_color = c(255, 0, 0),
browse = FALSE
)
sedona_render_choropleth_map(
pts_per_region_rdd,
resolution_x = 1000,
resolution_y = 600,
output_location = tempfile("choropleth-map-"),
boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
base_color = c(63, 127, 255),
overlay = contours
)
which supplies us the next:

With some low-level spatial operations taken care of utilizing the SRDD API and
the correct spatial partitioning and indexing knowledge buildings, we will then
import the outcomes from SRDDs to Spark dataframes. When working with spatial
objects inside Spark dataframes, we will write high-level, declarative queries
on these objects utilizing dplyr
verbs together with Sedona
spatial UDFs, e.g.
, the
following question tells us whether or not every of the 8
nearest polygons to the
question level accommodates that time, and in addition, the convex hull of every polygon.
tbl DBI::dbGetQuery(
sc, "SELECT ST_GeomFromText("POINT(-66.3 18)") AS `pt`"
)
pt tbl$pt[[1]]
knn_rdd sedona_knn_query(
polygon_rdd, x = pt, okay = 8, index_type = "rtree"
)
knn_sdf knn_rdd %>%
sdf_register() %>%
dplyr::mutate(
contains_pt = ST_contains(geometry, ST_Point(-66.3, 18)),
convex_hull = ST_ConvexHull(geometry)
)
knn_sdf %>% print()
# Supply: spark> [?? x 3]
geometry contains_pt convex_hull
1
Acknowledgements
The creator of this weblog put up wish to thank Jia Yu,
the creator of Apache Sedona, and Lorenz Walthert for
their suggestion to contribute sparklyr.sedona
to the upstream
incubator-sedona repository. Jia has offered
in depth code-review suggestions to make sure sparklyr.sedona
complies with coding requirements
and finest practices of the Apache Sedona challenge, and has additionally been very useful within the
instrumentation of CI workflows verifying sparklyr.sedona
works as anticipated with snapshot
variations of Sedona libraries from improvement branches.
The creator can be grateful for his colleague Sigrid Keydana
for helpful editorial ideas on this weblog put up.
That’s all. Thanks for studying!
Photograph by NASA on Unsplash
Reuse
Textual content and figures are licensed underneath Artistic Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall underneath this license and may be acknowledged by a be aware of their caption: “Determine from …”.
Quotation
For attribution, please cite this work as
Li (2021, July 7). Posit AI Weblog: sparklyr.sedona: A sparklyr extension for analyzing geospatial knowledge. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-07-07-sparklyr-sedona/
BibTeX quotation
@misc{sparklyr-sedona, creator = {Li, Yitao}, title = {Posit AI Weblog: sparklyr.sedona: A sparklyr extension for analyzing geospatial knowledge}, url = {https://blogs.rstudio.com/tensorflow/posts/2021-07-07-sparklyr-sedona/}, 12 months = {2021} }