GitHub - andy327/sparkling: Fluent, type-safe Spark DataFrame DSL for Scala.

Sparkling is a Scala library that wraps Apache Spark's DataFrame API with a fluent, type-safe DSL. It replaces strongly-typed column expressions and verbose agg() calls with composable builders for grouping, streaming group operations, and window functions — while staying close enough to Spark that the underlying DataFrame is always a .df away.

Getting Started

Add the following to your build.sbt:

libraryDependencies += "io.github.andy327" %% "sparkling" % "0.1.0"

Sparkling is published for Scala 2.12 and 2.13. Spark itself is a provided dependency — your project is expected to bring its own Spark runtime.

Word Count

Here is a word count in Sparkling, along with the equivalent raw Spark code.

Sparkling

import com.sparkling.dsl._

val wordCounts = lines.frame
  .flatMap("line" -> "word")((s: String) => s.split("\\s+"))
  .groupBy("word") {
    _.count("n")
  }
  .orderBy("n", descending = true)

Raw Spark

import org.apache.spark.sql.functions._

val wordCounts = lines
  .select(explode(split(col("line"), "\\s+")).as("word"))
  .groupBy("word")
  .agg(count("*").as("n"))
  .orderBy(desc("n"))

The single import com.sparkling.dsl._ brings everything into scope. The .frame extension method lifts any Spark DataFrame into a Frame. From there, all operations stay in the sparkling DSL.

More Examples

Basic Transformations

employees.frame
  .project("id", "name", "dept", "salary")         // keep only these columns
  .rename("dept" -> "department")                  // rename a column
  .filter("salary")(_ >= 50_000)                   // typed predicate
  .map("salary" -> "salary") { (s: Int) => s * 2 } // typed transform

Grouping and Aggregation

groupBy returns a GroupedFrame builder. Chain aggregations and they all run in a single Spark pass.

employees.frame
  .groupBy("department") {
    _.count("headcount")
     .avg("salary" -> "avg_salary")
     .max("salary" -> "max_salary")
     .sum("salary" -> "total_salary")
  }

For joins and unions:

// inner join on a shared key
employees.frame.join("dept_id", departments.frame)

// asymmetric keys
orders.frame.join("customer_id" -> "id", customers.frame, JoinType.Left)

// union two frames (pads missing columns with null)
currentMonth.frame ++ lastMonth.frame

Streaming Group Operations

streamBy processes each group as a typed iterator, enabling operations that require seeing the whole group at once — like deduplication, sessionization, or custom ranking. The group is optionally sorted before the iterator is handed to your function.

// keep only the first event per user, by timestamp
events.frame
  .streamBy("user_id") {
    _.sortBy("timestamp")
     .mapGroups("event_type" -> "first_event_type") { iter: Iterator[String] =>
       iter.take(1)
     }
  }

For stateful operations, mapStreamWithContext threads a per-group value through the iterator:

// assign sequential positions within each session
events.frame
  .streamBy("session_id") {
    _.sortBy("timestamp")
     .mapStreamWithContext("event_id" -> "position")(init = 0) {
       (counter, iter: Iterator[Long]) =>
         iter.scanLeft(counter)((n, _) => n + 1).drop(1)
     }
  }

Window Functions

windowBy returns a WindowedFrame builder that accumulates multiple window operations and applies them all in one pass. Rows are never reduced.

import com.sparkling.frame.WindowBounds._

employees.frame
  .windowBy("department") {
    _.orderBy("salary")
     .rank("dept_rank")                          // rank within dept by salary
     .lag("salary" -> "prev_salary", offset = 1) // salary of the row below
     .sum("salary" -> "running_total")           // running total (default bounds)
  }

Custom bounds give you rolling windows or whole-partition aggregates:

import com.sparkling.frame.WindowBounds
import com.sparkling.frame.WindowBounds._

sales.frame
  .windowBy("region") {
    _.orderBy("sale_date")
     .sum("revenue" -> "rolling_7d",
       bounds = WindowBounds.rowsBetween(-6, currentRow))
     .sum("revenue" -> "pct_of_total",
       bounds = WindowBounds.rangeBetween(unboundedPreceding, unboundedFollowing))
  }

windowAll applies window functions over all rows as a single partition — useful for global rankings:

scores.frame
  .windowAll {
    _.orderBy("score", descending = true)
     .rowNumber("global_rank")
  }

Algebird Aggregators

Sparkling integrates with Twitter Algebird through GroupedFrame.aggregate. Any MonoidAggregator can be plugged directly into a groupBy pipeline. Kryo serialization of opaque buffer types is handled automatically.

import com.sparkling.algebird.Aggregators

// approximate top-10 items by frequency within each category
val topK = Aggregators.forSpaceSaver[String](capacity = 1000, k = 10)

events.frame
  .groupBy("category") {
    _.aggregate("item" -> "top_items")(topK)
  }

Custom aggregators can wrap any MonoidAggregator you build with Algebird combinators. If the buffer type is SQL-encodable, Sparkling uses the typed Spark Aggregator path directly; if not (e.g. Option[SpaceSaver[T]]), it falls back to Kryo-serialized binary buffers automatically.

Typed Record Mapping

mapRecord and flatMapRecord decode rows into case classes, apply a typed function, and re-encode the output — all without leaving the Frame API.

case class Employee(name: String, salary: Int)
case class Bonus(name: String, bonus: Int)

employees.frame
  .mapRecord[Employee, Bonus](("name", "salary") -> ("name", "bonus")) { emp =>
    Bonus(emp.name, (emp.salary * 0.1).toInt)
  }

Building and Testing

Sparkling supports Scala 2.12 and 2.13 and requires JDK 17. Spark dependencies are marked provided, so you will need a Spark environment at runtime.

# compile
sbt compile

# run tests
sbt test

# run the full CI check (lint + format + coverage)
sbt ci

Individual test suites can be run with sbt "testOnly com.sparkling.frame.FrameSpec".

Code formatting and import ordering are enforced by Scalafmt and Scalafix. To auto-format everything before committing:

sbt formatAll

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
images		images
project		project
src		src
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.scalafix.conf		.scalafix.conf
.scalafmt.conf		.scalafmt.conf
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Word Count

Sparkling

Raw Spark

More Examples

Basic Transformations

Grouping and Aggregation

Streaming Group Operations

Window Functions

Algebird Aggregators

Typed Record Mapping

Building and Testing

License

About

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Word Count

Sparkling

Raw Spark

More Examples

Basic Transformations

Grouping and Aggregation

Streaming Group Operations

Window Functions

Algebird Aggregators

Typed Record Mapping

Building and Testing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages