Grouping a Spark DataFrame and Creating JSON Lists in Scala

Prasad Khode
2 min readOct 9, 2024

--

In this post, we’ll explore how to group a Spark DataFrame by a specific column and create a list of JSON objects from other columns. This operation is useful when you want to combine data into a structured format for further analysis or export.

Problem Scenario

Let’s say we have the following DataFrame:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|A |1 |x |
|B |2 |y |
|A |3 |z |
|B |4 |w |
|A |5 |v |
+----+----+----+

We want to group by col1 and generate a list of JSON objects containing col2 and col3 for each unique value in col1.

Steps to Solve

Here’s how we can achieve this using Apache Spark and Scala:

  1. Import Required Libraries: We’ll start by importing necessary Spark libraries.
  2. Create the DataFrame: For this example, we’ll create a simple DataFrame with sample data.
  3. Convert Columns into JSON: We’ll use Spark’s built-in to_json and struct functions to convert the columns col2 and col3 into JSON format.
  4. Group by and Aggregate: Finally, we’ll group the DataFrame by col1 and collect the JSON objects into a list.

Let’s dive into the implementation.

Code Implementation

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

// Step 1: Create a Spark session
val spark = SparkSession.builder().appName("GroupByJSON").getOrCreate()
import spark.implicits._

// Step 2: Define the sample data and create a DataFrame
val data = Seq(
("A", 1, "x"),
("B", 2, "y"),
("A", 3, "z"),
("B", 4, "w"),
("A", 5, "v")
)
val df = data.toDF("col1", "col2", "col3")

// Step 3: Create a JSON representation of col2 and col3
val dfWithJson = df.withColumn("json_column", to_json(struct($"col2", $"col3")))

// Step 4: Group by col1 and collect JSON objects into a list
val result = dfWithJson.groupBy("col1").agg(collect_list("json_column").as("json_list"))

// Step 5: Show the result
result.show(false)
  • We initiate a Spark session which serves as the entry point for Spark operations.
  • We create a sample DataFrame with three columns (col1, col2, and col3). You can replace this with any data you are working with.
  • We use struct to combine col2 and col3 into a structured format, and then use to_json to convert this structure into a JSON string.
  • We group the data by col1 and use the collect_list function to aggregate the JSON objects for each group.

Output

Here’s what the final result looks like:

+----+---------------------------------------------------------------------+
|col1|json_list |
+----+---------------------------------------------------------------------+
|A |[{"col2":1,"col3":"x"}, {"col2":3,"col3":"z"}, {"col2":5,"col3":"v"}]|
|B |[{"col2":2,"col3":"y"}, {"col2":4,"col3":"w"}] |
+----+---------------------------------------------------------------------+

Each row now has a list of JSON objects that represent the combination of col2 and col3 for each unique value of col1.

Use Cases

This technique can be applied in several scenarios:

  • Data transformation: Reshaping your data into a nested JSON structure can be useful when preparing data for APIs or exporting it to document-based databases like MongoDB.
  • Data aggregation: Combining related data into lists simplifies reporting and analysis, especially when working with grouped data.

Conclusion

We’ve successfully demonstrated how to group a Spark DataFrame by one column and generate a list of JSON objects from other columns. This approach is efficient for transforming structured data into more flexible, readable formats.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--