Adding External and Maven JARs to Spark Shell for Ad-Hoc Analysis

3 min readAug 13, 2024

When performing ad-hoc data analysis using Apache Spark, you may encounter situations where you need additional libraries to process your data. These libraries could be external JARs available from Maven repositories or custom JARs that you’ve developed locally. In this post, we’ll walk through how to add both types of JARs to your Spark shell session.

Adding a Maven JAR

Let’s start with an example where we need to include a library that is available in Maven. We’ll use the org.mongodb.bson library as our example, which is available on Maven Central. The SBT dependency for this library looks like this:

// https://mvnrepository.com/artifact/org.mongodb/bson
libraryDependencies += "org.mongodb" % "bson" % "5.1.3"

To include this Maven JAR in your Spark shell session, you can use the --packages option. The format for specifying a Maven package is <groupId>:<artifactId>:<version>. Here’s how you can start the Spark shell with the org.mongodb.bson package included:

spark-shell --master spark://$(hostname -f):7077 \
 --driver-memory 4G --executor-memory 4G \
 --executor-cores 2 --total-executor-cores 4 \
 --packages org.mongodb:bson:5.1.3

Once you execute this command, the Spark shell will resolve the dependency and download the required JAR. You’ll see output similar to this in your terminal:

:: loading settings :: url = jar:file:/e2deepde/spark-3.4.1-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/deuser/.ivy2/cache
The jars for the packages stored in: /home/deuser/.ivy2/jars
org.mongodb#bson added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0d4af84f-28c1-4692-937e-7f9b9b405559;1.0
 confs: [default]
 found org.mongodb#bson;5.1.3 in central
downloading https://repo1.maven.org/maven2/org/mongodb/bson/5.1.3/bson-5.1.3.jar ...
 [SUCCESSFUL ] org.mongodb#bson;5.1.3!bson.jar (813ms)
:: resolution report :: resolve 903ms :: artifacts dl 816ms
 :: modules in use:
 org.mongodb#bson;5.1.3 from central in [default]
 ---------------------------------------------------------------------
 |                  |            modules            ||   artifacts   |
 |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
 ---------------------------------------------------------------------
 |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
 ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-0d4af84f-28c1-4692-937e-7f9b9b405559
 confs: [default]
 1 artifacts copied, 0 already retrieved (496kB/5ms)

This output confirms that the Spark shell has successfully downloaded the Maven artifact, and you can now use the org.mongodb.bson library within your Spark session.

Adding a Local JAR

If you have a JAR file that you’ve developed locally, or one that isn’t available on Maven, you can add it to your Spark shell using the --jars option. This option allows you to specify the full path to your JAR files. Here’s an example:

spark-shell --master spark://$(hostname -f):7077 \
 --driver-memory 4G --executor-memory 4G \
 --executor-cores 2 --total-executor-cores 4 \
 --jars ~/local_jars/*.jar

If you need to add multiple JAR files, you can separate them with a comma:

spark-shell --master spark://$(hostname -f):7077 \
 --driver-memory 4G --executor-memory 4G \
 --executor-cores 2 --total-executor-cores 4 \
 --packages org.mongodb:bson:5.1.3,org.json:json:20240303 \
 --jars ~/local_jars/*.jar,~/other_jars/*.jar

This approach gives you the flexibility to include any JARs necessary for your analysis, whether they are external libraries or custom implementations.

Conclusion

Adding external or local JARs to your Spark shell session is straightforward and allows you to extend the capabilities of Spark for ad-hoc analysis. By using the --packages option for Maven JARs and the --jars option for local files, you can seamlessly integrate additional libraries into your Spark environment, enabling more robust and flexible data processing.

Adding External and Maven JARs to Spark Shell for Ad-Hoc Analysis

Adding a Maven JAR

Adding a Local JAR

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Prasad Khode