Adding External and Maven JARs to Spark Shell for Ad-Hoc Analysis
When performing ad-hoc data analysis using Apache Spark, you may encounter situations where you need additional libraries to process your data. These libraries could be external JARs available from Maven repositories or custom JARs that you’ve developed locally. In this post, we’ll walk through how to add both types of JARs to your Spark shell session.
Adding a Maven JAR
Let’s start with an example where we need to include a library that is available in Maven. We’ll use the org.mongodb.bson
library as our example, which is available on Maven Central. The SBT dependency for this library looks like this:
// https://mvnrepository.com/artifact/org.mongodb/bson
libraryDependencies += "org.mongodb" % "bson" % "5.1.3"
To include this Maven JAR in your Spark shell session, you can use the --packages
option. The format for specifying a Maven package is <groupId>:<artifactId>:<version>
. Here’s how you can start the Spark shell with the org.mongodb.bson
package included:
spark-shell --master spark://$(hostname -f):7077 \
--driver-memory 4G --executor-memory 4G \
--executor-cores 2 --total-executor-cores 4 \
--packages org.mongodb:bson:5.1.3
Once you execute this command, the Spark shell will resolve the dependency and download the required JAR. You’ll see output similar to this in your terminal:
:: loading settings :: url = jar:file:/e2deepde/spark-3.4.1-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/deuser/.ivy2/cache
The jars for the packages stored in: /home/deuser/.ivy2/jars
org.mongodb#bson added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0d4af84f-28c1-4692-937e-7f9b9b405559;1.0
confs: [default]
found org.mongodb#bson;5.1.3 in central
downloading https://repo1.maven.org/maven2/org/mongodb/bson/5.1.3/bson-5.1.3.jar ...
[SUCCESSFUL ] org.mongodb#bson;5.1.3!bson.jar (813ms)
:: resolution report :: resolve 903ms :: artifacts dl 816ms
:: modules in use:
org.mongodb#bson;5.1.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 1 | 1 | 0 || 1 | 1 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-0d4af84f-28c1-4692-937e-7f9b9b405559
confs: [default]
1 artifacts copied, 0 already retrieved (496kB/5ms)
This output confirms that the Spark shell has successfully downloaded the Maven artifact, and you can now use the org.mongodb.bson
library within your Spark session.
Adding a Local JAR
If you have a JAR file that you’ve developed locally, or one that isn’t available on Maven, you can add it to your Spark shell using the --jars
option. This option allows you to specify the full path to your JAR files. Here’s an example:
spark-shell --master spark://$(hostname -f):7077 \
--driver-memory 4G --executor-memory 4G \
--executor-cores 2 --total-executor-cores 4 \
--jars ~/local_jars/*.jar
If you need to add multiple JAR files, you can separate them with a comma:
spark-shell --master spark://$(hostname -f):7077 \
--driver-memory 4G --executor-memory 4G \
--executor-cores 2 --total-executor-cores 4 \
--packages org.mongodb:bson:5.1.3,org.json:json:20240303 \
--jars ~/local_jars/*.jar,~/other_jars/*.jar
This approach gives you the flexibility to include any JARs necessary for your analysis, whether they are external libraries or custom implementations.
Conclusion
Adding external or local JARs to your Spark shell session is straightforward and allows you to extend the capabilities of Spark for ad-hoc analysis. By using the --packages
option for Maven JARs and the --jars
option for local files, you can seamlessly integrate additional libraries into your Spark environment, enabling more robust and flexible data processing.