Apache Spark

This article includes information that was originally written by Arpan Patel on Anant Github and Astra DataStax

Overview¶

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. Use Apache Spark to connect to your database and begin accessing your Astra DB tables using Scala in spark-shell.

ℹ️ Introduction to Apache Spark
📥 Apache Spark Download Link

Prerequisites¶

You should have an Astra account
You should Create an Astra Database
You should Have an Astra Token
You should Download your Secure Connect Bundle and unpack it.
Download and install the latest version of Spark Cassandra Connector that matches with your Apache Spark and Scala version from the maven central repository. To find the right version of SCC, please check SCC compatibility here.

Installation and Setup¶

These steps assume you will be using Apache Spark in local mode. For help using Spark cluster mode click the chat button on the bottom of the screen.

✅ Steps:¶

Expand the downloaded Apache Spark package into a directory, and assign the directory name to $SPARK_HOME.
Navigate to this directory using cd $SPARK_HOME
Append the following lines at the end of a file called $SPARK_HOME/conf/spark-defaults.conf (you may be able to find a template under $SPARK_HOME/conf directory), and replace the second column (value) with the first four lines:

spark.files $SECURE_CONNECT_BUNDLE_FILE_PATH/secure-connect-astraiscool.zip
spark.cassandra.connection.config.cloud.path secure-connect-astraiscool.zip
spark.cassandra.auth.username <<CLIENT ID>>
spark.cassandra.auth.password <<CLIENT SECRET>>
spark.dse.continuousPagingEnabled false

Launch spark-shell and enter the following scala commands:

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
spark.read.cassandraFormat("tables", "system_schema").load().count()

You should expect to see the following output:

$ bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._

scala> spark.read.cassandraFormat("tables", "system_schema").load().count()
res0: Long = 25

scala> :quit

Last update: 2023-10-13