Skip to content

Apache Spark

  • This article includes information that was originally written by Arpan Patel on Anant Github and Astra DataStax


Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. Use Apache Spark to connect to your database and begin accessing your Astra DB tables using Scala in spark-shell.


Installation and Setup

These steps assume you will be using Apache Spark in local mode. For help using Spark cluster mode click the chat button on the bottom of the screen.


  1. Expand the downloaded Apache Spark package into a directory, and assign the directory name to $SPARK_HOME.

  2. Navigate to this directory using cd $SPARK_HOME

  3. Append the following lines at the end of a file called $SPARK_HOME/conf/spark-defaults.conf (you may be able to find a template under $SPARK_HOME/conf directory), and replace the second column (value) with the first four lines:

spark.cassandra.auth.username <<CLIENT ID>>
spark.cassandra.auth.password <<CLIENT SECRET>>
spark.dse.continuousPagingEnabled false
  1. Launch spark-shell and enter the following scala commands:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._"tables", "system_schema").load().count()

You should expect to see the following output:

$ bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._

scala>"tables", "system_schema").load().count()
res0: Long = 25

scala> :quit

Last update: 2023-10-13