Apache Spark with Scala

October 7, 2024

Introduction to Apache Spark with Scala.

Table of Contents

What is Scala and Apache Spark?
Why Scala?
Development
- Setting Up Your Environment
- Development Workflow with Spark and Scala
Sample Data Fetch and transformation
Spark in AWS

What is Scala and Apache Spark?

Scala is a high-level programming language that combines functional and object-oriented programming paradigms. Developed by Martin Odersky and released in 2003, Scala was designed to address limitations in Java while running on the Java Virtual Machine (JVM). By offering concise and expressive syntax, Scala allows developers to write efficient code that can effortlessly interact with Java libraries and frameworks. Its statically-typed nature, combined with powerful features like pattern matching and immutable collections, makes Scala particularly suited for big data applications.

Apache Spark, initially developed by the AMPLab at the University of California, Berkeley, and released as an open-source project in 2010, is a distributed computing system focused on processing large-scale data swiftly. Spark's key innovation is its in-memory data processing capability, which significantly enhances the speed of data analytics over traditional disk-based frameworks. It provides a suite of libraries for diverse tasks, including Spark SQL, MLlib for machine learning, and Spark Streaming for real-time data processing.

Why Scala?

Scala is not the only language that can interact with the Spark API, even though Spark and Scala are often spoken of synonymously. Over the years (partial) support has been added to various other languages.

Language	Support Added	Notes
Scala	2010	Original language used in Spark development.
Java	2010	Supported from the beginning alongside Scala.
Python	2013	Introduced with Spark 0.7.0 (PySpark).
R	2015	Added with Spark 1.4 (SparkR).
SQL	2014	Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API
Kotlin	2019	Unofficial support via third-party libraries.

Apache Spark is still primarily used with Scala for several reasons, despite the popularity of languages like Python:

Performance: Scala is the native language of Apache Spark, meaning that Spark's core engine and APIs are written in Scala. This ensures that operations conducted with Scala are more closely aligned with the internal workings of Spark, often resulting in more efficient execution compared to other languages, particularly for complex computations.

Type Safety and Static Typing: Scala, being statically typed, offers compile-time type safety. This allows developers to catch errors early in the development process, leading to more robust and maintainable code. This feature is particularly beneficial in big data applications, where the datasets and transformations can become quite complex.

Functional Programming Paradigm: Scala's support for functional programming paradigms fits well with the distributed processing model of Spark. Functional programming concepts like immutability and higher-order functions provide a more natural and concise way to express data transformations, which are core operations in Spark.

Integration with Java: Scala runs on the Java Virtual Machine (JVM) and has seamless interoperability with Java, which is essential for leveraging existing Java libraries and integrating Spark with other JVM-based enterprise applications.

Comprehensive API: While Spark offers APIs for multiple languages, the Scala API tends to be more comprehensive and is often the first to receive new features and updates. This is because many Spark developers and contributors are most familiar with Scala, leading to a more mature and well-supported API.

While Python is also popular due to its simplicity and extensive ecosystem of data science libraries, there can be performance trade-offs when using PySpark. Python operations may involve greater data serialization and communication overheads between the Spark JVM and Python runtime, potentially leading to slower execution for certain tasks. Despite these considerations, Python remains widely used with Spark, especially for teams focused on the rapid development of data applications and leveraging Python's rich data science tools.

Development

I'm using Ubuntu Linux on Windows WSL. Your installation instructions may vary.

Setting Up Your Environment

Install Java Development Kit (JDK):

Spark runs on the Java Virtual Machine (JVM), so you’ll need to install a JDK. While JDK is the latest LTS release as of this writing, when working with Scala and Apache Spark, it's crucial to consider the compatibility of these tools with the Java version you choose.

JDK version 17 is officially supported for both Spark and Scala.

sudo apt install openjdk-17-jdk

Install Apache Spark:

Download the latest version of Apache Spark from the Apache Spark downloads page. Choose the pre-built version for Hadoop (e.g., "Pre-built for Apache Hadoop 3.3 and later"). Extract the downloaded archive and add the bin directory to your system’s PATH variable, allowing you to run Spark commands from the terminal.

Install Scala:

Install Scala

sudo apt install scala

and SBT (Scala Build Tool).

This is actually a bit complicated now... the method in this article worked for me: Medium: How to install Scala/SBT on Ubuntu

Install an IDE:

Common IDE's for Spark development include VSCode, IntelliJ IDEA, Jupyter Notebooks, Apache Zeppelin Notebooks, Databricks Notebooks and PyCharm.

Install Apache Maven or SBT (Scala Build Tool):

These tools are essential for managing dependencies and building your Scala projects. You can install Maven from Apache Maven’s website or SBT from SBT's official site.

Development Workflow with Spark and Scala

Create a New Scala Project:

Open your IDE, create a new project, and select Scala with the build tool of your choice (Maven or SBT).

Configure the project SDK to use the JDK you installed.

Add Spark Dependencies:

If you are using SBT, add the following dependency to your build.sbt file:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.4.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.4.0"

For Maven, you will include similar dependencies in the pom.xml.

Develop Spark Applications:

Write your Scala code leveraging Spark's APIs. For example, you can start by creating a simple application that initializes a SparkSession and executes a basic transformation and action:

import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Simple Spark Application")
      .master("local[*]")  // Run locally with many threads
      .getOrCreate()

    val data = Seq((1, "apple"), (2, "banana"), (3, "orange"))
    val df = spark.createDataFrame(data).toDF("id", "fruit")
    df.show()

    spark.stop()
  }
}

Build and Run the Application:

Use your chosen build tool (e.g., running sbt run or using IntelliJ’s run configuration) to compile and execute your application.

Sample Data Fetch and transformation

Here is a simple example where I fetch some data from an API endpoint that i've created, which returns a JSON file. I parse the JSON file, turn it into a datafield and then I can perform operations on it. Just like Pandas!

In this instance I am doing some simple math on two columns and outputting the result into a new column.

Main.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scalaj.http._

object Main {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .appName("API Data Processing")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    // Send GET request to the API
    val response: String = Http(
      "https://emzdc3lzy9.execute-api.us-east-1.amazonaws.com/dev/sample-data"
    ).asString.body

    // Load JSON data into DataFrame
    val jsonSeq = Seq(response)
    val jsonDS = spark.createDataset(jsonSeq)
    val df = spark.read.option("multiLine", true).json(jsonDS)

    // Display the DataFrame schema and data
    df.printSchema()
    df.show(truncate = false)

    // Cast the string columns to Double (or Integer, if appropriate)
    val dfWithNumeric = df
      .withColumn("clicks", col("clicks").cast("Double"))
      .withColumn("cost", col("cost").cast("Double"))

    // Now, calculate CPC safely
    val dfWithSafeCPC = dfWithNumeric.withColumn(
      "cpc",
      when(col("clicks") =!= 0, col("cost") / col("clicks")).otherwise(lit(0))
    )

  // Display the result
    dfWithSafeCPC.show()

    // Stop the SparkSession
    spark.stop()
  }
}

Which successfully outputs the data. Great.

[info] +------+-----+--------------------+-----+---+
[info] |clicks| cost|                date|sales|cpc|
[info] +------+-----+--------------------+-----+---+
[info] | 512.0|256.0|2024-08-01T00:00:...| 1024|0.5|
[info] | 478.0|239.0|2024-08-02T00:00:...|  956|0.5|
[info] | 525.0|262.5|2024-08-03T00:00:...| 1050|0.5|
[info] | 610.0|305.0|2024-08-04T00:00:...| 1220|0.5|
[info] | 450.0|225.0|2024-08-05T00:00:...|  900|0.5|
[info] | 490.0|245.0|2024-08-06T00:00:...|  980|0.5|
[info] | 530.0|265.0|2024-08-07T00:00:...| 1060|0.5|
[info] | 580.0|290.0|2024-08-08T00:00:...| 1160|0.5|
[info] | 475.0|237.5|2024-08-09T00:00:...|  950|0.5|
[info] | 500.0|250.0|2024-08-10T00:00:...| 1000|0.5|
[info] | 620.0|310.0|2024-08-11T00:00:...| 1240|0.5|
[info] | 460.0|230.0|2024-08-12T00:00:...|  920|0.5|
[info] | 515.0|257.5|2024-08-13T00:00:...| 1030|0.5|
[info] | 495.0|247.5|2024-08-14T00:00:...|  990|0.5|
[info] | 550.0|275.0|2024-08-15T00:00:...| 1100|0.5|
[info] | 600.0|300.0|2024-08-16T00:00:...| 1200|0.5|
[info] | 480.0|240.0|2024-08-17T00:00:...|  960|0.5|
[info] | 530.0|265.0|2024-08-18T00:00:...| 1060|0.5|
[info] | 465.0|232.5|2024-08-19T00:00:...|  930|0.5|
[info] | 510.0|255.0|2024-08-20T00:00:...| 1020|0.5|
[info] +------+-----+--------------------+-----+---+

Spark in AWS

Comments

Recent Work

Basalt

basalt.software

Free desktop AI Chat client, designed for developers and businesses. Unlocks advanced model settings only available in the API. Includes quality of life features like custom syntax highlighting.

Learn More

BidBear

bidbear.io

Bidbear is a report automation tool. It downloads Amazon Seller and Advertising reports, daily, to a private database. It then merges and formats the data into beautiful, on demand, exportable performance reports.

Learn More

What is Scala and Apache Spark?​

Why Scala?​

Development​

Setting Up Your Environment​

Development Workflow with Spark and Scala​

Sample Data Fetch and transformation​

Spark in AWS​