Now that you know What is Spark, we’ll see how to set up and test a Spark development environment on Windows, Linux (Ubuntu), and macOS X — whatever common operating system you are using, this article should give you what you need to be able to start developing Spark applications.

What is Spark development environment?

The development environment is an installation of Apache Spark and other related components on your local computer that you can use for developing and testing Spark applications prior to deploying them to a production environment.

Spark provides support for Python, Java, Scala, R. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so to run Spark all you need is an installation of Java. However, if you want to use the Python API (PySpark), you will also need a Python interpreter (version 2.7 or later). Or if you want to use R (SparkR), you have to install a version of R on your machine.

What are the spark setup options

The options for getting started with Spark are:

  • Downloading and installing Apache Spark components individually on your laptop.
  • Downloading the quick start VM distribution.
  • Running a web-based version in Databricks Community Edition, a free cloud environment. I’ll explain all these options to you.

Installing Manually

Downloading Spark Locally

To download and run Spark locally, the first step is to make sure that you have Java installed on your machine, along with Python/R version if you would like to use Python or R. Next, visit the project’s official download page and select the package type “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This will download a compressed TAR file, or tarball.

Building Spark from Source

You can also build and configure Spark from source. You can select a Spark source package from Github to get just the source and follow the instructions in the README file for building.

Installation/Configuration Steps

In case you choose to install Spark manually I suggest to use Vagrant, which will provide the isolated environment in your host OS and prevent the the host OS from getting corrupted. The detailed steps are available on Github.

Downloading the quick start VM or Distribution

You can download the quick start VM from Hortonworks or Clouders. These distributions are virtual images, hence to use them you need to install VMware or Oracle Virtualbox. These VM images are pre-configured, hence you don’t have to perform any additional installation or configuration. For distributions you can choose Hortonworks, Cloudera, or MapR.

Running Spark in the Cloud

Databricks offers a free community edition of its cloud service as a learning environment. You need to Sign up for Databricks Community Edition and follow the steps.

Testing the intallation

Once we have installed the Apache Spark, we need to test the installation.

  • Java version: java -version
  • sbt version: sbt about
  • Hadoop: hdfs -version and hdfs dfs -ls .
  • Python version: python --version
  • Execute Pyspark: pyspark on console and you will be logged into the Spark shell for python
    >>> sc
    >>> sc.version
    
  • Execute Spark-Shell: Type spark-shell on console and you will be logged into Spark shell for scala
    scala> sc
    scala> sc.version
    

Terminal Recording for Testing the Installation

Summary

You now have a Spark development environment up and running on your computer.

So far we have tested our spark environment. In the next article we expand on this process, building a simple but complete data cleaning application.