More examples on GitHub

Getting started Spark with Google Collab

Apache Spark is a lightning-fast cluster computing system solving the limitation of previous favorite Map Reduce system for large data sets. It is the framework of choice for data scientists and machine learning engineers to work with big data problems. Spark engine is written in Scala which is considered to be the language of choice for scalable computing. However, Apache Spark provides high-level APIs in Java, Scala, Python, and R. pyspark is a good entry API to leverage spark parallel and distributed processing.

Spark 3.0.1 is a maintenance release containing stability fixes. This release is based Spark 3.0.0 released on 18th June 2020 after passing the vote on the 10th of June 2020 and now this release work well with google collab. But we need some initialization for each notebook each times. You could if you preferer install a Ubuntu Linux natively or on Windows Sub system for Linux : follow this link

Pre requisite using google collab

Get an google account. If you have one use it, if not I could create one for you on the domaine cnamliban.org (education google workspace account) or you can create one of you own.

Then only one thing to do go there https://colab.research.google.com/ and follow these few steps...

Before going further, what is google collab?

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with

Zero configuration required
Free access to GPUs
Easy sharing

Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier. You can watch this introduction on YouTube (Introduction to Colab) to learn more, or just get started with these steps. We will use python (jupyter notebook) on collab to install java, spark then PySpark. And we will be ready to learn and experiment Datamining analytics

You will get here the steps (use copy/paste to try) and you can follow this lecture

To use Spark with Collab we need theses steps (each time)

Note: "!" is used to run bash shell commands, it is not python.

# install java (in fact collab is a Linux debian like distro)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download spark 3.0.1

![ -f spark-3.0.1-bin-hadoop2.7.tgz ] || wget -q https://mirrors.netix.net/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

#untar it if not already done

![ -d spark-3.0.1-bin-hadoop2.7 ] || tar -xzf spark-3.0.1-bin-hadoop2.7.tgz

# define some environment variable directly with python instruction using the module os

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

# we need now a tool to help python find the spark installation

# Happily their is a module

!python -m pip install -q findspark

Now we verify that all is ok

# let test

import findspark

findspark.init()

from pyspark.sql import SparkSession

#Create a spark session

spark = SparkSession.builder.appName("BDML").master("local[*]").getOrCreate()

print(spark)

print(spark.sparkContext)

If you have this kind of output that means all is OK, bravo

<pyspark.sql.session.SparkSession object at 0x7f928ddd3cc0>

It is time to stop the session and prepare Our self for the next lectures

spark.stop()

Page updated

Report abuse