More examples on GitHub
Apache Spark is a lightning-fast cluster computing system solving the limitation of previous favorite Map Reduce system for large data sets. It is the framework of choice for data scientists and machine learning engineers to work with big data problems. Spark engine is written in Scala which is considered to be the language of choice for scalable computing. However, Apache Spark provides high-level APIs in Java, Scala, Python, and R. pyspark is a good entry API to leverage spark parallel and distributed processing.
Spark 3.0.1 is a maintenance release containing stability fixes. This release is based Spark 3.0.0 released on 18th June 2020 after passing the vote on the 10th of June 2020 and now this release work well with google collab. But we need some initialization for each notebook each times. You could if you preferer install a Ubuntu Linux natively or on Windows Sub system for Linux : follow this link
Get an google account. If you have one use it, if not I could create one for you on the domaine cnamliban.org (education google workspace account) or you can create one of you own.
Then only one thing to do go there https://colab.research.google.com/ and follow these few steps...
Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with
Zero configuration required
Free access to GPUs
Easy sharing
Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier. You can watch this introduction on YouTube (Introduction to Colab) to learn more, or just get started with these steps. We will use python (jupyter notebook) on collab to install java, spark then PySpark. And we will be ready to learn and experiment Datamining analytics
You will get here the steps (use copy/paste to try) and you can follow this lecture
Note: "!" is used to run bash shell commands, it is not python.
# install java (in fact collab is a Linux debian like distro)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# Download spark 3.0.1
![ -f spark-3.0.1-bin-hadoop2.7.tgz ] || wget -q https://mirrors.netix.net/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
#untar it if not already done
![ -d spark-3.0.1-bin-hadoop2.7 ] || tar -xzf spark-3.0.1-bin-hadoop2.7.tgz
# define some environment variable directly with python instruction using the module os
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"
# we need now a tool to help python find the spark installation
# Happily their is a module
!python -m pip install -q findspark
# let test
import findspark
findspark.init()
from pyspark.sql import SparkSession
#Create a spark session
spark = SparkSession.builder.appName("BDML").master("local[*]").getOrCreate()
print(spark)
print(spark.sparkContext)
<pyspark.sql.session.SparkSession object at 0x7f928ddd3cc0>
<SparkContext master=local[*] appName=BDML>
spark.stop()