More examples on GitHub

About the First two sessions of the Workshop

Spark and it Python API PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. Theses activities starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter (or google Collab) to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks (will focus on text mining examples, but not only). Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use Resilient Distributed Datasets and the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks about open and free ('libre') environments to do this kind of work : Data analyst, machine learning, development, EDI, ...

By the end of this course, you will not only be familiar with data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in organizations.

Who is this two firsts sessions for and what does it cover?

Some experience in programming
interested in using Spark and python for analysing big data

What these sessions are not:

Do not cover typical Python libraries: Numpy, Pandas, etc.. they can be used without Spark

We will cover quickly

PySpark RDD and DataFrames
Spark 3.0.1
Working with large datasets (mocking large datasets)
A text mining example "Spam detection"

Useful Resources : The UCI machine learning repository

We can access the UCI machine learning repository by navigating to https://archive.ics.uci.edu/ml/.

So, what is the UCI machine learning repository? UCI stands for the University of California Irvine machine learning repository, and it is a very useful resource for getting open source and free datasets for machine learning. We can use this as a chance to get big datasets that help us test out the functions of PySpark.

Presenting the Big Data analytics sections of the workshop

00_EN_intro.mp4

L0: Introducing the Big Data Analytic section of the workshop (6 hours) hybrid: Asynchronous and gathering

We will cover theses topics

Spark and Bigdata basics
Setting up Spark in various ways
Python and Spark RDD and Dataframes
Pyspark examples and exercises
introduction with natural language processing with examples

A1: Introduction and outcomes: Possible environment, prerequisite for the next lectures

assigement1_env.mp4

collab.mp4

Google Collab

git.mp4

git

recmmendations.mp4

github

WSL.mp4

Window subsystem for Libux

Note: If you want. You can do the exercises and the examples of the workshop without installing anything on your machine, just by using google Collab. See the video a little below concerning this point

Lecture 1: What is Spark and Why Spark ?

Presentation

BigData_Spark_intro.mp4

The video

Lecture 2 (between A1_A2) Big data issue the downsizing and MapReduce

Session 1_2

The presentation

downsizing.mp4

The video

A2: Using Spark with Google collab

You have 2 versions, one long with more details and some simulated errors when trying to run cells in the notebooks, the second just

activity_first_long.mp4

Long version, with details and some errors to correct, real time presentation

Here is the link for Google Collab : https://colab.research.google.com/

And to GitHub of this Activity : Spark on collab

Step by Step : https://github.com/BDML-Workshop/spark-on-collab/blob/main/init_spark.ipynb

activity_first_short.mp4

Short version, just the essential

Copy/paste init part to your google collab for using spark (new version Spark 3.0.1 Hadoop 3.2

!wget -q https://mirrors.netix.net/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

!tar -xzf spark-3.0.1-bin-hadoop3.2.tgz

!pip install -q findspark

# define some evironement variable diretly with python instruction using the module os

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/default-java"

os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

import findspark

findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BDML").master("local[*]").getOrCreate()

sc = spark.sparkContext

spark is the spark session ans sc is the context object

A3 : Word Cout example : RDD, Spark, Goolge Collab

WordCount.mp4

A4 : More about RDD, more basic examples filter, MapReduce

filterMap.mp4

A 5: Action (Reduce) example

reduceIsAction.mp4

Lecture 3: From RDD to Data Frames

Spark data frame a basic introduction

Spark DataFrame Intro

dataframe.mp4

Learning Data frame by examples, This time you will try things alone:

Open these 2 examples (notebook on github) with google collab

A6 : Data frame Basic : https://colab.research.google.com/github/BDML-Workshop/DataFrames/blob/main/00_DataFrame_Basics.ipynb

A 7 : Data Frame some operations : https://colab.research.google.com/github/BDML-Workshop/DataFrames/blob/main/DataFrame_Basic_Operations_stock_example.ipynb

And then let us jump to our final text mining lecture and example

lecture 4 : A text mining example (spam detection)

textMiningIntro.mp4

A8 : Tools for NLP with PySpark

The associated notebook : https://colab.research.google.com/github/bdmlworkshop/Examples/blob/main/Features_for_NLP.ipynb

A9 : Spam detection example

The associated notebook : https://colab.research.google.com/github/bdmlworkshop/Examples/blob/main/Spam_Detection_example.ipynb

Page updated

Report abuse