About the First two sessions of the Workshop
Spark and it Python API PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. Theses activities starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter (or google Collab) to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
You'll learn to work with Apache Spark and perform ML tasks (will focus on text mining examples, but not only). Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use Resilient Distributed Datasets and the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks about open and free ('libre') environments to do this kind of work : Data analyst, machine learning, development, EDI, ...
By the end of this course, you will not only be familiar with data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in organizations.