Course curriculum

  • 1

    Introduction

    • Why Big Data
    • Applications of PySpark
    • Introduction to Instructor
    • Introduction to Course
    • Projects Overview
  • 2

    01-Introduction to Hadoop, Spark EcoSystems and Architectures

    • Why Spark
    • Hadoop EcoSystem
    • Spark Architecture and EcoSystem
    • DataBricks SignUp
    • Create DataBricks Notebook
    • Download Spark and Dependencies
    • Java Setup on Window
    • Python Setup on Window
    • Spark Setup on Window
    • Hadoop Setup on Window
    • Runing Spark on Window
    • Java Download on MAC
    • Installing JDK on MAC
    • Setting Java Home on MAC
    • Java check on MAC
    • Installing Python on MAC
    • Setup Spark on MAC
  • 3

    Spark RDDs

    • Spark RDDs
    • Creating Spark RDD
    • Running Spark Code Locally
    • RDD Map (Lambda)
    • RDD Map (Simple Function)
    • Quiz (Map)
    • Solution 1 (Map)
    • Solution 2 (Map)
    • RDD FlatMap
    • RDD Filter
    • Quiz (Filter)
    • Solution (Filter)
    • RDD Distinct
    • RDD GroupByKey
    • RDD ReduceByKey
    • Quiz (Word Count)
    • Solution (Word Count)
    • RDD (Count and CountByValue)
    • RDD (saveAsTextFile)
    • RDD (Partition)
    • Finding Average-1
    • Finding Average-2
    • Quiz (Average)
    • Solution (Average)
    • Finding Min and Max
    • Quiz (Min and Max)
    • Solution (Min and Max)
    • Project Overview
    • Total Students
    • Total Marks by Male and Female Student
    • Total Passed and Failed Students
    • Total Enrollments per Course
    • Total Marks per Course
    • Average marks per Course
    • Finding Minimum and Maximum marks
    • Average Age of Male and Female Students
  • 4

    Spark DFs

    • Introduction to Spark DFs
    • Creating Spark DFs
    • Spark Infer Schema
    • Spark Provide Schema
    • Create DF from Rdd
    • Rectifying the Error
    • Select DF Colums
    • Spark DF withColumn
    • Spark DF withColumnRenamed and Alias
    • Spark DF Filter rows
    • Quiz (select, withColumn, filter)
    • Solution (select, withColumn, filter)
    • Spark DF (Count, Distinct, Duplicate)
    • Quiz (Distinct, Duplicate)
    • Solution (Distinct, Duplicate)
    • Spark DF (sort, orderBy)
    • Quiz (sort, orderBy)
    • Solution (sort, orderBy)
    • Spark DF (Group By)
    • Spark DF (Group By - Multiple Columns and Aggregations)
    • Spark DF (Group By -Visualization)
    • Spark DF (Group By - Filtering)
    • Quiz (Group By)
    • Solution (Group By)
    • Quiz (Word Count)
    • Solution (Word Count)
    • Spark DF (UDFs)
    • Quiz (UDFs)
    • Solution (UDFs)
    • Solution (Cache and Presist)
    • Spark DF (DF to RDD)
    • Spark DF (Spark SQL)
    • Spark DF (Write DF)
    • Project Overview
    • Project (Count and Select)
    • Project (Group By)
    • Project (Group By, Aggregations and Order By)
    • Project (Filtering)
    • Project (UDF and WithColumn)
    • Project (Write)
  • 5

    Collaborative filtering

    • Collaborative filtering
    • Utility Matrix
    • Explicit and Implicit Ratings
    • Expected Results
    • Dataset
    • Joining Dataframes
    • Train and Test Data
    • ALS model
    • Hyperparameter tuning and cross validation
    • Best model and evaluate predictions
    • Recommendations
  • 6

    Spark Streaming

    • Introduction to Spark Streaming
    • Spark Streaming with RDD
    • Spark Streaming Context
    • Spark Streaming Reading Data
    • Spark Streaming Cluster Restart
    • Spark Streaming RDD Transformations
    • Spark Streaming DF
    • Spark Streaming Display
    • Spark Streaming DF Aggregations
  • 7

    ETL Pipeline

    • Introduction to ETL
    • ETL pipeline Flow
    • Data set
    • Extracting Data
    • Transforming Data
    • Loading data (Creating RDS-I)
    • Load data (Creating RDS-II)
    • RDS Networking
    • Downloading Postgres
    • Installing Postgres
    • Connect to RDS thorugh PgAdmin
    • Loading Data
  • 8

    Project - Change Data Capture / Replication On Going

    • Introduction to Project
    • Project Architecture
    • Creating RDS MySql instance
    • Creating S3 Bucket
    • Creating DMS Source Endpoint
    • Creating DMS Destination Endpoint
    • Creating DMS Instance
    • MySql WorkBench
    • Connecting with RDS and Dumping Data
    • Quering RDS
    • DMS Full Load
    • DMS Replication Ongoing
    • Stoping Instances
    • Glue Job (Full Load)
    • Glue Job (Change Capture)
    • Glue Job (CDC)
    • Creating Lambda Function and Adding Trigger
    • Checking Trigger
    • Getting S3 file name in Lambda
    • Creating Glue Job
    • Adding Invoke for Glue Job
    • Testing Invoke
    • Writing Glue Shell Job
    • Full Load Pipeline
    • Change Data Capture Pipeline