Azure Databricks

What is Azure Databricks?

- Azure Databricks is a managed Apache Spark analytics service offered by Microsoft as part of the Microsoft Azure cloud platform.
- It provides a collaborative environment for data scientists, engineers, and analysts to work on big data and machine learning projects.

Key Features

- Fully managed Spark environment - Databricks handles cluster provisioning, configuration, and scaling, allowing users to focus on their work.
- Unified Analytics Platform - Supports multiple programming languages (Python, Scala, R, SQL) and integrates with various data sources and tools.
- Collaborative Workspace - Provides a notebook-based interface for interactive analysis, code development, and sharing.
- MLflow Integration - Enables end-to-end machine learning lifecycle management.
- Delta Lake Support - Provides a storage layer that adds reliability and performance to data lakes.

Key Use Cases

- Big Data Analytics - Leverage the power of Spark for large-scale data processing and analysis.

- Machine Learning and AI - Build, train, and deploy machine learning models at scale.

- Data Engineering - Perform data ETL (Extract, Transform, Load) pipelines and data preparation.

- Streaming and Real-time Analytics - Process and analyze real-time data streams.

Pricing and Deployment

- Azure Databricks is a fully managed service, so users don't need to manage the underlying infrastructure.

- Pricing is based on the compute resources (DBU - Databricks Unit) and storage used, with options for on-demand or pre-paid (committed) usage.

- Azure Databricks can be integrated with other Azure services, such as Azure Storage, Azure SQL Database, and Azure Cosmos DB.

Azure Databricks Course Curriculum

Module 1: Big Data Analytics

  • What is Big Data Analytics
  • Data Analytics Platform
    • Storage
    • Compute
  • Data Processing Paradigms
    • Monolithic Computing
    • Distributed Computing
  • Distributed Computing Frameworks
    • Hadoop MapReduce
    • Apache Spark
  • Distributed Storage
  • Big Data Analytics : Data Lakes
    • Tightly Coupled Data Lake
    • Looseky Coupled Data Lake

    Module 3: Core Databricks Concepts

    • Workspace
    • Notebooks
    • Library
    • Folder
    • Repos
    • Data
    • Compute
    • Workflows

    Module 5: Databricks - Internal Storage

    • Databricks File System (DBFS)

    Module 7: Storages - Azure Credentials

    • Account Access Key
    • Shared Access Signature Token
    • OAuth2.0 Azure Service Principal

    Module 9: Databricks Utilities

    • File System Utility
    • Widgets Utility
    • Secrets Utility
    • Notebook Utility

    Module 11: CSV File Format

    • Reading Data
    • Reading Data from Multiple CSV Files
    • Writing Data

    Module 13: Excel File Format

    • Single Sheet Reading
    • Multiple Sheet Reading Using List object
    • Dynamically Reading Multiple Sheets

    Module 15: Libraries

    • Install Cluster Libraries
      • Maven Package
      • PyPI Package
      • CRAN Package

    Module 17: Databricks - Accesing Azure Data Lake

    • Account Access Key
    • Shared Access Signature Token
    • Mounting Azure Data Lake (Service Principle)

    Module 19: Notebook - Code Modularity

    • %run
    • dbutils.notebook.run()

    Module 21: Intruduction To Delta Lake

    • Delta Lake Features
      • ACID transactions
      • Handling metadata
      • Streaming and batch workfloads
      • Schema enforcement
      • Time travel
      • Upserts and delets
    • Delta Lake Components
      • _delta_log(Transaction log)
      • Versioned parquet files
    • Delata lake Operations
      • Create Table
      • Upsert to a table
      • Read a table
      • Update a table
      • Delete frmm a table
      • Display table history
      • Time table
      • Clean up snapshots with VACUUM
      • Delta Lake table history
      • Restore a Delta table to an earlir state
      • Vacuum unused data files

    Module 27: Databricks Integration With Azure Data Factory

    • Call a Notebook using Notebook Activity
    • SetVariable Activity
    • Trigger ADF Pipeline

    Module 2: Introduction to Azure Databricks

    • Introduction to Databricks
    • Azure Databricks Architecture
    • Azure Databricks Main Concepts

    Module 4: Types Of Clusters

    • All-Purpose Clusters
    • Job Clusters
    • Pools

    Module 6: Databricks - External Storage

    • Azure Blob Storage
    • Azure Datalake Storage Gen2
    • Azure SQL Database
    • Azure Synapse Dedicated SQL Pool
    • Snowflake

    Module 8: Databricks Notebooks - Magic Commands

    • %Python or %py
    • %r
    • %scala
    • %sql

    Module 10: Bigdata File Format

    • Row - Based File Formats
      • CSV,TSV, and AVRO
    • Columnar File Formats
      • Parquet,Delta, and ORC

    Module 12: JSON File Format

    • Single Line JSON
    • Multi Line JSON
    • Complex Multi Line JSON
      • Arrays
      • Struct Fields

    Module 14: XML File Format

    • Simple XML Files
    • Complex XML Files

    Module 16: Spark Structured Streaming

    • ReadStream
    • WriteStream
    • output modes
    • Triggers
      • Fixed Interval
      • One Time
      • Continues
    • Managing Streams

    Module 18: Azure databricks - Types of Loads

    • History Load
    • Incremental Load

    Module 20: Intruduction To Spark SQL Module

    • Managed Tables(Internal Tables)
      • DataFrame API
      • Spark SQL API
    • Un-Manged Tables(External Tables)
      • DataFrame API
      • Spark SQL API
    • Temporary Views(Temporary Table)
    • Global Temporary Views

    Module 22: Delta Lake - Slowly Changing Dimension

    • Type1 Dimension
    • Type2 Dimension
    • Type3 Dimension

    Module 23: Databricks - Azure SQL Database

    • Reading Data With Jdbc Driver
    • Writing Data With Jdbc Driver

    Module 24: Databricks - Synapse Dedicated SQL Pool

    • Reading Data From Synapse Table
    • Writing Data To Synapse Table

    Module 25: Databricks - Snowflake

    • Reading Data From Snowflake Table
    • Writing Data To Snowflake Table

    Module 26: Delta Lake - Performance Optimization Technics

    • OPTIMIZE a Table
    • Z-ORDER by Columns

    Module 28: Azure Key Vault Integration With databricks

    • Create Secrets
    • Create SecretScope

    Azure Databricks Regular Class Practice Sessions

    • Session1_Introduction to Big Data Analytics Platform
    • Session2_Big Data Analytics_Data Processing Paradigms (Compute)
    • Session3_Distributed Computing Frameworks_Apache Hadoop vs Apache Spark
    • Session4_Big Data Analytics_Distributed Storage_Key Takeaways
    • Session5_Big Data Analytics_Tightly and Loosely Coupled Data Lakes
    • Session6_Distributed Computing Cluster_Scalability
    • Session7_Introduction to Azure Databricks
    • Session8_Create Azure Databricks Workspace
    • Session9_Azure Databricks_Types of Clusters_Configurations
    • Session10_Creation of All-Purpose Cluster_Databricks Pools
    • Session11_Introduction to Databricks File System(DBFS)
    • Session12_Databricks File System(DBFS)_dbutils.fs Utility_%fs Magic Command
    • Session13_Databricks_Spark Data API's_RDD_DataFrame_Dataset
    • Session14_Databricks_Different Ways of Creating DataFrame
    • Session15_Reading Data from Single CSV File_DataFrame API
    • Session16_Reading Data from Single CSV File_User Defined Schema
    • Session17_Reading Data from CSV Files_Data Parsing Modes
    • Session18_Reading Data from Single Line JSON File Format
    • Session19_Reading Data from Multi Line JSON with Explicit Schema
    • Session20_Reading Data from Multi Line Complex JSON File Format
    • Session21_Reading Data from Multiple Excel Sheets Dynamically
    • Session22_Databricks_Reading Data from XML File Format
    • Session23_Databricks_Batch Data Processing
    • Session24_Databricks_Batch Data Processing_Transformations
    • Session25_Databricks_Batch Data Processing_Narrow_Wide Transformations
    • Session26_Databricks_Data Merging_Joining Two DataFrames_Types of Joins
    • Session27_Databricks_Data Merging_Union_UnionAll_UnionByName
    • Session28_Databricks_Batch Data Processing_DataFrame Writer API_Save Modes
    • Session29_Databricks_Spark Structured Streaming API_Real-Time Processing
    • Session30_Databricks_Calling a Notebook from another Notebook using %run magic Command
    • Session31_Databricks_Calling a Notebook from another Notebook using run() Method
    • Session32_Databricks_Introduction to Spark SQL Module
    • Session33_Databricks_Spark SQL_Create Global Managed Tables_DataFrame API_SQL API
    • Session34_Databricks_Spark SQL_Create Global Un-Managed Tables_DataFrame API_SQL API
    • Session35_Spark SQL_Types of Views_Local_Global Temporary Views
    • Session36_Introduction to Delta Lake
    • Session37_Create Delta Lake Tables_Explore Components of Delta Lake
    • Session38_Databricks_Delta Lake_Time Travel Using Version and TimeStamp
    • Session39_Databricks_Delta Lake_Schema Validation_Enforcement
    • Session40_Databricks_Delta Lake_Schema Evolution Using mergeSchema Option
    • Session41_Databricks_Delta Lake_Updates_Deletes in Data Lake with Delta Lake
    • Session42_Databricks_Delta Lake_OPTIMIZE_ZORDER
    • Session43_Databricks_Delta Table_Vacuum Command
    • Session44_Databricks_Designing Workflow to Orchestrate Multiple Tasks
    • Session45_Databricks_Implementation of History Load_Incremental Load
    • Session46_Calling a Databricks Notebook from ADF Pipeline
    • Session47_Databricks_Reading_Writing Data To Azure Blob Storage_Account Accee Key
    • Session48_Databricks_Reading_Writing Data To Azure Data Lake Gen2_Azure Service Principal
    • Session49_Databricks_Create Mount Point to Azure Blob Storage_Data Lake Storage Gen2
    • Session50_Databricks_Read_Write_Azure SQL Database
    • Session51_Create Snowflake Free Trail Account
    • Session52_Read and write data from Snowflake
    • Session53_Read and write data from Synapse Dedicated SQL Pool
    • Session54_Databricks_Introduction to Slowly Changing Dimension(SCD)
    • Session55_Databricks_Implementation of SCD Type 0 Dimension
    • Session56_Databricks_Implementation of SCD Type 1 Dimension
    • Session57_Databricks_Introduction to SCD Type 2 Dimension
    • Session58_Databricks_Implementation of SCD Type 2 Dimension
    • Session59_Databricks_Implementation of SCD Type 3 Dimension
    • Session60_Data Engineering_Medallion Project Architecture

    Azure Databricks_Assignments & Case Studies

    • ADB_Assignment1_Azure Databricks_Types of Clusters
    • ADB_Assignment2_Azure Databricks_Cluster_Pools
    • ADB_Assignment3_Azure Databricks_Compute_On-Demand vs Azure Spot VM Instances
    • ADB_Assignment4_Azure Databricks_Bigdata File formats
    • ADB_Assignment5_Reading Data from Multiple CSV Files With the Same StructureADB_Assignment1_Reading TSV Files_User Defined Schema
    • ADB_Assignment6_Apache Spark_Transformations_Actions
    • ADB_Assignment7_Create DataFrame Using Python Collection Objects_List_Tuple_Dictionary
    • ADB_Assignment8_Create DataFrame_Define Schema Programatically Using StructType() & StructField()
    • ADB_Assignment9_Reading Single_Double_PIPE Delimited Files
    • ADB_Assignment10_Reading_Multiple_Different_Delimiter CSV Files
    • ADB_Assignment11_Spark Low Level API's vs Structured API's
    • ADB_Assignment12_Creation of Structured API_DataFrame
    • ADB_Assignment13_Creation of DataFrame_Schemas
    • ADB_Assignment14_Python Functions
    • ADB_Assignment15_Python Dictionaries_Functions_Widgets
    • ADB_Assignment16_Flatten Multi Line Complex JSON Files_Python User Defined Function
    • ADB_Assignment17_Flatten Arrays_Maps_explode()_explode_outer() Functions
    • ADB_Assignment18_Batch ETL Processing_Replace Nulls with Literals
    • ADB_Assignment19_Batch ETL Processing_GroupBy_Aggregation Processing
    • ADB_Assignment20_Batch ETL Processing_PySpark_Join Types
    • ADB_Assignment21_Batch ETL Processing_PySpark_Union_UnionAll
    • ADB_Assignment22_Batch ETL Processing_PySpark_Distinct_DropDuplicates Methods
    • ADB_Assignment23_Batch ETL Processing_GroupBy_Aggregation Processing
    • ADB_Assignment24_Create Workflow to orchistrate Multiple Tasks
    • ADB_Assignment25_Implement Slowly Changing Dimension Type1 and Type3
    • ADB_Assignment26_Batch Processing_Data Processing Techniques_Python List Comprehension
    • ADB_Assignment27_Batch Processing_Sorting on Single Column_sort() method
    • ADB_Assignment28_Batch Processing_Sorting on Multiple Columns
    • ADB_Assignment29_Batch Processing_PySpark_Date Functions
    • ADB_Assignment30_Batch Processing_PySpark_Date Functions
    • ADB_Assignment31_Batch Processing_PySpark_Indentify or Check Duplicates in DataFrame
    • ADB_Assignment32_Batch Processing_PySpark_Dropping Rows that Contains Null Values using dropna() & na.drop() Methods
    • ADB_Assignment33_Batch Processing_PySpark_Replacing Nulls with another Value Using fillna() Method_na.fill() Method
    • ADB_Assignment34_Batch Processing_Reading and Writing Data to Snowflake Cloud Data Platform
    • ADB_Assignment35_Delta Lake_Schema Validation_Enforcement
    • ADB_Assignment36_Delta Lake_Schema Evolution
    • ADB_Assignment37_Update_Delete Operations in data lake with Delta Lake
    • ADB_Assignment38_Audting Data Changes with Operation History

