• Welcome to CloudMonks
  • INDIA: +91 96660 64406
  • USA : +1(971)-243-1687
  • info@thecloudmonks.com

Azure Databricks

What is Azure Databricks?

- Azure Databricks is a managed Apache Spark analytics service offered by Microsoft as part of the Microsoft Azure cloud platform.
- It provides a collaborative environment for data scientists, engineers, and analysts to work on big data and machine learning projects.

Key Features

- Fully managed Spark environment - Databricks handles cluster provisioning, configuration, and scaling, allowing users to focus on their work.
- Unified Analytics Platform - Supports multiple programming languages (Python, Scala, R, SQL) and integrates with various data sources and tools.
- Collaborative Workspace - Provides a notebook-based interface for interactive analysis, code development, and sharing.
- MLflow Integration - Enables end-to-end machine learning lifecycle management.
- Delta Lake Support - Provides a storage layer that adds reliability and performance to data lakes.

Key Use Cases

- Big Data Analytics - Leverage the power of Spark for large-scale data processing and analysis.

- Machine Learning and AI - Build, train, and deploy machine learning models at scale.

- Data Engineering - Perform data ETL (Extract, Transform, Load) pipelines and data preparation.

- Streaming and Real-time Analytics - Process and analyze real-time data streams.

Pricing and Deployment

- Azure Databricks is a fully managed service, so users don't need to manage the underlying infrastructure.

- Pricing is based on the compute resources (DBU - Databricks Unit) and storage used, with options for on-demand or pre-paid (committed) usage.

- Azure Databricks can be integrated with other Azure services, such as Azure Storage, Azure SQL Database, and Azure Cosmos DB.

Azure Databricks Course Curriculum

Module 1: Big Data Analytics

  • What is Big Data Analytics
  • Data Analytics Platform
    • Storage
    • Compute
  • Data Processing Paradigms
    • Monolithic Computing
    • Distributed Computing
  • Distributed Computing Frameworks
    • Hadoop MapReduce
    • Apache Spark
  • Distributed Storage
  • Big Data Analytics : Data Lakes
    • Tightly Coupled Data Lake
    • Looseky Coupled Data Lake

    Module 3: Core Databricks Concepts

    • Workspace
    • Notebooks
    • Library
    • Folder
    • Repos
    • Data
    • Compute
    • Workflows

    Module 5: Databricks - Internal Storage

    • Databricks File System (DBFS)

    Module 7: Storages - Azure Credentials

    • Account Access Key
    • Shared Access Signature Token
    • OAuth2.0 Azure Service Principal

    Module 9: Databricks Utilities

    • File System Utility
    • Widgets Utility
    • Secrets Utility
    • Notebook Utility

    Module 11: CSV File Format

    • Reading Data
    • Reading Data from Multiple CSV Files
    • Writing Data

    Module 13: Excel File Format

    • Single Sheet Reading
    • Multiple Sheet Reading Using List object
    • Dynamically Reading Multiple Sheets

    Module 15: Libraries

    • Install Cluster Libraries
      • Maven Package
      • PyPI Package
      • CRAN Package

    Module 17: Databricks - Accesing Azure Data Lake

    • Account Access Key
    • Shared Access Signature Token
    • Mounting Azure Data Lake (Service Principle)

    Module 19: Notebook - Code Modularity

    • %run
    • dbutils.notebook.run()

    Module 21: Intruduction To Delta Lake

    • Delta Lake Features
      • ACID transactions
      • Handling metadata
      • Streaming and batch workfloads
      • Schema enforcement
      • Time travel
      • Upserts and delets
    • Delta Lake Components
      • _delta_log(Transaction log)
      • Versioned parquet files
    • Delata lake Operations
      • Create Table
      • Upsert to a table
      • Read a table
      • Update a table
      • Delete frmm a table
      • Display table history
      • Time table
      • Clean up snapshots with VACUUM
      • Delta Lake table history
      • Restore a Delta table to an earlir state
      • Vacuum unused data files

    Module 27: Databricks Integration With Azure Data Factory

    • Call a Notebook using Notebook Activity
    • SetVariable Activity
    • Trigger ADF Pipeline

    Module 2: Introduction to Azure Databricks

    • Introduction to Databricks
    • Azure Databricks Architecture
    • Azure Databricks Main Concepts

    Module 4: Types Of Clusters

    • All-Purpose Clusters
    • Job Clusters
    • Pools

    Module 6: Databricks - External Storage

    • Azure Blob Storage
    • Azure Datalake Storage Gen2
    • Azure SQL Database
    • Azure Synapse Dedicated SQL Pool
    • Snowflake

    Module 8: Databricks Notebooks - Magic Commands

    • %Python or %py
    • %r
    • %scala
    • %sql

    Module 10: Bigdata File Format

    • Row - Based File Formats
      • CSV,TSV, and AVRO
    • Columnar File Formats
      • Parquet,Delta, and ORC

    Module 12: JSON File Format

    • Single Line JSON
    • Multi Line JSON
    • Complex Multi Line JSON
      • Arrays
      • Struct Fields

    Module 14: XML File Format

    • Simple XML Files
    • Complex XML Files

    Module 16: Spark Structured Streaming

    • ReadStream
    • WriteStream
    • output modes
    • Triggers
      • Fixed Interval
      • One Time
      • Continues
    • Managing Streams

    Module 18: Azure databricks - Types of Loads

    • History Load
    • Incremental Load

    Module 20: Intruduction To Spark SQL Module

    • Managed Tables(Internal Tables)
      • DataFrame API
      • Spark SQL API
    • Un-Manged Tables(External Tables)
      • DataFrame API
      • Spark SQL API
    • Temporary Views(Temporary Table)
    • Global Temporary Views

    Module 22: Delta Lake - Slowly Changing Dimension

    • Type1 Dimension
    • Type2 Dimension
    • Type3 Dimension

    Module 23: Databricks - Azure SQL Database

    • Reading Data With Jdbc Driver
    • Writing Data With Jdbc Driver

    Module 24: Databricks - Synapse Dedicated SQL Pool

    • Reading Data From Synapse Table
    • Writing Data To Synapse Table

    Module 25: Databricks - Snowflake

    • Reading Data From Snowflake Table
    • Writing Data To Snowflake Table

    Module 26: Delta Lake - Performance Optimization Technics

    • OPTIMIZE a Table
    • Z-ORDER by Columns

    Module 28: Azure Key Vault Integration With databricks

    • Create Secrets
    • Create SecretScope

    Azure Databricks Regular Class Practice Sessions

    • Session1_Introduction to Big Data Analytics Platform
    • Session2_Big Data Analytics_Data Processing Paradigms (Compute)
    • Session3_Distributed Computing Frameworks_Apache Hadoop vs Apache Spark
    • Session4_Big Data Analytics_Distributed Storage_Key Takeaways
    • Session5_Big Data Analytics_Tightly and Loosely Coupled Data Lakes
    • Session6_Distributed Computing Cluster_Scalability
    • Session7_Introduction to Azure Databricks
    • Session8_Create Azure Databricks Workspace
    • Session9_Azure Databricks_Types of Clusters_Configurations
    • Session10_Creation of All-Purpose Cluster_Databricks Pools
    • Session11_Introduction to Databricks File System(DBFS)
    • Session12_Databricks File System(DBFS)_dbutils.fs Utility_%fs Magic Command
    • Session13_Databricks_Spark Data API's_RDD_DataFrame_Dataset
    • Session14_Databricks_Different Ways of Creating DataFrame
    • Session15_Reading Data from Single CSV File_DataFrame API
    • Session16_Reading Data from Single CSV File_User Defined Schema
    • Session17_Reading Data from CSV Files_Data Parsing Modes
    • Session18_Reading Data from Single Line JSON File Format
    • Session19_Reading Data from Multi Line JSON with Explicit Schema
    • Session20_Reading Data from Multi Line Complex JSON File Format
    • Session21_Reading Data from Multiple Excel Sheets Dynamically
    • Session22_Databricks_Reading Data from XML File Format
    • Session23_Databricks_Batch Data Processing
    • Session24_Databricks_Batch Data Processing_Transformations
    • Session25_Databricks_Batch Data Processing_Narrow_Wide Transformations
    • Session26_Databricks_Data Merging_Joining Two DataFrames_Types of Joins
    • Session27_Databricks_Data Merging_Union_UnionAll_UnionByName
    • Session28_Databricks_Batch Data Processing_DataFrame Writer API_Save Modes
    • Session29_Databricks_Spark Structured Streaming API_Real-Time Processing
    • Session30_Databricks_Calling a Notebook from another Notebook using %run magic Command
    • Session31_Databricks_Calling a Notebook from another Notebook using run() Method
    • Session32_Databricks_Introduction to Spark SQL Module
    • Session33_Databricks_Spark SQL_Create Global Managed Tables_DataFrame API_SQL API
    • Session34_Databricks_Spark SQL_Create Global Un-Managed Tables_DataFrame API_SQL API
    • Session35_Spark SQL_Types of Views_Local_Global Temporary Views
    • Session36_Introduction to Delta Lake
    • Session37_Create Delta Lake Tables_Explore Components of Delta Lake
    • Session38_Databricks_Delta Lake_Time Travel Using Version and TimeStamp
    • Session39_Databricks_Delta Lake_Schema Validation_Enforcement
    • Session40_Databricks_Delta Lake_Schema Evolution Using mergeSchema Option
    • Session41_Databricks_Delta Lake_Updates_Deletes in Data Lake with Delta Lake
    • Session42_Databricks_Delta Lake_OPTIMIZE_ZORDER
    • Session43_Databricks_Delta Table_Vacuum Command
    • Session44_Databricks_Designing Workflow to Orchestrate Multiple Tasks
    • Session45_Databricks_Implementation of History Load_Incremental Load
    • Session46_Calling a Databricks Notebook from ADF Pipeline
    • Session47_Databricks_Reading_Writing Data To Azure Blob Storage_Account Accee Key
    • Session48_Databricks_Reading_Writing Data To Azure Data Lake Gen2_Azure Service Principal
    • Session49_Databricks_Create Mount Point to Azure Blob Storage_Data Lake Storage Gen2
    • Session50_Databricks_Read_Write_Azure SQL Database
    • Session51_Create Snowflake Free Trail Account
    • Session52_Read and write data from Snowflake
    • Session53_Read and write data from Synapse Dedicated SQL Pool
    • Session54_Databricks_Introduction to Slowly Changing Dimension(SCD)
    • Session55_Databricks_Implementation of SCD Type 0 Dimension
    • Session56_Databricks_Implementation of SCD Type 1 Dimension
    • Session57_Databricks_Introduction to SCD Type 2 Dimension
    • Session58_Databricks_Implementation of SCD Type 2 Dimension
    • Session59_Databricks_Implementation of SCD Type 3 Dimension
    • Session60_Data Engineering_Medallion Project Architecture

    Azure Databricks_Assignments & Case Studies

    • ADB_Assignment1_Azure Databricks_Types of Clusters
    • ADB_Assignment2_Azure Databricks_Cluster_Pools
    • ADB_Assignment3_Azure Databricks_Compute_On-Demand vs Azure Spot VM Instances
    • ADB_Assignment4_Azure Databricks_Bigdata File formats
    • ADB_Assignment5_Reading Data from Multiple CSV Files With the Same StructureADB_Assignment1_Reading TSV Files_User Defined Schema
    • ADB_Assignment6_Apache Spark_Transformations_Actions
    • ADB_Assignment7_Create DataFrame Using Python Collection Objects_List_Tuple_Dictionary
    • ADB_Assignment8_Create DataFrame_Define Schema Programatically Using StructType() & StructField()
    • ADB_Assignment9_Reading Single_Double_PIPE Delimited Files
    • ADB_Assignment10_Reading_Multiple_Different_Delimiter CSV Files
    • ADB_Assignment11_Spark Low Level API's vs Structured API's
    • ADB_Assignment12_Creation of Structured API_DataFrame
    • ADB_Assignment13_Creation of DataFrame_Schemas
    • ADB_Assignment14_Python Functions
    • ADB_Assignment15_Python Dictionaries_Functions_Widgets
    • ADB_Assignment16_Flatten Multi Line Complex JSON Files_Python User Defined Function
    • ADB_Assignment17_Flatten Arrays_Maps_explode()_explode_outer() Functions
    • ADB_Assignment18_Batch ETL Processing_Replace Nulls with Literals
    • ADB_Assignment19_Batch ETL Processing_GroupBy_Aggregation Processing
    • ADB_Assignment20_Batch ETL Processing_PySpark_Join Types
    • ADB_Assignment21_Batch ETL Processing_PySpark_Union_UnionAll
    • ADB_Assignment22_Batch ETL Processing_PySpark_Distinct_DropDuplicates Methods
    • ADB_Assignment23_Batch ETL Processing_GroupBy_Aggregation Processing
    • ADB_Assignment24_Create Workflow to orchistrate Multiple Tasks
    • ADB_Assignment25_Implement Slowly Changing Dimension Type1 and Type3
    • ADB_Assignment26_Batch Processing_Data Processing Techniques_Python List Comprehension
    • ADB_Assignment27_Batch Processing_Sorting on Single Column_sort() method
    • ADB_Assignment28_Batch Processing_Sorting on Multiple Columns
    • ADB_Assignment29_Batch Processing_PySpark_Date Functions
    • ADB_Assignment30_Batch Processing_PySpark_Date Functions
    • ADB_Assignment31_Batch Processing_PySpark_Indentify or Check Duplicates in DataFrame
    • ADB_Assignment32_Batch Processing_PySpark_Dropping Rows that Contains Null Values using dropna() & na.drop() Methods
    • ADB_Assignment33_Batch Processing_PySpark_Replacing Nulls with another Value Using fillna() Method_na.fill() Method
    • ADB_Assignment34_Batch Processing_Reading and Writing Data to Snowflake Cloud Data Platform
    • ADB_Assignment35_Delta Lake_Schema Validation_Enforcement
    • ADB_Assignment36_Delta Lake_Schema Evolution
    • ADB_Assignment37_Update_Delete Operations in data lake with Delta Lake
    • ADB_Assignment38_Audting Data Changes with Operation History

    Train your teams on the theory and enable technical mastery of cloud computing courses essential to the enterprise such as security, compliance, and migration on AWS, Azure, and Google Cloud Platform.

    Talk With Us