Best Azure Databricks Online Training in Hyderabad

What is Azure Databricks?

- Azure Databricks is a managed Apache Spark analytics service offered by Microsoft as part of the Microsoft Azure cloud platform.
- It provides a collaborative environment for data scientists, engineers, and analysts to work on big data and machine learning projects.

Key Features

- Fully managed Spark environment - Databricks handles cluster provisioning, configuration, and scaling, allowing users to focus on their work.
- Unified Analytics Platform - Supports multiple programming languages (Python, Scala, R, SQL) and integrates with various data sources and tools.
- Collaborative Workspace - Provides a notebook-based interface for interactive analysis, code development, and sharing.
- MLflow Integration - Enables end-to-end machine learning lifecycle management.
- Delta Lake Support - Provides a storage layer that adds reliability and performance to data lakes.

Key Use Cases

- Big Data Analytics - Leverage the power of Spark for large-scale data processing and analysis.

- Machine Learning and AI - Build, train, and deploy machine learning models at scale.

- Data Engineering - Perform data ETL (Extract, Transform, Load) pipelines and data preparation.

- Streaming and Real-time Analytics - Process and analyze real-time data streams.

Pricing and Deployment

- Azure Databricks is a fully managed service, so users don't need to manage the underlying infrastructure.

- Pricing is based on the compute resources (DBU - Databricks Unit) and storage used, with options for on-demand or pre-paid (committed) usage.

- Azure Databricks can be integrated with other Azure services, such as Azure Storage, Azure SQL Database, and Azure Cosmos DB.

Azure Databricks Course Curriculum

Day 1:

What is Big Data Analytics

Data Analytics Platform

Storage
Compute

Data Processing Paradigms

Monolithic Computing
Distributed Computing

Day 2:

Distributed Computing Frameworks

Hadoop MapReduce
Apache Spark

Big Data Analytics : Data Lakes

Tightly Coupled Data Lake
Looseky Coupled Data Lake

Day 3:

Big Data File Formats

Row Storage Format
Columnar Storage Format

Scalability

Scale - Up (Vertical Scalability)
Scale - Out (Horizontal Scalability)

Day 4: Intruduction To Azure Databricks

Core Databricks Concepts

Workspace
Notebooks
Library
Folder
Repos
Data
Compute
Workflows

Day 5: Introducing Spark Fundamentals

What is Apache Spark
Why Choose Apache Spark
What are the Spark use cases

Day 6: Spark Architecture

Spark Components

Spark Driver
SparkSession
Cluster manager
Spark Executors

Day 7: Create Databricks Workspace

Workspace Assets

Day 8: Creating Spark Cluster

All-Purpose Cluster

Single Node Cluster
Multi Node Cluster

Day 9: Databricks - Internal Storage

Databricks File System (DBFS)
Uploading Files to DBFS

Day 10: DBUTILS Module

Interaction with DBFS
%fs Magic Command

Day 11: Spark Data API's

RDD (Resilient Distributed Dataset)
DataFrame
Dataset

Day 12: Create Data Frame

Using Python Collection
Converting RDD to DataFrame

Day 13: Reading CSV data with Apache Spark

Inferred Schema
Explicit Schema
Parsing Modes

Day 14: Reading JSON data with Apache Spark

SingleLine JSON
Multiline JSON
Complex JSON
explode() Function

Day 15: Reading XML Data with Apache Spark

Install Spark-xml Library
User Defined Schema

DDL String Approach
StructType() with StructFields()

Day 16: Reading Excel File With Apache Spark

Single Sheet Reading
Multiple Sheet Reading Using List object

Day 17: Reading Excel File With Apache Spark

Multiple Excel Sheets with Same Structure
Multiple Excel Sheets with Different Structures

Day 18: Reading parquet data With Apache Spark

Uploading parquet data
View the data DataFrame
view the Schema of the DataFrame
limitations of parquet file
Schema Evolution

Day 19: Intruduction to Delta Lake

Delta Lake Features
Delta Lake Components

Day 20: Delta lake Features

DML Operations
Time Travel Operations

Day 21: Delta lake Features

Schema Validation and Enforcement
Schema Evolution

Day 22: Access Data from Azure Blob Storage

Account Access Key
Windows Azure Storage Blob driver (WASB)
Read Operations
Write Operation

Day 23: Access Data from Azure Data Lake Gen2

Azure Service Principal
Azure Service Principal
Azure Blob Filesystem driver (ABFS)
Read Operations
Write Operation

Day 24: Access Data from Azure Data Lake Gen2

Shared access signatures (SAS)
Azure Blob Filesystem driver (ABFS)
Read Operations
Write Operation

Day 25: Access Data from Azure SQL Database

Configure a connection to SQL server

Day 26: Access Data from Synapse Dedicated SQL Pool

Configure storage account access key
Read data from an Azure Synapse table
Write Data to Azure Synapse table

Day 27: Access Data from Snowflake

Reading Data
Writing Data

Day 28: Create Mount Point to Azure Cloud Storages

Azure Blob Storage
Azure Data Lake Storage

Day 29: Introduction to Spark SQL Module

Hive Metastore
Spark Catalog

Day 30: Spark SQL - Create Global Managed Tables

DataFrame API
SQL API

Day 31: Spark SQL - Create Global Un-Managed Tables

DataFrame API
SQL API

Day 32: Spark SQL_Create Views

Temporary Views
Global Temporary Views
DataFrame API
SQL API
Dropping Views

Day 33: Spark Batch Processing

Reading Batch Data
Writing Batch Data

Day 34: Spark Structured Streaming API

Reading Streaming Data
Write Streaming Data
checkPoint Location

Day 35: Spark Structured Streaming API - outputModes

Append
Complete
Update

Day 36: Spark Structured Streaming API_Triggers

Unspecified Trigger (Default Behavior)
trigger(availableNow = True)
trigger(processingTime = "n minutes")

Day 37: Spark Structured Streaming API

Data Processing
Joins
Aggregation

Day 38: Code Modularity of Notebooks

%run Magic Command

Day 39: dbutils.notebook Utility

run()
exit()

Day 40: Widgets_Types of Widgets

text
dropdown
multiselect
combobox

Day 41:Parameterization of Notebooks

History Load
Incremental Load

Day 42:Trigger Notebook from Data Factory Pipeline

Notebook Parameters

Day 43:Databricks Workflow

Orchestration of Tasks

Day 44:Databricks Workflow

Task Parameters
Job Trigger

Day 45: Delta Lake Implementation

SCD Type0 Dimension

Day 46:Delta Lake Implementation

SCD Type1 Dimension

Day 47:Delta Lake Implementation

SCD Type2 Dimension

Day 48:Delta Lake Implementation

SCD Type3 Dimension

Day 49:PySpark Performance Optimization

Cache()
Persist()

Day 50:PySpark Performance Optimization

repartition()
coalesce()

Day 51:PySpark Performance Optimization

Column Predicate Pushdown
partitionBy()

Day 52:PySpark Performance Optimization

bucketBy()

Day 53:PySpark Performance Optimization

BroadCastJoin

Day 54:Delta Lake_Performance Optimization

OPTIMIZE
ZORDER

Day 55:Delta Lake_Performance Optimization

Delta Cache

Day 56:Delta Lake_Performance Optimization

Liquid Clustering

Day 57:Delta Lake_Performance Optimization

Partitioning
Liquid Clustering

Day 58:Databricks Unity Catalog

Metastore
Catalog
Schema
Tables
Volumes
Views

Day 59:Databricks Unity Catalog

Managed Tables
External Tables

Day 60:Databricks Unity Catalog

Managed Volumes
External Volumes

Day 61:Databricks - Auto Loader

Auto Loader file detection modes

Directory Listing mode
File Notification mode

Schema Evolution with Auto Loader

Day 62:Delta Live Tables

Simple Declarative SQL & Python APIs
Automated Pipeline Creation
Data Quality Checks

Databricks with PySpark Assessments (@Home)

ADB_Assessment1_ADB_PySpark_Types of Operations_Transformations_Actions
ADB_Assessment2_PySpark_Transformations_select()_selectExpr()
ADB_Assessment3_PySpark_Transformations_Data Cleansing_filter()_where()
ADB_Assessment4_PySpark_Transformation_Identifying Duplicates_Remove Duplicates
ADB_Assessment5_PySpark_Transformation_Sorting Data_sort()_orderBy()
ADB_Assessment6_PySpark_Transformation_Single_Multi_Aggregation
ADB_Assessment7_PySpark_DataFrame_Renaming Columns using List Comprehension
ADB_Assessment8_PySpark_Introduction to Join_Types of Joins
ADB_Assessment9_PySpark_Implementation of Joins_Types of Joins
ADB_Assessment10_PySpark_Drop Rows that Contains Nulls_dropna()_na.drop()
ADB_Assessment11_PySpark_Fill Rows that Contains Nulls_fillna()_na.fill()
ADB_Assessment12_PySpark_Window Functions_rank()_dense_rank()_row_number()

Azure Databricks