Best Azure Data Engineering - Databricks in Hyderabad

An Azure Data Engineering -Databricks role encompasses the entire data lifecycle within the Azure ecosystem, from data ingestion to analysis and reporting. These engineers are responsible for designing, implementing, and maintaining data pipelines, data warehouses, and data lake solutions using a variety of Azure services. They also handle tasks like data transformation, security, and performance optimization.

Responsibilities:

Data Ingestion and Extraction:

Bringing data from various sources (structured, unstructured, real-time) into Azure.

Data Transformation and Cleaning:

Ensuring data quality and consistency through cleaning, transformation, and integration processes.

Data Storage:

Designing and implementing data storage solutions, including Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database.

Data Warehousing:

Building and maintaining data warehouses using Azure Synapse Analytics.

Data Pipeline Development:

Creating and managing automated data pipelines for efficient data movement and processing using Azure Data Factory or Azure Databricks.

Data Security and Compliance:

Implementing security measures (encryption, access control) and ensuring compliance with data privacy laws.

Performance Monitoring and Optimization:

Identifying and resolving performance bottlenecks in data systems.

Collaboration:

Working with data scientists, analysts, and business stakeholders to understand their needs and implement appropriate data solutions.

Azure Data Engineering - Azure Databricks

Week1:

Intruduction To Azure Databricks

Core Databricks Concepts
Workspace
Notebooks
Library
Folder
Repos
Data
Compute
Workflows

Introducing Spark Fundamentals

What is Apache Spark
Why Choose Apache Spark
What are the Spark use cases

Spark Architecture

Spark Components
Spark Driver
SparkSession
Cluster manager
Spark Executors

Create Databricks Workspace

Workspace Assets

Creating Spark Cluster

All-Purpose Cluster
Single Node Cluster
Multi Node Cluster

Databricks - Internal Storage

Databricks File System (DBFS)
Uploading Files to DBFS

DBUTILS Module

Interaction with DBFS
%fs Magic Command

Spark Data API's

RDD (Resilient Distributed Dataset)
DataFrame
Dataset

Create Data Frame

Using Python Collection
Converting RDD to DataFrame

Week 2:

Reading CSV data with Apache Spark

Inferred Schema
Explicit Schema
Parsing Modes

Reading JSON data with Apache Spark

SingleLine JSON
Multiline JSON
Complex JSON
explode() Function

Reading XML Data with Apache Spark

Install Spark-xml Library
User Defined Schema
DDL String Approach
StructType() with StructFields()

Reading Excel File With Apache Spark

Single Sheet Reading
Multiple Sheet Reading Using List object

Reading Excel File With Apache Spark

Multiple Excel Sheets with Same Structure
Multiple Excel Sheets with Different Structures

Week 3:

Intruduction to Delta Lake

Delta Lake Features
Delta Lake Components

Delta lake Features

DML Operations
Time Travel Operations

Delta lake Features

Schema Validation and Enforcement
Schema Evolution

Introduction to Spark SQL Module

Hive Metastore
Spark Catalog

Spark SQL - Create Global Managed Tables

DataFrame API
SQL API

Spark SQL - Create Global Un-Managed Tables

DataFrame API
SQL API

Spark SQL_Create Views

Temporary Views
Global Temporary Views
DataFrame API
SQL API
Dropping Views

Week 4

Access Data from Azure Blob Storage

Account Access Key
Windows Azure Storage Blob driver (WASB)
Read Operations
Write Operation

Access Data from Azure Data Lake Gen2

Azure Service Principal
Azure Service Principal
Azure Blob Filesystem driver (ABFS)
Read Operations
Write Operation

Access Data from Azure Data Lake Gen2

Shared access signatures (SAS)
Azure Blob Filesystem driver (ABFS)
Read Operations
Write Operation

Access Data from Azure SQL Database

Configure a connection to SQL server

Access Data from Synapse Dedicated SQL Pool

Configure storage account access key
Read data from an Azure Synapse table
Write Data to Azure Synapse table

Access Data from Snowflake

Reading Data
Writing Data

Create Mount Point to Azure Cloud Storages

Azure Blob Storage
Azure Data Lake Storage

Week 5:

Spark Batch Processing

Reading Batch Data
Writing Batch Data

Spark Structured Streaming API

Reading Streaming Data
Write Streaming Data
checkPoint Location

Code Modularity of Notebooks

%run Magic Command

dbutils.notebook Utility

run()
exit()

Widgets_Types of Widgets

text
dropdown
multiselect
combobox

Parameterization of Notebooks

History Load
Incremental Load

Trigger Notebook from Data Factory Pipeline

Notebook Parameters
Notebook Parameters

Databricks Workflow

Orchestration of Tasks

Databricks Workflow

Task Parameters
Job Trigger

Week 6:

Delta Lake Implementation

SCD Type0 Dimension

Delta Lake Implementation

SCD Type1 Dimension

Delta Lake Implementation

SCD Type2 Dimension

Delta Lake Implementation

SCD Type3 Dimension

Databricks - Auto Loader

Auto Loader file detection modes
Directory Listing mode
File Notification mode
Schema Evolution with Auto Loader

Week 7:

Databricks Unity Catalog

Metastore
Catalog
Schema
Tables
Volumes
Views

Databricks Unity Catalog

Managed Tables
External Tables

Databricks Unity Catalog

Managed Volumes
External Volumes

Delta Live Tables

Simple Declarative SQL & Python APIs
Automated Pipeline Creation
Data Quality Checks

Azure Data Engineering - Azure Databricks Assessments

PySpark_Transformation

Identify Duplicate Records
Eliminate Duplicates Records
Dropping Rows with Nulls

PySpark_Transformation

Join and Types of Joins
Filling Nulls with Values Using fillna()

PySpark_Transformation

Join and Types of Joins

PySpark_Transformation

Types of joins_Joins Pocket Guide

PySpark_Transformation

Merging DataFrames Using union()_unionByName()

PySpark_Transformation

Calculating Business Aggregates_
Single and Multi Aggregations

PySpark_Transformation

Window Functions
Row_Number()
Rnk()
Dense_Rank()

PySpark_Transformation

Window Functions
sum()
Rnk()
lag()

PySpark_Transformation

Data Pivot_
UnPivoting Data

Delta Lake

Vacuum Command

Spark Structured Streaming API - outputModes

Append
Complete
Update

Spark Structured Streaming API_Triggers

Unspecified Trigger (Default Behavior)
trigger(availableNow = True)
trigger(processingTime = "n minutes")

Spark Structured Streaming API

Data Processing
Joins
Aggregation

Databricks_COPY INTO SQL Command

Incremental Data Ingestion

Databricks_Autoloader_

Schema Inference
SchemaHints
Schema Location

Databricks_Autoloader

Schema Evolution Modes

dbutils.notebook Utility

run()
exit()