• Welcome to CloudMonks
  • +91 96660 64406
  • info@thecloudmonks.com

Azure Data Engineering - Azure Databricks


An Azure Data Engineering -Databricks role encompasses the entire data lifecycle within the Azure ecosystem, from data ingestion to analysis and reporting. These engineers are responsible for designing, implementing, and maintaining data pipelines, data warehouses, and data lake solutions using a variety of Azure services. They also handle tasks like data transformation, security, and performance optimization.


Responsibilities:

Data Ingestion and Extraction:

Bringing data from various sources (structured, unstructured, real-time) into Azure.

Data Transformation and Cleaning:

Ensuring data quality and consistency through cleaning, transformation, and integration processes.

Data Storage:

Designing and implementing data storage solutions, including Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database.

Data Warehousing:

Building and maintaining data warehouses using Azure Synapse Analytics.

Data Pipeline Development:

Creating and managing automated data pipelines for efficient data movement and processing using Azure Data Factory or Azure Databricks.

Data Security and Compliance:

Implementing security measures (encryption, access control) and ensuring compliance with data privacy laws.

Performance Monitoring and Optimization:

Identifying and resolving performance bottlenecks in data systems.

Collaboration:

Working with data scientists, analysts, and business stakeholders to understand their needs and implement appropriate data solutions.



Azure Data Engineering - Azure Databricks



Week1:

  • Intruduction To Azure Databricks
    • Core Databricks Concepts
    • Workspace
    • Notebooks
    • Library
    • Folder
    • Repos
    • Data
    • Compute
    • Workflows
  • Introducing Spark Fundamentals
    • What is Apache Spark
    • Why Choose Apache Spark
    • What are the Spark use cases
  • Spark Architecture
    • Spark Components
    • Spark Driver
    • SparkSession
    • Cluster manager
    • Spark Executors
  • Create Databricks Workspace
    • Workspace Assets
  • Creating Spark Cluster
    • All-Purpose Cluster
    • Single Node Cluster
    • Multi Node Cluster
  • Databricks - Internal Storage
    • Databricks File System (DBFS)
    • Uploading Files to DBFS
  • DBUTILS Module
    • Interaction with DBFS
    • %fs Magic Command
  • Spark Data API's
    • RDD (Resilient Distributed Dataset)
    • DataFrame
    • Dataset
  • Create Data Frame
    • Using Python Collection
    • Converting RDD to DataFrame

    Week 2:

  • Reading CSV data with Apache Spark
    • Inferred Schema
    • Explicit Schema
    • Parsing Modes
  • Reading JSON data with Apache Spark
    • SingleLine JSON
    • Multiline JSON
    • Complex JSON
    • explode() Function
  • Reading XML Data with Apache Spark
    • Install Spark-xml Library
    • User Defined Schema
    • DDL String Approach
    • StructType() with StructFields()
  • Reading Excel File With Apache Spark
    • Single Sheet Reading
    • Multiple Sheet Reading Using List object
  • Reading Excel File With Apache Spark
    • Multiple Excel Sheets with Same Structure
    • Multiple Excel Sheets with Different Structures

    Week 3:

  • Intruduction to Delta Lake
    • Delta Lake Features
    • Delta Lake Components
  • Delta lake Features
    • DML Operations
    • Time Travel Operations
  • Delta lake Features
    • Schema Validation and Enforcement
    • Schema Evolution
  • Introduction to Spark SQL Module
    • Hive Metastore
    • Spark Catalog
  • Spark SQL - Create Global Managed Tables
    • DataFrame API
    • SQL API
  • Spark SQL - Create Global Un-Managed Tables
    • DataFrame API
    • SQL API
  • Spark SQL_Create Views
    • Temporary Views
    • Global Temporary Views
    • DataFrame API
    • SQL API
    • Dropping Views

    Week 4

  • Access Data from Azure Blob Storage
    • Account Access Key
    • Windows Azure Storage Blob driver (WASB)
    • Read Operations
    • Write Operation
  • Access Data from Azure Data Lake Gen2
    • Azure Service Principal
    • Azure Service Principal
    • Azure Blob Filesystem driver (ABFS)
    • Read Operations
    • Write Operation
  • Access Data from Azure Data Lake Gen2
    • Shared access signatures (SAS)
    • Azure Blob Filesystem driver (ABFS)
    • Read Operations
    • Write Operation
  • Access Data from Azure SQL Database
    • Configure a connection to SQL server
  • Access Data from Synapse Dedicated SQL Pool
    • Configure storage account access key
    • Read data from an Azure Synapse table
    • Write Data to Azure Synapse table
  • Access Data from Snowflake
    • Reading Data
    • Writing Data
  • Create Mount Point to Azure Cloud Storages
    • Azure Blob Storage
    • Azure Data Lake Storage

    Week 5:

  • Spark Batch Processing
    • Reading Batch Data
    • Writing Batch Data
  • Spark Structured Streaming API
    • Reading Streaming Data
    • Write Streaming Data
    • checkPoint Location
  • Code Modularity of Notebooks
    • %run Magic Command
  • dbutils.notebook Utility
    • run()
    • exit()
  • Widgets_Types of Widgets
    • text
    • dropdown
    • multiselect
    • combobox
  • Parameterization of Notebooks
    • History Load
    • Incremental Load
  • Trigger Notebook from Data Factory Pipeline
    • Notebook Parameters
    • Notebook Parameters
  • Databricks Workflow
    • Orchestration of Tasks
  • Databricks Workflow
    • Task Parameters
    • Job Trigger

    Week 6:

  • Delta Lake Implementation
    • SCD Type0 Dimension
  • Delta Lake Implementation
    • SCD Type1 Dimension
  • Delta Lake Implementation
    • SCD Type2 Dimension
  • Delta Lake Implementation
    • SCD Type3 Dimension
  • Databricks - Auto Loader
    • Auto Loader file detection modes
    • Directory Listing mode
    • File Notification mode
    • Schema Evolution with Auto Loader

    Week 7:

  • Databricks Unity Catalog
    • Metastore
    • Catalog
    • Schema
    • Tables
    • Volumes
    • Views
  • Databricks Unity Catalog
    • Managed Tables
    • External Tables
  • Databricks Unity Catalog
    • Managed Volumes
    • External Volumes
  • Delta Live Tables
    • Simple Declarative SQL & Python APIs
    • Automated Pipeline Creation
    • Data Quality Checks


    Azure Data Engineering - Azure Databricks Assessments

  • PySpark_Transformation
    • Identify Duplicate Records
    • Eliminate Duplicates Records
    • Dropping Rows with Nulls
  • PySpark_Transformation
    • Join and Types of Joins
    • Filling Nulls with Values Using fillna()
  • PySpark_Transformation
    • Join and Types of Joins
  • PySpark_Transformation
    • Types of joins_Joins Pocket Guide
  • PySpark_Transformation
    • Merging DataFrames Using union()_unionByName()
  • PySpark_Transformation
    • Calculating Business Aggregates_
    • Single and Multi Aggregations
  • PySpark_Transformation
    • Window Functions
    • Row_Number()
    • Rnk()
    • Dense_Rank()
  • PySpark_Transformation
    • Window Functions
    • sum()
    • Rnk()
    • lag()
  • PySpark_Transformation
    • Data Pivot_
    • UnPivoting Data
  • Delta Lake
    • Vacuum Command
  • Spark Structured Streaming API - outputModes
    • Append
    • Complete
    • Update
  • Spark Structured Streaming API_Triggers
    • Unspecified Trigger (Default Behavior)
    • trigger(availableNow = True)
    • trigger(processingTime = "n minutes")
  • Spark Structured Streaming API
    • Data Processing
    • Joins
    • Aggregation
  • Databricks_COPY INTO SQL Command
    • Incremental Data Ingestion
  • Databricks_Autoloader_
    • Schema Inference
    • SchemaHints
    • Schema Location
  • Databricks_Autoloader
    • Schema Evolution Modes
  • dbutils.notebook Utility
    • run()
    • exit()
  • PySpark Performance Optimization
    • Cache()
    • Persist()
  • PySpark Performance Optimization
    • repartition()
    • coalesce()
  • PySpark Performance Optimization
    • Column Predicate Pushdown
    • partitionBy()
  • PySpark Performance Optimization
    • bucketBy()
  • PySpark Performance Optimization
    • BroadCastJoin
  • Delta Lake_Performance Optimization
    • OPTIMIZE
    • ZORDER
  • Delta Lake_Performance Optimization
    • Delta Cache
  • Delta Lake_Performance Optimization
    • Liquid Clustering
  • Delta Lake_Performance Optimization
    • Partitioning
    • Liquid Clustering
  • Unity Catalog
    • Create Catalog
    • Schema
    • Tables Using UI and SQL
  • Unity Catalog Metastore Storage Account Container
    • Read CSV Files
  • External Data Lake Storage Account
    • Storage Credentials
    • External Locations
    • Read CSV Files
  • Unity Catalog - Managed Tables
    • Managed Tables
    • Managed Storage Locations
  • Unity Catalog - Create External Tables
    • External Tables
  • Unity Catalog-Volume
    • Create Managed Volume using Catalog Explorer UI
    • Create Managed Volume using SQL


    Train your teams on the theory and enable technical mastery of cloud computing courses essential to the enterprise such as security, compliance, and migration on AWS, Azure, and Google Cloud Platform.

    Talk With Us