Skip to main content

Home Specialist skills Artificial Intelligence

Apache Spark Programming with Databricks

  • bullet point
    Describe the architecture and core components of Apache Spark
  • bullet point
    Implement data transformations using the DataFrame API
  • bullet point
    Optimise Spark queries for performance improvements
  • bullet point
    Apply partitioning strategies to manage large datasets efficiently
  • bullet point
    Use Structured Streaming to process real-time data
  • bullet point
    Implement Delta Lake to enhance data reliability and performance.

Overview

Off the shelf (OTS)

This course provides an in-depth exploration of Apache Spark and Delta Lake on Databricks, focusing on the core architectural components of Spark, the DataFrame API, and Structured Streaming. Participants will learn how to efficiently read, transform, and aggregate data using SparkSQL and the DataFrame API. The course also covers user-defined functions (UDFs), query optimization, partitioning strategies, and the advantages of Delta Lake for improving data pipelines. By the end of the course, learners will be able to execute streaming queries and understand how Delta Lake enhances real-time data processing.

Participants should have:
• Familiarity with Python and fundamental programming concepts, including data types, lists, dictionaries, variables, functions, loops, conditional statements, exception handling, accessing classes, and using third-party libraries.
• Basic knowledge of SQL, including writing queries using SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, and JOIN.

This course is designed for:
• Data engineers and data scientists looking to enhance their Spark programming skills.
• Developers who want to leverage Apache Spark and Delta Lake on Databricks.
• Professionals working with large-scale data processing and real-time analytics.

This course includes:
• Practical exercises using Apache Spark on Databricks.
• Hands-on labs to implement and optimise Spark queries.
• Guided projects focusing on real-time data processing with Structured Streaming and Delta Lake.

This course is not specifically aligned with an exam.

Delivery method
Face to face icon

Face to face

Virtual icon

Virtual

Course duration
Duration icon

14 hours

Competency level
Working icon

Working

Pink building representing strand 4 of the campus map
Delivery method
  • face to face icon

    Face to face

  • Virtual icon

    Virtual

Course duration
Duration icon

14 hours

Competency level
  • Working icon

    Working

chatbotSpark login