Data Eng.

Data Engineering – Complete Course Content (Beginner to Advanced)

Introduction to Data Engineering

  • Role of Data Engineer
  • Data Engineering vs Data Science vs Data Analytics
  • Modern data ecosystem overview
  • ETL vs ELT
  • Batch vs Streaming data
  • Data lake, Data warehouse, Data lakehouse concepts

Linux & Shell Basics (For Data Engineers)

  • Basic Linux commands
  • File system navigation
  • File manipulation
  • Shell scripting basics
  • Automating simple workflows

Python for Data Engineering

Core Python

  • Variables, data types, loops, functions
  • File handling
  • Exception handling

Python for Data Pipelines

  • Working with CSV, JSON, Excel
  • Reading/writing large datasets
  • Pandas for data processing
  • Boto3 / APIs (basic introduction for pipelines)

SQL for Data Engineering

  • SQL fundamentals
  • Joins, aggregations, subqueries
  • CTEs
  • Window functions
  • Query optimization basics
  • Writing analytical SQL queries
  • Stored procedures & functions (basics)

Data Modeling

  • OLTP vs OLAP
  • Star schema & Snowflake schema
  • Fact & dimension tables
  • Normalization & denormalization
  • Slowly Changing Dimensions (SCD Type 1, 2)

ETL / ELT & Pipeline Concepts

  • What is a pipeline?
  • Data ingestion techniques
  • Data transformation layers
  • Batch pipelines
  • Incremental loads & change data capture (CDC)
  • Error handling & logging
  • Orchestration concepts

Big Data & Distributed Systems

  • Hadoop ecosystem overview
  • HDFS concepts
  • MapReduce basics
  • YARN & distributed computation
  • Why Spark replaced Hadoop MapReduce

Apache Spark (Core Skill for Data Engineers)

Spark Core

  • RDD concepts
  • Transformations & actions
  • Lazy evaluation
  • Partitioning

Spark SQL

  • DataFrames
  • Dataset API
  • SQL queries
  • Optimizer & Catalyst
  • UDFs

Spark for ETL

  • Reading/writing CSV, JSON, Parquet
  • Working with large datasets
  • Writing pipelines in PySpark

Cloud Data Engineering

Azure / AWS / GCP Overview

  • IAM concepts
  • Storage services
  • Compute services
  • Serverless basics

Cloud Storage

  • AWS S3 / Azure Blob / GCP Storage

Compute & Processing

  • AWS EMR / Azure Databricks / GCP Dataproc
  • Serverless compute (AWS Lambda / Azure Functions)

Cloud Data Warehouse

  • Snowflake
  • AWS Redshift
  • Azure Synapse
  • Google BigQuery

Modern Data Pipelines Tools

  • Airflow (workflow orchestration)
  • ADF / AWS Glue
  • Databricks workflows
  • Kafka basics (streaming ingestion)
  • Event hubs / Pub-sub overview

Data Quality & Governance

  • Data validation techniques
  • Data quality checks (DQ rules)
  • Logging & monitoring
  • Metadata management
  • Lineage basics (Collibra/Apache Atlas overview)

DevOps for Data Engineering

  • Git & GitHub
  • CI/CD concepts
  • Docker basics
  • Deploying ETL pipelines
  • Environment management

Mini Projects + Capstone

  • ETL pipeline using SQL + Python
  • PySpark batch processing pipeline
  • Data ingestion from API to cloud storage
  • End-to-end pipeline using Airflow/ADF
  • Data warehouse modeling project
  • Resume building for Data Engineers
  • Interview preparation + coding round
  • Direct Material Procurement Process through SAP FIORI Apps
  • Indirect Material Procurement Processes
  • Inventory Management and Physical Inventory
  • Quantity and Value Contract and Scheduling Agreement
  • Stock Transfer, Stock Transport Orders, and Intercompany Purchasing
  • Physical Inventory Count and Cycle Counting Method
  • Goods Movement and Vendor Return Processes
  • Evaluated Receipt Settlement and Invoicing Plans
  • Material Classification, Batch Management, and Serial Numbers
  • MRP (Material Requirements Planning) and PO Release Strategy