Pattern · Data pipelines

ETL Pipeline Patterns

Move and transform data from source systems into usable form. Batch, streaming, or hybrid.

ETL PIPELINE PATTERNS EXTRACT TRANSFORM LOAD Sources RDS · S3 · APIs · Streams Glue · Lambda · Spark Cleanse · Enrich · Aggregate Targets Redshift · S3 · DynamoDB BATCH S3 → Glue Job → Redshift (scheduled, high throughput) #1C1A11 STREAMING Kinesis Streams → Lambda / KDA → DynamoDB / OpenSearch (real-time) #1C1A11 Lambda Architecture = Batch layer + Speed layer + Serving layer

Overview

ETL (Extract, Transform, Load) pipelines move data from operational systems into analytical stores. The key decision is latency tolerance: batch processing is cheaper and simpler, streaming enables real-time decisions.

Pipeline Types

Batch Processing

Scheduled jobs process large volumes of accumulated data. High throughput, cost-effective, tolerates latency.

  • AWS Glue (serverless Spark)
  • EMR (managed Hadoop/Spark)
  • S3 → Redshift COPY

Real-time Streaming

Continuous processing of events as they arrive. Low latency, immediate insights.

  • Kinesis Data Streams
  • MSK (Kafka)
  • Kinesis Data Analytics (Flink)

Lambda Architecture

Hybrid: batch layer for accuracy + speed layer for latency + serving layer combines both.

  • S3 (batch) + Kinesis (speed)
  • Redshift + DynamoDB serving
  • Athena for ad-hoc queries

Best Practices

References

AWS Glue Samples

ETL jobs, crawlers, and data catalog configurations

GitHub

Analytics Reference Architecture

Complete data lake with ETL, governance, and analytics

GitHub

Building Data Lakes on AWS

ETL patterns, governance, and analytics on AWS

AWS Whitepaper