Pattern · Data pipelines
ETL Pipeline Patterns
Move and transform data from source systems into usable form. Batch, streaming, or hybrid.
Overview
ETL (Extract, Transform, Load) pipelines move data from operational systems into analytical stores. The key decision is latency tolerance: batch processing is cheaper and simpler, streaming enables real-time decisions.
Pipeline Types
Batch Processing
Scheduled jobs process large volumes of accumulated data. High throughput, cost-effective, tolerates latency.
- AWS Glue (serverless Spark)
- EMR (managed Hadoop/Spark)
- S3 → Redshift COPY
Real-time Streaming
Continuous processing of events as they arrive. Low latency, immediate insights.
- Kinesis Data Streams
- MSK (Kafka)
- Kinesis Data Analytics (Flink)
Lambda Architecture
Hybrid: batch layer for accuracy + speed layer for latency + serving layer combines both.
- S3 (batch) + Kinesis (speed)
- Redshift + DynamoDB serving
- Athena for ad-hoc queries
Best Practices
- Validate early — catch bad data at ingestion, not at the warehouse
- Dead letter queues — failed records need a landing zone for inspection
- Idempotent transforms — safe to re-run without double-counting
- Data lineage — track where every record came from (AWS Glue Data Catalog)
- Encrypt everywhere — S3 SSE, Redshift encryption, KMS key rotation