AWS Services Analysis & Opportunities

⚠️ NOTE: This document is OUTDATED and has been superseded by AWS_SPARK_SIMPLIFICATION_ANALYSIS.md

Current Status: - ✅ Athena and Glue Data Catalog are implemented in Terraform - ❌ Lambda, DynamoDB, Aurora are intentionally not implemented (not needed)

For current architecture decisions, see: tasks/01_data_ingestion_transformation/AWS_SPARK_SIMPLIFICATION_ANALYSIS.md


Historical Context

This document was created during initial analysis based on the Ohpen Data Engineer job posting requirements. It has been superseded by a more focused analysis that recommends simplicity over complexity.

Current Implementation Status

✅ Implemented Services

  • S3: Primary storage (Bronze/Silver/Gold layers)
  • Glue: ETL jobs (Python Shell + Spark)
  • CloudWatch: Monitoring and alarms
  • Athena: SQL query engine (✅ Implemented)
  • Glue Data Catalog: Metadata catalog (✅ Implemented)
  • Step Functions: ETL orchestration (✅ Implemented)
  • EventBridge: Scheduled ETL runs (✅ Implemented)

❌ Intentionally Not Implemented

  • Lambda: Not needed (Step Functions handles orchestration)
  • DynamoDB: Not needed (Silver layer scan sufficient for scale)
  • Aurora: Not needed (S3/Athena sufficient for analytics)

Decision Rationale

See tasks/01_data_ingestion_transformation/AWS_SPARK_SIMPLIFICATION_ANALYSIS.md for detailed rationale on why we chose simplicity over additional AWS services.

Key Principle: Add complexity only when it provides proportional benefit. For current scale (~500MB/month), the implemented services are sufficient.


This document is kept for historical reference only.