AWS Services Analysis & Opportunities¶
⚠️ NOTE: This document is OUTDATED and has been superseded by
AWS_SPARK_SIMPLIFICATION_ANALYSIS.mdCurrent Status: - ✅ Athena and Glue Data Catalog are implemented in Terraform - ❌ Lambda, DynamoDB, Aurora are intentionally not implemented (not needed)
For current architecture decisions, see:
tasks/01_data_ingestion_transformation/AWS_SPARK_SIMPLIFICATION_ANALYSIS.md
Historical Context¶
This document was created during initial analysis based on the Ohpen Data Engineer job posting requirements. It has been superseded by a more focused analysis that recommends simplicity over complexity.
Current Implementation Status¶
✅ Implemented Services¶
- S3: Primary storage (Bronze/Silver/Gold layers)
- Glue: ETL jobs (Python Shell + Spark)
- CloudWatch: Monitoring and alarms
- Athena: SQL query engine (✅ Implemented)
- Glue Data Catalog: Metadata catalog (✅ Implemented)
- Step Functions: ETL orchestration (✅ Implemented)
- EventBridge: Scheduled ETL runs (✅ Implemented)
❌ Intentionally Not Implemented¶
- Lambda: Not needed (Step Functions handles orchestration)
- DynamoDB: Not needed (Silver layer scan sufficient for scale)
- Aurora: Not needed (S3/Athena sufficient for analytics)
Decision Rationale¶
See tasks/01_data_ingestion_transformation/AWS_SPARK_SIMPLIFICATION_ANALYSIS.md for detailed rationale on why we chose simplicity over additional AWS services.
Key Principle: Add complexity only when it provides proportional benefit. For current scale (~500MB/month), the implemented services are sufficient.
This document is kept for historical reference only.