Cloud Data
Repository: landerox/cloud-landerox-data
This project serves as a centralized architectural guide and reference implementation for managing data movement and transformations on Google Cloud. It complements the Cloud Infra blueprint by providing the application-layer logic (Cloud Functions and Dataflow) that runs on the provisioned resources.
Key Goals
- Production-Grade Foundations: Implementing a complete Medallion Architecture (Bronze, Silver, Gold) to demonstrate structured data management.
- Lakehouse Approach: Combining the flexibility of a Raw Data Lake in Cloud Storage with the analytical power of BigQuery.
- Unified Processing: Establishing a single codebase for both real-time streaming and historical batch processing using the Kappa Architecture.
Tech Stack
- Language: Python 3.12+ (managed with
uv) - Processing: Dataflow (Apache Beam), BigQuery
- Ingestion: Cloud Functions (Gen 2), Pub/Sub
- Storage: Google Cloud Storage (Object Store), BigQuery (Analytical)
- Infrastructure: Terraform (Schema & Resource management)
- Quality:
pylint(10/10 mandatory),ruff,pyright
Architecture Highlights
- Medallion Layers:
- Bronze: Event-driven ingestion capturing raw data into GCS.
- Silver: Single Source of Truth with cleansed and enriched data in Dataflow.
- Gold: Curated, business-ready datasets optimized for BigQuery consumption.
- Kappa Pattern: Eliminates the need for separate batch and streaming codebases, ensuring consistent logic across all data velocities.
- Type-Driven Schema: Uses Python
typing.NamedTuplefor static validation and automatic BigQuery schema generation. - Reliability: Implements robust retry logic and Dead Letter Queues (DLQ) to handle malformed records without halting pipelines.
Enterprise Considerations
While focused on the data layer, this blueprint integrates organizational standards to ensure production readiness:
- FinOps: Optimized worker configurations and BigQuery partitioning/clustering to control compute costs.
- Security: Dedicated Service Accounts per component and secret management via GCP Secret Manager.
- Observability: Structured logging integrated with Cloud Logging and custom metrics for pipeline health.