Skip to content

Modern Data Engineering

Evaluation of high-performance data transformation, orchestration, and storage formats for modern data lakehouses.

Orchestration, Transformation & Patterns

Tool Status Value
Apache Airflow ADOPT Industry standard for complex, enterprise-grade DAG orchestration.
dbt ADOPT The standard for modular, versioned, and tested SQL transformations.
Dagster TRIAL Asset-based orchestration focusing on software-defined data assets.
Airbyte ADOPT The open-source standard for EL (Extract-Load) pipelines.
Kappa Architecture TRIAL Stream-first architecture treating all data as events. Simplifying the stack by removing batch-layer redundancy.

Processing & Dataframes

Tool Status Value
Polars ADOPT High-performance Rust-powered DataFrames for efficient local processing.
Pandas ADOPT The standard for data manipulation and analysis in Python.

Compute & Database Engines

Engine Status Why
Google BigQuery ADOPT Fully managed, serverless enterprise data warehouse. My primary engine for analytics at scale.
Databricks ASSESS Unified analytics platform. Assessing for specific Spark-heavy workloads and Delta Lake integration.
DuckDB ADOPT In-process OLAP database for fast, local analytics and data profiling.
Apache Flink ASSESS Evaluating for complex, stateful stream processing with sub-second latency requirements.
Apache Spark (Dataproc) ADOPT The legacy powerhouse for massive batch processing and migration of on-prem Hadoop workloads to GCP.
ClickHouse ASSESS High-performance columnar database for real-time analytical queries and user-facing dashboards.
Neo4j TRIAL Specialized graph database for complex relationship analysis and network modeling.

Operational & Cloud Databases

Database Status Why
PostgreSQL ADOPT The world's most advanced open-source relational database. My default for structured data.
Snowflake ADOPT Cloud-native data warehouse with excellent separation of compute and storage.
Redis ADOPT High-performance in-memory data store. Essential for caching and real-time features.
MongoDB TRIAL Document-oriented NoSQL database. Assessing for specific flexible-schema use cases.
SQLAlchemy ADOPT The definitive Python SQL toolkit and Object Relational Mapper (ORM).

Data & Table Formats

Format Status Why
Apache Iceberg ADOPT The dominant open table format for cloud-native data lakehouses.
Apache Parquet ADOPT The universal columnar storage format for high-performance analytics.
Apache Avro ADOPT Row-oriented format. The gold standard for streaming data serialization (Kafka/PubSub).