architecting-data — Skillopedia

Data Architecture Purpose Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms. When to Use This Skill Invoke this skill when: - Designing a new data platform or modernizing legacy systems - Choosing between data lake, data warehouse, or data lakehouse - Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables) - Evaluating centralized vs data mesh architecture - Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi) - Designing medallion architecture (bronze, silver, gold lay…

)\n```\n\n**Soda Core:**\n```yaml\nchecks for silver.customers:\n - row_count > 1000\n - missing_count(customer_id) = 0\n - duplicate_count(customer_id) = 0\n - invalid_percent(email) \u003c 1%\n```\n\n**dbt Tests:**\n```yaml\nmodels:\n - name: customers\n columns:\n - name: customer_id\n tests:\n - unique\n - not_null\n```\n\n## Access Control\n\n**1. Role-Based Access Control (RBAC):**\n```sql\nCREATE ROLE sales_analyst;\nGRANT SELECT ON gold.fact_sales TO ROLE sales_analyst;\nGRANT ROLE sales_analyst TO USER [email protected];\n```\n\n**2. Attribute-Based (Row-Level Security):**\n```sql\nCREATE ROW ACCESS POLICY customers_region_policy\nAS (region string) RETURNS BOOLEAN ->\n CASE\n WHEN IS_ROLE_IN_SESSION('US_ANALYST') AND region = 'US' THEN TRUE\n WHEN IS_ROLE_IN_SESSION('EU_ANALYST') AND region = 'EU' THEN TRUE\n ELSE FALSE\n END;\n```\n\n**3. Column-Level Security (Dynamic Data Masking):**\n```sql\nCREATE MASKING POLICY mask_email AS (val string) RETURNS string ->\n CASE\n WHEN IS_ROLE_IN_SESSION('PII_VIEWER') THEN val\n ELSE '***MASKED***'\n END;\n\nALTER TABLE gold.customers\n MODIFY COLUMN email SET MASKING POLICY mask_email;\n```\n\n## Data Quality in Medallion\n\n| Layer Transition | Quality Checks | Purpose |\n|-----------------|---------------|---------|\n| Bronze → Silver | Schema validation, type checks, deduplication | Ensure parseable and unique |\n| Silver → Gold | Business rules, referential integrity | Ensure business logic |\n| Gold | Anomaly detection, statistical checks | Ensure reasonable |\n\n## Best Practices\n\n1. **Automate quality checks:** Integrate into CI/CD\n2. **Monitor lineage:** Track data flow end-to-end\n3. **Catalog everything:** Make data discoverable\n4. **Enforce access control:** Least privilege principle\n5. **Track data quality:** Metrics and dashboards\n6. **Document data products:** Schema, SLAs, examples\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3743,"content_sha256":"14bd517a7fd505729be392e04a9e89348d4a11e2a8960635dd5394e2c312adb6"},{"filename":"references/medallion-pattern.md","content":"# Medallion Architecture Pattern\n\nStandard pattern for organizing data in lakehouses. Three layers of increasing data quality.\n\n\n## Table of Contents\n\n- [Three Layers](#three-layers)\n - [Bronze Layer (Raw)](#bronze-layer-raw)\n - [Silver Layer (Cleaned & Conformed)](#silver-layer-cleaned-conformed)\n - [Gold Layer (Business-Level)](#gold-layer-business-level)\n- [Data Quality by Layer](#data-quality-by-layer)\n- [Medallion + Data Mesh](#medallion-data-mesh)\n- [Benefits](#benefits)\n\n## Three Layers\n\n### Bronze Layer (Raw)\n\n**Purpose:** Exact copy of source systems; immutable historical archive\n\n**Characteristics:**\n- Append-only (never modify or delete)\n- Full fidelity (all columns, all rows)\n- Format: Source format or Parquet\n- Retention: Forever (or years)\n\n**Example:**\n```sql\nCREATE TABLE bronze.raw_customers (\n _ingested_at TIMESTAMP,\n _source_file STRING,\n _raw_data STRING -- entire JSON/CSV blob\n);\n```\n\n### Silver Layer (Cleaned & Conformed)\n\n**Purpose:** Validated, deduplicated, typed data\n\n**Transformations:**\n- Parse JSON/XML\n- Type conversions (string → int, date)\n- Deduplication\n- Normalized naming\n\n**Example:**\n```sql\nCREATE TABLE silver.customers AS\nSELECT\n json_extract_scalar(_raw_data, '$.id') AS customer_id,\n json_extract_scalar(_raw_data, '$.name') AS customer_name,\n CAST(json_extract_scalar(_raw_data, '$.revenue') AS BIGINT) AS annual_revenue,\n DATE(json_extract_scalar(_raw_data, '$.created')) AS created_date\nFROM bronze.raw_customers\nQUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY _ingested_at DESC) = 1;\n```\n\n**Data Quality Checks:**\n- Primary key uniqueness\n- Null checks on required fields\n- Referential integrity\n- Value range validation\n\n### Gold Layer (Business-Level)\n\n**Purpose:** Optimized for consumption (BI, ML, APIs)\n\n**Patterns:**\n- Star schemas (dimensional models)\n- Wide tables (ML features)\n- Aggregates (daily_sales_by_region)\n\n**Example:**\n```sql\nCREATE TABLE gold.fact_sales AS\nSELECT\n s.order_id,\n d.date_key,\n c.customer_key,\n p.product_key,\n s.quantity * s.unit_price AS gross_revenue,\n s.quantity * s.unit_price - s.discount_amount AS net_revenue,\n (s.quantity * s.unit_price - s.discount_amount) - (s.quantity * p.unit_cost) AS profit\nFROM silver.sales s\nJOIN gold.dim_date d ON s.order_date = d.date\nJOIN gold.dim_customer c ON s.customer_id = c.customer_id\nJOIN gold.dim_product p ON s.product_id = p.product_id;\n```\n\n## Data Quality by Layer\n\n| Layer | Quality Checks | Purpose |\n|-------|---------------|---------|\n| Bronze → Silver | Schema validation, type checks, deduplication | Ensure data is parseable and unique |\n| Silver → Gold | Business rule validation, referential integrity | Ensure data meets business logic |\n| Gold | Anomaly detection, statistical checks | Ensure data is reasonable |\n\n## Medallion + Data Mesh\n\nEach domain owns bronze-silver-gold for their data products:\n\n```\nSales Domain:\n bronze.sales_raw\n silver.sales_cleaned\n gold.sales_analytics (data product)\n\nMarketing Domain:\n bronze.marketing_raw\n silver.marketing_cleaned\n gold.marketing_analytics (data product)\n\nCross-Domain:\n gold.customer_360 (combines sales + marketing)\n```\n\n## Benefits\n\n- Clear separation of concerns\n- Reprocessing at any layer\n- Incremental quality improvement\n- Standard pattern across organization\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3319,"content_sha256":"6a10247a550b37936483d14e74cb0ef666ff3b4ba58ffd18cae08f81ef521128"},{"filename":"references/modeling-approaches.md","content":"# Data Modeling Approaches\n\n## Table of Contents\n\n1. [Dimensional Modeling (Kimball)](#dimensional-modeling-kimball)\n2. [Normalized Modeling (Inmon)](#normalized-modeling-inmon)\n3. [Data Vault 2.0](#data-vault-20)\n4. [Wide Tables (Denormalized)](#wide-tables-denormalized)\n\n---\n\n## Dimensional Modeling (Kimball)\n\n### Overview\n\nOptimized for analytical queries and business intelligence. Organizes data into fact tables (metrics) and dimension tables (context).\n\n### Star Schema Design Process\n\n**Step 1: Identify Business Process**\n- Example: \"Analyze product sales performance\"\n- Focus on measurable business events\n\n**Step 2: Declare Grain**\n- Grain = level of detail in fact table\n- Examples: \"one row per order line item\" or \"one row per daily sales summary\"\n- **Critical:** Grain must be specific and consistent\n\n**Step 3: Identify Dimensions (Context)**\n- **Who:** Customer, Salesperson\n- **What:** Product, Promotion\n- **When:** Date, Time\n- **Where:** Store, Location\n- **How:** Payment Method\n- **Why:** Campaign, Channel\n\n**Step 4: Identify Facts (Measures)**\n- Quantity Sold\n- Unit Price\n- Discount Amount\n- Net Revenue\n- Cost\n- Profit\n\n### Star Schema Example\n\n```\nDimDate DimProduct DimCustomer\n- DateKey (PK) - ProductKey (PK) - CustomerKey (PK)\n- Date - ProductID - CustomerID\n- DayOfWeek - ProductName - CustomerName\n- Month - Category - Segment\n- Quarter - Brand - Country\n- Year - UnitCost - JoinDate\n \\ | /\n \\ | /\n \\ | /\n \\ | /\n \\ | /\n \\ | /\n FactSales (Fact Table)\n - SalesKey (PK, surrogate)\n - DateKey (FK)\n - ProductKey (FK)\n - CustomerKey (FK)\n - StoreKey (FK)\n - Quantity\n - UnitPrice\n - DiscountAmount\n - NetRevenue\n```\n\n### Slowly Changing Dimensions (SCD)\n\n**Type 1: Overwrite (No History)**\n```sql\n-- Update customer segment\nUPDATE DimCustomer\nSET Segment = 'Premium'\nWHERE CustomerID = 12345;\n\n-- Old value lost, no history\n```\n\n**Use When:** History doesn't matter (typo corrections, insignificant changes)\n\n---\n\n**Type 2: Add Row (Full History)**\n```sql\n-- Current record\nINSERT INTO DimCustomer (CustomerKey, CustomerID, Name, Segment, EffectiveDate, EndDate, IsCurrent)\nVALUES (98765, 12345, 'John Doe', 'Premium', '2025-01-01', '9999-12-31', TRUE);\n\n-- When segment changes\nUPDATE DimCustomer SET EndDate = '2025-06-30', IsCurrent = FALSE WHERE CustomerKey = 98765;\nINSERT INTO DimCustomer (CustomerKey, CustomerID, Name, Segment, EffectiveDate, EndDate, IsCurrent)\nVALUES (98766, 12345, 'John Doe', 'VIP', '2025-07-01', '9999-12-31', TRUE);\n```\n\n**Query Pattern:**\n```sql\n-- Current customers\nSELECT * FROM DimCustomer WHERE IsCurrent = TRUE;\n\n-- Historical customers at specific date\nSELECT * FROM DimCustomer\nWHERE '2025-03-15' BETWEEN EffectiveDate AND EndDate;\n```\n\n**Use When:** Full history is critical (customer address, product pricing, employee roles)\n\n---\n\n**Type 3: Add Column (Limited History)**\n```sql\nALTER TABLE DimCustomer\nADD COLUMN PreviousSegment VARCHAR(50);\n\n-- Track current and previous\nUPDATE DimCustomer\nSET PreviousSegment = Segment, Segment = 'VIP'\nWHERE CustomerID = 12345;\n```\n\n**Use When:** Need to compare current vs previous (not full history)\n\n---\n\n**Type 6: Hybrid (1+2+3)**\nCombines Type 2 (full history) with Type 1 (current value in all rows).\n\n```sql\nCREATE TABLE DimCustomer (\n CustomerKey BIGINT PRIMARY KEY,\n CustomerID BIGINT,\n Name VARCHAR(100),\n CurrentSegment VARCHAR(50), -- Type 1: Always current\n HistoricalSegment VARCHAR(50), -- Type 2: Historical value\n EffectiveDate DATE, -- Type 2\n EndDate DATE, -- Type 2\n IsCurrent BOOLEAN -- Type 2\n);\n\n-- All rows show current segment, but history preserved\n```\n\n**Use When:** Need both point-in-time history and current value for all historical rows\n\n### When to Use Dimensional Modeling\n\n**Ideal Scenarios:**\n- Business intelligence and reporting\n- Known query patterns (dashboards, KPIs)\n- Historical trend analysis\n- User-friendly for SQL analysts\n- BI tools (Tableau, Power BI, Looker)\n\n**Trade-Offs:**\n- ✅ Fast analytical queries (denormalized, few joins)\n- ✅ Intuitive for business users\n- ✅ Optimized for BI tools\n- ❌ Inflexible to schema changes\n- ❌ Denormalization creates redundancy\n- ❌ Requires upfront modeling effort\n\n---\n\n## Normalized Modeling (Inmon)\n\n### Overview\n\nEliminate redundancy and ensure data integrity through normalization (3NF, BCNF). Designed for transactional systems (OLTP).\n\n### Normal Forms\n\n**1st Normal Form (1NF):**\n- Atomic values (no repeating groups)\n- Each column contains single value\n\n**2nd Normal Form (2NF):**\n- 1NF + No partial dependencies\n- Non-key attributes depend on entire primary key\n\n**3rd Normal Form (3NF):**\n- 2NF + No transitive dependencies\n- Non-key attributes depend only on primary key (not other non-key attributes)\n\n**Boyce-Codd Normal Form (BCNF):**\n- Stricter version of 3NF\n- Every determinant is a candidate key\n\n### Normalized Example (3NF)\n\n```\nCustomers Table\n- CustomerID (PK)\n- CustomerName\n- Email\n- PhoneNumber\n\nOrders Table\n- OrderID (PK)\n- CustomerID (FK)\n- OrderDate\n- TotalAmount\n\nOrderLines Table\n- OrderLineID (PK)\n- OrderID (FK)\n- ProductID (FK)\n- Quantity\n- UnitPrice\n\nProducts Table\n- ProductID (PK)\n- ProductName\n- CategoryID (FK)\n- UnitCost\n\nCategories Table\n- CategoryID (PK)\n- CategoryName\n```\n\n### When to Use Normalized Modeling\n\n**Ideal Scenarios:**\n- Transactional systems (OLTP)\n- Data with frequent updates\n- Strong consistency requirements\n- Source systems feeding data warehouse\n\n**Trade-Offs:**\n- ✅ No redundancy (DRY principle)\n- ✅ Data integrity enforced\n- ✅ Flexible to changes\n- ❌ Complex joins slow analytical queries\n- ❌ Not intuitive for business users\n- ❌ Poor performance for BI\n\n---\n\n## Data Vault 2.0\n\n### Overview\n\nFlexible, auditable, scalable model for enterprise data warehouses. Designed for compliance, multi-source integration, and agile requirements.\n\n### Core Structures\n\n**Hubs: Unique Business Keys**\n```sql\nCREATE TABLE HubCustomer (\n CustomerHashKey BINARY(20) PRIMARY KEY, -- Hash of CustomerID\n CustomerID VARCHAR(50), -- Business key\n LoadDate TIMESTAMP,\n RecordSource VARCHAR(50)\n);\n\n-- Immutable (never deleted)\n-- No descriptive attributes\n```\n\n**Links: Relationships Between Hubs**\n```sql\nCREATE TABLE LinkOrder (\n OrderHashKey BINARY(20) PRIMARY KEY, -- Hash of OrderID\n CustomerHashKey BINARY(20), -- FK to HubCustomer\n ProductHashKey BINARY(20), -- FK to HubProduct\n OrderID VARCHAR(50), -- Business key\n LoadDate TIMESTAMP,\n RecordSource VARCHAR(50),\n FOREIGN KEY (CustomerHashKey) REFERENCES HubCustomer(CustomerHashKey),\n FOREIGN KEY (ProductHashKey) REFERENCES HubProduct(ProductHashKey)\n);\n\n-- Represents transactions or associations\n```\n\n**Satellites: Descriptive Attributes**\n```sql\nCREATE TABLE SatCustomer (\n CustomerHashKey BINARY(20), -- FK to HubCustomer\n LoadDate TIMESTAMP, -- Part of PK\n EndDate TIMESTAMP,\n CustomerName VARCHAR(100),\n Email VARCHAR(100),\n Phone VARCHAR(20),\n Address VARCHAR(200),\n RecordSource VARCHAR(50),\n PRIMARY KEY (CustomerHashKey, LoadDate),\n FOREIGN KEY (CustomerHashKey) REFERENCES HubCustomer(CustomerHashKey)\n);\n\n-- Temporal (tracks changes over time)\n-- Multiple satellites per hub (source-specific)\n```\n\n### Query Pattern (Join Hubs, Links, Satellites)\n\n```sql\n-- Reconstruct customer orders (current state)\nSELECT\n hc.CustomerID,\n sc.CustomerName,\n sc.Email,\n lo.OrderID,\n hp.ProductID,\n sp.ProductName\nFROM HubCustomer hc\nJOIN SatCustomer sc ON hc.CustomerHashKey = sc.CustomerHashKey\n AND sc.EndDate = '9999-12-31' -- Current record\nJOIN LinkOrder lo ON hc.CustomerHashKey = lo.CustomerHashKey\nJOIN HubProduct hp ON lo.ProductHashKey = hp.ProductHashKey\nJOIN SatProduct sp ON hp.ProductHashKey = sp.ProductHashKey\n AND sp.EndDate = '9999-12-31'; -- Current record\n```\n\n### When to Use Data Vault\n\n**Ideal Scenarios:**\n- Compliance requirements (full audit trail)\n- Multiple source systems with overlapping data\n- Agile warehousing (requirements change frequently)\n- Long-term historical archive\n\n**Trade-Offs:**\n- ✅ Highly flexible (easy to add sources)\n- ✅ Complete audit trail\n- ✅ Parallel loading (hubs, links, satellites independent)\n- ❌ Complex queries (many joins)\n- ❌ Requires data mart layer for BI\n- ❌ Storage overhead\n\n---\n\n## Wide Tables (Denormalized)\n\n### Overview\n\nSingle table with hundreds of pre-joined columns. Optimized for columnar storage and query performance.\n\n### Example: Customer Feature Table (ML)\n\n```sql\nCREATE TABLE CustomerFeatures (\n -- Identity\n customer_id BIGINT PRIMARY KEY,\n created_at DATE,\n\n -- Demographics\n age INT,\n gender VARCHAR(10),\n country VARCHAR(50),\n segment VARCHAR(50),\n\n -- Behavioral (Last 30 Days)\n total_purchases_last_30d INT,\n total_revenue_last_30d DECIMAL(10,2),\n avg_order_value_last_30d DECIMAL(10,2),\n days_since_last_purchase INT,\n\n -- Behavioral (Last 90 Days)\n total_purchases_last_90d INT,\n total_revenue_last_90d DECIMAL(10,2),\n\n -- Lifetime\n total_purchases_lifetime INT,\n total_revenue_lifetime DECIMAL(10,2),\n tenure_days INT,\n customer_lifetime_value DECIMAL(10,2),\n\n -- Product Affinity (100+ category columns)\n purchases_category_electronics INT,\n purchases_category_clothing INT,\n -- ... 100+ more category columns\n\n -- Predictions\n churn_risk_score DECIMAL(3,2),\n next_purchase_days INT,\n\n -- Metadata\n last_updated TIMESTAMP\n);\n```\n\n### Optimization Techniques\n\n**1. Columnar Storage (Parquet, ORC):**\n```python\n# Write as Parquet (columnar format)\ndf.write.parquet(\"s3://bucket/customer_features.parquet\")\n\n# Only read needed columns\nspark.read.parquet(\"s3://bucket/customer_features.parquet\") \\\n .select(\"customer_id\", \"churn_risk_score\", \"total_revenue_lifetime\")\n\n# Reads only 3 columns, not all 100+\n```\n\n**2. Partitioning:**\n```sql\nCREATE TABLE CustomerFeatures (\n customer_id BIGINT,\n segment VARCHAR(50),\n -- ... 100+ columns\n last_updated DATE\n)\nPARTITIONED BY (segment, last_updated);\n\n-- Prune partitions\nSELECT * FROM CustomerFeatures\nWHERE segment = 'VIP' AND last_updated = '2025-01-01';\n```\n\n**3. Clustering:**\n```sql\n-- Snowflake clustering\nALTER TABLE CustomerFeatures\nCLUSTER BY (customer_id, segment);\n\n-- Co-locate related data for faster joins/filters\n```\n\n**4. Materialized Views:**\n```sql\n-- Pre-compute aggregates\nCREATE MATERIALIZED VIEW CustomerSummary AS\nSELECT\n segment,\n COUNT(*) AS customer_count,\n AVG(total_revenue_lifetime) AS avg_ltv,\n AVG(churn_risk_score) AS avg_churn_risk\nFROM CustomerFeatures\nGROUP BY segment;\n\n-- Auto-refresh on base table changes\n```\n\n### When to Use Wide Tables\n\n**Ideal Scenarios:**\n- ML feature stores\n- Data science notebooks (exploratory analysis)\n- High-performance dashboards\n- Columnar databases (Snowflake, BigQuery, Redshift)\n\n**Trade-Offs:**\n- ✅ Fastest analytical queries (no joins)\n- ✅ Simple for users (one table)\n- ✅ Efficient in columnar storage (only read needed columns)\n- ❌ Significant redundancy\n- ❌ Large storage footprint\n- ❌ Update complexity (many columns to maintain)\n\n---\n\n## Modeling Approach Comparison\n\n| Factor | Dimensional | Normalized | Data Vault | Wide Tables |\n|--------|-------------|------------|------------|-------------|\n| **Primary Use Case** | BI/Reporting | OLTP | Compliance | ML/Data Science |\n| **Query Performance** | Fast | Slow | Slow | Fastest |\n| **Storage Efficiency** | Medium | Best | Medium | Worst |\n| **Update Complexity** | Medium | Low | Low | High |\n| **Historical Tracking** | SCD Types | Difficult | Excellent | Medium |\n| **Schema Flexibility** | Low | Medium | High | Low |\n| **Learning Curve** | Medium | Low | High | Low |\n| **Governance** | Medium | Medium | Excellent | Low |\n\n### Hybrid Approaches\n\n**1. Data Vault + Dimensional (Common Pattern):**\n```\nSource Systems\n → Data Vault (Integration Layer, full history, audit)\n → Dimensional (Presentation Layer, star schemas for BI)\n → BI Tools\n```\n\n**Benefits:** Flexibility of Data Vault + performance of dimensional models\n\n---\n\n**2. Dimensional + Wide Tables:**\n```\nDimensional Model (Star Schema)\n → Wide Tables (ML Feature Tables)\n → ML Models\n```\n\n**Benefits:** BI on dimensional, ML on denormalized features\n\n---\n\n**3. Normalized (OLTP) + Dimensional (OLAP):**\n```\nTransactional Database (3NF)\n → ETL/ELT\n → Data Warehouse (Dimensional)\n → BI Tools\n```\n\n**Benefits:** Classic separation of transactional and analytical workloads\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":13076,"content_sha256":"b9fa8c0b89096c60db04afd1741bb36874fb083d84b905a762d962e27d8bda6b"},{"filename":"references/modern-data-stack.md","content":"# Modern Data Stack Architecture\n\n\n## Table of Contents\n\n- [Standard Architecture Layers (2025)](#standard-architecture-layers-2025)\n- [Layer Selection Criteria](#layer-selection-criteria)\n - [Ingestion](#ingestion)\n - [Storage](#storage)\n - [Transformation](#transformation)\n - [Orchestration](#orchestration)\n- [Typical Workflow](#typical-workflow)\n- [Tool Stack Evolution](#tool-stack-evolution)\n\n## Standard Architecture Layers (2025)\n\n```\n┌─────────────────────────────────────────┐\n│ Data Sources │\n│ - Databases (PostgreSQL, MySQL, MongoDB)│\n│ - SaaS (Salesforce, Stripe, HubSpot) │\n│ - Events (Segment, Rudderstack) │\n│ - Files (S3, GCS, SFTP) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Ingestion Layer │\n│ - Fivetran (ELT, pre-built connectors) │\n│ - Airbyte (open-source, custom) │\n│ - Kafka (streaming, event-driven) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Storage Layer │\n│ - Snowflake (cloud data warehouse) │\n│ - Databricks (lakehouse, Spark) │\n│ - BigQuery (Google Cloud, serverless) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Transformation Layer │\n│ - dbt (SQL-based, version controlled) │\n│ - Dataform (Google, SQL + Dataform) │\n│ - Spark (PySpark, large-scale) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Orchestration Layer │\n│ - Airflow (Python DAGs, most popular) │\n│ - Dagster (asset-based, modern) │\n│ - Prefect (Python, dynamic workflows) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Visualization/BI Layer │\n│ - Tableau (enterprise visualizations) │\n│ - Looker (Google, LookML modeling) │\n│ - Power BI (Microsoft ecosystem) │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Governance Layer (Cross-Cutting) │\n│ - DataHub (catalog, open-source) │\n│ - Great Expectations (data quality) │\n│ - OpenLineage (lineage tracking) │\n└─────────────────────────────────────────┘\n```\n\n## Layer Selection Criteria\n\n### Ingestion\n\n**Fivetran:**\n- ✅ Pre-built connectors (300+)\n- ✅ SaaS, low maintenance\n- ❌ Expensive ($1K-10K+/month)\n\n**Airbyte:**\n- ✅ Open-source, free\n- ✅ Custom connectors\n- ❌ More maintenance\n\n**Kafka:**\n- ✅ Real-time streaming\n- ✅ High throughput\n- ❌ Complex setup\n\n### Storage\n\n**Snowflake:**\n- Best for: BI/analytics\n- Strength: Zero-maintenance, strong governance\n- Cost: $$\n\n**Databricks:**\n- Best for: ML/data science\n- Strength: Unified analytics (BI + ML)\n- Cost: $\n\n**BigQuery:**\n- Best for: Google Cloud users\n- Strength: Serverless, pay-per-query\n- Cost: $\n\n### Transformation\n\n**dbt:**\n- SQL-based, analysts can write\n- Version control, testing\n- Cost: Free (open-source)\n\n**Spark:**\n- PySpark/Scala, data engineers\n- Large-scale processing\n- Cost: Compute-based\n\n### Orchestration\n\n**Airflow:**\n- Most mature (2014)\n- Large community\n- Complexity: High\n\n**Dagster:**\n- Asset-based (modern)\n- Strong testing\n- Complexity: Medium\n\n**Prefect:**\n- Simpler than Airflow\n- Dynamic workflows\n- Complexity: Low-Medium\n\n## Typical Workflow\n\n**Batch ELT Pipeline:**\n```\nPostgreSQL → Fivetran → Snowflake → dbt → Tableau\n (Extract/Load) (Transform) (Visualize)\n```\n\n**Streaming Pipeline:**\n```\nApp Events → Kafka → Flink → Iceberg → Trino → Dashboard\n (Stream) (Process) (Store) (Query)\n```\n\n**Hybrid Pipeline:**\n```\nBatch Sources → Fivetran → Lakehouse\n ↓\nStreaming Sources → Kafka → Lakehouse\n ↓\n dbt (Transform)\n ↓\n BI Tools\n```\n\n## Tool Stack Evolution\n\n**Phase 1 (Startup):**\n- BigQuery + Airbyte + dbt + Metabase\n- Cost: \u003c$1K/month\n- Team: 1-2 analysts\n\n**Phase 2 (Growth):**\n- Snowflake + Fivetran + dbt + Tableau + DataHub\n- Cost: $10K-50K/month\n- Team: 2-5 data engineers, 5-10 analysts\n\n**Phase 3 (Enterprise):**\n- Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Tableau + Alation\n- Cost: $50K-500K/month\n- Team: 10+ data engineers, 20+ analysts\n\nSee [tool-recommendations.md](tool-recommendations.md) for detailed comparisons.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5942,"content_sha256":"f43aa907ca5c6918d6e3d6bfe19c507dcdc054f902e01673692f4f2ccfdb21e0"},{"filename":"references/scenarios.md","content":"# Common Scenarios and Recommendations\n\n\n## Table of Contents\n\n- [Scenario 1: Startup Data Platform](#scenario-1-startup-data-platform)\n- [Scenario 2: Enterprise Modernization](#scenario-2-enterprise-modernization)\n- [Scenario 3: Data Mesh Assessment](#scenario-3-data-mesh-assessment)\n- [Scenario 4: Lakehouse Table Format Selection](#scenario-4-lakehouse-table-format-selection)\n- [Scenario 5: Streaming + Batch Architecture](#scenario-5-streaming-batch-architecture)\n- [Scenario 6: Cost Optimization](#scenario-6-cost-optimization)\n- [Decision Framework Summary](#decision-framework-summary)\n\n## Scenario 1: Startup Data Platform\n\n**Context:**\n- 50-person startup\n- Data sources: PostgreSQL, MongoDB, Stripe\n- Need: Analytics for CEO dashboard + ML experiments\n- Team: 1-2 analysts, no data engineers\n- Budget: \u003c$5K/month\n\n**Recommendation:**\n\n**Storage:** BigQuery or Snowflake (pay-per-use)\n- BigQuery if GCP-committed\n- Snowflake if multi-cloud flexibility desired\n\n**Ingestion:** Airbyte (open-source) or Fivetran (if budget)\n- Airbyte: Free, requires some ops\n- Fivetran: $1K-2K/month, zero maintenance\n\n**Transformation:** dbt Core (SQL-based)\n- Free, open-source\n- Version control with Git\n- Testing framework included\n\n**Orchestration:** dbt Cloud or Prefect Cloud\n- dbt Cloud: Simple, dbt-specific\n- Prefect Cloud: More flexible\n\n**Visualization:** Metabase (open-source) or Looker Studio (free)\n\n**Architecture:** Simple data warehouse\n- Start with dimensional models (star schemas)\n- Add lakehouse if ML becomes priority\n\n**Why:** Minimize complexity, leverage managed services, focus on business value.\n\n---\n\n## Scenario 2: Enterprise Modernization\n\n**Context:**\n- Legacy Oracle data warehouse\n- Want cloud migration\n- Support BI + ML workloads\n- Reduce costs\n- Team: 10+ data engineers\n\n**Recommendation:**\n\n**Storage:** Data Lakehouse (Databricks or Snowflake with Iceberg)\n- Databricks if ML-heavy\n- Snowflake if BI-primary\n\n**Migration Strategy:**\n1. Identify critical BI reports (keep on Oracle short-term)\n2. Build lakehouse for new analytics and ML\n3. Use CDC tools (Debezium, Fivetran) for continuous replication\n4. Gradually migrate Oracle tables to lakehouse\n5. Sunset Oracle after full migration\n\n**Architecture:** Medallion (bronze, silver, gold)\n- Bronze: Raw replication from Oracle\n- Silver: Cleaned, typed data\n- Gold: Star schemas + ML features\n\n**Cost Savings:** 60-80% with lake storage vs Oracle licensing\n\n**Timeline:** 12-18 months for full migration\n\n---\n\n## Scenario 3: Data Mesh Assessment\n\n**Context:**\n- 200-person company\n- Centralized data team: 5 people\n- Domains: Sales, Marketing, Product\n- Question: Should we adopt data mesh?\n\n**Assessment:**\n\n**6-Factor Scores (1-5):**\n1. Domain clarity: 4 (clear domains)\n2. Team maturity: 2 (analysts, but no data engineers in domains)\n3. Platform capability: 2 (some automation, but not self-serve)\n4. Governance maturity: 3 (centralized, but difficult to enforce)\n5. Scale need: 2 (5-person team not yet bottleneck)\n6. Organizational buy-in: 3 (interested, but skeptical)\n\n**Total Score:** 16/30\n\n**Recommendation:** NOT YET. Build foundation first.\n\n**Why:**\n- 200 people is too small for full data mesh (recommend >500)\n- Central team of 5 is not yet a bottleneck\n- Domain teams lack data engineering skills\n- Self-serve platform doesn't exist yet\n\n**Alternative Approach:**\n1. Invest in self-serve platform (infrastructure as code, CI/CD)\n2. Improve governance (catalog, lineage, quality)\n3. Train domain teams on data engineering\n4. Start with \"data product thinking\" (SLAs, documentation) without full decentralization\n5. Reconsider data mesh at 500+ people or when central team is clear bottleneck\n\n**Hybrid Option:**\n- Sales domain (most mature) could start owning data products\n- Marketing and Product remain centralized\n- Platform team provides shared infrastructure\n\n---\n\n## Scenario 4: Lakehouse Table Format Selection\n\n**Context:**\n- Building lakehouse on S3\n- Need Spark for ML\n- Need Trino for BI queries\n- Want to avoid vendor lock-in\n\n**Recommendation:** Apache Iceberg\n\n**Why:**\n- **Multi-engine support:** Both Spark and Trino have excellent Iceberg support\n- **Vendor-neutral:** Apache Foundation governance (not Databricks-led)\n- **Hidden partitioning:** Query without partition predicates\n- **Partition evolution:** Change partitioning without rewriting data\n- **Mature ecosystem:** Apache Polaris (catalog), Project Nessie (versioning)\n\n**Alternative:** Delta Lake if committed to Databricks, but less optimal for Trino\n\n**Implementation:**\n```sql\nCREATE TABLE catalog.db.sales (\n order_id BIGINT,\n customer_id BIGINT,\n order_date DATE,\n amount DECIMAL(10,2)\n)\nUSING iceberg\nPARTITIONED BY (days(order_date));\n```\n\n---\n\n## Scenario 5: Streaming + Batch Architecture\n\n**Context:**\n- Need both real-time dashboards and historical analytics\n- Data sources: Kafka (streaming), PostgreSQL (batch)\n- Team: Experienced data engineers\n\n**Recommendation:** Kappa Architecture with Medallion\n\n**Architecture:**\n```\nKafka (Streaming)\n → Flink (Stream Processing)\n → Bronze Layer (Iceberg tables)\n → Silver Layer (cleaned, 5-minute windows)\n → Gold Layer (BI dashboards + ML features)\n\nPostgreSQL (Batch)\n → Fivetran (CDC)\n → Bronze Layer\n → Silver Layer (daily batch)\n → Gold Layer\n```\n\n**Unified Gold Layer:** Merge streaming and batch data\n\n**Why:**\n- Single pipeline for both streaming and batch (Kappa)\n- Medallion architecture for quality layers\n- Iceberg supports both streaming writes (Flink) and batch reads (Trino)\n\n**Trade-Offs:**\n- Complexity: High (streaming infrastructure)\n- Latency: Real-time (milliseconds for streaming path)\n- Cost: Medium (compute for streaming)\n\n---\n\n## Scenario 6: Cost Optimization\n\n**Context:**\n- Currently on Snowflake\n- $50K/month bill\n- Mostly BI workloads\n- Want to reduce costs\n\n**Recommendation:** Migrate to Lakehouse\n\n**Strategy:**\n1. **Assess workloads:**\n - 80% of queries access last 30 days\n - 20% access historical data (>1 year)\n\n2. **Hot-Cold Architecture:**\n - Hot data (last 30 days): Snowflake\n - Cold data (>30 days): Iceberg on S3\n - Unified view: Snowflake External Tables + Iceberg\n\n3. **Incremental Migration:**\n - Week 1-2: Export historical data to S3 as Iceberg tables\n - Week 3-4: Update queries to use external tables for historical data\n - Week 5+: Keep only hot data in Snowflake\n\n**Cost Savings:**\n- Snowflake: $50K/month → $15K/month (70% reduction)\n- S3 + Trino: $5K/month\n- Total: $20K/month (60% savings)\n\n**Trade-Offs:**\n- Query performance: Slightly slower for historical queries\n- Complexity: Moderate (manage hybrid architecture)\n- Maintenance: Some ops work for Iceberg tables\n\n---\n\n## Decision Framework Summary\n\n| Scenario | Organization Size | Primary Workload | Recommendation |\n|----------|------------------|------------------|----------------|\n| Startup | \u003c50 | BI + Early ML | Data Warehouse (BigQuery, Snowflake) |\n| Growth | 50-500 | BI + ML | Data Lakehouse (Databricks, Iceberg) |\n| Enterprise | >500 | Mixed | Hybrid (Snowflake BI + Databricks ML) |\n| Data Mesh | >500 + Bottleneck | Decentralized | Domain-owned data products |\n| Cost-Sensitive | Any | Any | Lakehouse (60-80% cheaper) |\n| Streaming + Batch | Any | Real-time + Historical | Kappa + Medallion |\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":7299,"content_sha256":"18df14ded798038da2a687ff98a03c71cc4dd2f0a80731ba01b809adf4b9fc20"},{"filename":"references/storage-paradigms.md","content":"# Storage Paradigms Deep Dive\n\n## Table of Contents\n\n1. [Data Lake](#data-lake)\n2. [Data Warehouse](#data-warehouse)\n3. [Data Lakehouse](#data-lakehouse)\n4. [Architectural Patterns](#architectural-patterns)\n\n---\n\n## Data Lake\n\n### Definition\n\nCentralized repository storing raw data in native format at massive scale. Schema-on-read approach allows flexibility but requires discipline to avoid \"data swamp.\"\n\n### Core Characteristics\n\n- **Schema-on-read:** Structure applied when querying, not when storing\n- **Format-agnostic:** CSV, JSON, Parquet, Avro, images, videos, logs\n- **Cost-optimized:** Object storage (AWS S3, Google GCS, Azure ADLS)\n- **Scalability:** Petabyte+ capacity with linear scaling\n- **Immutability:** Write-once, read-many pattern\n\n### Three-Zone Architecture\n\n```\n┌─────────────────────────────────────────┐\n│ Raw Zone (Bronze) │\n│ - Exact copy of source data │\n│ - No transformations │\n│ - Immutable historical record │\n│ - Retention: Forever │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Cleaned Zone (Silver) │\n│ - Validated, deduplicated │\n│ - Type conversions │\n│ - Normalized formats │\n│ - Retention: 1-2 years │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Curated Zone (Gold) │\n│ - Business-level aggregates │\n│ - Ready for consumption │\n│ - Optimized for queries │\n│ - Retention: Business-defined │\n└─────────────────────────────────────────┘\n```\n\n### When to Use Data Lake\n\n**Ideal Scenarios:**\n- Diverse data sources (structured + semi-structured + unstructured)\n- Exploratory analytics with unknown use cases\n- ML/AI training data requiring raw, full-history data\n- Cost-sensitive workloads (storage is cheapest)\n- Long-term archival (years to decades)\n- Future-proof data retention\n\n**Anti-Patterns:**\n- Known BI/reporting requirements (warehouse is better)\n- Need for strong data quality guarantees\n- Users require SQL-first interface\n- No data engineering team to manage complexity\n\n### Trade-Offs\n\n**Advantages:**\n- Lowest storage cost ($0.02-0.03/GB/month)\n- Maximum schema flexibility\n- Store all data types\n- Future-proof (keep raw data for unforeseen use cases)\n- No upfront schema design required\n\n**Disadvantages:**\n- No ACID guarantees (eventual consistency)\n- Data quality issues common (\"garbage in, garbage out\")\n- Governance challenges (risk of data swamp)\n- Slower query performance vs warehouse\n- Requires data engineering expertise\n\n### Technologies\n\n**Storage:**\n- AWS S3 (Standard, Infrequent Access, Glacier tiers)\n- Google Cloud Storage (Standard, Nearline, Coldline)\n- Azure Data Lake Storage Gen2\n\n**File Formats:**\n- Parquet (columnar, best for analytics)\n- ORC (columnar, Hive-optimized)\n- Avro (row-based, schema evolution)\n- JSON, CSV (human-readable, not optimized)\n\n**Query Engines:**\n- Presto / Trino (distributed SQL)\n- AWS Athena (serverless Presto)\n- Apache Spark (batch and streaming)\n- Google Cloud Dataproc (managed Spark)\n\n---\n\n## Data Warehouse\n\n### Definition\n\nCentralized, structured repository optimized for analytical queries. Schema-on-write enforces data quality but reduces flexibility.\n\n### Core Characteristics\n\n- **Schema-on-write:** Structure enforced on ingestion\n- **Optimized for BI:** Indexing, partitioning, materialized views\n- **ACID transactions:** Consistency guaranteed\n- **Columnar storage:** Only read needed columns\n- **Query optimization:** Cost-based optimizer, statistics\n\n### Three-Layer Architecture\n\n```\n┌─────────────────────────────────────────┐\n│ Staging Layer │\n│ - Raw data ingestion │\n│ - Temporary storage │\n│ - Minimal transformations │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Integration Layer (EDW) │\n│ - Cleaned, conformed data │\n│ - Normalized or data vault │\n│ - Source of truth │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Presentation Layer (Data Marts) │\n│ - Star/snowflake schemas │\n│ - Department-specific views │\n│ - Optimized for reporting │\n└─────────────────────────────────────────┘\n```\n\n### When to Use Data Warehouse\n\n**Ideal Scenarios:**\n- Structured, relational data primarily\n- Known BI and reporting use cases\n- Strong governance and compliance requirements\n- Performance-critical dashboards\n- Financial reporting, regulatory compliance\n- Business users need SQL interface\n\n**Anti-Patterns:**\n- Highly unstructured data (images, videos, logs)\n- Exploratory analytics with unknown requirements\n- Rapid schema changes\n- Tight budget constraints\n\n### Trade-Offs\n\n**Advantages:**\n- Best query performance (sub-second for most queries)\n- Strong governance and data quality\n- ACID transactions\n- Mature tooling and ecosystem\n- User-friendly for analysts (SQL interface)\n\n**Disadvantages:**\n- Highest storage cost ($20-40/TB/month)\n- Schema inflexibility (migrations are painful)\n- Not ideal for unstructured data\n- Difficult to support ML workloads requiring raw data\n- Compute tightly coupled to storage (scaling complexity)\n\n### Technologies\n\n**Cloud Data Warehouses:**\n- Snowflake (multi-cloud, zero-maintenance)\n- Google BigQuery (serverless, pay-per-query)\n- AWS Redshift (AWS-native, columnar)\n- Azure Synapse Analytics (Azure-native, integrated)\n\n**On-Premises (Legacy):**\n- Teradata\n- Oracle Exadata\n- IBM Netezza\n- Vertica\n\n### Optimization Techniques\n\n**1. Partitioning:**\n- Time-based (daily, monthly) - most common\n- Geography (region, country)\n- Hash (even distribution)\n\n**2. Clustering:**\n- Co-locate related data on disk\n- Snowflake: Automatic micro-partitions + clustering keys\n- BigQuery: Clustering columns (max 4)\n\n**3. Materialized Views:**\n- Pre-compute expensive aggregations\n- Auto-refresh on base table changes\n- Trade storage cost for query performance\n\n**4. Incremental Loading:**\n- Only load new/changed data\n- Watermark column (e.g., `last_updated_timestamp`)\n- Reduces load time by 90%+\n\n---\n\n## Data Lakehouse\n\n### Definition\n\nHybrid architecture combining data lake cost efficiency and flexibility with data warehouse reliability and performance. Enabled by open table formats (Iceberg, Delta Lake, Hudi).\n\n### Core Characteristics\n\n- **Open table formats:** ACID transactions on object storage\n- **Schema enforcement:** Optional validation on write\n- **Time travel:** Query historical versions\n- **Unified platform:** BI and ML on same data\n- **Multi-engine support:** Spark, Trino, Flink, Presto\n- **Cost-effective:** Lake storage prices with warehouse features\n\n### Architecture\n\n```\n┌─────────────────────────────────────────┐\n│ Cloud Object Storage (S3, GCS, ADLS) │\n│ - Low-cost, scalable │\n│ - Parquet/ORC files │\n│ - Open formats (no proprietary lock-in)│\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Metadata Layer (Iceberg/Delta) │\n│ - Table schema, statistics │\n│ - Partition information │\n│ - Transaction log (ACID) │\n│ - Time travel snapshots │\n└─────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Compute Engines (Multi-Engine) │\n│ - Spark (batch, ML) │\n│ - Trino/Presto (SQL analytics) │\n│ - Flink (streaming) │\n│ - Dremio (semantic layer) │\n└─────────────────────────────────────────┘\n```\n\n### When to Use Data Lakehouse\n\n**Ideal Scenarios:**\n- Both BI and ML workloads (unified platform)\n- Cost optimization (60-80% cheaper than warehouse)\n- Multi-engine access (Spark, Trino, Flink)\n- Schema evolution requirements (frequent changes)\n- Streaming + batch processing\n- Avoid vendor lock-in (open formats)\n\n**Anti-Patterns:**\n- Only BI workloads (warehouse is simpler)\n- Very small scale (\u003c100GB data)\n- No data engineering expertise\n- Legacy tools incompatible with lakehouse\n\n### Trade-Offs\n\n**Advantages:**\n- Cost-effective (lake pricing, warehouse features)\n- Flexible (structured + semi-structured)\n- Multi-engine support (avoid vendor lock-in)\n- Open formats (Iceberg, Delta Lake)\n- Time travel and schema evolution\n- Unified BI + ML platform\n\n**Disadvantages:**\n- Newer technology (less mature than warehouse)\n- Performance not quite warehouse-level (improving)\n- Requires careful optimization (partitioning, clustering)\n- Steeper learning curve\n- Smaller ecosystem than traditional warehouses\n\n### Technologies\n\n**Table Formats:**\n- Apache Iceberg (vendor-neutral, multi-engine)\n- Delta Lake (Databricks-led, Spark-optimized)\n- Apache Hudi (upsert-optimized, streaming)\n\n**Lakehouse Platforms:**\n- Databricks (Delta Lake native)\n- Snowflake (Iceberg support added)\n- AWS Lake Formation (Iceberg + Hudi)\n- Dremio (Iceberg semantic layer)\n\n**Compute Engines:**\n- Apache Spark\n- Trino / Presto\n- Apache Flink\n- Dremio\n\n---\n\n## Architectural Patterns\n\n### Lambda Architecture (Batch + Speed Layers)\n\n```\nData Sources\n ↓\n┌────────────────────┬────────────────────┐\n│ Batch Layer │ Speed Layer │\n│ (Hourly/Daily) │ (Real-time) │\n│ - Historical data │ - Recent data │\n│ - Reprocessable │ - Low latency │\n│ - Spark batch │ - Kafka + Flink │\n└──────────┬─────────┴──────────┬─────────┘\n │ │\n ↓ ↓\n Batch View Real-time View\n │ │\n └──────────┬──────────┘\n ↓\n Serving Layer\n (Queries)\n```\n\n**When to Use:**\n- Need both historical accuracy and real-time insights\n- Batch reprocessing for data corrections\n- Trade-off between latency and correctness\n\n**Challenges:**\n- Complexity (two parallel pipelines)\n- Code duplication (batch and streaming logic)\n- Consistency challenges between layers\n\n### Kappa Architecture (Streaming-Only)\n\n```\nData Sources\n ↓\n Stream Processing\n (Kafka + Flink)\n ↓\n┌────────────────────┬────────────────────┐\n│ Serving Layer │ Archive Layer │\n│ (Last 30 days) │ (All history) │\n│ - Fast queries │ - S3/GCS │\n│ - Hot data │ - Reprocessable │\n└────────────────────┴────────────────────┘\n```\n\n**When to Use:**\n- Streaming-first organization\n- Simplified architecture (one pipeline)\n- Reprocessing via stream replay\n\n**Challenges:**\n- Stream processing for everything (including batch)\n- Requires mature streaming infrastructure\n- Reprocessing can be slow\n\n### Medallion Architecture (Databricks/Lakehouse Standard)\n\n```\nBronze Layer (Raw)\n- Exact copy of source\n- Append-only\n- All history retained\n ↓\nSilver Layer (Cleaned)\n- Validated, deduplicated\n- Type conversions\n- Slowly Changing Dimensions applied\n ↓\nGold Layer (Business-level)\n- Aggregates\n- Star schemas\n- Feature tables\n- Ready for BI/ML\n```\n\n**When to Use:**\n- Lakehouse platform (Databricks, Iceberg)\n- Clear data quality layers\n- Progressive data refinement\n\n**Benefits:**\n- Clear separation of concerns\n- Reprocessing at any layer\n- Incremental quality improvement\n\n### Comparison\n\n| Pattern | Complexity | Real-Time | Batch | Reprocessing |\n|---------|------------|-----------|-------|--------------|\n| Lambda | High | Excellent | Excellent | Batch layer |\n| Kappa | Medium | Excellent | Good | Stream replay |\n| Medallion | Low-Medium | Good | Excellent | Any layer |\n\n**Recommendation:**\n- Start with **Medallion** (simplest, most flexible)\n- Add **Kappa** if streaming-first\n- Avoid **Lambda** unless strong requirement for separate batch/speed layers\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":14247,"content_sha256":"9bbae7ea9392bcbfc09111c8b83fbc48f4c9578729e8bfb1092d025fe7ffc38a"},{"filename":"references/table-formats.md","content":"# Open Table Formats Comparison\n\nQuick comparison of Apache Iceberg, Delta Lake, and Apache Hudi for lakehouse architectures.\n\n## Feature Matrix\n\n| Feature | Apache Iceberg | Delta Lake | Apache Hudi |\n|---------|---------------|------------|-------------|\n| **ACID Transactions** | Yes (serializable) | Yes | Yes |\n| **Time Travel** | Yes (snapshot ID) | Yes (version number) | Yes (commit time) |\n| **Schema Evolution** | Excellent | Good | Good |\n| **Partition Evolution** | Yes (no rewrite) | No | Limited |\n| **Hidden Partitioning** | Yes | No | No |\n| **Multi-Engine Support** | Spark, Trino, Flink, Presto, Dremio | Spark (primary), Trino (limited) | Spark, Flink |\n| **Governance** | Apache Foundation | Databricks/Linux Foundation | Apache Foundation |\n| **CDC Support** | Good | Excellent | Excellent |\n| **Streaming** | Good | Excellent | Excellent |\n\n## Decision Matrix\n\n**Use Apache Iceberg when:**\n- Need multi-engine flexibility (Spark, Trino, Flink)\n- Want vendor neutrality (Apache Foundation)\n- Partition evolution required\n- Future-proof architecture (broadest adoption)\n\n**Use Delta Lake when:**\n- Committed to Databricks ecosystem\n- Primarily Spark-based workflows\n- Excellent streaming + CDC support needed\n- Unity Catalog integration desired\n\n**Use Apache Hudi when:**\n- CDC and upserts are primary workload\n- Streaming ingestion dominant\n- Need record-level updates (Merge-on-Read)\n- Cost-optimized incremental processing\n\n## Recommendation\n\n**Default:** Apache Iceberg (vendor-neutral, multi-engine, best long-term)\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":1541,"content_sha256":"938864dd203fc0218b53050d7993c3947b3ff3fbed4beeea70f7795cfc1ed3cd"},{"filename":"references/tool-recommendations.md","content":"# Tool Recommendations\n\n\n## Table of Contents\n\n- [Research-Validated Tools (Context7, December 2025)](#research-validated-tools-context7-december-2025)\n - [dbt (Data Build Tool)](#dbt-data-build-tool)\n - [Apache Iceberg](#apache-iceberg)\n - [Tool Comparison Matrix](#tool-comparison-matrix)\n- [Tool Stack Recommendations by Use Case](#tool-stack-recommendations-by-use-case)\n - [Startup (Cost-Optimized)](#startup-cost-optimized)\n - [Growth Company (Balanced)](#growth-company-balanced)\n - [Enterprise (Full Stack)](#enterprise-full-stack)\n- [Selection Criteria](#selection-criteria)\n - [Ingestion: Fivetran vs Airbyte vs Kafka](#ingestion-fivetran-vs-airbyte-vs-kafka)\n - [Storage: Snowflake vs Databricks vs BigQuery](#storage-snowflake-vs-databricks-vs-bigquery)\n - [Transformation: dbt vs Spark](#transformation-dbt-vs-spark)\n - [Orchestration: Airflow vs Dagster vs Prefect](#orchestration-airflow-vs-dagster-vs-prefect)\n\n## Research-Validated Tools (Context7, December 2025)\n\n### dbt (Data Build Tool)\n\n**Context7 Score:** 87.0 (Excellent)\n**Code Snippets:** 3,532+\n**Reputation:** High\n**Library ID:** /websites/getdbt\n\n**Use For:**\n- SQL-based transformations in data warehouse/lakehouse\n- Version control for data pipelines (Git integration)\n- Testing and documentation (built-in framework)\n- Lineage tracking (automatic DAG generation)\n\n**Key Features:**\n- Models as SQL `SELECT` statements\n- Built-in testing (unique, not_null, relationships, custom tests)\n- Incremental models (only process new data)\n- Seeds (CSV files as reference data)\n- Snapshots (track changes over time)\n- Documentation generation (automatic)\n- Multi-warehouse support (Snowflake, BigQuery, Redshift, Databricks)\n\n**Ecosystem Packages:**\n- `dbt-utils`: Macros for common patterns\n- `dbt-expectations`: Great Expectations-style tests\n- `dbt-databricks`: Databricks-specific optimizations\n- `dbt-project-evaluator`: Project quality checks\n\n**Why Recommended:**\n- Industry standard (majority of data teams use dbt)\n- Strong software engineering practices (Git, CI/CD, testing)\n- Extensive documentation and community support\n- Multi-warehouse support avoids lock-in\n\n**Getting Started:**\n```bash\npip install dbt-snowflake\ndbt init my_project\ncd my_project\ndbt run\ndbt test\n```\n\n---\n\n### Apache Iceberg\n\n**Context7 Score:** 79.7 (Strong)\n**Code Snippets:** 832+\n**Reputation:** High\n**Library ID:** /apache/iceberg\n\n**Use For:**\n- Open table format for data lakehouses\n- Multi-engine analytics (Spark, Trino, Flink, Presto)\n- ACID transactions on object storage\n- Schema evolution without rewrites\n- Time travel queries\n\n**Key Features:**\n- Hidden partitioning (no partition predicates in queries)\n- Partition evolution (change partitioning without rewriting data)\n- ACID transactions (serializable isolation)\n- Time travel (query historical snapshots)\n- Schema evolution (add/drop/rename columns safely)\n- Metadata management (efficient manifest files)\n- Multi-engine support (best-in-class)\n\n**Ecosystem:**\n- **Apache Polaris:** Open-source catalog (1,224 snippets, High reputation)\n- **Project Nessie:** Git-like versioning for data lakes (356 snippets, High reputation)\n- **Lakekeeper:** Rust-based Iceberg REST Catalog (301 snippets, Medium reputation)\n\n**Why Recommended:**\n- Vendor-neutral (Apache Foundation governance)\n- Broadest multi-engine support (avoid vendor lock-in)\n- Production-ready (Netflix, Apple, Adobe, Airbnb use in production)\n- Best partition evolution support\n\n**Getting Started:**\n```sql\n-- Create Iceberg table\nCREATE TABLE catalog.db.sales (\n order_id BIGINT,\n customer_id BIGINT,\n order_date DATE,\n amount DECIMAL(10,2)\n)\nUSING iceberg\nPARTITIONED BY (days(order_date));\n\n-- Time travel\nSELECT * FROM catalog.db.sales\nTIMESTAMP AS OF '2025-01-01 00:00:00';\n```\n\n---\n\n### Tool Comparison Matrix\n\n| Tool | Best For | Trust Score | Cost | Complexity |\n|------|----------|-------------|------|------------|\n| **Ingestion** |\n| Fivetran | Pre-built connectors, low maintenance | N/A | $$ | Low |\n| Airbyte | Custom connectors, cost-sensitive | N/A | $ (OSS) | Medium |\n| Kafka | Real-time streaming, event-driven | N/A | $ | High |\n| **Storage** |\n| Snowflake | BI/analytics, strong governance | N/A | $$ | Low |\n| Databricks | ML/data science, Spark-native | N/A | $ | Medium |\n| BigQuery | Google Cloud, serverless | N/A | $ | Low |\n| **Transformation** |\n| dbt | SQL-based, analysts, BI-focused | 87.0 | Free (OSS) | Low-Medium |\n| Spark | PySpark, large-scale, ML | N/A | $ | High |\n| **Orchestration** |\n| Airflow | Mature, large community | N/A | $ (OSS) | Medium-High |\n| Dagster | Asset-based, modern | N/A | $ (OSS) | Medium |\n| Prefect | Simple, dynamic workflows | N/A | $ (OSS) | Low-Medium |\n| **Table Format** |\n| Apache Iceberg | Multi-engine, vendor-neutral | 79.7 | Free (OSS) | Medium |\n| Delta Lake | Databricks, Spark-optimized | N/A | Free (OSS) | Medium |\n| Apache Hudi | CDC, frequent upserts | N/A | Free (OSS) | Medium-High |\n| **Catalog** |\n| DataHub | Open-source, REST API | N/A | Free (OSS) | Medium |\n| Alation | Enterprise, AI-powered | N/A | $$ | Low |\n| Collibra | Governance-focused | N/A | $$ | Medium |\n| **Quality** |\n| Great Expectations | Comprehensive, Python | N/A | Free (OSS) | Medium |\n| Soda Core | Simple, YAML-based | N/A | Free (OSS) | Low |\n| dbt Tests | Built-in, SQL-based | N/A | Free | Low |\n\n---\n\n## Tool Stack Recommendations by Use Case\n\n### Startup (Cost-Optimized)\n\n**Budget:** \u003c$5K/month\n**Team:** 1-2 analysts, no data engineers\n**Scale:** \u003c100GB data\n\n**Stack:**\n- **Storage:** BigQuery or Snowflake (pay-per-use)\n- **Ingestion:** Airbyte (open-source) or Fivetran (if budget)\n- **Transformation:** dbt Core (open-source)\n- **Orchestration:** dbt Cloud (managed, simple) or Prefect Cloud\n- **Visualization:** Metabase (open-source) or Looker Studio (free)\n- **Catalog:** DataHub (if needed later)\n\n**Why:** Minimize complexity, leverage managed services, focus on business value.\n\n---\n\n### Growth Company (Balanced)\n\n**Budget:** $10K-50K/month\n**Team:** 2-5 data engineers, 3-10 analysts\n**Scale:** 100GB-10TB data\n\n**Stack:**\n- **Storage:** Snowflake (BI-focused) OR Databricks (ML-focused)\n- **Ingestion:** Fivetran (convenience) + Kafka (real-time)\n- **Transformation:** dbt (primary) + Spark (heavy processing)\n- **Orchestration:** Airflow (Astronomer or AWS MWAA managed)\n- **Visualization:** Tableau or Looker\n- **Catalog:** DataHub (open-source)\n- **Quality:** Great Expectations + dbt tests\n- **Table Format:** Apache Iceberg (if lakehouse)\n\n**Why:** Balance convenience and cost; support BI and ML workloads.\n\n---\n\n### Enterprise (Full Stack)\n\n**Budget:** $50K-500K/month\n**Team:** 10+ data engineers, 20+ analysts\n**Scale:** >10TB data\n\n**Stack:**\n- **Storage:** Snowflake (BI) + Databricks (ML) hybrid\n- **Ingestion:** Fivetran + Custom Airflow DAGs + Kafka\n- **Transformation:** dbt + Spark (for heavy processing)\n- **Orchestration:** Airflow (self-hosted or managed)\n- **Visualization:** Tableau (enterprise license) + custom tools\n- **Catalog:** Alation or Collibra (enterprise features)\n- **Observability:** Monte Carlo (data observability) + Datadog\n- **Quality:** Great Expectations + Soda + dbt tests\n- **Table Format:** Apache Iceberg (vendor-neutral)\n- **Lineage:** OpenLineage + Marquez\n\n**Why:** Full-featured stack, mature governance, support for all workloads.\n\n---\n\n## Selection Criteria\n\n### Ingestion: Fivetran vs Airbyte vs Kafka\n\n**Fivetran:**\n- ✅ 300+ pre-built connectors\n- ✅ Lowest maintenance (SaaS, auto-schema change detection)\n- ✅ Best for standard sources (Salesforce, Stripe, PostgreSQL)\n- ❌ Expensive (starts $1K/month, scales with rows)\n- ❌ Proprietary (vendor lock-in)\n\n**Airbyte:**\n- ✅ Open-source (free)\n- ✅ 300+ connectors (community-maintained)\n- ✅ Custom connector SDK (build your own)\n- ✅ Self-hosted or cloud\n- ❌ More maintenance (self-hosted requires ops)\n- ❌ Less polished than Fivetran\n\n**Kafka:**\n- ✅ Real-time streaming (millisecond latency)\n- ✅ Event-driven architecture\n- ✅ Highest throughput (millions of events/sec)\n- ❌ High complexity (requires Kafka expertise)\n- ❌ Operational overhead (cluster management)\n- ❌ Not ideal for batch CDC\n\n**Recommendation:**\n- Standard sources + budget → Fivetran\n- Custom sources + cost-sensitive → Airbyte\n- Real-time streaming → Kafka\n- Hybrid: Fivetran (batch) + Kafka (real-time)\n\n---\n\n### Storage: Snowflake vs Databricks vs BigQuery\n\n**Snowflake:**\n- ✅ Best for BI/analytics (fastest SQL queries)\n- ✅ Strong governance (RBAC, masking, row-level security)\n- ✅ Multi-cloud (AWS, GCP, Azure)\n- ✅ Zero-maintenance (fully managed)\n- ❌ Expensive for large data volumes\n- ❌ Not ideal for ML workloads (limited Python support)\n\n**Databricks:**\n- ✅ Best for ML/data science (Spark-native, notebooks)\n- ✅ Unified analytics (BI + ML)\n- ✅ Lakehouse platform (Delta Lake)\n- ✅ Excellent for Spark workloads\n- ❌ Steeper learning curve\n- ❌ Not as SQL-optimized as Snowflake\n\n**BigQuery:**\n- ✅ Serverless (no cluster management)\n- ✅ Pay-per-query pricing (cost-effective for small workloads)\n- ✅ Google Cloud integration (GCS, Dataflow, Vertex AI)\n- ✅ Fast for large scans (petabyte-scale)\n- ❌ GCP lock-in\n- ❌ Limited multi-cloud support\n\n**Recommendation:**\n- BI-primary → Snowflake\n- ML-primary → Databricks\n- GCP-committed → BigQuery\n- Hybrid: Snowflake (BI) + Databricks (ML)\n\n---\n\n### Transformation: dbt vs Spark\n\n**dbt:**\n- ✅ SQL-based (analysts can write transformations)\n- ✅ Version control (Git integration)\n- ✅ Testing framework (built-in)\n- ✅ Documentation generation (automatic)\n- ✅ Incremental models (efficient)\n- ❌ SQL-only (no Python UDFs)\n- ❌ Not ideal for >100TB data\n\n**Spark:**\n- ✅ PySpark/Scala (complex logic, UDFs)\n- ✅ Large-scale processing (petabyte-scale)\n- ✅ ML integration (MLlib, feature engineering)\n- ✅ Flexible (batch and streaming)\n- ❌ Steep learning curve\n- ❌ Requires data engineers\n\n**Recommendation:**\n- Analysts writing transformations → dbt\n- Complex logic, large scale → Spark\n- Hybrid: dbt (80% of transformations) + Spark (20% heavy processing)\n\n---\n\n### Orchestration: Airflow vs Dagster vs Prefect\n\n**Airflow:**\n- ✅ Most mature (2014, battle-tested)\n- ✅ Largest community (extensive integrations)\n- ✅ Enterprise features (RBAC, audit logs)\n- ❌ Steep learning curve (DAG complexity)\n- ❌ Operational overhead (self-hosted)\n- ❌ Task-based (not asset-based)\n\n**Dagster:**\n- ✅ Asset-based (data pipelines as assets)\n- ✅ Strong testing (unit tests for pipelines)\n- ✅ Modern design (Python-native)\n- ✅ Better for data products\n- ❌ Smaller community (2019)\n- ❌ Less mature than Airflow\n\n**Prefect:**\n- ✅ Simpler than Airflow (dynamic workflows)\n- ✅ Python-native (Pythonic API)\n- ✅ Hybrid execution (local + cloud)\n- ❌ Smaller community (2018)\n- ❌ Less enterprise features\n\n**Recommendation:**\n- Mature, complex workflows → Airflow\n- Asset-oriented thinking → Dagster\n- Simplicity, dynamic workflows → Prefect\n- dbt-only → dbt Cloud (no external orchestrator needed)\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":11130,"content_sha256":"2b01fd358044ee6e0daa045c05bcbc22a968c84f65416f14c60668c8451c1721"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Data Architecture","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Purpose","type":"text"}]},{"type":"paragraph","content":[{"text":"Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to Use This Skill","type":"text"}]},{"type":"paragraph","content":[{"text":"Invoke this skill when:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Designing a new data platform or modernizing legacy systems","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Choosing between data lake, data warehouse, or data lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Evaluating centralized vs data mesh architecture","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Designing medallion architecture (bronze, silver, gold layers)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Implementing data governance and cataloging","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Core Concepts","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"1. Storage Paradigms","type":"text"}]},{"type":"paragraph","content":[{"text":"Three primary patterns for analytical data storage:","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Lake:","type":"text","marks":[{"type":"strong"}]},{"text":" Centralized repository for raw data at scale","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Schema-on-read, cost-optimized ($0.02-0.03/GB/month)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Diverse data sources, exploratory analytics, ML/AI training data","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Data Warehouse:","type":"text","marks":[{"type":"strong"}]},{"text":" Structured repository optimized for BI","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Schema-on-write, ACID transactions, fast queries","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Known BI requirements, strong governance needed","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Data Lakehouse:","type":"text","marks":[{"type":"strong"}]},{"text":" Hybrid combining lake flexibility with warehouse reliability","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Open table formats (Iceberg, Delta Lake), ACID on object storage","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Mixed BI + ML workloads, cost optimization (60-80% cheaper than warehouse)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Decision Framework:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"BI/Reporting only + Known queries → Data Warehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ML/AI primary + Raw data needed → Data Lake or Lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Exploratory/Unknown use cases → Data Lake","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For detailed comparison, see ","type":"text"},{"text":"references/storage-paradigms.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/storage-paradigms.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"2. Data Modeling Approaches","type":"text"}]},{"type":"paragraph","content":[{"text":"Four primary modeling patterns:","type":"text"}]},{"type":"paragraph","content":[{"text":"Dimensional (Kimball):","type":"text","marks":[{"type":"strong"}]},{"text":" Star/snowflake schemas for BI","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Known query patterns, BI dashboards, trend analysis","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Normalized (3NF):","type":"text","marks":[{"type":"strong"}]},{"text":" Eliminate redundancy for transactional systems","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: OLTP systems, frequent updates, strong consistency","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Data Vault 2.0:","type":"text","marks":[{"type":"strong"}]},{"text":" Flexible model with complete audit trail","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Compliance requirements, multiple sources, agile warehousing","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Wide Tables:","type":"text","marks":[{"type":"strong"}]},{"text":" Denormalized, optimized for columnar storage","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: ML feature stores, data science notebooks, high-performance dashboards","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Decision Framework:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Analytical (BI) + Known queries → Dimensional (Star Schema)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Transactional (OLTP) → Normalized (3NF)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compliance/Audit → Data Vault 2.0","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data Science/ML → Wide Tables","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For detailed patterns, see ","type":"text"},{"text":"references/modeling-approaches.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/modeling-approaches.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"3. Data Mesh Principles","type":"text"}]},{"type":"paragraph","content":[{"text":"Decentralized architecture for large organizations (>500 people).","type":"text"}]},{"type":"paragraph","content":[{"text":"Four Core Principles:","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Domain-oriented decentralization","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data as a product (SLAs, quality, documentation)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Self-serve data infrastructure","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Federated computational governance","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Readiness Assessment (Score 1-5 each):","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Domain clarity","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Team maturity","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Platform capability","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Governance maturity","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Scale need","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Organizational buy-in","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Scoring:","type":"text","marks":[{"type":"strong"}]},{"text":" 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized","type":"text"}]},{"type":"paragraph","content":[{"text":"Red Flags:","type":"text","marks":[{"type":"strong"}]},{"text":" Small org (\u003c100 people), unclear domains, no platform team, weak governance","type":"text"}]},{"type":"paragraph","content":[{"text":"For full guide, see ","type":"text"},{"text":"references/data-mesh-guide.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/data-mesh-guide.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"4. Medallion Architecture","type":"text"}]},{"type":"paragraph","content":[{"text":"Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)","type":"text"}]},{"type":"paragraph","content":[{"text":"Bronze Layer:","type":"text","marks":[{"type":"strong"}]},{"text":" Exact copy of source data, immutable, append-only","type":"text"}]},{"type":"paragraph","content":[{"text":"Silver Layer:","type":"text","marks":[{"type":"strong"}]},{"text":" Validated, deduplicated, typed data","type":"text"}]},{"type":"paragraph","content":[{"text":"Gold Layer:","type":"text","marks":[{"type":"strong"}]},{"text":" Business logic, aggregates, dimensional models, ML features","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Quality by Layer:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Bronze → Silver: Schema validation, type checks, deduplication","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Silver → Gold: Business rule validation, referential integrity","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Gold: Anomaly detection, statistical checks","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For patterns, see ","type":"text"},{"text":"references/medallion-pattern.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/medallion-pattern.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"5. Open Table Formats","type":"text"}]},{"type":"paragraph","content":[{"text":"Enable ACID transactions on data lakes:","type":"text"}]},{"type":"paragraph","content":[{"text":"Apache Iceberg:","type":"text","marks":[{"type":"strong"}]},{"text":" Multi-engine, vendor-neutral (Context7: 79.7 score)","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Avoid vendor lock-in, multi-engine flexibility","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Delta Lake:","type":"text","marks":[{"type":"strong"}]},{"text":" Databricks ecosystem, Spark-optimized","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: Committed to Databricks","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Apache Hudi:","type":"text","marks":[{"type":"strong"}]},{"text":" Optimized for CDC and frequent upserts","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use when: CDC-heavy workloads","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Recommendation:","type":"text","marks":[{"type":"strong"}]},{"text":" Apache Iceberg for new projects (vendor-neutral, broadest support)","type":"text"}]},{"type":"paragraph","content":[{"text":"For comparison, see ","type":"text"},{"text":"references/table-formats.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/table-formats.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"6. Modern Data Stack","type":"text"}]},{"type":"paragraph","content":[{"text":"Standard Layers:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Ingestion: Fivetran, Airbyte, Kafka","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage: Snowflake, Databricks, BigQuery","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Transformation: dbt (Context7: 87.0 score), Spark","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Orchestration: Airflow, Dagster, Prefect","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Visualization: Tableau, Looker, Power BI","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Governance: DataHub, Alation, Great Expectations","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Tool Selection:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Fivetran vs Airbyte: Pre-built connectors vs cost-sensitive","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Snowflake vs Databricks: BI-focused vs ML-focused","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"dbt vs Spark: SQL-based vs large-scale processing","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For detailed recommendations, see ","type":"text"},{"text":"references/tool-recommendations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/tool-recommendations.md","title":null}}]},{"text":" and ","type":"text"},{"text":"references/modern-data-stack.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/modern-data-stack.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"7. Data Governance","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Catalog:","type":"text","marks":[{"type":"strong"}]},{"text":" Searchable inventory (DataHub, Alation, Collibra)","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Lineage:","type":"text","marks":[{"type":"strong"}]},{"text":" Track data flow (OpenLineage, Marquez)","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Quality:","type":"text","marks":[{"type":"strong"}]},{"text":" Validation and testing (Great Expectations, Soda, dbt tests)","type":"text"}]},{"type":"paragraph","content":[{"text":"Access Control:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"RBAC: Role-based (sales_analyst role)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ABAC: Attribute-based (row-level security)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Column-level: Dynamic data masking for PII","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For governance patterns, see ","type":"text"},{"text":"references/governance-patterns.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/governance-patterns.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Decision Frameworks","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Framework 1: Storage Paradigm Selection","type":"text"}]},{"type":"paragraph","content":[{"text":"Step 1: Identify Primary Use Case","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"BI/Reporting only → Data Warehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ML/AI primary → Data Lake or Lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mixed BI + ML → Data Lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Exploratory → Data Lake","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Step 2: Evaluate Budget","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"High budget, known queries → Data Warehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cost-sensitive, flexible → Data Lakehouse","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Recommendation by Org Size:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Startup (\u003c50): Data Warehouse (simplicity)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Growth (50-500): Data Lakehouse (balance)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Enterprise (>500): Hybrid or unified Lakehouse","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/decision-frameworks.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/decision-frameworks.md#storage-paradigm","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Framework 2: Data Modeling Approach","type":"text"}]},{"type":"paragraph","content":[{"text":"Decision Tree:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Analytical (BI) workload → Dimensional or Wide Tables","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Transactional (OLTP) → Normalized (3NF)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compliance/Audit → Data Vault 2.0","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data Science/ML → Wide Tables","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/decision-frameworks.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/decision-frameworks.md#modeling-approach","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Framework 3: Data Mesh Readiness","type":"text"}]},{"type":"paragraph","content":[{"text":"Use 6-factor assessment. Score interpretation:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"24-30: Proceed with data mesh","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"18-23: Hybrid approach","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"12-17: Build foundation first","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"6-11: Centralized","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/decision-frameworks.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/decision-frameworks.md#data-mesh-readiness","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Framework 4: Open Table Format Selection","type":"text"}]},{"type":"paragraph","content":[{"text":"Decision Tree:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-engine flexibility → Apache Iceberg","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Databricks ecosystem → Delta Lake","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Frequent upserts/CDC → Apache Hudi","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Recommendation:","type":"text","marks":[{"type":"strong"}]},{"text":" Apache Iceberg for new projects","type":"text"}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/decision-frameworks.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/decision-frameworks.md#table-format","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Common Scenarios","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Startup Data Platform","type":"text"}]},{"type":"paragraph","content":[{"text":"Context:","type":"text","marks":[{"type":"strong"}]},{"text":" 50-person startup, PostgreSQL + MongoDB + Stripe","type":"text"}]},{"type":"paragraph","content":[{"text":"Recommendation:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage: BigQuery or Snowflake","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Ingestion: Airbyte or Fivetran","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Transformation: dbt","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Orchestration: dbt Cloud","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Architecture: Simple data warehouse","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/scenarios.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/scenarios.md#startup","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Enterprise Modernization","type":"text"}]},{"type":"paragraph","content":[{"text":"Context:","type":"text","marks":[{"type":"strong"}]},{"text":" Legacy Oracle warehouse, need cloud migration","type":"text"}]},{"type":"paragraph","content":[{"text":"Recommendation:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storage: Data Lakehouse (Databricks or Snowflake with Iceberg)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Strategy: Incremental migration with CDC","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Architecture: Medallion (bronze, silver, gold)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cost Savings: 60-80%","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/scenarios.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/scenarios.md#enterprise-modernization","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Data Mesh Assessment","type":"text"}]},{"type":"paragraph","content":[{"text":"Context:","type":"text","marks":[{"type":"strong"}]},{"text":" 200-person company, 5-person central data team","type":"text"}]},{"type":"paragraph","content":[{"text":"Recommendation:","type":"text","marks":[{"type":"strong"}]},{"text":" NOT YET. Build foundation first.","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Organization too small (\u003c500 recommended)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Central team not yet bottleneck","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Invest in self-serve platform and governance","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/scenarios.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/scenarios.md#data-mesh","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Tool Recommendations","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Research-Validated (Context7, December 2025)","type":"text"}]},{"type":"paragraph","content":[{"text":"dbt:","type":"text","marks":[{"type":"strong"}]},{"text":" Score 87.0, 3,532+ code snippets","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"SQL-based transformations, version control, testing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Industry standard for data transformation","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Apache Iceberg:","type":"text","marks":[{"type":"strong"}]},{"text":" Score 79.7, 832+ code snippets","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Open table format, multi-engine, vendor-neutral","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Production-ready (Netflix, Apple, Adobe)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Tool Stack by Use Case:","type":"text","marks":[{"type":"strong"}]}]},{"type":"paragraph","content":[{"text":"Startup:","type":"text","marks":[{"type":"strong"}]},{"text":" BigQuery + Airbyte + dbt + Metabase (\u003c$1K/month)","type":"text"}]},{"type":"paragraph","content":[{"text":"Growth:","type":"text","marks":[{"type":"strong"}]},{"text":" Snowflake + Fivetran + dbt + Airflow + Tableau ($10K-50K/month)","type":"text"}]},{"type":"paragraph","content":[{"text":"Enterprise:","type":"text","marks":[{"type":"strong"}]},{"text":" Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Alation ($50K-500K/month)","type":"text"}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/tool-recommendations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/tool-recommendations.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Implementation Patterns","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pattern 1: Medallion Architecture","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Bronze: Raw ingestion\nCREATE TABLE bronze.raw_customers (_ingested_at TIMESTAMP, _raw_data STRING);\n\n-- Silver: Cleaned\nCREATE TABLE silver.customers AS\nSELECT json_extract(_raw_data, '$.id') AS customer_id, ...\nFROM bronze.raw_customers\nQUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY _ingested_at DESC) = 1;\n\n-- Gold: Business-level\nCREATE TABLE gold.fact_sales AS\nSELECT s.order_id, d.date_key, c.customer_key, ...\nFROM silver.sales s\nJOIN gold.dim_date d ON s.order_date = d.date;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pattern 2: Apache Iceberg Table","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"CREATE TABLE catalog.db.sales (order_id BIGINT, amount DECIMAL(10,2))\nUSING iceberg\nPARTITIONED BY (days(order_date));\n\n-- Time travel\nSELECT * FROM catalog.db.sales TIMESTAMP AS OF '2025-01-01';","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pattern 3: dbt Transformation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- models/staging/stg_customers.sql\nWITH source AS (SELECT * FROM {{ source('raw', 'customers') }}),\ncleaned AS (\n SELECT customer_id, UPPER(customer_name) AS customer_name\n FROM source WHERE customer_id IS NOT NULL\n)\nSELECT * FROM cleaned","type":"text"}]},{"type":"paragraph","content":[{"text":"For complete examples, see ","type":"text"},{"text":"examples/","type":"text","marks":[{"type":"link","attrs":{"href":"examples/","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Best Practices","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Start simple:","type":"text","marks":[{"type":"strong"}]},{"text":" Avoid over-engineering; begin with warehouse or basic lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Invest in governance early:","type":"text","marks":[{"type":"strong"}]},{"text":" Catalog, lineage, quality from day one","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Medallion architecture:","type":"text","marks":[{"type":"strong"}]},{"text":" Use bronze-silver-gold for clear quality layers","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Open table formats:","type":"text","marks":[{"type":"strong"}]},{"text":" Prefer Iceberg or Delta Lake to avoid vendor lock-in","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Assess mesh readiness:","type":"text","marks":[{"type":"strong"}]},{"text":" Don't decentralize prematurely (\u003c500 people)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Automate quality:","type":"text","marks":[{"type":"strong"}]},{"text":" Integrate tests (Great Expectations, dbt) into CI/CD","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Monitor pipelines:","type":"text","marks":[{"type":"strong"}]},{"text":" Observability is critical (freshness, quality, health)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Document as code:","type":"text","marks":[{"type":"strong"}]},{"text":" Use dbt docs, DataHub, YAML for self-service","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Incremental loading:","type":"text","marks":[{"type":"strong"}]},{"text":" Only load new/changed data (watermark columns)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Business alignment:","type":"text","marks":[{"type":"strong"}]},{"text":" Align architecture to outcomes, not just technologies","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Anti-Patterns","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Data swamp: Lake without governance or cataloging","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Premature mesh: Mesh before organizational readiness","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Tool sprawl: Too many tools without integration","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ No quality checks: \"Garbage in, garbage out\"","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Centralized bottleneck: Single team in large org (>500 people)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Vendor lock-in: Proprietary formats without migration path","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ No lineage: Can't answer \"where did this come from?\"","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"❌ Over-engineering: Complex architecture for simple use cases","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Integration with Other Skills","type":"text"}]},{"type":"paragraph","content":[{"text":"Direct Dependencies:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ingesting-data:","type":"text","marks":[{"type":"strong"}]},{"text":" ETL/ELT mechanics, Fivetran, Airbyte implementation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"data-transformation:","type":"text","marks":[{"type":"strong"}]},{"text":" dbt and Dataform detailed implementation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"streaming-data:","type":"text","marks":[{"type":"strong"}]},{"text":" Kafka, Flink for real-time pipelines","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Complementary:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"databases-relational:","type":"text","marks":[{"type":"strong"}]},{"text":" PostgreSQL, MySQL as source systems","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"databases-document:","type":"text","marks":[{"type":"strong"}]},{"text":" MongoDB, DynamoDB as sources","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ai-data-engineering:","type":"text","marks":[{"type":"strong"}]},{"text":" Feature stores, ML training pipelines","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"designing-distributed-systems:","type":"text","marks":[{"type":"strong"}]},{"text":" CAP theorem, consistency models","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"observability:","type":"text","marks":[{"type":"strong"}]},{"text":" Monitoring pipeline health, data quality metrics","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Downstream:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"visualizing-data:","type":"text","marks":[{"type":"strong"}]},{"text":" BI and dashboard patterns","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"sql-optimization:","type":"text","marks":[{"type":"strong"}]},{"text":" Query performance tuning","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Common Workflows:","type":"text","marks":[{"type":"strong"}]}]},{"type":"paragraph","content":[{"text":"End-to-End Analytics:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"data-architecture (warehouse) → ingesting-data (Fivetran) →\ndata-transformation (dbt) → visualizing-data (Tableau)","type":"text"}]},{"type":"paragraph","content":[{"text":"Data Platform for AI/ML:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"data-architecture (lakehouse) → ingesting-data (Kafka) →\ndata-transformation (dbt features) → ai-data-engineering (feature store)","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Further Reading","type":"text"}]},{"type":"paragraph","content":[{"text":"Reference Files:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"decision-frameworks.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/decision-frameworks.md","title":null}}]},{"text":" - All 4 decision frameworks in detail","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"storage-paradigms.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/storage-paradigms.md","title":null}}]},{"text":" - Lake vs warehouse vs lakehouse","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"modeling-approaches.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/modeling-approaches.md","title":null}}]},{"text":" - Dimensional, normalized, data vault, wide","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"data-mesh-guide.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/data-mesh-guide.md","title":null}}]},{"text":" - Data mesh principles and implementation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"medallion-pattern.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/medallion-pattern.md","title":null}}]},{"text":" - Bronze, silver, gold layers","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"table-formats.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/table-formats.md","title":null}}]},{"text":" - Iceberg, Delta Lake, Hudi comparison","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"tool-recommendations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/tool-recommendations.md","title":null}}]},{"text":" - Tool analysis and recommendations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"modern-data-stack.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/modern-data-stack.md","title":null}}]},{"text":" - Tool categories and selection","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"governance-patterns.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/governance-patterns.md","title":null}}]},{"text":" - Catalog, lineage, quality, access control","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scenarios.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/scenarios.md","title":null}}]},{"text":" - Startup, enterprise, data mesh scenarios","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Examples:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"examples/dbt-project/","type":"text","marks":[{"type":"link","attrs":{"href":"examples/dbt-project/","title":null}}]},{"text":" - dbt project with medallion architecture","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"External Resources:","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Apache Iceberg: https://iceberg.apache.org/","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"dbt Documentation: https://docs.getdbt.com/","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data Mesh (Zhamak Dehghani): https://www.datamesh-architecture.com/","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Databricks Medallion: https://www.databricks.com/glossary/medallion-architecture","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"architecting-data","author":"@skillopedia","source":{"stars":368,"repo_name":"ai-design-components","origin_url":"https://github.com/ancoleman/ai-design-components/blob/HEAD/skills/architecting-data/SKILL.md","repo_owner":"ancoleman","body_sha256":"7e7d41c055208a3c7e97c721f5c292e00b34f76cdf1b5754d85a450ba8aa51e7","cluster_key":"372356d3b76eea4836a45d9f134dc58dbb2db30f84385bcf8103de9d22e54d64","clean_bundle":{"format":"clean-skill-bundle-v1","source":"ancoleman/ai-design-components/skills/architecting-data/SKILL.md","attachments":[{"id":"439a7ec9-f929-57e6-9e3d-07bd917742ac","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/439a7ec9-f929-57e6-9e3d-07bd917742ac/attachment.md","path":"examples/dbt-project/README.md","size":925,"sha256":"a5d749e2c2186140ac7bf2c16ab5e58b854a1a580ed86893649daab095d2e01d","contentType":"text/markdown; charset=utf-8"},{"id":"d343b672-abe0-535f-9e7e-3053b1e1cba0","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d343b672-abe0-535f-9e7e-3053b1e1cba0/attachment.sql","path":"examples/dbt-project/stg_customers.sql","size":380,"sha256":"d34440f6c7940c1c35b2f62d1992c3900dc1ab22be410330cfc86b2208d3a077","contentType":"application/sql"},{"id":"7e8b058f-91e5-50eb-a687-10139124defd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/7e8b058f-91e5-50eb-a687-10139124defd/attachment.yaml","path":"outputs.yaml","size":7730,"sha256":"af4b9dde91986da294b2e49bcc4012c6fdcf5208c84f9fe7cc885253e7e6437d","contentType":"application/yaml; charset=utf-8"},{"id":"4e9be782-563c-56b1-b447-9757dd889b0a","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/4e9be782-563c-56b1-b447-9757dd889b0a/attachment.md","path":"references/data-mesh-guide.md","size":3610,"sha256":"ab6e0584a01130862b841789fb15a057be865166794eccfd686b8b895bf447d0","contentType":"text/markdown; charset=utf-8"},{"id":"5db865c6-ebee-50a8-91f2-82b030820f34","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/5db865c6-ebee-50a8-91f2-82b030820f34/attachment.md","path":"references/decision-frameworks.md","size":14768,"sha256":"171e6b1d3bf4d0e6c8c5d4206b96ba8b2a7dfa3eb22ac58d9bf17de6fcce6ef0","contentType":"text/markdown; charset=utf-8"},{"id":"369e6568-8a5d-5a24-a0e2-3d05444146bd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/369e6568-8a5d-5a24-a0e2-3d05444146bd/attachment.md","path":"references/governance-patterns.md","size":3743,"sha256":"14bd517a7fd505729be392e04a9e89348d4a11e2a8960635dd5394e2c312adb6","contentType":"text/markdown; charset=utf-8"},{"id":"c472f48d-7467-5507-b5b6-9ac073f545d5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/c472f48d-7467-5507-b5b6-9ac073f545d5/attachment.md","path":"references/medallion-pattern.md","size":3319,"sha256":"6a10247a550b37936483d14e74cb0ef666ff3b4ba58ffd18cae08f81ef521128","contentType":"text/markdown; charset=utf-8"},{"id":"e5ecc598-893d-547a-9ad5-7d2ff63ddc69","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/e5ecc598-893d-547a-9ad5-7d2ff63ddc69/attachment.md","path":"references/modeling-approaches.md","size":13076,"sha256":"b9fa8c0b89096c60db04afd1741bb36874fb083d84b905a762d962e27d8bda6b","contentType":"text/markdown; charset=utf-8"},{"id":"9ee82957-84b7-5713-9202-6593bfaf8145","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9ee82957-84b7-5713-9202-6593bfaf8145/attachment.md","path":"references/modern-data-stack.md","size":5942,"sha256":"f43aa907ca5c6918d6e3d6bfe19c507dcdc054f902e01673692f4f2ccfdb21e0","contentType":"text/markdown; charset=utf-8"},{"id":"bbdff3b7-69cd-59bc-96db-fee33fb9d359","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/bbdff3b7-69cd-59bc-96db-fee33fb9d359/attachment.md","path":"references/scenarios.md","size":7299,"sha256":"18df14ded798038da2a687ff98a03c71cc4dd2f0a80731ba01b809adf4b9fc20","contentType":"text/markdown; charset=utf-8"},{"id":"c0cec370-cb23-5292-bbca-bb5a1ab771dc","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/c0cec370-cb23-5292-bbca-bb5a1ab771dc/attachment.md","path":"references/storage-paradigms.md","size":14247,"sha256":"9bbae7ea9392bcbfc09111c8b83fbc48f4c9578729e8bfb1092d025fe7ffc38a","contentType":"text/markdown; charset=utf-8"},{"id":"bb1fe8da-3050-5fe9-a64e-989844b6dc3e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/bb1fe8da-3050-5fe9-a64e-989844b6dc3e/attachment.md","path":"references/table-formats.md","size":1541,"sha256":"938864dd203fc0218b53050d7993c3947b3ff3fbed4beeea70f7795cfc1ed3cd","contentType":"text/markdown; charset=utf-8"},{"id":"144683dc-1568-5156-af7d-eea11d0635cd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/144683dc-1568-5156-af7d-eea11d0635cd/attachment.md","path":"references/tool-recommendations.md","size":11130,"sha256":"2b01fd358044ee6e0daa045c05bcbc22a968c84f65416f14c60668c8451c1721","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"ef3ee257fc3e22efa37502d4dfb3dba30c64602627cfac33ea49d4e0f8dc32d2","attachment_count":13,"text_attachments":13,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"skills/architecting-data/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"web-development","category_label":"Web"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"web-development","import_tag":"clean-skills-v1","description":"Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks."}},"renderedAt":1782979678273}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.