databricks-dbsql — Skillopedia

Databricks SQL (DBSQL) - Advanced Features Quick Reference | Feature | Key Syntax | Since | Reference | |---------|-----------|-------|-----------| | SQL Scripting | , , | DBR 16.3+ | references/sql-scripting.md | | Stored Procedures | , | DBR 17.0+ | references/sql-scripting.md | | Recursive CTEs | | DBR 17.0+ | references/sql-scripting.md | | Transactions | | Preview | references/sql-scripting.md | | Materialized Views | | Pro/Serverless | references/materialized-views-pipes.md | | Temp Tables | | All | references/materialized-views-pipes.md | | Pipe Syntax | operator | DBR 16.1+ | referenc…

, CAST(remote_orders.total_spend AS STRING),\n ' in segment ', local_profiles.segment),\n ARRAY('high_value', 'medium_value', 'low_value', 'at_risk')\n ) AS value_tier\nFROM remote_query(\n 'my_postgres',\n database => 'sales_db',\n query => 'SELECT customer_id, SUM(amount) as total_spend FROM orders GROUP BY customer_id'\n) remote_orders\nJOIN catalog.schema.customer_profiles local_profiles\n ON remote_orders.customer_id = local_profiles.customer_id;\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":43379,"content_sha256":"cb6d3f1b4075a99519be7acbe500efbc07390c1433a4fe73632c54daf61a4532"},{"filename":"references/best-practices.md","content":"# Data Modeling and DBSQL Best Practices\n\nComprehensive reference for data modeling patterns, DBSQL performance optimization, and operational best practices on the Databricks Lakehouse Platform.\n\n---\n\n## Data Modeling Best Practices\n\n### Star Schema vs Denormalization in the Lakehouse\n\nThe Databricks Lakehouse fully supports dimensional modeling. Star schemas translate well to Delta tables and often deliver superior performance compared to fully denormalized approaches.\n\n**Star Schema (Dimensional Modeling):**\n- Central fact table linked to multiple denormalized dimension tables\n- Optimizes for complex analytics and multi-dimensional aggregations\n- Provides intuitive business process mapping and scales well with SCDs\n- Supports up to ~10 filtering dimensions (5 tables x 2 clustering keys each)\n- Clear separation of concerns enables fine-grained governance\n\n**One Big Table (OBT):**\n- Single wide table with all attributes pre-joined\n- Eliminates joins, simpler governance (one table to manage)\n- Liquid Clustering limited to 1-4 keys, so effective filtering is limited to 1-3 dimensions\n- Full table scans become bottlenecks as data grows\n- Lacks structured business process mapping\n- Complicates fine-grained access controls and data quality checks\n\n**Key finding:** In benchmarks, dimensional models outperformed OBT (2.6s vs 3.5s) despite requiring joins, because fewer files needed to be scanned. However, with Liquid Clustering applied, OBT achieved >3x improvement (down to 1.13s). Both approaches achieve sub-500ms with automatic caching.\n\n**Recommended approach:** Use a hybrid medallion architecture:\n- Silver layer: OBT or Data Vault for rapid integration and cleansing\n- Gold layer: Star schema dimensional models as the curated, business-ready presentation layer for BI and reporting\n\n### When to Normalize vs Denormalize\n\n| Use Case | Approach |\n|---|---|\n| Gold layer for BI reporting | Star schema (denormalized dimensions, normalized facts) |\n| Silver layer data integration | Normalized or Data Vault |\n| Single-use IoT/logging analytics | OBT (filter by 1-3 dimensions) |\n| Multi-dimensional business analysis | Star schema |\n| Rapidly evolving schemas | OBT in Silver, star schema in Gold |\n| High-cardinality filtering (5+ dimensions) | Star schema with Liquid Clustering per table |\n\n**Rule of thumb:** Dimension tables should be highly denormalized (flatten many-to-one relationships within a single dimension table). Fact tables should remain normalized at the grain of the business event.\n\n### Kimball-Style Modeling in Databricks\n\nKimball dimensional modeling is the recommended approach for the Gold layer in the Lakehouse:\n\n1. **Identify the business process** (sales, orders, shipments)\n2. **Declare the grain** (one row per transaction, per day, etc.)\n3. **Choose dimensions** (who, what, where, when, why, how)\n4. **Identify facts** (measurable numeric values at the declared grain)\n\n**Databricks-specific implementation details:**\n- Use Unity Catalog for organizing dimensional models (catalog.schema.table)\n- Define PRIMARY KEY constraints on dimension surrogate keys\n- Define FOREIGN KEY constraints on fact table dimension keys for query optimization\n- Add COMMENT on all tables and columns for discoverability\n- Apply TAGS for governance (e.g., PII tagging) to enable downstream AI/BI capabilities\n- Use `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS` on dimension keys to support Adaptive Query Execution\n\n**Key principle:** \"The better you model your data upfront, the more easily you can leverage AI on top of it out of the box.\" Proper schema design enables downstream AI/BI capabilities.\n\n### Fact Table Patterns\n\n**Design rules:**\n- Store quantitative, numeric measures at the most granular transactional level\n- Use DECIMAL instead of floating-point numbers for financial data\n- Include foreign keys referencing dimension tables\n- Include degenerate dimensions (source-system identifiers like order numbers)\n- Transactional fact tables are typically not updated or versioned\n- Cluster fact tables by foreign keys to frequently joined dimensions\n\n**Types of fact tables:**\n- **Transaction facts:** One row per event (most common)\n- **Periodic snapshot facts:** One row per entity per time period\n- **Accumulating snapshot facts:** One row per entity lifecycle, updated as milestones are reached\n\n**Fact table Liquid Clustering strategy:**\n```sql\nCREATE TABLE gold.sales.fact_orders (\n order_key BIGINT GENERATED ALWAYS AS IDENTITY,\n customer_key BIGINT NOT NULL,\n product_key BIGINT NOT NULL,\n date_key INT NOT NULL,\n order_amount DECIMAL(18,2),\n quantity INT,\n CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES gold.sales.dim_customer(customer_key),\n CONSTRAINT fk_product FOREIGN KEY (product_key) REFERENCES gold.sales.dim_product(product_key)\n)\nCLUSTER BY (date_key, customer_key);\n```\n\n### Dimension Table Patterns\n\n**Design rules:**\n- Use `GENERATED ALWAYS AS IDENTITY` or hash values for surrogate keys\n- Prefer integer surrogate keys over strings for join performance\n- Highly denormalize: flatten many-to-one relationships within a single dimension table\n- Support complex types: MAP for extensibility, STRUCT for nested attributes, ARRAY for multi-valued attributes\n- Avoid using ARRAY/MAP columns as filter predicates (they lack column-level statistics for data skipping)\n- Cluster dimension tables by primary key plus common filter columns\n\n**Dimension table example:**\n```sql\nCREATE TABLE gold.sales.dim_customer (\n customer_key BIGINT GENERATED ALWAYS AS IDENTITY,\n customer_id STRING NOT NULL COMMENT 'Natural key from source system',\n full_name STRING,\n email STRING,\n city STRING,\n state STRING,\n country STRING,\n segment STRING,\n effective_start_date TIMESTAMP,\n effective_end_date TIMESTAMP,\n is_current BOOLEAN,\n CONSTRAINT pk_customer PRIMARY KEY (customer_key)\n)\nCLUSTER BY (customer_key, segment)\nCOMMENT 'Customer dimension with SCD Type 2 history tracking';\n```\n\n### Slowly Changing Dimensions (SCD) Patterns\n\n**SCD Type 1 (Overwrite):**\n- In-place updates without tracking history\n- Use MERGE INTO with matched UPDATE\n- Suitable for corrections or attributes where history is not needed\n\n**SCD Type 2 (History Tracking):**\n- Version records with surrogate keys and metadata columns\n- Include `effective_start_date`, `effective_end_date`, and `is_current` columns\n- Use MERGE INTO for implementing SCD Type 2 logic in DBSQL\n\n**SCD Type 2 with MERGE:**\n```sql\nMERGE INTO gold.sales.dim_customer AS target\nUSING (\n SELECT * FROM silver.crm.customers_changes\n) AS source\nON target.customer_id = source.customer_id AND target.is_current = TRUE\nWHEN MATCHED AND (\n target.full_name != source.full_name OR\n target.city != source.city\n) THEN UPDATE SET\n effective_end_date = current_timestamp(),\n is_current = FALSE\nWHEN NOT MATCHED THEN INSERT (\n customer_id, full_name, email, city, state, country, segment,\n effective_start_date, effective_end_date, is_current\n) VALUES (\n source.customer_id, source.full_name, source.email,\n source.city, source.state, source.country, source.segment,\n current_timestamp(), NULL, TRUE\n);\n-- Then insert new versions for changed records in a second pass\n```\n\n**Delta Lake Time Travel** enables historical data access within configured log retention periods as a complementary feature to SCD.\n\n### Partitioning Strategies\n\n**Databricks recommends Liquid Clustering over traditional partitioning for all new tables.**\n\nTraditional partitioning rules of thumb (when needed):\n- Keep partition count under 10,000 (ideally under 5,000 distinct values)\n- Each partition should contain at least 1 GB of data\n- Partition by low-cardinality columns that are frequently used in WHERE clauses (e.g., date, region)\n- Works best for highly selective single-partition queries (e.g., filter on one day)\n\n**When traditional partitioning may still be appropriate:**\n- Very large tables (hundreds of terabytes) with a clear, stable partition key\n- Queries consistently filter on the same low-cardinality column\n- Data lifecycle management requires partition-level operations\n\n### Liquid Clustering vs Traditional Partitioning\n\n**Liquid Clustering is the default recommendation for all new Delta tables**, including streaming tables and materialized views. It replaces both partitioning and Z-ORDER.\n\n| Aspect | Liquid Clustering | Partitioning + Z-ORDER |\n|---|---|---|\n| Column flexibility | Change clustering keys anytime | Partition column fixed at creation |\n| Maintenance | Incremental, automatic with predictive optimization | Manual OPTIMIZE + Z-ORDER required |\n| Filter dimensions | Best with 1-4 clustering keys | One partition key + Z-ORDER columns |\n| Write overhead | Minimal (only unclustered ZCubes reorganized) | Z-ORDER reorganizes entire table/partition |\n| Best for | Most workloads, evolving access patterns | Very large tables with stable, low-cardinality filter |\n| Performance | 30-60% query speed improvement for variable queries | Better for single-partition lookup queries |\n\n**Liquid Clustering key selection best practices:**\n- Choose columns most frequently used in query filters and joins\n- Limit to 1-4 keys (fewer is better for smaller tables under 10 TB)\n- For fact tables: cluster by the most commonly filtered foreign keys\n- For dimension tables: cluster by primary key + common filter columns\n- Too many keys dilute data skipping benefits; for tables under 10 TB, 2 keys often outperform 4\n\n**Important:** Liquid Clustering is not compatible with partitioning or Z-ORDER on the same table.\n\n### Z-Ordering Considerations\n\nZ-ORDER is the legacy approach, now superseded by Liquid Clustering:\n\n- Z-ORDER reorganizes the entire table/partition during optimization (heavier writes)\n- Does not track ZCube IDs, so every OPTIMIZE re-sorts all data\n- Better suited for read-heavy workloads where write overhead is acceptable\n- For new tables, always prefer Liquid Clustering\n\n**Migration path:** When migrating existing partitioned + Z-ORDERed tables to Liquid Clustering:\n1. Drop the partition specification\n2. Enable Liquid Clustering with chosen keys\n3. Run OPTIMIZE to incrementally cluster data\n4. Allow predictive optimization to maintain layout going forward\n\n---\n\n## DBSQL Performance\n\n### Query Optimization Tips\n\n**Engine-level optimizations (automatic in DBSQL Serverless):**\n- **Predictive Query Execution (PQE):** Monitors tasks in real time, dynamically adjusts query execution to avoid skew, spills, and unnecessary work. Unlike Adaptive Query Execution (AQE) which re-plans only after a stage completes, PQE detects issues like data skew or memory spills as they occur and replans immediately.\n- **Photon Vectorized Shuffle:** Keeps data in compact columnar format, sorts within CPU cache, and uses vectorized instructions for 1.5x higher shuffle throughput. Best for CPU-bound workloads (large joins, wide aggregations).\n- **Low Shuffle Merge:** Optimized MERGE implementation that reduces shuffle overhead for most common workloads.\n\n**Manual optimization actions:**\n- Run `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS` on dimension keys and frequently filtered columns to support AQE and data skipping\n- Set `'delta.dataSkippingStatsColumns'` table property to specify which columns collect statistics\n- Define PRIMARY KEY and FOREIGN KEY constraints to help the query optimizer\n- Use deterministic queries (avoid `NOW()`, `CURRENT_TIMESTAMP()` in filters) to benefit from query result caching\n- Prefer `CREATE OR REPLACE TABLE` over delete-then-create patterns\n- Use `DECIMAL` over `FLOAT`/`DOUBLE` for financial calculations\n\n**SQL writing tips for DBSQL:**\n- Filter early, aggregate late: push WHERE clauses as close to the source as possible\n- Prefer explicit column lists over SELECT *\n- Use CTEs for readability but be aware the optimizer may inline them\n- Avoid Python/Scala UDFs when native SQL functions exist (UDFs require serialization between Python and Spark, significantly slowing queries)\n- Use window functions instead of self-joins where possible\n- Leverage QUALIFY clause for row-level filtering after window functions\n\n### Warehouse Sizing Guidance\n\n**Databricks recommends serverless SQL warehouses for most workloads.** Serverless uses Intelligent Workload Management (IWM) to automatically manage query workloads.\n\n**Sizing strategy:**\n- Start with a single larger warehouse and let serverless features manage concurrency\n- Size down if needed rather than starting small and scaling up\n- If queries spill to disk, increase the cluster size\n\n**Scaling configuration:**\n- Low concurrency (1-2 queries): keep max_clusters low\n- Unpredictable spikes: set max_num_clusters high with target_utilization ~70%\n- For dashboards with variable/infrequent load: enable aggressive auto-scaling and auto-stopping\n\n**Serverless advantages:**\n- Start and scale up in seconds\n- Scale down earlier than non-serverless warehouses\n- Pay only when queries are running\n- 30-60 second cold start latency (savings from no idle time far outweigh this)\n- All 2025 optimizations (PQE, Photon Vectorized Shuffle) are automatically available\n\n### Caching Strategies\n\n**Query Result Cache:**\n- DBSQL caches results per-cluster for all queries\n- Cache is invalidated when underlying Delta data changes\n- To maximize cache hits, use deterministic queries (no `NOW()`, `RAND()`, etc.)\n- Both OBT and star schema achieve sub-500ms with automatic caching after first run\n\n**Delta Cache (Disk Cache):**\n- Automatically caches remote data on local SSD in columnar format\n- Accelerates data reads without manual configuration on serverless warehouses\n- Particularly effective for repeated scans of the same tables\n\n**Best practice:** Design dashboards and reports to use parameterized queries that hit the same underlying patterns, maximizing cache reuse.\n\n### Photon Engine Benefits\n\nPhoton is a vectorized query engine written in C++ that runs natively on Databricks:\n\n- Enabled by default on all DBSQL serverless warehouses\n- Processes data in columnar batches using CPU vector instructions (SIMD)\n- Excels at: large joins, wide aggregations, string processing, data shuffles\n- 2025 vectorized shuffle delivers 1.5x higher shuffle throughput\n- Combined with PQE, delivers up to 25% faster queries on top of existing 5x gains\n\n### Recent Performance Improvements (2025)\n\n| Improvement | Impact |\n|---|---|\n| Overall production workloads | Up to 40% faster (automatic, no tuning) |\n| Photon Vectorized Shuffle | 1.5x higher shuffle throughput |\n| PQE + Photon Vectorized Shuffle combined | Up to 25% faster on top of existing 5x gains |\n| Spatial SQL queries | Up to 17x faster (R-tree indexing, optimized spatial joins) |\n| AI functions | Up to 85x faster for large batch workloads |\n| End-to-end Unity Catalog latency | Up to 10x improvement |\n| 3-year cumulative improvement | 5x faster across customer workloads |\n\nAll improvements are live in DBSQL Serverless with nothing to enable.\n\n### Cost Optimization Patterns\n\n1. **Use serverless SQL warehouses:** Pay only when queries run, auto-scale and auto-stop\n2. **Enable predictive optimization:** Automatically runs OPTIMIZE and VACUUM on Unity Catalog managed tables\n3. **Right-size warehouses:** Start larger, scale down based on actual usage patterns\n4. **Avoid idle warehouses:** Use aggressive auto-stop for dashboards with infrequent load\n5. **Leverage caching:** Design deterministic queries to maximize result cache hits\n6. **Use Liquid Clustering:** Reduces scan volume, fewer DBUs consumed per query\n7. **Collect statistics:** `ANALYZE TABLE` enables better query plans, reducing wasted compute\n8. **Monitor with Query Profile:** Identify expensive operations, spills, and skew\n9. **Use materialized views** for frequently computed aggregations\n10. **Avoid UDFs:** Native functions are dramatically faster, no serialization overhead\n\n---\n\n## Delta Lake Optimization for DBSQL\n\n### OPTIMIZE, VACUUM, and ANALYZE\n\n**Recommended execution order:** OPTIMIZE -> VACUUM -> ANALYZE\n\n**OPTIMIZE:**\n- Compacts small files into larger ones (target 1 GB by default)\n- Run frequently on tables with many small files (especially after streaming writes)\n- Configurable target size via `delta.targetFileSize` table property\n- With Liquid Clustering: only reorganizes unclustered ZCubes (incremental)\n\n**VACUUM:**\n- Removes old files no longer in the transaction log\n- Reduces storage costs\n- Use compute-optimized instances (AWS C5, Azure F-series, GCP C2)\n- Default retention: 7 days (configurable via `delta.deletedFileRetentionDuration`)\n- Never set retention below the longest-running query duration\n\n**ANALYZE TABLE:**\n- Computes column-level statistics for query optimization\n- Run immediately after table overwrites or major data changes\n- Focus on columns used in WHERE clauses, JOINs, and GROUP BY\n\n**Predictive optimization (recommended):**\n\n> **Note:** On serverless SQL warehouses, `delta.enableOptimizeWrite` and `delta.autoOptimize.autoCompact` are managed automatically and cannot be set manually (they will raise `DELTA_UNKNOWN_CONFIGURATION`). The properties below apply only to classic compute. For serverless, simply enable predictive optimization at the catalog/schema level.\n\n```sql\n-- Classic compute only:\nALTER TABLE catalog.schema.table_name\nSET TBLPROPERTIES ('delta.enableOptimizeWrite' = 'true');\n-- For Unity Catalog managed tables, predictive optimization\n-- handles OPTIMIZE and VACUUM automatically\n```\n\n### File Size and Compaction\n\n- **Auto-compaction:** Combines small files within partitions automatically after writes\n- **Optimized writes:** Rebalances data via shuffle before writing to reduce small files\n- **Target file size:** Default 1 GB; adjust with `delta.targetFileSize` for specific workloads\n- For tables with many small files (streaming ingestion), schedule regular OPTIMIZE jobs\n\n### Table Properties for Performance\n\n> **Note:** `delta.enableOptimizeWrite` and `delta.autoOptimize.autoCompact` are only valid on classic compute. On serverless SQL warehouses, these are managed automatically and setting them raises `DELTA_UNKNOWN_CONFIGURATION`. The remaining properties work on both classic and serverless.\n\n```sql\n-- Classic compute only (serverless manages these automatically):\n-- 'delta.enableOptimizeWrite' = 'true',\n-- 'delta.autoOptimize.autoCompact' = 'true',\n\n-- Works on both classic and serverless:\nALTER TABLE catalog.schema.my_table SET TBLPROPERTIES (\n 'delta.columnMapping.mode' = 'name',\n 'delta.enableChangeDataFeed' = 'true',\n 'delta.deletedFileRetentionDuration' = '30 days',\n 'delta.dataSkippingStatsColumns' = 'col1,col2,col3'\n);\n```\n\n---\n\n## Unity Catalog Integration Patterns\n\n### Organization Best Practices\n\n- Use a three-level namespace: `catalog.schema.table`\n- Organize by environment (dev/staging/prod) at the catalog level\n- Organize by business domain at the schema level\n- Use managed tables (not external) to benefit from predictive optimization and enhanced governance\n\n### Governance Features for Data Modeling\n\n- **Primary/Foreign Key constraints:** Inform the query optimizer about table relationships\n- **Row filters and column masks:** Fine-grained access control at the table level\n- **Tags:** Apply governance tags (e.g., PII, sensitivity level) to tables and columns\n- **Comments:** Document all tables and columns for AI/BI discoverability\n- **Lineage tracking:** Automatic lineage for understanding data flow through the medallion architecture\n\n### Entity Relationship Visualization\n\nUnity Catalog renders entity relationship diagrams when primary and foreign key constraints are defined, providing visual documentation of the dimensional model.\n\n---\n\n## Monitoring and Observability\n\n- **Query Profile:** Analyze execution plans, identify bottlenecks, spills, and data skew\n- **Query History:** Track query performance trends over time\n- **Warehouse monitoring:** Track utilization, queue times, and scaling events\n- **System tables:** Query `system.billing`, `system.access`, and `system.query` for operational insights\n- **Alerts:** Set up SQL alerts for data quality checks and SLA monitoring\n\n---\n\n## Common Anti-Patterns to Avoid\n\n### Data Modeling Anti-Patterns\n\n1. **Skipping dimensional modeling in Gold layer:** OBTs are fine for Silver, but Gold should use star schemas for multi-dimensional analysis\n2. **Over-partitioning:** More than 5,000-10,000 partitions degrades performance; use Liquid Clustering instead\n3. **String surrogate keys:** Use integer IDENTITY columns for better join performance\n4. **Missing constraints:** Not defining PK/FK constraints deprives the optimizer of relationship information\n5. **Missing comments and tags:** Reduces discoverability for AI/BI tools and governance\n6. **Using FLOAT for financial data:** Use DECIMAL to avoid precision errors\n7. **Filtering on ARRAY/MAP columns:** These types lack column-level statistics for data skipping\n\n### Query and Performance Anti-Patterns\n\n1. **Delete-then-recreate tables:** Use `CREATE OR REPLACE TABLE` instead to preserve time travel and avoid reader interruptions\n2. **Python/Scala UDFs when native functions exist:** Serialization overhead dramatically slows queries\n3. **Not collecting statistics:** Missing `ANALYZE TABLE` leads to suboptimal query plans\n4. **Non-deterministic functions in cached queries:** `NOW()`, `RAND()` etc. prevent query result caching\n5. **Partitioning by wrong column:** Partitioning by a column not used in filters causes full scans\n6. **Too many Liquid Clustering keys:** For tables under 10 TB, 2 keys often outperform 4 keys\n7. **Manual OPTIMIZE/VACUUM without predictive optimization:** Enable predictive optimization for Unity Catalog managed tables\n\n### Operational Anti-Patterns\n\n1. **Idle warehouses:** Always enable auto-stop; use serverless for variable workloads\n2. **Under-sized warehouses:** Queries spilling to disk waste more DBUs than a larger warehouse\n3. **External tables when managed will do:** External tables miss predictive optimization and enhanced governance\n4. **Skipping VACUUM:** Unbounded file growth increases storage costs and slows metadata operations\n5. **Running VACUUM with too-short retention:** Can break long-running queries and time travel\n\n---\n\n## Quick Reference: SQL Patterns for AI Agents\n\nWhen generating SQL for Databricks, prefer these patterns:\n\n```sql\n-- Use CREATE OR REPLACE (not DROP + CREATE)\nCREATE OR REPLACE TABLE catalog.schema.my_table AS\nSELECT ...;\n\n-- Use MERGE for upserts (not DELETE + INSERT)\nMERGE INTO target USING source\nON target.key = source.key\nWHEN MATCHED THEN UPDATE SET ...\nWHEN NOT MATCHED THEN INSERT ...;\n\n-- Use QUALIFY for window function filtering (not subquery)\nSELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rn\nFROM my_table\nQUALIFY rn = 1;\n\n-- Use DECIMAL for money\nSELECT CAST(amount AS DECIMAL(18,2)) AS revenue FROM orders;\n\n-- Collect statistics after loading\nANALYZE TABLE catalog.schema.my_table COMPUTE STATISTICS FOR ALL COLUMNS;\n\n-- Enable predictive optimization (classic compute only; serverless manages this automatically)\nALTER TABLE catalog.schema.my_table\nSET TBLPROPERTIES ('delta.enableOptimizeWrite' = 'true');\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":23174,"content_sha256":"fabd6fa7457fe06373fd9c26925dcd3f104cf4eead870c80cda50d9f51deebfe"},{"filename":"references/geospatial-collations.md","content":"# Geospatial SQL and Collations in Databricks SQL\n\n---\n\n## Part 1: Geospatial SQL\n\nDatabricks SQL provides comprehensive geospatial support through two function families: **H3 functions** for hexagonal grid indexing and **ST functions** for standard spatial operations. Together they enable high-performance geospatial analytics at scale.\n\n### Geospatial Data Types\n\n| Type | Description | Coordinate System | SRID Support |\n|------|-------------|-------------------|--------------|\n| `GEOMETRY` | Spatial objects using Euclidean coordinates (X, Y, optional Z) -- treats Earth as flat | Any projected CRS | 11,000+ SRIDs |\n| `GEOGRAPHY` | Geographic objects on Earth's surface using longitude/latitude | WGS 84 | SRID 4326 only |\n\n**When to use which:**\n- Use `GEOMETRY` for projected coordinate systems, Euclidean distance calculations, and when working with local/regional data in meters or feet.\n- Use `GEOGRAPHY` for global data using longitude/latitude coordinates and spherical distance calculations.\n\n### Supported Geometry Subtypes\n\nBoth `GEOMETRY` and `GEOGRAPHY` support: **Point**, **LineString**, **Polygon**, **MultiPoint**, **MultiLineString**, **MultiPolygon**, and **GeometryCollection**.\n\n### Format Support\n\n| Format | Description | Import Function | Export Function |\n|--------|-------------|-----------------|-----------------|\n| WKT | Well-Known Text | `ST_GeomFromWKT`, `ST_GeogFromWKT` | `ST_AsWKT`, `ST_AsText` |\n| WKB | Well-Known Binary | `ST_GeomFromWKB`, `ST_GeogFromWKB` | `ST_AsWKB`, `ST_AsBinary` |\n| EWKT | Extended WKT (includes SRID) | `ST_GeomFromEWKT`, `ST_GeogFromEWKT` | `ST_AsEWKT` |\n| EWKB | Extended WKB (includes SRID) | `ST_GeomFromEWKB` | `ST_AsEWKB` |\n| GeoJSON | JSON-based format | `ST_GeomFromGeoJSON`, `ST_GeogFromGeoJSON` | `ST_AsGeoJSON` |\n| Geohash | Hierarchical grid encoding | `ST_GeomFromGeoHash`, `ST_PointFromGeoHash` | `ST_GeoHash` |\n\n---\n\n### H3 Geospatial Functions\n\nH3 is Uber's hexagonal hierarchical spatial index. It divides the Earth into hexagonal cells at 16 resolutions (0-15). Available since Databricks Runtime 11.2 (H3 Java library 3.7.0). No separate installation required.\n\n#### H3 Import Functions (Coordinate/Geometry to H3)\n\n| Function | Description | Returns |\n|----------|-------------|---------|\n| `h3_longlatash3(lon, lat, resolution)` | Convert longitude/latitude to H3 cell ID | `BIGINT` |\n| `h3_longlatash3string(lon, lat, resolution)` | Convert longitude/latitude to H3 cell ID | `STRING` (hex) |\n| `h3_pointash3(geogExpr, resolution)` | Convert GEOGRAPHY point to H3 cell ID | `BIGINT` |\n| `h3_pointash3string(geogExpr, resolution)` | Convert GEOGRAPHY point to H3 cell ID | `STRING` (hex) |\n| `h3_polyfillash3(geogExpr, resolution)` | Fill polygon with contained H3 cells | `ARRAY\u003cBIGINT>` |\n| `h3_polyfillash3string(geogExpr, resolution)` | Fill polygon with contained H3 cells | `ARRAY\u003cSTRING>` |\n| `h3_coverash3(geogExpr, resolution)` | Cover geography with minimal set of H3 cells | `ARRAY\u003cBIGINT>` |\n| `h3_coverash3string(geogExpr, resolution)` | Cover geography with minimal set of H3 cells | `ARRAY\u003cSTRING>` |\n| `h3_tessellateaswkb(geogExpr, resolution)` | Tessellate geography using H3 cells | `ARRAY\u003cSTRUCT>` |\n| `h3_try_polyfillash3(geogExpr, resolution)` | Safe polyfill (returns NULL on error) | `ARRAY\u003cBIGINT>` |\n| `h3_try_polyfillash3string(geogExpr, resolution)` | Safe polyfill (returns NULL on error) | `ARRAY\u003cSTRING>` |\n| `h3_try_coverash3(geogExpr, resolution)` | Safe cover (returns NULL on error) | `ARRAY\u003cBIGINT>` |\n| `h3_try_coverash3string(geogExpr, resolution)` | Safe cover (returns NULL on error) | `ARRAY\u003cSTRING>` |\n| `h3_try_tessellateaswkb(geogExpr, resolution)` | Safe tessellate (returns NULL on error) | `ARRAY\u003cSTRUCT>` |\n\n#### H3 Export Functions (H3 to Geometry/Format)\n\n| Function | Description | Returns |\n|----------|-------------|---------|\n| `h3_boundaryaswkt(h3CellId)` | H3 cell boundary as WKT polygon | `STRING` |\n| `h3_boundaryaswkb(h3CellId)` | H3 cell boundary as WKB polygon | `BINARY` |\n| `h3_boundaryasgeojson(h3CellId)` | H3 cell boundary as GeoJSON | `STRING` |\n| `h3_centeraswkt(h3CellId)` | H3 cell center as WKT point | `STRING` |\n| `h3_centeraswkb(h3CellId)` | H3 cell center as WKB point | `BINARY` |\n| `h3_centerasgeojson(h3CellId)` | H3 cell center as GeoJSON point | `STRING` |\n\n#### H3 Conversion Functions\n\n| Function | Description |\n|----------|-------------|\n| `h3_h3tostring(h3CellId)` | Convert BIGINT cell ID to hex STRING |\n| `h3_stringtoh3(h3CellIdString)` | Convert hex STRING to BIGINT cell ID |\n\n#### H3 Hierarchy / Traversal Functions\n\n| Function | Description |\n|----------|-------------|\n| `h3_resolution(h3CellId)` | Get the resolution of a cell |\n| `h3_toparent(h3CellId, resolution)` | Get parent cell at coarser resolution |\n| `h3_tochildren(h3CellId, resolution)` | Get all child cells at finer resolution |\n| `h3_maxchild(h3CellId, resolution)` | Get child with maximum value |\n| `h3_minchild(h3CellId, resolution)` | Get child with minimum value |\n| `h3_ischildof(h3CellId1, h3CellId2)` | Test if cell1 is equal to or child of cell2 |\n\n#### H3 Distance / Neighbor Functions\n\n| Function | Description |\n|----------|-------------|\n| `h3_distance(h3CellId1, h3CellId2)` | Grid distance between two cells |\n| `h3_try_distance(h3CellId1, h3CellId2)` | Grid distance or NULL if undefined |\n| `h3_kring(h3CellId, k)` | All cells within grid distance k (filled disk) |\n| `h3_kringdistances(h3CellId, k)` | Cells within distance k with their distances |\n| `h3_hexring(h3CellId, k)` | Hollow ring of cells at exactly distance k |\n\n#### H3 Compaction Functions\n\n| Function | Description |\n|----------|-------------|\n| `h3_compact(h3CellIds)` | Compact array of cells to minimal representation |\n| `h3_uncompact(h3CellIds, resolution)` | Expand compacted cells to target resolution |\n\n#### H3 Validation Functions\n\n| Function | Description |\n|----------|-------------|\n| `h3_isvalid(expr)` | Check if BIGINT or STRING is valid H3 cell |\n| `h3_validate(h3CellId)` | Return cell ID if valid, error otherwise |\n| `h3_try_validate(h3CellId)` | Return cell ID if valid, NULL otherwise |\n| `h3_ispentagon(h3CellId)` | Check if cell is a pentagon (12 per resolution) |\n\n#### H3 Examples\n\n```sql\n-- Convert coordinates to H3 cell at resolution 9\nSELECT h3_longlatash3(-73.985428, 40.748817, 9) AS h3_cell;\n\n-- Index taxi trips by pickup location\nCREATE TABLE trips_h3 AS\nSELECT\n h3_longlatash3(pickup_longitude, pickup_latitude, 12) AS pickup_cell,\n h3_longlatash3(dropoff_longitude, dropoff_latitude, 12) AS dropoff_cell,\n *\nFROM taxi_trips;\n\n-- Fill zip code polygons with H3 cells for spatial indexing\nCREATE TABLE zipcode_h3 AS\nSELECT\n explode(h3_polyfillash3(geom_wkt, 12)) AS cell,\n zipcode, city, state\nFROM zipcodes;\n\n-- Find all trips picked up in a specific zip code using H3 join\nSELECT t.*\nFROM trips_h3 t\nINNER JOIN zipcode_h3 z ON t.pickup_cell = z.cell\nWHERE z.zipcode = '10001';\n\n-- Proximity search: find all H3 cells within 2 rings of a location\nSELECT explode(h3_kring(h3_longlatash3(-73.985, 40.748, 9), 2)) AS nearby_cell;\n\n-- Aggregate trip counts and get centroids for visualization\nSELECT\n dropoff_cell,\n h3_centerasgeojson(dropoff_cell):coordinates[0] AS lon,\n h3_centerasgeojson(dropoff_cell):coordinates[1] AS lat,\n count(*) AS trip_count\nFROM trips_h3\nGROUP BY dropoff_cell;\n\n-- Roll up to coarser resolution\nSELECT\n h3_toparent(pickup_cell, 7) AS parent_cell,\n count(*) AS trip_count\nFROM trips_h3\nGROUP BY h3_toparent(pickup_cell, 7);\n\n-- Compact a set of cells for efficient storage\nSELECT h3_compact(collect_set(cell)) AS compacted\nFROM zipcode_h3\nWHERE zipcode = '10001';\n```\n\n---\n\n### ST Geospatial Functions\n\nNative spatial SQL functions operating on `GEOMETRY` and `GEOGRAPHY` types. Requires Databricks Runtime 17.1+. Public Preview. Over 80 functions available.\n\n#### ST Import Functions (Create Geometry/Geography)\n\n| Function | Description | Output Type |\n|----------|-------------|-------------|\n| `ST_GeomFromText(wkt [, srid])` | Create GEOMETRY from WKT | `GEOMETRY` |\n| `ST_GeomFromWKT(wkt [, srid])` | Create GEOMETRY from WKT (alias) | `GEOMETRY` |\n| `ST_GeomFromWKB(wkb [, srid])` | Create GEOMETRY from WKB | `GEOMETRY` |\n| `ST_GeomFromEWKT(ewkt)` | Create GEOMETRY from Extended WKT | `GEOMETRY` |\n| `ST_GeomFromEWKB(ewkb)` | Create GEOMETRY from Extended WKB | `GEOMETRY` |\n| `ST_GeomFromGeoJSON(geojson)` | Create GEOMETRY(4326) from GeoJSON | `GEOMETRY` |\n| `ST_GeomFromGeoHash(geohash)` | Create polygon GEOMETRY from geohash | `GEOMETRY` |\n| `ST_GeogFromText(wkt)` | Create GEOGRAPHY(4326) from WKT | `GEOGRAPHY` |\n| `ST_GeogFromWKT(wkt)` | Create GEOGRAPHY(4326) from WKT | `GEOGRAPHY` |\n| `ST_GeogFromWKB(wkb)` | Create GEOGRAPHY(4326) from WKB | `GEOGRAPHY` |\n| `ST_GeogFromEWKT(ewkt)` | Create GEOGRAPHY from Extended WKT | `GEOGRAPHY` |\n| `ST_GeogFromGeoJSON(geojson)` | Create GEOGRAPHY(4326) from GeoJSON | `GEOGRAPHY` |\n| `ST_Point(x, y [, srid])` | Create point from coordinates | `GEOMETRY` |\n| `ST_PointFromGeoHash(geohash)` | Create point from geohash center | `GEOMETRY` |\n| `to_geometry(georepExpr)` | Auto-detect format and create GEOMETRY | `GEOMETRY` |\n| `to_geography(georepExpr)` | Auto-detect format and create GEOGRAPHY | `GEOGRAPHY` |\n| `try_to_geometry(georepExpr)` | Safe geometry creation (NULL on error) | `GEOMETRY` |\n| `try_to_geography(georepExpr)` | Safe geography creation (NULL on error) | `GEOGRAPHY` |\n\n#### ST Export Functions\n\n| Function | Description | Output |\n|----------|-------------|--------|\n| `ST_AsText(geo)` | Export as WKT | `STRING` |\n| `ST_AsWKT(geo)` | Export as WKT (alias) | `STRING` |\n| `ST_AsBinary(geo)` | Export as WKB | `BINARY` |\n| `ST_AsWKB(geo)` | Export as WKB (alias) | `BINARY` |\n| `ST_AsEWKT(geo)` | Export as Extended WKT | `STRING` |\n| `ST_AsEWKB(geo)` | Export as Extended WKB | `BINARY` |\n| `ST_AsGeoJSON(geo)` | Export as GeoJSON | `STRING` |\n| `ST_GeoHash(geo)` | Export as geohash string | `STRING` |\n\n#### ST Constructor Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_Point(x, y [, srid])` | Create a point geometry |\n| `ST_MakeLine(pointArray)` | Create linestring from array of points |\n| `ST_MakePolygon(outer [, innerArray])` | Create polygon from outer ring and optional holes |\n\n#### ST Accessor Functions\n\n| Function | Description | Returns |\n|----------|-------------|---------|\n| `ST_X(geo)` | X coordinate of a point | `DOUBLE` |\n| `ST_Y(geo)` | Y coordinate of a point | `DOUBLE` |\n| `ST_Z(geo)` | Z coordinate of a point | `DOUBLE` |\n| `ST_M(geo)` | M coordinate of a point | `DOUBLE` |\n| `ST_XMin(geo)` | Minimum X of bounding box | `DOUBLE` |\n| `ST_XMax(geo)` | Maximum X of bounding box | `DOUBLE` |\n| `ST_YMin(geo)` | Minimum Y of bounding box | `DOUBLE` |\n| `ST_YMax(geo)` | Maximum Y of bounding box | `DOUBLE` |\n| `ST_ZMin(geo)` | Minimum Z coordinate | `DOUBLE` |\n| `ST_ZMax(geo)` | Maximum Z coordinate | `DOUBLE` |\n| `ST_Dimension(geo)` | Topological dimension (0=point, 1=line, 2=polygon) | `INT` |\n| `ST_NDims(geo)` | Number of coordinate dimensions | `INT` |\n| `ST_NPoints(geo)` | Total number of points | `INT` |\n| `ST_NumGeometries(geo)` | Number of geometries in collection | `INT` |\n| `ST_NumInteriorRings(geo)` | Number of interior rings (polygon) | `INT` |\n| `ST_GeometryType(geo)` | Geometry type as string | `STRING` |\n| `ST_GeometryN(geo, n)` | N-th geometry (1-based) from collection | `GEOMETRY` |\n| `ST_PointN(geo, n)` | N-th point from linestring | `GEOMETRY` |\n| `ST_StartPoint(geo)` | First point of linestring | `GEOMETRY` |\n| `ST_EndPoint(geo)` | Last point of linestring | `GEOMETRY` |\n| `ST_ExteriorRing(geo)` | Outer ring of polygon | `GEOMETRY` |\n| `ST_InteriorRingN(geo, n)` | N-th interior ring of polygon | `GEOMETRY` |\n| `ST_Envelope(geo)` | Minimum bounding rectangle | `GEOMETRY` |\n| `ST_Envelope_Agg(geo)` | Aggregate: bounding box of all geometries | `GEOMETRY` |\n| `ST_Dump(geo)` | Explode multi-geometry into array of singles | `ARRAY` |\n| `ST_IsEmpty(geo)` | True if geometry has no points | `BOOLEAN` |\n\n#### ST Measurement Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_Area(geo)` | Area of a polygon (in CRS units) |\n| `ST_Length(geo)` | Length of a linestring (in CRS units) |\n| `ST_Perimeter(geo)` | Perimeter of a polygon (in CRS units) |\n| `ST_Distance(geo1, geo2)` | Cartesian distance between geometries |\n| `ST_DistanceSphere(geo1, geo2)` | Spherical distance in meters (fast, approximate) |\n| `ST_DistanceSpheroid(geo1, geo2)` | Geodesic distance in meters on WGS84 (accurate) |\n| `ST_Azimuth(geo1, geo2)` | North-based azimuth angle in radians |\n| `ST_ClosestPoint(geo1, geo2)` | Point on geo1 closest to geo2 |\n\n#### ST Topological Relationship Functions (Predicates)\n\n| Function | Description |\n|----------|-------------|\n| `ST_Contains(geo1, geo2)` | True if geo1 fully contains geo2 |\n| `ST_Within(geo1, geo2)` | True if geo1 is fully within geo2 (inverse of Contains) |\n| `ST_Intersects(geo1, geo2)` | True if geometries share any space |\n| `ST_Disjoint(geo1, geo2)` | True if geometries share no space |\n| `ST_Touches(geo1, geo2)` | True if boundaries touch but interiors do not |\n| `ST_Covers(geo1, geo2)` | True if geo1 covers geo2 (no point of geo2 is exterior) |\n| `ST_Equals(geo1, geo2)` | True if geometries are topologically equal |\n| `ST_DWithin(geo1, geo2, distance)` | True if geometries are within given distance |\n\n#### ST Overlay Functions (Set Operations)\n\n| Function | Description |\n|----------|-------------|\n| `ST_Intersection(geo1, geo2)` | Geometry of shared space |\n| `ST_Union(geo1, geo2)` | Geometry combining both inputs |\n| `ST_Union_Agg(geo)` | Aggregate: union of all geometries in column |\n| `ST_Difference(geo1, geo2)` | Geometry of geo1 minus geo2 |\n\n#### ST Processing Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_Buffer(geo, radius)` | Expand geometry by radius distance |\n| `ST_Centroid(geo)` | Center point of geometry |\n| `ST_ConvexHull(geo)` | Smallest convex polygon containing geometry |\n| `ST_ConcaveHull(geo, ratio [, allowHoles])` | Concave hull with length ratio |\n| `ST_Boundary(geo)` | Boundary of geometry (not available on all SQL Warehouse versions) |\n| `ST_Simplify(geo, tolerance)` | Simplify using Douglas-Peucker algorithm |\n\n#### ST Editor Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_AddPoint(linestring, point [, index])` | Add point to linestring |\n| `ST_RemovePoint(linestring, index)` | Remove point from linestring |\n| `ST_SetPoint(linestring, index, point)` | Replace point in linestring |\n| `ST_FlipCoordinates(geo)` | Swap X and Y coordinates |\n| `ST_Multi(geo)` | Convert single geometry to multi-geometry |\n| `ST_Reverse(geo)` | Reverse vertex order |\n\n#### ST Affine Transformation Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_Translate(geo, xOffset, yOffset [, zOffset])` | Move geometry by offset |\n| `ST_Scale(geo, xFactor, yFactor [, zFactor])` | Scale geometry by factors |\n| `ST_Rotate(geo, angle)` | Rotate geometry around origin (radians) |\n\n#### ST Spatial Reference System Functions\n\n| Function | Description |\n|----------|-------------|\n| `ST_SRID(geo)` | Get SRID of geometry |\n| `ST_SetSRID(geo, srid)` | Set SRID value (no reprojection) |\n| `ST_Transform(geo, targetSrid)` | Reproject to target coordinate system |\n\n#### ST Validation\n\n| Function | Description |\n|----------|-------------|\n| `ST_IsValid(geo)` | Check if geometry is OGC-valid |\n\n#### ST Practical Examples\n\n> **Note:** `GEOMETRY` and `GEOGRAPHY` column types in `CREATE TABLE` require serverless compute with DBR 17.1+. On SQL Warehouses that don't support these column types, use `STRING` columns with WKT representation and convert with `ST_GeomFromText()` / `ST_GeogFromText()` at query time.\n\n```sql\n-- Create a table with geometry columns (requires serverless DBR 17.1+)\nCREATE TABLE retail_stores (\n store_id INT,\n name STRING,\n location GEOMETRY\n);\n\nINSERT INTO retail_stores VALUES\n (1, 'Downtown Store', ST_Point(-73.9857, 40.7484, 4326)),\n (2, 'Midtown Store', ST_Point(-73.9787, 40.7614, 4326)),\n (3, 'Uptown Store', ST_Point(-73.9680, 40.7831, 4326));\n\n-- Create delivery zones as polygons\nCREATE TABLE delivery_zones (\n zone_id INT,\n zone_name STRING,\n boundary GEOMETRY\n);\n\nINSERT INTO delivery_zones VALUES\n (1, 'Zone A', ST_GeomFromText(\n 'POLYGON((-74.00 40.74, -73.97 40.74, -73.97 40.76, -74.00 40.76, -74.00 40.74))', 4326\n ));\n\n-- Point-in-polygon: find stores within a delivery zone\nSELECT s.name, z.zone_name\nFROM retail_stores s\nJOIN delivery_zones z\n ON ST_Contains(z.boundary, s.location);\n\n-- Distance calculation: find customers within 5km of a store\n-- Note: to_geography() expects STRING (WKT/GeoJSON) or BINARY (WKB) input, not GEOMETRY.\n-- Use ST_AsText() to convert GEOMETRY to WKT first.\nSELECT c.customer_id, c.name,\n ST_DistanceSphere(c.location, s.location) AS distance_meters\nFROM customers c\nCROSS JOIN retail_stores s\nWHERE s.store_id = 1\n AND ST_DWithin(\n ST_GeogFromText(ST_AsText(c.location)),\n ST_GeogFromText(ST_AsText(s.location)),\n 5000 -- 5km in meters\n );\n\n-- Buffer zone: create 1km buffer around a store (use projected CRS for meters)\nSELECT ST_Buffer(\n ST_Transform(location, 5070), -- project to NAD83/Albers (meters)\n 1000 -- 1000 meters\n) AS buffer_zone\nFROM retail_stores\nWHERE store_id = 1;\n\n-- Area calculation\nSELECT zone_name,\n ST_Area(ST_Transform(boundary, 5070)) AS area_sq_meters\nFROM delivery_zones;\n\n-- Union of overlapping zones\nSELECT ST_Union_Agg(boundary) AS combined_coverage\nFROM delivery_zones;\n\n-- Convert between formats\nSELECT\n ST_AsText(location) AS wkt,\n ST_AsGeoJSON(location) AS geojson,\n ST_GeoHash(location) AS geohash\nFROM retail_stores;\n\n-- Spatial join with BROADCAST hint for performance\nSELECT /*+ BROADCAST(zones) */\n c.customer_id, z.zone_name\nFROM customers c\nJOIN delivery_zones zones\n ON ST_Contains(zones.boundary, c.location);\n```\n\n### Combining H3 and ST Functions\n\n```sql\n-- Use H3 for fast pre-filtering, then ST for precise spatial operations\n-- Step 1: Index store locations with H3\nCREATE TABLE store_h3 AS\nSELECT store_id, name, location,\n h3_longlatash3(ST_X(location), ST_Y(location), 9) AS h3_cell\nFROM retail_stores;\n\n-- Step 2: Index customer locations with H3\nCREATE TABLE customer_h3 AS\nSELECT customer_id, name, location,\n h3_longlatash3(ST_X(location), ST_Y(location), 9) AS h3_cell\nFROM customers;\n\n-- Step 3: Fast proximity using H3 pre-filter + precise ST distance\nSELECT s.name AS store, c.name AS customer,\n ST_DistanceSphere(s.location, c.location) AS distance_m\nFROM store_h3 s\nJOIN customer_h3 c\n ON c.h3_cell IN (SELECT explode(h3_kring(s.h3_cell, 2)))\nWHERE ST_DistanceSphere(s.location, c.location) \u003c 2000;\n```\n\n### Spatial Join Performance\n\nDatabricks automatically optimizes spatial joins using built-in spatial indexing. Spatial predicates like `ST_Intersects`, `ST_Contains`, and `ST_Within` in JOIN conditions benefit from up to **17x performance improvement** compared to classic clusters. No code changes required -- the optimizer applies spatial indexing automatically.\n\n**Performance tips:**\n- Use `BROADCAST` hint when one side of the join is small enough to fit in memory.\n- Use projected coordinate systems (e.g., SRID 5070 in meters) for distance calculations to avoid expensive spheroid functions.\n- Combine H3 for coarse pre-filtering with ST for precise operations.\n- Use Delta Lake liquid clustering on H3 cell columns for optimized data layout.\n- Enable auto-optimization: `delta.autoOptimize.optimizeWrite` and `delta.autoOptimize.autoCompact`.\n\n---\n\n## Part 2: Collations\n\nCollations define rules for comparing and sorting strings. Databricks supports binary, case-insensitive, accent-insensitive, and locale-specific collations using the ICU library. Available from Databricks Runtime 16.1+.\n\n### Collation Types\n\n| Collation | Description | Behavior |\n|-----------|-------------|----------|\n| `UTF8_BINARY` | Default. Byte-by-byte comparison of UTF-8 encoding | `'A' \u003c 'Z' \u003c 'a'` -- binary order, case/accent sensitive |\n| `UTF8_LCASE` | Case-insensitive binary. Converts to lowercase then compares with UTF8_BINARY | `'A' == 'a'` but `'e' != 'e'` (accent sensitive) |\n| `UNICODE` | ICU root locale. Language-agnostic Unicode ordering | `'a' \u003c 'A' \u003c 'A' \u003c 'b'` -- groups similar characters |\n| Locale-specific | ICU locale-based (e.g., `DE`, `FR`, `JA`) | Language-aware sorting rules |\n\n### Collation Syntax\n\n```\n{ UTF8_BINARY | UTF8_LCASE | { UNICODE | locale } [ _ modifier [...] ] }\n```\n\nWhere `locale` is:\n```\nlanguage_code [ _ script_code ] [ _ country_code ]\n```\n\n- `language_code`: ISO 639-1 (e.g., `EN`, `DE`, `FR`, `JA`, `ZH`)\n- `script_code`: ISO 15924 (e.g., `Hant` for Traditional Chinese, `Latn` for Latin)\n- `country_code`: ISO 3166-1 (e.g., `US`, `DE`, `CAN`)\n\n### Collation Modifiers (DBR 16.2+)\n\n| Modifier | Description | Default |\n|----------|-------------|---------|\n| `CS` | Case-Sensitive: `'A' != 'a'` | Yes (default) |\n| `CI` | Case-Insensitive: `'A' == 'a'` | No |\n| `AS` | Accent-Sensitive: `'e' != 'e'` | Yes (default) |\n| `AI` | Accent-Insensitive: `'e' == 'e'` | No |\n| `RTRIM` | Trailing-space insensitive: `'Hello' == 'Hello '` | No |\n\nSpecify at most one from each pair (CS/CI, AS/AI) plus optional RTRIM. Order does not matter.\n\n### Locale Examples\n\n| Collation Name | Description |\n|----------------|-------------|\n| `UNICODE` | ICU root locale, language-agnostic |\n| `UNICODE_CI` | Unicode, case-insensitive |\n| `UNICODE_CI_AI` | Unicode, case and accent-insensitive |\n| `DE` | German sorting rules |\n| `DE_CI_AI` | German, case and accent-insensitive |\n| `FR_CAN` | French (Canada) |\n| `EN_US` | English (United States) |\n| `ZH_Hant_MAC` | Traditional Chinese (Macau) |\n| `SR` | Serbian (normalized from `SR_CYR_SRN_CS_AS`) |\n| `JA` | Japanese |\n| `EN_CS_AI` | English, case-sensitive, accent-insensitive |\n| `UTF8_LCASE_RTRIM` | Case-insensitive with trailing space trimming |\n\n### Collation Precedence\n\nFrom highest to lowest:\n\n1. **Explicit** -- Assigned via `COLLATE` expression\n2. **Implicit** -- Derived from column, field, or variable definition\n3. **Default** -- Applied to string literals and function results\n4. **None** -- When combining different implicit collations\n\nMixing two different **explicit** collations in the same expression produces an error.\n\n### Setting Collations at Different Levels\n\n#### Catalog Level (DBR 17.1+)\n\n```sql\n-- Create catalog with default collation\nCREATE CATALOG customer_cat\n DEFAULT COLLATION UNICODE_CI_AI;\n\n-- All schemas, tables, and string columns created in this catalog\n-- inherit UNICODE_CI_AI unless overridden\n```\n\n#### Schema Level (DBR 17.1+)\n\n```sql\n-- Create schema with default collation\nCREATE SCHEMA my_schema\n DEFAULT COLLATION UNICODE_CI;\n\n-- Change default collation for new objects (existing objects unchanged)\nALTER SCHEMA my_schema\n DEFAULT COLLATION UNICODE_CI_AI;\n```\n\n#### Table Level (DBR 16.3+)\n\n```sql\n-- Table-level default collation\nCREATE TABLE users (\n id INT,\n username STRING, -- inherits UNICODE_CI from table default\n email STRING, -- inherits UNICODE_CI from table default\n password_hash STRING COLLATE UTF8_BINARY -- explicit override\n) DEFAULT COLLATION UNICODE_CI;\n```\n\n#### Column Level (DBR 16.1+)\n\n```sql\n-- Column-level collation\nCREATE TABLE products (\n id INT,\n name STRING COLLATE UNICODE_CI,\n sku STRING COLLATE UTF8_BINARY,\n description STRING COLLATE UNICODE_CI_AI\n);\n\n-- Add column with collation\nALTER TABLE products\n ADD COLUMN category STRING COLLATE UNICODE_CI;\n\n-- Change column collation (requires DBR 17.2+; may not be available on all SQL Warehouse versions)\nALTER TABLE products\n ALTER COLUMN name SET COLLATION UNICODE_CI_AI;\n```\n\n#### Expression Level\n\n```sql\n-- Apply collation inline in a query\nSELECT *\nFROM products\nWHERE name COLLATE UNICODE_CI = 'laptop';\n\n-- Check the collation of an expression\nSELECT collation('test' COLLATE UNICODE_CI);\n-- Returns: UNICODE_CI\n```\n\n### Collation Inheritance Hierarchy\n\n```\nCatalog DEFAULT COLLATION\n -> Schema DEFAULT COLLATION (overrides catalog)\n -> Table DEFAULT COLLATION (overrides schema)\n -> Column COLLATE (overrides table)\n -> Expression COLLATE (overrides column)\n```\n\nIf no collation is specified at any level, `UTF8_BINARY` is used.\n\n### Collation-Aware String Functions\n\nMost string functions respect collations. Key collation-aware operations:\n\n| Function/Operator | Collation Behavior |\n|-------------------|-------------------|\n| `=`, `!=`, `\u003c`, `>`, `\u003c=`, `>=` | Comparison uses column/expression collation |\n| `LIKE` | Pattern matching respects collation |\n| `CONTAINS(str, substr)` | Substring search respects collation |\n| `STARTSWITH(str, prefix)` | Prefix match respects collation |\n| `ENDSWITH(str, suffix)` | Suffix match respects collation |\n| `IN (...)` | Membership test respects collation |\n| `BETWEEN` | Range comparison respects collation |\n| `ORDER BY` | Sorting respects collation |\n| `GROUP BY` | Grouping respects collation |\n| `DISTINCT` | Deduplication respects collation |\n| `REPLACE(str, old, new)` | Search respects collation |\n| `TRIM` / `LTRIM` / `RTRIM` | Trim characters respect collation |\n\n**Performance note:** `STARTSWITH` and `ENDSWITH` with `UTF8_LCASE` collation show up to **10x performance speedup** compared to equivalent `LOWER()` workarounds.\n\n### Utility Functions\n\n```sql\n-- Get collation of an expression\nSELECT collation(name) FROM products;\n\n-- List all supported collations\nSELECT * FROM collations();\n\n-- Test collation with COLLATE\nSELECT collation('hello' COLLATE DE_CI_AI);\n-- Returns: DE_CI_AI\n```\n\n### Practical Collation Examples\n\n#### Case-Insensitive Search\n\n```sql\n-- Using column collation (preferred - leverages indexes)\nCREATE TABLE users (\n id INT,\n username STRING COLLATE UTF8_LCASE,\n email STRING COLLATE UTF8_LCASE\n);\n\nINSERT INTO users VALUES\n (1, 'JohnDoe', '[email protected]'),\n (2, 'janedoe', '[email protected]');\n\n-- Case-insensitive match automatically\nSELECT * FROM users WHERE username = 'johndoe';\n-- Returns: JohnDoe\n\nSELECT * FROM users WHERE email = '[email protected]';\n-- Returns: [email protected]\n```\n\n#### Case-Insensitive Search with Expression Collation\n\n```sql\n-- Ad-hoc case-insensitive comparison on a UTF8_BINARY column\nSELECT * FROM products\nWHERE name COLLATE UNICODE_CI = 'MacBook Pro';\n-- Matches: macbook pro, MACBOOK PRO, MacBook Pro, etc.\n```\n\n#### Accent-Insensitive Search\n\n```sql\n-- Accent-insensitive matching\nCREATE TABLE cities (\n id INT,\n name STRING COLLATE UNICODE_CI_AI\n);\n\nINSERT INTO cities VALUES (1, 'Montreal'), (2, 'Montreal');\n\nSELECT * FROM cities WHERE name = 'Montreal';\n-- Returns both: Montreal and Montreal (treats e and e as equal)\n```\n\n#### Locale-Aware Sorting\n\n```sql\n-- German sorting (umlauts sort correctly)\nSELECT name\nFROM german_customers\nORDER BY name COLLATE DE;\n-- Sorts: Arzte before Bauer (A treated as A+e in German sorting)\n\n-- Swedish sorting (A, A, O sort after Z)\nSELECT name\nFROM swedish_customers\nORDER BY name COLLATE SV;\n```\n\n#### Trailing Space Handling\n\n```sql\n-- RTRIM modifier ignores trailing spaces\nSELECT 'Hello' COLLATE UTF8_BINARY_RTRIM = 'Hello ';\n-- Returns: true\n\nSELECT 'Hello' COLLATE UTF8_BINARY = 'Hello ';\n-- Returns: false\n```\n\n#### Catalog-Wide Case-Insensitive Setup\n\n```sql\n-- Create a catalog where everything is case-insensitive by default\nCREATE CATALOG app_data DEFAULT COLLATION UNICODE_CI;\n\nUSE CATALOG app_data;\nCREATE SCHEMA users_schema;\nUSE SCHEMA users_schema;\n\n-- All STRING columns automatically use UNICODE_CI\nCREATE TABLE accounts (\n id INT,\n username STRING, -- UNICODE_CI inherited from catalog\n email STRING -- UNICODE_CI inherited from catalog\n);\n\n-- Queries are automatically case-insensitive\nSELECT * FROM accounts WHERE username = 'admin';\n-- Matches: Admin, ADMIN, admin, aDmIn, etc.\n```\n\n### Limitations and Notes\n\n- `CHECK` constraints and generated column expressions require `UTF8_BINARY` default collation.\n- `hive_metastore` catalog tables do not support collation constraints.\n- `ALTER SCHEMA ... DEFAULT COLLATION` only affects newly created objects, not existing ones.\n- Mixing two different explicit collations in the same expression raises an error.\n- `UTF8_LCASE` is used internally for Databricks identifier resolution (catalog, schema, table, column names).\n- Databricks normalizes collation names by removing defaults (e.g., `SR_CYR_SRN_CS_AS` simplifies to `SR`).\n- Collation modifiers require Databricks Runtime 16.2+.\n- Catalog/Schema-level `DEFAULT COLLATION` requires Databricks Runtime 17.1+.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":28721,"content_sha256":"23579de0add0babf26cc98a326a4825e392d1ee9b2785d3b2fcb8d60c8f3453d"},{"filename":"references/materialized-views-pipes.md","content":"# Materialized Views, Temporary Tables/Views, and Pipe Syntax\n\n## 1. Materialized Views in Databricks SQL\n\n### Overview\n\nMaterialized views (MVs) are Unity Catalog-managed tables that physically store precomputed query results. Unlike standard views that recompute on every query, MVs cache results and update automatically -- either on a schedule, when upstream data changes, or on-demand.\n\nKey characteristics:\n- **Pre-computed storage**: Results are physically stored as Delta tables, reducing query latency\n- **Automatic updates**: Changes propagate from source tables via incremental or full refresh\n- **Serverless pipelines**: Each MV automatically creates a serverless pipeline for creation and refreshes\n- **Incremental refresh**: Can compute only changed data from source tables under certain conditions\n\n### Requirements\n\n- **Compute**: Unity Catalog-enabled **Serverless** SQL warehouse\n- **Region**: Serverless SQL warehouse support must be available in your region\n- **Permissions**:\n - Creator needs: `SELECT` on base tables, `USE CATALOG`, `USE SCHEMA`, `CREATE TABLE`, `CREATE MATERIALIZED VIEW`\n - Refresh needs: Ownership or `REFRESH` privilege; MV owner must retain `SELECT` on base tables\n - Query needs: `SELECT` on the MV, `USE CATALOG`, `USE SCHEMA`\n\n### CREATE MATERIALIZED VIEW Syntax\n\n```sql\n{ CREATE OR REPLACE MATERIALIZED VIEW | CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] }\n view_name\n [ column_list ]\n [ view_clauses ]\n AS query\n```\n\n**Column list** (optional):\n```sql\nCREATE MATERIALIZED VIEW mv_name (\n col1 INT NOT NULL,\n col2 STRING,\n col3 DOUBLE,\n CONSTRAINT pk PRIMARY KEY (col1)\n)\nAS SELECT ...\n```\n\n**View clauses** (optional):\n- `PARTITIONED BY (col1, col2)` -- partition by columns\n- `CLUSTER BY (col1, col2)` or `CLUSTER BY AUTO` -- liquid clustering (cannot combine with PARTITIONED BY)\n- `COMMENT 'description'` -- view description\n- `TBLPROPERTIES ('key' = 'value')` -- user-defined properties\n- `WITH ROW FILTER func ON (col1, col2)` -- row-level security\n- `MASK func` on columns -- column-level masking\n- `SCHEDULE` clause -- automatic refresh schedule\n- `TRIGGER ON UPDATE` clause -- event-driven refresh\n\n### Basic Examples\n\n```sql\n-- Simple materialized view\nCREATE MATERIALIZED VIEW catalog.schema.daily_sales\n COMMENT 'Daily sales aggregations'\nAS SELECT\n date,\n region,\n SUM(sales) AS total_sales,\n COUNT(*) AS num_transactions\nFROM catalog.schema.raw_sales\nGROUP BY date, region;\n\n-- MV with explicit columns, constraints, and clustering\nCREATE MATERIALIZED VIEW catalog.schema.customer_orders (\n customer_id INT NOT NULL,\n full_name STRING,\n order_count BIGINT,\n CONSTRAINT customer_pk PRIMARY KEY (customer_id)\n)\nCLUSTER BY AUTO\nCOMMENT 'Customer order counts'\nAS SELECT\n c.customer_id,\n c.full_name,\n COUNT(o.order_id) AS order_count\nFROM catalog.schema.customers c\nINNER JOIN catalog.schema.orders o ON c.customer_id = o.customer_id\nGROUP BY c.customer_id, c.full_name;\n```\n\n### Refresh Options\n\nMVs support four refresh strategies:\n\n#### 1. Manual Refresh\n\n```sql\n-- Synchronous (blocks until complete)\nREFRESH MATERIALIZED VIEW catalog.schema.daily_sales;\n\n-- Asynchronous (returns immediately)\nREFRESH MATERIALIZED VIEW catalog.schema.daily_sales ASYNC;\n```\n\n#### 2. Scheduled Refresh (SCHEDULE)\n\n```sql\n-- Every N hours/days/weeks\nCREATE OR REPLACE MATERIALIZED VIEW catalog.schema.hourly_metrics\n SCHEDULE EVERY 1 HOUR\nAS SELECT date_trunc('hour', event_time) AS hour, COUNT(*) AS events\nFROM catalog.schema.raw_events\nGROUP BY 1;\n\n-- Cron-based schedule\nCREATE OR REPLACE MATERIALIZED VIEW catalog.schema.nightly_report\n SCHEDULE CRON '0 0 2 * * ?' AT TIME ZONE 'America/New_York'\nAS SELECT * FROM catalog.schema.daily_aggregates;\n```\n\nValid intervals: 1-72 hours, 1-31 days, 1-8 weeks. A Databricks Job is automatically created for scheduled refreshes.\n\n#### 3. Event-Driven Refresh (TRIGGER ON UPDATE)\n\nAutomatically refreshes when upstream data changes:\n\n```sql\nCREATE OR REPLACE MATERIALIZED VIEW catalog.schema.customer_orders\n TRIGGER ON UPDATE\nAS SELECT c.customer_id, c.name, COUNT(o.order_id) AS order_count\nFROM catalog.schema.customers c\nJOIN catalog.schema.orders o ON c.customer_id = o.customer_id\nGROUP BY c.customer_id, c.name;\n\n-- With throttle to avoid excessive refreshes\nCREATE OR REPLACE MATERIALIZED VIEW catalog.schema.customer_orders\n TRIGGER ON UPDATE AT MOST EVERY INTERVAL 5 MINUTES\nAS SELECT c.customer_id, c.name, COUNT(o.order_id) AS order_count\nFROM catalog.schema.customers c\nJOIN catalog.schema.orders o ON c.customer_id = o.customer_id\nGROUP BY c.customer_id, c.name;\n```\n\nTrigger limitations:\n- Maximum **10 upstream source tables** and **30 upstream views**\n- Minimum **1-minute** interval (default)\n- Maximum **1,000** trigger-based MVs per workspace\n- Supports Delta tables, managed views, and streaming tables as sources\n- Does **not** support Delta Sharing shared tables\n\n#### 4. Job-Based Orchestration\n\nIntegrate refreshes into existing Databricks Jobs using SQL task types:\n\n```sql\n-- In a Databricks Job SQL task\nREFRESH MATERIALIZED VIEW catalog.schema.daily_sales_summary;\n```\n\n### Managing Schedules After Creation\n\n```sql\n-- Add a schedule to an existing MV\nALTER MATERIALIZED VIEW catalog.schema.my_mv ADD SCHEDULE EVERY 4 HOURS;\n\n-- Add trigger-based refresh\nALTER MATERIALIZED VIEW catalog.schema.my_mv ADD TRIGGER ON UPDATE;\n\n-- Change an existing schedule\nALTER MATERIALIZED VIEW catalog.schema.my_mv ALTER SCHEDULE EVERY 2 HOURS;\n\n-- Remove a schedule\nALTER MATERIALIZED VIEW catalog.schema.my_mv DROP SCHEDULE;\n```\n\n### Incremental vs Full Refresh\n\n| Aspect | Incremental Refresh | Full Refresh |\n|--------|-------------------|--------------|\n| What it does | Evaluates changes since last refresh, merges only new/modified records | Re-executes the entire defining query |\n| When used | When source tables support change tracking and query structure allows it | When incremental is not possible or not cost-effective |\n| Requirements | Delta source tables with row tracking and CDF enabled | No special requirements |\n| Cost | Lower (processes only deltas) | Higher (recomputes everything) |\n\nEnable row tracking on source tables for incremental refresh:\n\n```sql\nALTER TABLE catalog.schema.source_table\nSET TBLPROPERTIES (delta.enableRowTracking = true);\n```\n\nBy default, Databricks uses a cost model to choose between incremental and full refresh. Use `EXPLAIN CREATE MATERIALIZED VIEW` to verify the chosen refresh type.\n\n### Timeout Configuration\n\n```sql\n-- Set timeout before creating or refreshing\nSET STATEMENT_TIMEOUT = '6h';\nCREATE OR REFRESH MATERIALIZED VIEW catalog.schema.my_mv\n SCHEDULE EVERY 12 HOURS\nAS SELECT * FROM catalog.schema.large_source_table;\n```\n\nDefault timeout is **2 days** if no warehouse timeout is configured. After changing warehouse timeouts, re-run `CREATE OR REFRESH` to apply new settings.\n\n### Monitoring\n\n- **Catalog Explorer**: View refresh status, schema, permissions, lineage under the MV entry\n- **DESCRIBE EXTENDED**: Get schedule and configuration details\n- **Jobs & Pipelines UI**: Monitor the automatically created pipeline\n- **Pipelines API**: `GET /api/2.0/pipelines/{pipeline_id}` for programmatic access\n- **DESCRIBE EXTENDED AS JSON**: Get refresh information including last refresh time, type, status, and schedule (added October 2025)\n\n### Key Limitations\n\n- No identity columns or surrogate keys\n- Cannot read change data feeds (CDF) from materialized views\n- Time travel queries are not supported\n- `OPTIMIZE` and `VACUUM` commands are not supported (managed automatically)\n- **Null handling edge case**: `SUM()` on a nullable column returns **0** instead of `NULL` when all non-null values are removed\n- Non-column expressions in the defining query require explicit aliases\n- Underlying storage may contain upstream data not visible in the MV definition (required for incremental refresh)\n- Cannot rename the MV or change its owner via ALTER (must drop and recreate)\n- No data quality expectations support\n- AWS PrivateLink requires contacting Databricks support\n\n### DBSQL Materialized Views vs Pipeline (SDP/DLT) Materialized Views\n\n| Aspect | DBSQL Materialized Views | Pipeline (SDP/DLT) Materialized Views |\n|--------|-------------------------|--------------------------------------|\n| **Creation** | `CREATE MATERIALIZED VIEW` in SQL warehouse | Defined in pipeline source code (SQL or Python) |\n| **Pipeline type** | `MV/ST` (auto-created serverless pipeline) | `ETL` (explicitly defined pipeline) |\n| **Pipeline management** | Automatically created and managed | User-defined, full pipeline lifecycle control |\n| **Syntax** | Standard `CREATE MATERIALIZED VIEW` | `CREATE OR REFRESH MATERIALIZED VIEW` with `PRIVATE` option |\n| **Private MVs** | Not supported | `PRIVATE` keyword for pipeline-scoped views |\n| **Refresh trigger** | Schedule, trigger-on-update, manual, or job-based | Pipeline update (manual or scheduled) |\n| **Compute** | Serverless SQL warehouse (creation); serverless pipeline (refresh) | Pipeline compute (serverless or classic) |\n| **Data quality** | Not supported | Expectations supported |\n| **Best for** | Standalone MVs, BI dashboard acceleration, simple ETL | Complex multi-table pipelines, orchestrated transformations |\n\nBoth approaches ultimately use similar underlying mechanisms (serverless pipelines) and support incremental refresh. The key difference is in management: DBSQL MVs are self-contained with auto-managed pipelines, while pipeline MVs are part of a broader orchestrated data flow.\n\n### Best Practices\n\n1. **Choose the right refresh strategy**: `TRIGGER ON UPDATE` for near-real-time SLA; `SCHEDULE` for predictable cadences; manual or job-based for complex orchestration\n2. **Enable row tracking** on Delta source tables for cost-effective incremental refreshes\n3. **Use async refreshes** when refresh duration is long and downstream queries can tolerate slight staleness\n4. **Set explicit timeouts** when refresh duration is predictable to avoid runaway costs\n5. **Use `CLUSTER BY AUTO`** for automatic liquid clustering optimization\n6. **Apply row filters and column masks** at MV creation for security\n7. **Monitor refresh types** with `EXPLAIN CREATE MATERIALIZED VIEW` to verify incremental behavior\n\n---\n\n## 2. Temporary Tables and Temporary Views\n\n### Temporary Tables\n\nTemporary tables are session-scoped, physical Delta tables for intermediate data storage. They exist only within the session where they are created.\n\n#### Key Characteristics\n\n- **Session-scoped**: Only visible to the creating session; isolated from other users\n- **Physical storage**: Stored as Delta tables in an internal Unity Catalog location tied to the workspace\n- **Maximum lifetime**: 7 days from session creation, or until the session ends (whichever comes first)\n- **No catalog privileges needed**: Any user can create temporary tables without `CREATE TABLE` privileges\n- **Automatic cleanup**: Databricks reclaims storage automatically, even after unexpected disconnections\n- **Shared namespace**: Temporary tables share a namespace with temporary views; you cannot create both with the same name\n\n#### Syntax\n\n```sql\n-- Create with schema\nCREATE TEMPORARY TABLE temp_results (\n id INT,\n name STRING,\n score DOUBLE\n);\n\n-- Create from query (CTAS)\nCREATE TEMP TABLE temp_active_users\nAS SELECT user_id, username, last_login\nFROM catalog.schema.users\nWHERE last_login > current_date() - INTERVAL 30 DAYS;\n```\n\nNote: `CREATE OR REPLACE TEMP TABLE` is **not yet supported**. To replace, drop first.\n\n#### Supported Operations\n\n```sql\n-- INSERT\nINSERT INTO temp_results VALUES (1, 'Alice', 95.5);\nINSERT INTO temp_results SELECT * FROM catalog.schema.source WHERE score > 90;\n\n-- UPDATE\nUPDATE temp_results SET score = 100.0 WHERE name = 'Alice';\n\n-- MERGE\nMERGE INTO temp_results t\nUSING catalog.schema.new_scores s ON t.id = s.id\nWHEN MATCHED THEN UPDATE SET score = s.score\nWHEN NOT MATCHED THEN INSERT *;\n```\n\n#### Unsupported Operations\n\n- `DELETE FROM` (not supported)\n- `ALTER TABLE` (drop and recreate instead)\n- Shallow or deep cloning\n- Time travel\n- Streaming (foreachBatch)\n- DataFrame API access (SQL only)\n\n#### Use Cases\n\n1. **Exploratory analysis**: Store intermediate results while iterating on queries\n2. **Multi-step transformations**: Break complex transformations into readable steps\n3. **Query result reuse**: Compute once, reference multiple times in a session\n4. **Sandboxing**: Test transformations without affecting production tables\n\n#### Name Resolution\n\nWhen referencing a single-part table name, Databricks resolves in order:\n1. Temporary tables in the current session\n2. Permanent tables in the current schema\n\nTemporary tables with the same name as permanent tables **take precedence** within that session.\n\n### Temporary Views\n\nTemporary views are session-scoped, logical views that store a query definition (not data). They are recomputed on each access.\n\n#### Syntax\n\n```sql\n-- Create a temporary view\nCREATE TEMPORARY VIEW active_customers\nAS SELECT customer_id, name, email\nFROM catalog.schema.customers\nWHERE status = 'active';\n\n-- Replace an existing temporary view\nCREATE OR REPLACE TEMPORARY VIEW active_customers\nAS SELECT customer_id, name, email, phone\nFROM catalog.schema.customers\nWHERE status = 'active' AND last_order > current_date() - INTERVAL 90 DAYS;\n```\n\n#### Key Rules\n\n- Temporary view names **must not be qualified** (no catalog or schema prefix)\n- No special privileges required to create\n- Dropped automatically when the session ends\n- Cannot use `schema_binding` clauses\n- Support `COMMENT` and column comments\n\n#### Global Temporary Views (Databricks Runtime Only)\n\n```sql\n-- Only available in Databricks Runtime, NOT in Databricks SQL\nCREATE GLOBAL TEMPORARY VIEW global_summary\nAS SELECT region, SUM(revenue) AS total_revenue\nFROM catalog.schema.sales\nGROUP BY region;\n\n-- Must reference via global_temp schema\nSELECT * FROM global_temp.global_summary;\n```\n\nGlobal temporary views are stored in a system `global_temp` schema and are session-scoped. They are **not available in Databricks SQL** (only Databricks Runtime).\n\n### Temporary Tables vs Temporary Views\n\n| Aspect | Temporary Tables | Temporary Views |\n|--------|-----------------|-----------------|\n| **Storage** | Physical Delta table (stores data) | Logical (stores query definition only) |\n| **Compute on access** | No (data already materialized) | Yes (query re-executed each time) |\n| **DML support** | INSERT, UPDATE, MERGE | None (read-only definition) |\n| **Max lifetime** | 7 days or session end | Session end |\n| **CREATE OR REPLACE** | Not supported | Supported |\n| **Performance** | Faster for repeated reads (data cached) | Slower for repeated reads (recomputed) |\n| **Storage cost** | Uses cloud storage (auto-cleaned) | No storage cost |\n| **Shared namespace** | Yes (conflicts with temp views) | Yes (conflicts with temp tables) |\n| **When to use** | Large intermediate results, repeated access, DML needed | Simple query aliases, lightweight transformations |\n\n### Temporary Metric Views (Added September 2025)\n\n```sql\n-- Temporary metric views: session-scoped, dropped on session end\nCREATE TEMPORARY METRIC VIEW session_metrics\nAS SELECT ...;\n```\n\nAvailable in Databricks Runtime 17.2+ and Databricks SQL.\n\n---\n\n## 3. SQL Pipe Syntax\n\n### Overview\n\nPipe syntax (introduced February 2025) allows composing SQL queries as a top-down, left-to-right chain of operations using the `|>` operator. It eliminates deeply nested subqueries and makes SQL read like a DataFrame pipeline.\n\n**Requirements**: Databricks SQL or Databricks Runtime **16.2+**\n\n### Basic Syntax\n\n```sql\nFROM table_name\n|> pipe_operation_1\n|> pipe_operation_2\n|> pipe_operation_3;\n```\n\nAny query can start a pipeline. The most common pattern is `FROM table_name`, but any SELECT or subquery also works:\n\n```sql\n-- Start from a table\nFROM catalog.schema.sales |> WHERE region = 'US' |> SELECT product, amount;\n\n-- Start from a subquery\n(SELECT * FROM catalog.schema.sales WHERE year = 2025)\n|> AGGREGATE SUM(amount) AS total GROUP BY product\n|> ORDER BY total DESC;\n```\n\n### All Available Pipe Operators\n\n#### SELECT -- Project columns\n\n```sql\nFROM catalog.schema.employees\n|> SELECT employee_id, name, department, salary;\n```\n\nNote: `SELECT` in pipe syntax **must not contain aggregate functions**. Use `AGGREGATE` instead.\n\n#### EXTEND -- Add new columns\n\nAppends new columns to the existing result set (like PySpark's `withColumn`):\n\n```sql\nFROM catalog.schema.orders\n|> EXTEND quantity * unit_price AS line_total\n|> EXTEND line_total * 0.1 AS tax;\n```\n\nExpressions can reference columns created by preceding expressions in the same EXTEND.\n\n#### SET -- Modify existing columns\n\nOverrides existing column values (like PySpark's `withColumn` on existing columns):\n\n```sql\nFROM catalog.schema.products\n|> SET price = price * 1.1\n|> SET name = UPPER(name);\n```\n\nRaises `UNRESOLVED_COLUMN` if the column does not exist.\n\n#### DROP -- Remove columns\n\nRemoves columns (shorthand for `SELECT * EXCEPT`):\n\n```sql\nFROM catalog.schema.users\n|> DROP password_hash, internal_id, debug_flag;\n```\n\n#### WHERE -- Filter rows\n\n```sql\nFROM catalog.schema.transactions\n|> WHERE amount > 1000\n|> WHERE transaction_date >= '2025-01-01';\n```\n\n#### AGGREGATE -- Aggregation with optional GROUP BY\n\n```sql\n-- Full-table aggregation\nFROM catalog.schema.orders\n|> AGGREGATE\n COUNT(*) AS total_orders,\n SUM(amount) AS total_revenue,\n AVG(amount) AS avg_order_value;\n\n-- Grouped aggregation\nFROM catalog.schema.orders\n|> AGGREGATE\n SUM(amount) AS total_revenue,\n COUNT(*) AS order_count\n GROUP BY region, product_category;\n```\n\nIn pipe syntax, `AGGREGATE` replaces `SELECT ... GROUP BY`. Numeric values in GROUP BY reference input columns, not generated results.\n\n#### JOIN -- Combine relations\n\n```sql\nFROM catalog.schema.orders\n|> AS o\n|> LEFT JOIN catalog.schema.customers c ON o.customer_id = c.customer_id\n|> SELECT o.order_id, c.name, o.amount;\n```\n\nAll JOIN types are supported: `INNER JOIN`, `LEFT OUTER JOIN`, `RIGHT OUTER JOIN`, `FULL OUTER JOIN`, `CROSS JOIN`, `SEMI JOIN`, `ANTI JOIN`.\n\n#### ORDER BY -- Sort results\n\n```sql\nFROM catalog.schema.products\n|> ORDER BY price DESC, name ASC;\n```\n\n#### LIMIT and OFFSET -- Pagination\n\n```sql\nFROM catalog.schema.products\n|> ORDER BY price DESC\n|> LIMIT 10\n|> OFFSET 20;\n```\n\n#### AS -- Assign table alias\n\nNames the intermediate result for use in subsequent JOINs or self-references:\n\n```sql\nFROM catalog.schema.sales\n|> AS current_sales\n|> JOIN catalog.schema.targets t ON current_sales.region = t.region\n|> SELECT current_sales.region, current_sales.revenue, t.target;\n```\n\n#### Set Operators -- UNION, EXCEPT, INTERSECT\n\n```sql\nFROM catalog.schema.us_customers\n|> UNION ALL (SELECT * FROM catalog.schema.eu_customers)\n|> ORDER BY name;\n```\n\n#### TABLESAMPLE -- Sample rows\n\n```sql\n-- Sample by row count\nFROM catalog.schema.large_table\n|> TABLESAMPLE (1000 ROWS);\n\n-- Sample by percentage\nFROM catalog.schema.large_table\n|> TABLESAMPLE (10 PERCENT);\n```\n\n#### PIVOT -- Rows to columns\n\n```sql\nFROM catalog.schema.quarterly_sales\n|> PIVOT (\n SUM(revenue)\n FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')\n );\n```\n\n#### UNPIVOT -- Columns to rows\n\n```sql\nFROM catalog.schema.wide_metrics\n|> UNPIVOT (\n metric_value FOR metric_name IN (cpu_usage, memory_usage, disk_usage)\n );\n```\n\n### Practical Examples\n\n#### Example 1: Multi-step aggregation (replaces nested subqueries)\n\nTraditional SQL:\n```sql\nSELECT c_count, COUNT(*) AS custdist\nFROM (\n SELECT c_custkey, COUNT(o_orderkey) AS c_count\n FROM customer\n LEFT OUTER JOIN orders ON c_custkey = o_custkey\n AND o_comment NOT LIKE '%unusual%packages%'\n GROUP BY c_custkey\n) AS c_orders\nGROUP BY c_count\nORDER BY custdist DESC, c_count DESC;\n```\n\nPipe syntax:\n```sql\nFROM customer\n|> LEFT OUTER JOIN orders ON c_custkey = o_custkey\n AND o_comment NOT LIKE '%unusual%packages%'\n|> AGGREGATE COUNT(o_orderkey) AS c_count GROUP BY c_custkey\n|> AGGREGATE COUNT(*) AS custdist GROUP BY c_count\n|> ORDER BY custdist DESC, c_count DESC;\n```\n\n#### Example 2: Data exploration and profiling\n\n```sql\nFROM catalog.schema.raw_events\n|> WHERE event_date >= '2025-01-01'\n|> EXTEND YEAR(event_date) AS event_year, MONTH(event_date) AS event_month\n|> AGGREGATE\n COUNT(*) AS event_count,\n COUNT(DISTINCT user_id) AS unique_users,\n AVG(duration_seconds) AS avg_duration\n GROUP BY event_year, event_month\n|> ORDER BY event_year, event_month;\n```\n\n#### Example 3: Building a report step-by-step\n\n```sql\nFROM catalog.schema.orders\n|> AS o\n|> JOIN catalog.schema.products p ON o.product_id = p.product_id\n|> JOIN catalog.schema.customers c ON o.customer_id = c.customer_id\n|> WHERE o.order_date >= '2025-01-01'\n|> EXTEND o.quantity * p.unit_price AS line_total\n|> AGGREGATE\n SUM(line_total) AS total_revenue,\n COUNT(DISTINCT o.order_id) AS order_count\n GROUP BY c.region, p.category\n|> ORDER BY total_revenue DESC\n|> LIMIT 20;\n```\n\n#### Example 4: Debugging by commenting out tail operations\n\n```sql\nFROM catalog.schema.sales\n|> WHERE region = 'US'\n|> EXTEND amount * tax_rate AS tax_amount\n-- |> AGGREGATE SUM(tax_amount) AS total_tax GROUP BY state\n-- |> ORDER BY total_tax DESC\n;\n-- Comment out the last operations to inspect intermediate results\n```\n\n### Pipe Syntax vs Traditional SQL\n\n| Aspect | Traditional SQL | Pipe SQL |\n|--------|----------------|----------|\n| **Reading order** | Inside-out (subqueries first) | Top-down, left-to-right |\n| **Clause order** | Fixed: SELECT...FROM...WHERE...GROUP BY...ORDER BY | Any order, any number of times |\n| **Subquery nesting** | Required for multi-step aggregations | Eliminated via chaining |\n| **Column addition** | SELECT *, expr AS new_col | `EXTEND expr AS new_col` |\n| **Column removal** | SELECT with explicit column list or EXCEPT | `DROP col1, col2` |\n| **Column modification** | SELECT with expression replacing column | `SET col = new_expr` |\n| **Aggregation** | SELECT agg() ... GROUP BY | `AGGREGATE agg() GROUP BY` |\n| **Composability** | Limited; requires CTEs or subqueries | Native chaining |\n| **Interoperability** | Standard | Fully interoperable with traditional SQL |\n\n### When to Use Pipe Syntax\n\n**Use pipe syntax when:**\n- Multi-step aggregations would require nested subqueries\n- You want DataFrame-like readability in SQL\n- Building exploratory or iterative queries (easy to add/remove steps)\n- Complex transformations with many joins, filters, and projections\n\n**Use traditional SQL when:**\n- Simple queries that are already readable\n- Team is more familiar with standard SQL\n- Queries will be shared with tools that may not support pipe syntax\n\n### Performance Considerations\n\n- Pipe syntax is **syntactic sugar** -- it compiles to the same execution plan as traditional SQL\n- No performance difference between pipe and traditional syntax for equivalent queries\n- Best practice: Place data-reducing operations (`WHERE`, `DROP`, `SELECT`) early in the pipeline to minimize data flowing through subsequent operations\n- Use `TABLESAMPLE` during development to work with smaller datasets\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":23144,"content_sha256":"c19da7c7c10f5f9544e07ae55b47360f9d609c586f99a5c406c6ebfb851983c7"},{"filename":"references/sql-scripting.md","content":"# SQL Scripting, Stored Procedures, Recursive CTEs, and Transactions\n\n> Databricks SQL procedural extensions based on the SQL/PSM standard. Covers SQL scripting (compound statements, control flow, exception handling), stored procedures, recursive CTEs, and multi-statement transactions.\n\n---\n\n## Table of Contents\n\n- [SQL Scripting](#sql-scripting)\n - [Compound Statements (BEGIN...END)](#compound-statements-beginend)\n - [Variable Declaration (DECLARE)](#variable-declaration-declare)\n - [Variable Assignment (SET)](#variable-assignment-set)\n - [Control Flow](#control-flow)\n - [IF / ELSEIF / ELSE](#if--elseif--else)\n - [CASE Statement](#case-statement)\n - [WHILE Loop](#while-loop)\n - [FOR Loop](#for-loop)\n - [LOOP Statement](#loop-statement)\n - [REPEAT Statement](#repeat-statement)\n - [LEAVE and ITERATE](#leave-and-iterate)\n - [Exception Handling](#exception-handling)\n - [Condition Declaration](#condition-declaration)\n - [Handler Declaration](#handler-declaration)\n - [SIGNAL and RESIGNAL](#signal-and-resignal)\n - [EXECUTE IMMEDIATE (Dynamic SQL)](#execute-immediate-dynamic-sql)\n- [Stored Procedures](#stored-procedures)\n - [CREATE PROCEDURE](#create-procedure)\n - [CALL (Invoke a Procedure)](#call-invoke-a-procedure)\n - [DROP PROCEDURE](#drop-procedure)\n - [DESCRIBE PROCEDURE](#describe-procedure)\n - [SHOW PROCEDURES](#show-procedures)\n- [Recursive CTEs](#recursive-ctes)\n - [WITH RECURSIVE Syntax](#with-recursive-syntax)\n - [Anchor and Recursive Members](#anchor-and-recursive-members)\n - [MAX RECURSION LEVEL](#max-recursion-level)\n - [Use Cases and Examples](#use-cases-and-examples)\n - [Limitations](#limitations)\n- [Multi-Statement Transactions](#multi-statement-transactions)\n - [Overview and Current Status](#overview-and-current-status)\n - [SQL Scripting Atomic Blocks](#sql-scripting-atomic-blocks)\n - [Python Connector Transaction API](#python-connector-transaction-api)\n - [Isolation Levels](#isolation-levels)\n - [Write Conflicts and Concurrency](#write-conflicts-and-concurrency)\n - [Best Practices](#best-practices)\n\n---\n\n## SQL Scripting\n\n**Availability**: Databricks Runtime 16.3+ and Databricks SQL\n\nSQL scripting enables procedural logic using the SQL/PSM standard. Every SQL script starts with a compound statement block (`BEGIN...END`).\n\n### Compound Statements (BEGIN...END)\n\nA compound statement is the fundamental building block containing variable declarations, condition/handler declarations, and executable statements.\n\n**Syntax**:\n\n```sql\n[ label : ] BEGIN\n [ { declare_variable | declare_condition } ; [...] ]\n [ declare_handler ; [...] ]\n [ SQL_statement ; [...] ]\nEND [ label ]\n```\n\n**Key rules**:\n\n- Declarations must appear before executable statements\n- Variable declarations come before condition declarations, which come before handler declarations\n- Top-level compound statements cannot specify labels\n- `NOT ATOMIC` is the default and only behavior (no automatic rollback on failure)\n- In notebooks, the compound statement must be the sole statement in the cell\n\n**Supported statement types in body**:\n\n| Category | Statements |\n|----------|-----------|\n| DDL | ALTER, CREATE, DROP |\n| DCL | GRANT, REVOKE |\n| DML | INSERT, UPDATE, DELETE, MERGE |\n| Query | SELECT |\n| Assignment | SET |\n| Dynamic SQL | EXECUTE IMMEDIATE |\n| Control flow | IF, CASE, WHILE, FOR, LOOP, REPEAT, LEAVE, ITERATE |\n| Nesting | Nested BEGIN...END blocks |\n\n**Minimal example**:\n\n```sql\nBEGIN\n SELECT 'Hello, SQL Scripting!';\nEND;\n```\n\n### Variable Declaration (DECLARE)\n\n**Syntax**:\n\n```sql\nDECLARE variable_name [, ...] data_type [ DEFAULT default_expr ];\n```\n\n- Variables initialize to `NULL` if no `DEFAULT` is specified\n- Data type can be omitted when `DEFAULT` is provided (type inferred from expression)\n- Multiple variable names in a single `DECLARE` supported in Runtime 17.2+\n- Variables are scoped to their enclosing compound statement\n- Variable names resolve from the innermost scope outward; use labels to disambiguate\n\n**Examples**:\n\n```sql\nBEGIN\n DECLARE counter INT DEFAULT 0;\n DECLARE name STRING DEFAULT 'unknown';\n DECLARE x, y, z DOUBLE DEFAULT 0.0; -- Runtime 17.2+\n DECLARE inferred DEFAULT current_date(); -- type inferred as DATE\n\n SET counter = counter + 1;\n VALUES (counter, name);\nEND;\n```\n\n### Variable Assignment (SET)\n\n**Syntax**:\n\n```sql\nSET variable_name = expression;\nSET VAR variable_name = expression; -- explicit local variable\nSET (var1, var2, ...) = (expr1, expr2, ...); -- multi-assignment\n```\n\nUse `SET VAR` to explicitly target a local variable when a session variable with the same name exists.\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE total INT DEFAULT 0;\n DECLARE label STRING;\n SET total = 100;\n SET label = 'final';\n VALUES (total, label);\nEND;\n```\n\n### Control Flow\n\n#### IF / ELSEIF / ELSE\n\nExecutes statements based on the first condition evaluating to `TRUE`.\n\n**Syntax**:\n\n```sql\nIF condition THEN\n { stmt ; } [...]\n[ ELSEIF condition THEN\n { stmt ; } [...] ] [...]\n[ ELSE\n { stmt ; } [...] ]\nEND IF;\n```\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE score INT DEFAULT 85;\n DECLARE grade STRING;\n\n IF score >= 90 THEN\n SET grade = 'A';\n ELSEIF score >= 80 THEN\n SET grade = 'B';\n ELSEIF score >= 70 THEN\n SET grade = 'C';\n ELSE\n SET grade = 'F';\n END IF;\n\n VALUES (grade); -- Returns 'B'\nEND;\n```\n\n#### CASE Statement\n\nTwo forms: **simple CASE** (compare expression) and **searched CASE** (evaluate boolean conditions).\n\n**Simple CASE syntax**:\n\n```sql\nCASE expr\n WHEN opt1 THEN { stmt ; } [...]\n WHEN opt2 THEN { stmt ; } [...]\n [ ELSE { stmt ; } [...] ]\nEND CASE;\n```\n\n**Searched CASE syntax**:\n\n```sql\nCASE\n WHEN cond1 THEN { stmt ; } [...]\n WHEN cond2 THEN { stmt ; } [...]\n [ ELSE { stmt ; } [...] ]\nEND CASE;\n```\n\nOnly the first matching branch executes.\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE status STRING DEFAULT 'active';\n\n CASE status\n WHEN 'active' THEN VALUES ('Processing');\n WHEN 'paused' THEN VALUES ('On hold');\n WHEN 'archived' THEN VALUES ('Read-only');\n ELSE VALUES ('Unknown status');\n END CASE;\nEND;\n```\n\n#### WHILE Loop\n\nRepeats while a condition is `TRUE`.\n\n**Syntax**:\n\n```sql\n[ label : ] WHILE condition DO\n { stmt ; } [...]\nEND WHILE [ label ];\n```\n\n**Example** -- sum odd numbers from 1 to 10:\n\n```sql\nBEGIN\n DECLARE total INT DEFAULT 0;\n DECLARE i INT DEFAULT 0;\n\n sum_odds: WHILE i \u003c 10 DO\n SET i = i + 1;\n IF i % 2 = 0 THEN\n ITERATE sum_odds; -- skip even numbers\n END IF;\n SET total = total + i;\n END WHILE sum_odds;\n\n VALUES (total); -- Returns 25\nEND;\n```\n\n#### FOR Loop\n\nIterates over query result rows.\n\n**Syntax**:\n\n```sql\n[ label : ] FOR [ variable_name AS ] query DO\n { stmt ; } [...]\nEND FOR [ label ];\n```\n\n- Use `variable_name` (not the label) to qualify column references from the cursor\n- For Delta tables, modifying the source during iteration does not affect cursor results\n- Loop may not fully execute the query if terminated early by `LEAVE` or an error\n\n**Example** -- process each row from a query:\n\n```sql\nBEGIN\n DECLARE total_revenue DOUBLE DEFAULT 0.0;\n\n process_orders: FOR row AS\n SELECT order_id, amount FROM orders WHERE status = 'completed'\n DO\n SET total_revenue = total_revenue + row.amount;\n IF total_revenue > 1000000 THEN\n LEAVE process_orders;\n END IF;\n END FOR process_orders;\n\n VALUES (total_revenue);\nEND;\n```\n\n#### LOOP Statement\n\nUnconditional loop; must use `LEAVE` to exit.\n\n**Syntax**:\n\n```sql\n[ label : ] LOOP\n { stmt ; } [...]\nEND LOOP [ label ];\n```\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE counter INT DEFAULT 0;\n\n count_up: LOOP\n SET counter = counter + 1;\n IF counter >= 5 THEN\n LEAVE count_up;\n END IF;\n END LOOP count_up;\n\n VALUES (counter); -- Returns 5\nEND;\n```\n\n#### REPEAT Statement\n\nExecutes at least once, then repeats until condition is `TRUE`.\n\n**Syntax**:\n\n```sql\n[ label : ] REPEAT\n { stmt ; } [...]\n UNTIL condition\nEND REPEAT [ label ];\n```\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE total INT DEFAULT 0;\n DECLARE i INT DEFAULT 0;\n\n sum_loop: REPEAT\n SET i = i + 1;\n IF i % 2 != 0 THEN\n SET total = total + i;\n END IF;\n UNTIL i >= 10\n END REPEAT sum_loop;\n\n VALUES (total); -- Returns 25\nEND;\n```\n\n#### LEAVE and ITERATE\n\n| Statement | Purpose | Equivalent |\n|-----------|---------|-----------|\n| `LEAVE label` | Exit the labeled loop or compound block | `BREAK` in other languages |\n| `ITERATE label` | Skip to the next iteration of the labeled loop | `CONTINUE` in other languages |\n\nBoth require a labeled loop to target.\n\n### Exception Handling\n\n#### Condition Declaration\n\nDefine named conditions for specific SQLSTATE codes.\n\n**Syntax**:\n\n```sql\nDECLARE condition_name CONDITION [ FOR SQLSTATE [ VALUE ] sqlstate ];\n```\n\n- `sqlstate` is a 5-character alphanumeric string (A-Z, 0-9, case-insensitive)\n- Cannot start with `'00'`, `'01'`, or `'XX'`\n- Defaults to `'45000'` if not specified\n\n**Example**:\n\n```sql\nBEGIN\n DECLARE divide_by_zero CONDITION FOR SQLSTATE '22012';\n -- Use in handler declarations below\nEND;\n```\n\n#### Handler Declaration\n\nCatch and handle exceptions within compound statements.\n\n**Syntax**:\n\n```sql\nDECLARE handler_type HANDLER FOR condition_value [, ...] handler_action;\n```\n\n| Parameter | Options | Description |\n|-----------|---------|-------------|\n| `handler_type` | `EXIT` | Exits the enclosing compound after handling |\n| `condition_value` | `SQLSTATE 'xxxxx'`, `condition_name`, `SQLEXCEPTION`, `NOT FOUND` | What to catch |\n| `handler_action` | Single statement or nested `BEGIN...END` | What to execute |\n\n- `SQLEXCEPTION` catches all error states (SQLSTATE class not `'00'` or `'01'`)\n- `NOT FOUND` catches `'02xxx'` states (no data found)\n- A handler cannot apply to statements in its own body\n\n**Example** -- catch division by zero:\n\n```sql\nBEGIN\n DECLARE result DOUBLE;\n DECLARE EXIT HANDLER FOR SQLSTATE '22012'\n BEGIN\n SET result = -1;\n END;\n\n SET result = 10 / 0; -- triggers handler\n VALUES (result); -- Returns -1\nEND;\n```\n\n**Example** -- generic exception handler:\n\n```sql\nBEGIN\n DECLARE error_msg STRING DEFAULT 'none';\n\n DECLARE EXIT HANDLER FOR SQLEXCEPTION\n BEGIN\n SET error_msg = 'An error occurred';\n INSERT INTO error_log (message, ts) VALUES (error_msg, current_timestamp());\n END;\n\n -- statements that might fail\n INSERT INTO target_table SELECT * FROM source_table;\nEND;\n```\n\n#### SIGNAL and RESIGNAL\n\nRaise or re-raise exceptions.\n\n**SIGNAL syntax**:\n\n```sql\nSIGNAL condition_name\n [ SET { MESSAGE_ARGUMENTS = argument_map | MESSAGE_TEXT = message_str } ];\n\nSIGNAL SQLSTATE [ VALUE ] sqlstate\n [ SET MESSAGE_TEXT = message_str ];\n```\n\n**RESIGNAL syntax** (use in handlers to preserve diagnostic stack):\n\n```sql\nRESIGNAL [ condition_name | SQLSTATE [ VALUE ] sqlstate ]\n [ SET { MESSAGE_ARGUMENTS = argument_map | MESSAGE_TEXT = message_str } ];\n```\n\n- Prefer `RESIGNAL` over `SIGNAL` inside handlers -- `RESIGNAL` preserves the diagnostic stack while `SIGNAL` clears it\n- `MESSAGE_ARGUMENTS` takes a `MAP\u003cSTRING, STRING>` literal\n\n**Example** -- validate input and raise custom error:\n\n```sql\nBEGIN\n DECLARE input_value INT DEFAULT 150;\n\n IF input_value > 100 THEN\n SIGNAL SQLSTATE '45000'\n SET MESSAGE_TEXT = 'Input value must be \u003c= 100';\n END IF;\n\n VALUES (input_value);\nEND;\n```\n\n**Example** -- using named conditions with MESSAGE_ARGUMENTS:\n\n```sql\nBEGIN\n DECLARE input INT DEFAULT 5;\n DECLARE arg_map MAP\u003cSTRING, STRING>;\n\n IF input > 4 THEN\n SET arg_map = map('errorMessage', 'Input must be \u003c= 4.');\n SIGNAL USER_RAISED_EXCEPTION\n SET MESSAGE_ARGUMENTS = arg_map;\n END IF;\nEND;\n```\n\n### EXECUTE IMMEDIATE (Dynamic SQL)\n\nExecute SQL statements constructed as strings at runtime.\n\n**Availability**: Runtime 14.3+; expression-based `sql_string` and nested execution from Runtime 17.3+.\n\n**Syntax**:\n\n```sql\nEXECUTE IMMEDIATE sql_string\n [ INTO var_name [, ...] ]\n [ USING { arg_expr [ AS ] [ alias ] } [, ...] ];\n```\n\n- `sql_string`: a constant expression producing a well-formed SQL statement\n- `INTO`: captures a single-row result into variables (returns `NULL` for zero rows; errors for multiple rows)\n- `USING`: binds values to positional (`?`) or named (`:param`) parameter markers (cannot mix styles)\n\n**Examples**:\n\n```sql\n-- Positional parameters\nEXECUTE IMMEDIATE 'SELECT SUM(c1) FROM VALUES(?), (?) AS t(c1)' USING 5, 6;\n\n-- Named parameters with INTO\nBEGIN\n DECLARE total INT;\n EXECUTE IMMEDIATE 'SELECT SUM(c1) FROM VALUES(:a), (:b) AS t(c1)'\n INTO total USING (5 AS a, 6 AS b);\n VALUES (total); -- Returns 11\nEND;\n\n-- Dynamic table operations\nBEGIN\n DECLARE table_name STRING DEFAULT 'my_catalog.my_schema.staging';\n EXECUTE IMMEDIATE 'TRUNCATE TABLE ' || table_name;\n EXECUTE IMMEDIATE 'INSERT INTO ' || table_name || ' SELECT * FROM source';\nEND;\n```\n\n---\n\n## Stored Procedures\n\n**Availability**: Public Preview -- Databricks Runtime 17.0+\n\nStored procedures persist SQL scripts in Unity Catalog and are invoked with `CALL`.\n\n### CREATE PROCEDURE\n\n**Syntax**:\n\n```sql\nCREATE [ OR REPLACE ] PROCEDURE [ IF NOT EXISTS ]\n procedure_name ( [ parameter [, ...] ] )\n characteristic [...]\n AS compound_statement\n```\n\n**Parameter definition**:\n\n```sql\n[ IN | OUT | INOUT ] parameter_name data_type\n [ DEFAULT default_expression ]\n [ COMMENT parameter_comment ]\n```\n\n| Parameter mode | Behavior |\n|---------------|----------|\n| `IN` (default) | Input-only; value passed into the procedure |\n| `OUT` | Output-only; initialized to `NULL`; final value returned on success |\n| `INOUT` | Input and output; accepts a value and returns the modified value on success |\n\n**Required characteristics**:\n\n| Characteristic | Description |\n|---------------|-------------|\n| `LANGUAGE SQL` | Specifies the implementation language |\n| `SQL SECURITY INVOKER` | Executes under the invoker's authority |\n\n**Optional characteristics**:\n\n| Characteristic | Description |\n|---------------|-------------|\n| `NOT DETERMINISTIC` | Procedure may return different results with identical inputs |\n| `MODIFIES SQL DATA` | Procedure modifies SQL data |\n| `COMMENT 'description'` | Human-readable description |\n| `DEFAULT COLLATION UTF8_BINARY` | Required when schema uses non-UTF8_BINARY collation (Runtime 17.1+) |\n\n**Rules**:\n\n- `OR REPLACE` and `IF NOT EXISTS` cannot be combined\n- Parameter names must be unique within the procedure\n- `DEFAULT` is not supported for `OUT` parameters\n- Once a parameter has a `DEFAULT`, all subsequent parameters must also have defaults\n- Default expressions cannot reference other parameters or contain subqueries\n- Body is validated syntactically at creation but semantically only at invocation\n\n**Example** -- ETL procedure with output parameters:\n\n```sql\nCREATE OR REPLACE PROCEDURE run_daily_etl(\n IN source_schema STRING,\n IN target_schema STRING,\n OUT rows_processed INT,\n OUT status STRING DEFAULT 'pending'\n)\nLANGUAGE SQL\nSQL SECURITY INVOKER\nCOMMENT 'Daily ETL pipeline for order processing'\nAS BEGIN\n DECLARE EXIT HANDLER FOR SQLEXCEPTION\n BEGIN\n SET status = 'failed';\n SET rows_processed = 0;\n END;\n\n -- Truncate and reload\n EXECUTE IMMEDIATE 'TRUNCATE TABLE ' || target_schema || '.orders_daily';\n\n EXECUTE IMMEDIATE\n 'INSERT INTO ' || target_schema || '.orders_daily '\n || 'SELECT * FROM ' || source_schema || '.orders '\n || 'WHERE order_date = current_date()';\n\n EXECUTE IMMEDIATE\n 'SELECT COUNT(*) FROM ' || target_schema || '.orders_daily'\n INTO rows_processed;\n\n SET status = 'success';\nEND;\n```\n\n### CALL (Invoke a Procedure)\n\n**Syntax**:\n\n```sql\nCALL procedure_name( [ argument [, ...] ] );\nCALL procedure_name( [ named_param => argument ] [, ...] );\n```\n\n**Rules**:\n\n- Supports up to 64 levels of nesting\n- For `IN` parameters: any expression castable to the parameter type, or `DEFAULT`\n- For `OUT`/`INOUT` parameters: must be a session variable or local variable\n- Arguments must match the data type of the parameter (use typed literals, e.g., `DATE'2025-01-01'`)\n- Fewer arguments allowed if remaining parameters have `DEFAULT` values\n- Not supported via ODBC\n\n**Example**:\n\n```sql\n-- Positional invocation\nDECLARE rows_out INT;\nDECLARE status_out STRING;\nCALL run_daily_etl('raw', 'silver', rows_out, status_out);\nSELECT rows_out, status_out;\n\n-- Named parameter invocation\nCALL run_daily_etl(\n target_schema => 'silver',\n source_schema => 'raw',\n rows_processed => rows_out,\n status => status_out\n);\n```\n\n### DROP PROCEDURE\n\n**Syntax**:\n\n```sql\nDROP PROCEDURE [ IF EXISTS ] procedure_name;\n```\n\n- Without `IF EXISTS`, dropping a non-existent procedure raises `ROUTINE_NOT_FOUND`\n- Requires `MANAGE` privilege, ownership of the procedure, or ownership of the containing schema/catalog/metastore\n\n**Example**:\n\n```sql\nDROP PROCEDURE IF EXISTS run_daily_etl;\n```\n\n### DESCRIBE PROCEDURE\n\n**Syntax**:\n\n```sql\n{ DESC | DESCRIBE } PROCEDURE [ EXTENDED ] procedure_name;\n```\n\n- Basic: returns procedure name and parameter list\n- `EXTENDED`: additionally returns owner, creation time, body, language, security type, determinism, data access, and configuration\n\n**Example**:\n\n```sql\nDESCRIBE PROCEDURE EXTENDED run_daily_etl;\n```\n\n### SHOW PROCEDURES\n\n**Syntax**:\n\n```sql\nSHOW PROCEDURES [ { FROM | IN } schema_name ];\n```\n\nReturns columns: `catalog`, `namespace`, `schema`, `procedure_name`.\n\n**Example**:\n\n```sql\nSHOW PROCEDURES IN my_catalog.my_schema;\n```\n\n---\n\n## Recursive CTEs\n\n**Availability**: Databricks Runtime 17.0+ and DBSQL 2025.20+\n\nRecursive CTEs enable self-referential queries for hierarchical data, graph traversal, and series generation.\n\n### WITH RECURSIVE Syntax\n\n```sql\nWITH RECURSIVE cte_name [ ( column_name [, ...] ) ]\n [ MAX RECURSION LEVEL max_level ] AS (\n base_case_query\n UNION ALL\n recursive_query\n )\nSELECT ... FROM cte_name;\n```\n\n### Anchor and Recursive Members\n\n| Component | Description |\n|-----------|-------------|\n| **Anchor (base case)** | Initial query providing seed rows; must NOT reference the CTE name |\n| **Recursive member** | References the CTE name; processes rows from the previous iteration |\n| **UNION ALL** | Combines anchor and recursive results (required) |\n\nThe recursive member reads rows produced by the previous iteration and generates new rows. Recursion terminates when the recursive member produces zero rows.\n\n### MAX RECURSION LEVEL\n\n```sql\nWITH RECURSIVE cte_name MAX RECURSION LEVEL 200 AS (...)\n```\n\n| Setting | Default | Description |\n|---------|---------|-------------|\n| Max recursion depth | 100 | Exceeding raises `RECURSION_LEVEL_LIMIT_EXCEEDED` |\n| Max result rows | 1,000,000 | Exceeding raises an error |\n| `LIMIT ALL` | N/A | Suspends the row limit (Runtime 17.2+) |\n\n### Use Cases and Examples\n\n**Generate a number series**:\n\n```sql\nWITH RECURSIVE numbers(n) AS (\n VALUES (1)\n UNION ALL\n SELECT n + 1 FROM numbers WHERE n \u003c 100\n)\nSELECT * FROM numbers;\n```\n\n**Organizational hierarchy traversal**:\n\n```sql\nWITH RECURSIVE org_tree AS (\n -- Anchor: start from the CEO\n SELECT employee_id, name, manager_id, name AS root_name, 0 AS depth\n FROM employees\n WHERE manager_id IS NULL\n\n UNION ALL\n\n -- Recursive: find direct reports\n SELECT e.employee_id, e.name, e.manager_id, t.root_name, t.depth + 1\n FROM employees e\n JOIN org_tree t ON e.manager_id = t.employee_id\n)\nSELECT * FROM org_tree ORDER BY depth, name;\n```\n\n**Graph traversal with cycle detection**:\n\n```sql\nWITH RECURSIVE search_graph(f, t, label, path, cycle) AS (\n -- Anchor: all edges as starting paths\n SELECT *, array(struct(g.f, g.t)), false\n FROM graph g\n\n UNION ALL\n\n -- Recursive: extend paths, detect cycles\n SELECT g.f, g.t, g.label,\n sg.path || array(struct(g.f, g.t)),\n array_contains(sg.path, struct(g.f, g.t))\n FROM graph g\n JOIN search_graph sg ON g.f = sg.t\n WHERE NOT sg.cycle\n)\nSELECT * FROM search_graph WHERE NOT cycle;\n```\n\n**String accumulation**:\n\n```sql\nWITH RECURSIVE r(col) AS (\n SELECT 'a'\n UNION ALL\n SELECT col || char(ascii(substr(col, -1)) + 1)\n FROM r\n WHERE length(col) \u003c 10\n)\nSELECT * FROM r;\n-- a, ab, abc, abcd, ..., abcdefghij\n```\n\n**Bill of Materials (BOM) explosion**:\n\n```sql\nWITH RECURSIVE bom AS (\n -- Anchor: top-level product\n SELECT part_id, component_id, quantity, 1 AS level\n FROM bill_of_materials\n WHERE part_id = 'PROD-001'\n\n UNION ALL\n\n -- Recursive: sub-components\n SELECT b.part_id, b.component_id, b.quantity * bom.quantity, bom.level + 1\n FROM bill_of_materials b\n JOIN bom ON b.part_id = bom.component_id\n)\nSELECT component_id, SUM(quantity) AS total_quantity, MAX(level) AS max_depth\nFROM bom\nGROUP BY component_id\nORDER BY total_quantity DESC;\n```\n\n### Limitations\n\n- Not supported in UPDATE, DELETE, or MERGE statements\n- Step (recursive) queries cannot include correlated column references to the CTE name\n- Random number generators may produce identical values across iterations\n- Default row limit of 1,000,000 rows (use `LIMIT ALL` in Runtime 17.2+ to override)\n- Default recursion depth of 100 (override with `MAX RECURSION LEVEL`)\n\n---\n\n## Multi-Statement Transactions\n\n### Overview and Current Status\n\nMulti-statement transactions (MST) allow grouping multiple SQL statements into atomic units that either succeed completely or fail completely.\n\n| Feature | Status | Notes |\n|---------|--------|-------|\n| Single-table transactions | GA | Delta Lake default; every DML statement is atomic |\n| Multi-statement transactions (SQL scripting) | Preview | `BEGIN ATOMIC...END` blocks |\n| Multi-statement transactions (Python connector) | Preview | `connection.autocommit = False` pattern |\n| Cross-table transactions | Preview | Atomic updates across multiple Delta tables |\n\n### SQL Scripting Atomic Blocks\n\nUse `BEGIN ATOMIC...END` to execute multiple statements as a single atomic unit:\n\n```sql\nBEGIN ATOMIC\n INSERT INTO customers (id, name) VALUES (1, 'Alice');\n INSERT INTO orders (id, customer_id, amount) VALUES (1, 1, 250.00);\n INSERT INTO audit_log (action, ts) VALUES ('new_customer_order', current_timestamp());\nEND;\n```\n\nIf any statement fails, all changes are rolled back.\n\n> **Note:** Tables used in `BEGIN ATOMIC` blocks must have the `catalogManaged` table feature enabled. Create tables with `TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported')`. Existing tables cannot be upgraded in place — they must be recreated with this property.\n\n### Python Connector Transaction API\n\nThe Databricks SQL Connector for Python provides explicit transaction control:\n\n```python\nfrom databricks import sql\n\nconnection = sql.connect(\n server_hostname=\"...\",\n http_path=\"...\",\n access_token=\"...\"\n)\n\n# Disable autocommit to start explicit transactions\nconnection.autocommit = False\ncursor = connection.cursor()\n\ntry:\n cursor.execute(\"INSERT INTO customers VALUES (1, 'Alice')\")\n cursor.execute(\"INSERT INTO orders VALUES (1, 1, 100.00)\")\n cursor.execute(\"INSERT INTO shipments VALUES (1, 1, 'pending')\")\n connection.commit() # All three succeed atomically\nexcept Exception:\n connection.rollback() # All three discarded\nfinally:\n connection.autocommit = True\n```\n\n**Key API methods**:\n\n| Method | Description |\n|--------|-------------|\n| `connection.autocommit = False` | Start explicit transaction mode |\n| `connection.commit()` | Commit the current transaction |\n| `connection.rollback()` | Discard all changes in the current transaction |\n| `connection.get_transaction_isolation()` | Returns current isolation level |\n| `connection.set_transaction_isolation(level)` | Sets isolation level |\n\n**Error handling**:\n\n- `sql.TransactionError` raised when committing without an active transaction\n- Cannot change `autocommit` while a transaction is active\n- `rollback()` is a safe no-op when no transaction is active\n\n### Isolation Levels\n\nDatabricks uses **Snapshot Isolation** (mapped to `REPEATABLE_READ` in standard SQL terminology).\n\n| Level | Description | Default |\n|-------|-------------|---------|\n| `WriteSerializable` | Only writes are serializable; concurrent writes may reorder | Yes (table default) |\n| `Serializable` | Both reads and writes are serializable; strictest isolation | No |\n| `REPEATABLE_READ` | Snapshot isolation for connector-level transactions | Connector default |\n\n**Setting isolation at table level**:\n\n```sql\nALTER TABLE my_table\nSET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable');\n```\n\n**Setting isolation in Python connector**:\n\n```python\nfrom databricks.sql import TRANSACTION_ISOLATION_LEVEL_REPEATABLE_READ\n\nconnection.set_transaction_isolation(TRANSACTION_ISOLATION_LEVEL_REPEATABLE_READ)\n# Only REPEATABLE_READ is supported; others raise NotSupportedError\n```\n\n**Snapshot isolation behavior**:\n\n- **Repeatable reads**: Data read within a transaction remains consistent\n- **Atomic commits**: Changes are invisible to other connections until committed\n- **Write conflicts**: Concurrent writes to the same table cause conflicts\n- **Cross-table writes**: Concurrent writes to different tables can succeed\n\n### Write Conflicts and Concurrency\n\n**Row-level concurrency** (Runtime 14.2+) reduces conflicts for tables with deletion vectors or liquid clustering:\n\n| Operation | WriteSerializable | Serializable |\n|-----------|------------------|--------------|\n| INSERT vs INSERT | No conflict | No conflict |\n| UPDATE/DELETE/MERGE vs same | No conflict (different rows) | May conflict |\n| OPTIMIZE vs concurrent DML | Conflict only with ZORDER BY | May conflict |\n\n**Common conflict exceptions**:\n\n| Exception | Cause |\n|-----------|-------|\n| `ConcurrentAppendException` | Concurrent append to the same partition |\n| `ConcurrentDeleteReadException` | Concurrent delete of files being read |\n| `MetadataChangedException` | Concurrent ALTER TABLE or schema change |\n| `ProtocolChangedException` | Protocol version upgrade during write |\n\n### Best Practices\n\n1. **Keep transactions short** to minimize conflict windows\n2. **Always wrap in try/except/finally** with rollback on errors\n3. **Restore autocommit** in the `finally` block\n4. **Use partition pruning** in MERGE conditions to reduce conflict scope\n5. **Enable row-level concurrency** (deletion vectors + liquid clustering) for high-concurrency workloads\n6. **Prefer single-statement MERGE** over multi-statement transactions when updating a single table\n7. **Commit and restart** transactions to see changes made by other connections\n\n---\n\n## Runtime Version Reference\n\n| Feature | Minimum Runtime | Status |\n|---------|----------------|--------|\n| SQL Scripting (compound statements, control flow) | 16.3 | GA |\n| Stored Procedures (CREATE/CALL/DROP PROCEDURE) | 17.0 | Public Preview |\n| Recursive CTEs (WITH RECURSIVE) | 17.0 / DBSQL 2025.20 | GA |\n| Multi-variable DECLARE | 17.2 | GA |\n| EXECUTE IMMEDIATE (basic) | 14.3 | GA |\n| EXECUTE IMMEDIATE (expressions, nested) | 17.3 | GA |\n| Recursive CTE LIMIT ALL | 17.2 | GA |\n| Multi-statement Transactions | Varies | Preview |\n| Row-level Concurrency | 14.2 | GA |\n\n---\n\n## Quick Reference Card\n\n### SQL Scripting Skeleton\n\n```sql\nBEGIN\n -- 1. Declarations\n DECLARE var1 INT DEFAULT 0;\n DECLARE var2 STRING;\n DECLARE my_error CONDITION FOR SQLSTATE '45000';\n DECLARE EXIT HANDLER FOR SQLEXCEPTION\n BEGIN\n -- error handling logic\n END;\n\n -- 2. Logic\n IF var1 > 0 THEN\n SET var2 = 'positive';\n ELSE\n SET var2 = 'non-positive';\n END IF;\n\n -- 3. Output\n VALUES (var1, var2);\nEND;\n```\n\n### Stored Procedure Skeleton\n\n```sql\nCREATE OR REPLACE PROCEDURE my_schema.my_proc(\n IN input_param STRING,\n OUT output_param INT\n)\nLANGUAGE SQL\nSQL SECURITY INVOKER\nCOMMENT 'Description of what this procedure does'\nAS BEGIN\n DECLARE EXIT HANDLER FOR SQLEXCEPTION\n SET output_param = -1;\n\n -- procedure body\n SET output_param = (SELECT COUNT(*) FROM my_table WHERE col = input_param);\nEND;\n\n-- Invoke\nDECLARE result INT;\nCALL my_schema.my_proc('value', result);\nSELECT result;\n```\n\n### Recursive CTE Skeleton\n\n```sql\nWITH RECURSIVE cte_name (col1, col2) MAX RECURSION LEVEL 50 AS (\n -- Anchor\n SELECT seed_col1, seed_col2\n FROM base_table\n WHERE condition\n\n UNION ALL\n\n -- Recursive step\n SELECT derived_col1, derived_col2\n FROM source_table s\n JOIN cte_name c ON s.parent = c.col1\n)\nSELECT * FROM cte_name;\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":28320,"content_sha256":"7275ea7a6cf953fb8ea846240792e5353d4b0128bc789ea984789135d9caf737"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Databricks SQL (DBSQL) - Advanced Features","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quick Reference","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Feature","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Key Syntax","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Since","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Reference","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SQL Scripting","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"BEGIN...END","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"DECLARE","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"IF/WHILE/FOR","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 16.3+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/sql-scripting.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/sql-scripting.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Stored Procedures","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"CREATE PROCEDURE","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"CALL","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 17.0+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/sql-scripting.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/sql-scripting.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Recursive CTEs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"WITH RECURSIVE","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 17.0+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/sql-scripting.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/sql-scripting.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Transactions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"BEGIN ATOMIC...END","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Preview","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/sql-scripting.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/sql-scripting.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Materialized Views","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"CREATE MATERIALIZED VIEW","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pro/Serverless","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/materialized-views-pipes.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/materialized-views-pipes.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Temp Tables","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"CREATE TEMPORARY TABLE","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"All","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/materialized-views-pipes.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/materialized-views-pipes.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pipe Syntax","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"|>","type":"text","marks":[{"type":"code_inline"}]},{"text":" operator","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 16.1+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/materialized-views-pipes.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/materialized-views-pipes.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Geospatial (H3)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"h3_longlatash3()","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"h3_polyfillash3()","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 11.2+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/geospatial-collations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/geospatial-collations.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Geospatial (ST)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"ST_Point()","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"ST_Contains()","type":"text","marks":[{"type":"code_inline"}]},{"text":", 80+ funcs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 16.0+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/geospatial-collations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/geospatial-collations.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Collations","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"COLLATE","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"UTF8_LCASE","type":"text","marks":[{"type":"code_inline"}]},{"text":", locale-aware","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 16.1+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/geospatial-collations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/geospatial-collations.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"AI Functions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"ai_query()","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"ai_classify()","type":"text","marks":[{"type":"code_inline"}]},{"text":", 11+ funcs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBR 15.1+","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/ai-functions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/ai-functions.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http_request","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http_request(conn, ...)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pro/Serverless","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/ai-functions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/ai-functions.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"remote_query","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SELECT * FROM remote_query(...)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pro/Serverless","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/ai-functions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/ai-functions.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"read_files","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SELECT * FROM read_files(...)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"All","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/ai-functions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/ai-functions.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Data Modeling","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Star schema, Liquid Clustering","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"All","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/best-practices.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/best-practices.md","title":null}}]}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Common Patterns","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"SQL Scripting - Procedural ETL","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"BEGIN\n DECLARE v_count INT;\n DECLARE v_status STRING DEFAULT 'pending';\n\n SET v_count = (SELECT COUNT(*) FROM catalog.schema.raw_orders WHERE status = 'new');\n\n IF v_count > 0 THEN\n INSERT INTO catalog.schema.processed_orders\n SELECT *, current_timestamp() AS processed_at\n FROM catalog.schema.raw_orders\n WHERE status = 'new';\n\n SET v_status = 'completed';\n ELSE\n SET v_status = 'skipped';\n END IF;\n\n SELECT v_status AS result, v_count AS rows_processed;\nEND","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Stored Procedure with Error Handling","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"CREATE OR REPLACE PROCEDURE catalog.schema.upsert_customers(\n IN p_source STRING,\n OUT p_rows_affected INT\n)\nLANGUAGE SQL\nSQL SECURITY INVOKER\nBEGIN\n DECLARE EXIT HANDLER FOR SQLEXCEPTION\n BEGIN\n SET p_rows_affected = -1;\n SIGNAL SQLSTATE '45000'\n SET MESSAGE_TEXT = concat('Upsert failed for source: ', p_source);\n END;\n\n MERGE INTO catalog.schema.dim_customer AS t\n USING (SELECT * FROM identifier(p_source)) AS s\n ON t.customer_id = s.customer_id\n WHEN MATCHED THEN UPDATE SET *\n WHEN NOT MATCHED THEN INSERT *;\n\n SET p_rows_affected = (SELECT COUNT(*) FROM identifier(p_source));\nEND;\n\n-- Invoke:\nCALL catalog.schema.upsert_customers('catalog.schema.staging_customers', ?);","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Materialized View with Scheduled Refresh","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue\n CLUSTER BY (order_date)\n SCHEDULE EVERY 1 HOUR\n COMMENT 'Hourly-refreshed daily revenue by region'\nAS SELECT\n order_date,\n region,\n SUM(amount) AS total_revenue,\n COUNT(DISTINCT customer_id) AS unique_customers\nFROM catalog.schema.fact_orders\nJOIN catalog.schema.dim_store USING (store_id)\nGROUP BY order_date, region;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pipe Syntax - Readable Transformations","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Traditional SQL rewritten with pipe syntax\nFROM catalog.schema.fact_orders\n |> WHERE order_date >= current_date() - INTERVAL 30 DAYS\n |> AGGREGATE SUM(amount) AS total, COUNT(*) AS cnt GROUP BY region, product_category\n |> WHERE total > 10000\n |> ORDER BY total DESC\n |> LIMIT 20;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"AI Functions - Enrich Data with LLMs","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Classify support tickets\nSELECT\n ticket_id,\n description,\n ai_classify(description, ARRAY('billing', 'technical', 'account', 'feature_request')) AS category,\n ai_analyze_sentiment(description) AS sentiment\nFROM catalog.schema.support_tickets\nLIMIT 100;\n\n-- Extract entities from text\nSELECT\n doc_id,\n ai_extract(content, ARRAY('person_name', 'company', 'dollar_amount')) AS entities\nFROM catalog.schema.contracts;\n\n-- General-purpose AI query with structured output\nSELECT ai_query(\n 'databricks-meta-llama-3-3-70b-instruct',\n concat('Summarize this customer feedback in JSON with keys: topic, sentiment, action_items. Feedback: ', feedback),\n returnType => 'STRUCT\u003ctopic STRING, sentiment STRING, action_items ARRAY\u003cSTRING>>'\n) AS analysis\nFROM catalog.schema.customer_feedback\nLIMIT 50;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Geospatial - Proximity Search with H3","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Find stores within 5km of each customer using H3 indexing\nWITH customer_h3 AS (\n SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell\n FROM catalog.schema.customers\n),\nstore_h3 AS (\n SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell\n FROM catalog.schema.stores\n)\nSELECT\n c.customer_id,\n s.store_id,\n ST_Distance(\n ST_Point(c.longitude, c.latitude),\n ST_Point(s.longitude, s.latitude)\n ) AS distance_m\nFROM customer_h3 c\nJOIN store_h3 s ON h3_ischildof(c.h3_cell, h3_toparent(s.h3_cell, 5))\nWHERE ST_Distance(\n ST_Point(c.longitude, c.latitude),\n ST_Point(s.longitude, s.latitude)\n) \u003c 5000;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Collation - Case-Insensitive Search","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Create table with case-insensitive collation\nCREATE TABLE catalog.schema.products (\n product_id BIGINT GENERATED ALWAYS AS IDENTITY,\n name STRING COLLATE UTF8_LCASE,\n category STRING COLLATE UTF8_LCASE,\n price DECIMAL(10, 2)\n);\n\n-- Queries automatically case-insensitive (no LOWER() needed)\nSELECT * FROM catalog.schema.products\nWHERE name = 'MacBook Pro'; -- matches 'macbook pro', 'MACBOOK PRO', etc.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"http_request - Call External APIs","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Set up connection first (one-time)\nCREATE CONNECTION my_api_conn\n TYPE HTTP\n OPTIONS (host 'https://api.example.com', bearer_token secret('scope', 'token'));\n\n-- Call API from SQL\nSELECT\n order_id,\n http_request(\n conn => 'my_api_conn',\n method => 'POST',\n path => '/v1/validate',\n json => to_json(named_struct('order_id', order_id, 'amount', amount))\n ).text AS api_response\nFROM catalog.schema.orders\nWHERE needs_validation = true;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"read_files - Ingest Raw Files","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Read JSON files from a Volume with schema hints\nSELECT *\nFROM read_files(\n '/Volumes/catalog/schema/raw/events/',\n format => 'json',\n schemaHints => 'event_id STRING, timestamp TIMESTAMP, payload MAP\u003cSTRING, STRING>',\n pathGlobFilter => '*.json',\n recursiveFileLookup => true\n);\n\n-- Read CSV with options\nSELECT *\nFROM read_files(\n '/Volumes/catalog/schema/raw/sales/',\n format => 'csv',\n header => true,\n delimiter => '|',\n dateFormat => 'yyyy-MM-dd',\n schema => 'sale_id INT, sale_date DATE, amount DECIMAL(10,2), store STRING'\n);","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Recursive CTE - Hierarchy Traversal","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"WITH RECURSIVE org_chart AS (\n -- Anchor: top-level managers\n SELECT employee_id, name, manager_id, 0 AS depth, ARRAY(name) AS path\n FROM catalog.schema.employees\n WHERE manager_id IS NULL\n\n UNION ALL\n\n -- Recursive: direct reports\n SELECT e.employee_id, e.name, e.manager_id, o.depth + 1, array_append(o.path, e.name)\n FROM catalog.schema.employees e\n JOIN org_chart o ON e.manager_id = o.employee_id\n WHERE o.depth \u003c 10 -- safety limit\n)\nSELECT * FROM org_chart ORDER BY depth, name;","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"remote_query - Federated Queries","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"-- Query PostgreSQL via Lakehouse Federation\nSELECT *\nFROM remote_query(\n 'my_postgres_connection',\n database => 'my_database',\n query => 'SELECT customer_id, email, created_at FROM customers WHERE active = true'\n);","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Reference Files","type":"text"}]},{"type":"paragraph","content":[{"text":"Load these for detailed syntax, full parameter lists, and advanced patterns:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"File","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Contents","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"When to Read","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/sql-scripting.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/sql-scripting.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SQL Scripting, Stored Procedures, Recursive CTEs, Transactions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"User needs procedural SQL, error handling, loops, dynamic SQL","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/materialized-views-pipes.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/materialized-views-pipes.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Materialized Views, Temp Tables/Views, Pipe Syntax","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"User needs MVs, refresh scheduling, temp objects, pipe operator","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/geospatial-collations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/geospatial-collations.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"39 H3 functions, 80+ ST functions, Collation types and hierarchy","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"User needs spatial analysis, H3 indexing, case/accent handling","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/ai-functions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/ai-functions.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"13 AI functions, http_request, remote_query, read_files (all options)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"User needs AI enrichment, API calls, federation, file ingestion","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/best-practices.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/best-practices.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Data modeling, performance, Liquid Clustering, anti-patterns","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"User needs architecture guidance, optimization, or modeling advice","type":"text"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Key Guidelines","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always use Serverless SQL warehouses","type":"text","marks":[{"type":"strong"}]},{"text":" for AI functions, MVs, and http_request","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text","marks":[{"type":"strong"}]},{"text":"LIMIT","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" during development","type":"text","marks":[{"type":"strong"}]},{"text":" with AI functions to control costs","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Prefer Liquid Clustering over partitioning","type":"text","marks":[{"type":"strong"}]},{"text":" for new tables (1-4 keys max)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text","marks":[{"type":"strong"}]},{"text":"CLUSTER BY AUTO","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" when unsure about clustering keys","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Star schema in Gold layer","type":"text","marks":[{"type":"strong"}]},{"text":" for BI; OBT acceptable in Silver","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Define PK/FK constraints","type":"text","marks":[{"type":"strong"}]},{"text":" on dimensional models for query optimization","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text","marks":[{"type":"strong"}]},{"text":"COLLATE UTF8_LCASE","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" for user-facing string columns that need case-insensitive search","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Test SQL via CLI","type":"text","marks":[{"type":"strong"}]},{"text":" (","type":"text"},{"text":"databricks experimental aitools tools query","type":"text","marks":[{"type":"code_inline"}]},{"text":") or notebooks before deploying. If ","type":"text"},{"text":"--warehouse","type":"text","marks":[{"type":"code_inline"}]},{"text":" is rejected on your CLI version, set ","type":"text"},{"text":"DATABRICKS_WAREHOUSE_ID","type":"text","marks":[{"type":"code_inline"}]},{"text":" in the environment instead.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"databricks-dbsql","author":"@skillopedia","source":{"stars":140,"repo_name":"databricks-agent-skills","origin_url":"https://github.com/databricks/databricks-agent-skills/blob/HEAD/experimental/databricks-dbsql/SKILL.md","repo_owner":"databricks","body_sha256":"6ab2c30dde2a733430b1313d5b342be27fe97afc152dc03ac2c81aeeebc8d821","cluster_key":"b6964a2f941a7fd0e8845b846f6c532009260bd1ec6d4b40facadcd93bdbf412","clean_bundle":{"format":"clean-skill-bundle-v1","source":"databricks/databricks-agent-skills/experimental/databricks-dbsql/SKILL.md","attachments":[{"id":"2a03ea96-6f7c-5855-b725-decd0879d0bc","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/2a03ea96-6f7c-5855-b725-decd0879d0bc/attachment.yaml","path":"agents/openai.yaml","size":638,"sha256":"5a3ea80e8f020d1223ffc3eae2a9d46678650405344ca82e18cd6a708d3a5c37","contentType":"application/yaml; charset=utf-8"},{"id":"0fc0a7cc-d4fa-5556-89fc-c0c4513449fd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0fc0a7cc-d4fa-5556-89fc-c0c4513449fd/attachment.png","path":"assets/databricks.png","size":15366,"sha256":"06873c31ebda49d6ec5a730749c3dd1d0a140495f235cb1c0256cb2a3d1de850","contentType":"image/png"},{"id":"1640e281-6d8e-59af-a816-63d251b26d8b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/1640e281-6d8e-59af-a816-63d251b26d8b/attachment.svg","path":"assets/databricks.svg","size":582,"sha256":"00e528e4d5e665dbb5784e3d028b74ecf74405e5ff5a9c75a4d8661f05bda91c","contentType":"image/svg+xml"},{"id":"608b4c00-cbce-5dc0-a47e-ba2aa541d9d9","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/608b4c00-cbce-5dc0-a47e-ba2aa541d9d9/attachment.md","path":"references/ai-functions.md","size":43379,"sha256":"cb6d3f1b4075a99519be7acbe500efbc07390c1433a4fe73632c54daf61a4532","contentType":"text/markdown; charset=utf-8"},{"id":"93f8ba53-b267-5050-bc88-397d3224555c","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/93f8ba53-b267-5050-bc88-397d3224555c/attachment.md","path":"references/best-practices.md","size":23174,"sha256":"fabd6fa7457fe06373fd9c26925dcd3f104cf4eead870c80cda50d9f51deebfe","contentType":"text/markdown; charset=utf-8"},{"id":"b709de32-2349-526e-b095-7229619a9012","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b709de32-2349-526e-b095-7229619a9012/attachment.md","path":"references/geospatial-collations.md","size":28721,"sha256":"23579de0add0babf26cc98a326a4825e392d1ee9b2785d3b2fcb8d60c8f3453d","contentType":"text/markdown; charset=utf-8"},{"id":"10a54039-41de-597a-9a73-6a0e6eb492af","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/10a54039-41de-597a-9a73-6a0e6eb492af/attachment.md","path":"references/materialized-views-pipes.md","size":23144,"sha256":"c19da7c7c10f5f9544e07ae55b47360f9d609c586f99a5c406c6ebfb851983c7","contentType":"text/markdown; charset=utf-8"},{"id":"778674df-8712-5e32-ae77-7c3e762fe9ff","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/778674df-8712-5e32-ae77-7c3e762fe9ff/attachment.md","path":"references/sql-scripting.md","size":28320,"sha256":"7275ea7a6cf953fb8ea846240792e5353d4b0128bc789ea984789135d9caf737","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"0567a09b49e93a12d4d5b83d5e1cbb9c3984faacbf8713d07087e066d939a540","attachment_count":8,"text_attachments":6,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":2,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"experimental/databricks-dbsql/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"data-analytics","import_tag":"clean-skills-v1","description":"Databricks SQL (DBSQL) advanced features and SQL warehouse capabilities. This skill MUST be invoked when the user mentions: \"DBSQL\", \"Databricks SQL\", \"SQL warehouse\", \"SQL scripting\", \"stored procedure\", \"CALL procedure\", \"materialized view\", \"CREATE MATERIALIZED VIEW\", \"pipe syntax\", \"|>\", \"geospatial\", \"H3\", \"ST_\", \"spatial SQL\", \"collation\", \"COLLATE\", \"ai_query\", \"ai_classify\", \"ai_extract\", \"ai_gen\", \"AI function\", \"http_request\", \"remote_query\", \"read_files\", \"Lakehouse Federation\", \"recursive CTE\", \"WITH RECURSIVE\", \"multi-statement transaction\", \"temp table\", \"temporary view\", \"pipe operator\". SHOULD also invoke when the user asks about SQL best practices, data modeling patterns, or advanced SQL features on Databricks."}},"renderedAt":1782987098690}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.