Author: sainath

Snowflake SQL Tutorial: Master MERGE ALL BY NAME in 2025

Revolutionary SQL Features That Transform data engineering

In 2025, Snowflake has introduced groundbreaking improvements that fundamentally change how data engineers write queries. This Snowflake SQL tutorial covers the latest features including MERGE ALL BY NAME, UNION BY NAME, and Cortex AISQL. Whether you’re learning Snowflake SQL or optimizing existing code, this tutorial demonstrates how these enhancements eliminate tedious column mapping, reduce errors, and dramatically simplify complex data operations.

The star feature? MERGE ALL BY NAME—announced on September 29, 2025—automatically matches columns by name, eliminating the need to manually map every column when upserting data. This Snowflake SQL tutorial will show you how this single feature can transform a 50-line MERGE statement into just 5 lines.

But that’s not all. Additionally, this SQL tutorial covers:

UNION BY NAME for flexible data combining
Cortex AISQL for AI-powered SQL functions
Enhanced PIVOT/UNPIVOT with aliasing
Snowflake Scripting UDFs for procedural SQL
Lambda expressions in higher-order functions

For data engineers, these improvements mean less boilerplate code, fewer errors, and more time focused on solving business problems rather than wrestling with SQL syntax.

UNION BY NAME combining tables with different schemas and column orders flexibly

But that’s not all. Additionally, Snowflake 2025 brings:

UNION BY NAME for flexible data combining
Cortex AISQL for AI-powered SQL functions
Enhanced PIVOT/UNPIVOT with aliasing
Snowflake Scripting UDFs for procedural SQL
Lambda expressions in higher-order functions

Snowflake Scripting UDF showing procedural logic with conditionals and loops

For data engineers, these improvements mean less boilerplate code, fewer errors, and more time focused on solving business problems rather than wrestling with SQL syntax.

Snowflake SQL Tutorial: MERGE ALL BY NAME Feature

This Snowflake SQL tutorial begins with the most impactful feature of 2025…

Announced on September 29, 2025, MERGE ALL BY NAME is arguably the most impactful SQL improvement Snowflake has released this year. This feature automatically matches columns between source and target tables based on column names rather than positions.

The SQL Problem MERGE ALL BY NAME Solves

Traditionally, writing a MERGE statement required manually listing and mapping each column:

Productivity comparison showing OLD manual MERGE versus NEW automatic MERGE ALL BY NAME

sql

-- OLD WAY: Manual column mapping (tedious and error-prone)
MERGE INTO customer_target t
USING customer_updates s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
  UPDATE SET
    t.first_name = s.first_name,
    t.last_name = s.last_name,
    t.email = s.email,
    t.phone = s.phone,
    t.address = s.address,
    t.city = s.city,
    t.state = s.state,
    t.zip_code = s.zip_code,
    t.country = s.country,
    t.updated_date = s.updated_date
WHEN NOT MATCHED THEN
  INSERT (customer_id, first_name, last_name, email, phone, 
          address, city, state, zip_code, country, updated_date)
  VALUES (s.customer_id, s.first_name, s.last_name, s.email, 
          s.phone, s.address, s.city, s.state, s.zip_code, 
          s.country, s.updated_date);

This approach suffers from multiple pain points:

Manual mapping for every single column
High risk of typos and mismatches
Difficult maintenance when schemas evolve
Time-consuming for tables with many columns

The Snowflake SQL Solution: MERGE ALL BY NAME

With MERGE ALL BY NAME, the same operation becomes elegantly simple:

sql

-- NEW WAY: Automatic column matching (clean and reliable)
MERGE INTO customer_target
USING customer_updates
ON customer_target.customer_id = customer_updates.customer_id
WHEN MATCHED THEN
  UPDATE ALL BY NAME
WHEN NOT MATCHED THEN
  INSERT ALL BY NAME;

That’s it! Just 2 lines instead of 20+ lines of column mapping.

How MERGE ALL BY NAME Works

Snowflake MERGE ALL BY NAME automatically matching columns by name regardless of position

The magic happens through intelligent column name matching:

Snowflake analyzes both target and source tables
It identifies columns with matching names
It automatically maps columns regardless of position
It handles different column orders seamlessly
It executes the MERGE with proper type conversion

Importantly, MERGE ALL BY NAME works even when:

Columns are in different orders
Tables have extra columns in one but not the other
Column names use different casing (Snowflake is case-insensitive by default)

Requirements for MERGE ALL BY NAME

For this feature to work correctly:

Target and source must have the same number of matching columns
Column names must be identical (case-insensitive)
Data types must be compatible (Snowflake handles automatic casting)

However, column order doesn’t matter:

sql

-- This works perfectly!
CREATE TABLE target (
  id INT,
  name VARCHAR,
  email VARCHAR,
  created_date DATE
);

CREATE TABLE source (
  created_date DATE,  -- Different order
  email VARCHAR,       -- Different order
  id INT,             -- Different order
  name VARCHAR        -- Different order
);

MERGE INTO target
USING source
ON target.id = source.id
WHEN MATCHED THEN UPDATE ALL BY NAME
WHEN NOT MATCHED THEN INSERT ALL BY NAME;

Snowflake intelligently matches id with id, name with name, etc., regardless of position.

Real-World Use Case: Slowly Changing Dimensions

Consider implementing a Type 1 SCD (Slowly Changing Dimension) for product data:

sql

-- Product dimension table
CREATE OR REPLACE TABLE dim_product (
  product_id INT PRIMARY KEY,
  product_name VARCHAR,
  category VARCHAR,
  price DECIMAL(10,2),
  description VARCHAR,
  supplier_id INT,
  last_updated TIMESTAMP
);

-- Daily product updates from source system
CREATE OR REPLACE TABLE product_updates (
  product_id INT,
  description VARCHAR,  -- Different column order
  price DECIMAL(10,2),
  product_name VARCHAR,
  category VARCHAR,
  supplier_id INT,
  last_updated TIMESTAMP
);

-- SCD Type 1: Upsert with MERGE ALL BY NAME
MERGE INTO dim_product
USING product_updates
ON dim_product.product_id = product_updates.product_id
WHEN MATCHED THEN
  UPDATE ALL BY NAME
WHEN NOT MATCHED THEN
  INSERT ALL BY NAME;

This handles:

Updating existing products with latest information
Inserting new products automatically
Different column orders between systems
All columns without manual mapping

Benefits of MERGE ALL BY NAME

Data engineers report significant advantages:

Time Savings:

90% less code for MERGE statements
5 minutes instead of 30 minutes to write complex merges
Faster schema evolution without code changes

Error Reduction:

Zero typos from manual column mapping
No mismatched columns from copy-paste errors
Automatic validation by Snowflake

Maintenance Simplification:

Schema changes don’t require code updates
New columns automatically included
Removed columns handled gracefully

Code Readability:

Clear intent from simple syntax
Easy review in code reviews
Self-documenting logic

Snowflake SQL UNION BY NAME: Flexible Data Combining

This section of our Snowflake SQL tutorial explores how UNION BY NAME Introduced at Snowflake Summit 2025, UNION BY NAME revolutionizes how we combine datasets from different sources by focusing on column names rather than positions.

The Traditional UNION Problem

For years, SQL developers struggled with UNION ALL’s rigid requirements:

sql

-- TRADITIONAL UNION ALL: Requires exact column matching
SELECT id, name, department
FROM employees
UNION ALL
SELECT emp_id, emp_name, dept  -- Different names: FAILS!
FROM contingent_workers;

This fails because:

Column names don’t match
Positions matter, not names
Adding columns breaks existing queries
Schema evolution requires constant maintenance

UNION BY NAME Solution

With UNION BY NAME, column matching happens by name:

sql

-- NEW: UNION BY NAME matches columns by name
CREATE TABLE employees (
  id INT,
  name VARCHAR,
  department VARCHAR,
  role VARCHAR
);

CREATE TABLE contingent_workers (
  id INT,
  name VARCHAR,
  department VARCHAR
  -- Note: No 'role' column
);

SELECT * FROM employees
UNION ALL BY NAME
SELECT * FROM contingent_workers;

-- Result: Combines by name, fills missing 'role' with NULL

Output:

ID | NAME    | DEPARTMENT | ROLE
---+---------+------------+--------
1  | Alice   | Sales      | Manager
2  | Bob     | IT         | Developer
3  | Charlie | Sales      | NULL
4  | Diana   | IT         | NULL

Key behaviors:

Columns matched by name, not position
Missing columns filled with NULL
Extra columns included automatically
Order doesn’t matter

Use Cases for UNION BY NAME

This feature excels in several scenarios:

Merging Legacy and Modern Systems:

sql

-- Legacy system with old column names
SELECT 
  cust_id AS customer_id,
  cust_name AS name,
  phone_num AS phone
FROM legacy_customers

UNION ALL BY NAME

-- Modern system with new column names
SELECT
  customer_id,
  name,
  phone,
  email  -- New column not in legacy
FROM modern_customers;

Combining Data from Multiple Regions:

sql

-- Different regions have different optional fields
SELECT * FROM us_sales        -- Has 'state' column
UNION ALL BY NAME
SELECT * FROM eu_sales        -- Has 'country' column
UNION ALL BY NAME
SELECT * FROM asia_sales;     -- Has 'region' column

Incremental Schema Evolution:

sql

-- Historical data without new fields
SELECT * FROM sales_2023

UNION ALL BY NAME

-- Current data with additional tracking
SELECT * FROM sales_2024      -- Added 'source_channel' column

UNION ALL BY NAME

SELECT * FROM sales_2025;     -- Added 'attribution_id' column

Performance Considerations

While powerful, UNION BY NAME has slight overhead:

When to use UNION BY NAME:

Schemas differ across sources
Evolution happens frequently
Maintainability matters more than marginal performance

When to use traditional UNION ALL:

Schemas are identical and stable
Maximum performance is critical
Large-scale production queries with billions of rows

Best practice: Use UNION BY NAME for data integration and ELT pipelines where flexibility outweighs marginal performance costs.

Cortex AISQL: AI-Powered SQL Functions

Announced on June 2, 2025, Cortex AISQL brings powerful AI capabilities directly into Snowflake’s SQL engine, enabling AI pipelines with familiar SQL commands.

Revolutionary AI Functions

Cortex AISQL introduces three groundbreaking SQL functions:

AI_FILTER: Intelligent Data Filtering

Filter data using natural language questions instead of complex WHERE clauses:

sql

-- Traditional approach: Complex WHERE clause
SELECT *
FROM customer_reviews
WHERE (
  LOWER(review_text) LIKE '%excellent%' OR
  LOWER(review_text) LIKE '%amazing%' OR
  LOWER(review_text) LIKE '%outstanding%' OR
  LOWER(review_text) LIKE '%fantastic%'
) AND (
  sentiment_score > 0.7
);

-- AI_FILTER approach: Natural language
SELECT *
FROM customer_reviews
WHERE AI_FILTER(review_text, 'Is this a positive review praising the product?');

Use cases:

Filtering images by content (“Does this image contain a person?”)
Classifying text by intent (“Is this a complaint?”)
Quality control (“Is this product photo high quality?”)

AI_CLASSIFY: Intelligent Classification

Classify text or images into user-defined categories:

sql

-- Classify customer support tickets automatically
SELECT 
  ticket_id,
  subject,
  AI_CLASSIFY(
    description,
    ['Technical Issue', 'Billing Question', 'Feature Request', 
     'Bug Report', 'Account Access']
  ) AS ticket_category
FROM support_tickets;

-- Multi-label classification
SELECT
  product_id,
  AI_CLASSIFY(
    product_description,
    ['Electronics', 'Clothing', 'Home & Garden', 'Sports'],
    'multi_label'
  ) AS categories
FROM products;

Advantages:

No training required
Plain-language category definitions
Single or multi-label classification
Works on text and images

AI_AGG: Intelligent Aggregation

Aggregate text columns and extract insights across multiple rows:

sql

-- Traditional: Difficult to get insights from text
SELECT 
  product_id,
  STRING_AGG(review_text, ' | ')  -- Just concatenates
FROM reviews
GROUP BY product_id;

-- AI_AGG: Extract meaningful insights
SELECT
  product_id,
  AI_AGG(
    review_text,
    'Summarize the common themes in these reviews, highlighting both positive and negative feedback'
  ) AS review_summary
FROM reviews
GROUP BY product_id;

Key benefit: Not subject to context window limitations—can process unlimited rows.

Cortex AISQL Real-World Example

Complete pipeline for analyzing customer feedback:

Real-world Cortex AISQL pipeline filtering, classifying, and aggregating customer feedback

sql

-- Step 1: Filter relevant feedback
CREATE OR REPLACE TABLE relevant_feedback AS
SELECT *
FROM customer_feedback
WHERE AI_FILTER(feedback_text, 'Is this feedback about product quality or features?');

-- Step 2: Classify feedback by category
CREATE OR REPLACE TABLE categorized_feedback AS
SELECT
  feedback_id,
  customer_id,
  AI_CLASSIFY(
    feedback_text,
    ['Product Quality', 'Feature Request', 'User Experience', 
     'Performance', 'Pricing']
  ) AS feedback_category,
  feedback_text
FROM relevant_feedback;

-- Step 3: Aggregate insights by category
SELECT
  feedback_category,
  COUNT(*) AS feedback_count,
  AI_AGG(
    feedback_text,
    'Summarize the key points from this feedback, identifying the top 3 issues or requests mentioned'
  ) AS category_insights
FROM categorized_feedback
GROUP BY feedback_category;

This replaces:

Hours of manual review
Complex NLP pipelines with external tools
Expensive ML model training and deployment

Enhanced PIVOT and UNPIVOT with Aliases

Snowflake 2025 adds aliasing capabilities to PIVOT and UNPIVOT operations, improving readability and flexibility.

PIVOT with Column Aliases

Now you can specify aliases for pivot column names:

sql

-- Sample data: Monthly sales by product
CREATE OR REPLACE TABLE monthly_sales (
  product VARCHAR,
  month VARCHAR,
  sales_amount DECIMAL(10,2)
);

INSERT INTO monthly_sales VALUES
  ('Laptop', 'Jan', 50000),
  ('Laptop', 'Feb', 55000),
  ('Laptop', 'Mar', 60000),
  ('Phone', 'Jan', 30000),
  ('Phone', 'Feb', 35000),
  ('Phone', 'Mar', 40000);

-- PIVOT with aliases for readable column names
SELECT *
FROM monthly_sales
PIVOT (
  SUM(sales_amount)
  FOR month IN ('Jan', 'Feb', 'Mar')
) AS pivot_alias (
  product,
  january_sales,      -- Custom alias instead of 'Jan'
  february_sales,     -- Custom alias instead of 'Feb'
  march_sales         -- Custom alias instead of 'Mar'
);

Output:

PRODUCT | JANUARY_SALES | FEBRUARY_SALES | MARCH_SALES
--------+---------------+----------------+-------------
Laptop  | 50000         | 55000          | 60000
Phone   | 30000         | 35000          | 40000

Benefits:

Readable column names
Business-friendly output
Easier downstream consumption
Better documentation

UNPIVOT with Aliases

Similarly, UNPIVOT now supports aliases:

sql

-- Unpivot with custom column names
SELECT *
FROM pivot_sales_data
UNPIVOT (
  monthly_amount
  FOR sales_month IN (q1_sales, q2_sales, q3_sales, q4_sales)
) AS unpivot_alias (
  product_name,
  quarter,
  amount
);

Snowflake Scripting UDFs: Procedural SQL

A major enhancement in 2025 allows creating SQL UDFs with Snowflake Scripting procedural language.

Traditional UDF Limitations

Before, SQL UDFs were limited to single expressions:

sql

-- Simple UDF: No procedural logic allowed
CREATE FUNCTION calculate_discount(price FLOAT, discount_pct FLOAT)
RETURNS FLOAT
AS
$$
  price * (1 - discount_pct / 100)
$$;

New: Snowflake Scripting UDFs

Now you can include loops, conditionals, and complex logic:

sql

CREATE OR REPLACE FUNCTION calculate_tiered_commission(
  sales_amount FLOAT
)
RETURNS FLOAT
LANGUAGE SQL
AS
$$
DECLARE
  commission FLOAT;
BEGIN
  -- Tiered commission logic
  IF (sales_amount < 10000) THEN
    commission := sales_amount * 0.05;  -- 5%
  ELSEIF (sales_amount < 50000) THEN
    commission := (10000 * 0.05) + ((sales_amount - 10000) * 0.08);  -- 8%
  ELSE
    commission := (10000 * 0.05) + (40000 * 0.08) + ((sales_amount - 50000) * 0.10);  -- 10%
  END IF;
  
  RETURN commission;
END;
$$;

-- Use in SELECT statement
SELECT
  salesperson,
  sales_amount,
  calculate_tiered_commission(sales_amount) AS commission
FROM sales_data;

Key advantages:

Called in SELECT statements (unlike stored procedures)
Complex business logic encapsulated
Reusable across queries
Better than stored procedures for inline calculations

Real-World Example: Dynamic Pricing

sql

CREATE OR REPLACE FUNCTION calculate_dynamic_price(
  base_price FLOAT,
  inventory_level INT,
  demand_score FLOAT,
  competitor_price FLOAT
)
RETURNS FLOAT
LANGUAGE SQL
AS
$$
DECLARE
  adjusted_price FLOAT;
  inventory_factor FLOAT;
  demand_factor FLOAT;
BEGIN
  -- Calculate inventory factor
  IF (inventory_level < 10) THEN
    inventory_factor := 1.15;  -- Low inventory: +15%
  ELSEIF (inventory_level > 100) THEN
    inventory_factor := 0.90;  -- High inventory: -10%
  ELSE
    inventory_factor := 1.0;
  END IF;
  
  -- Calculate demand factor
  IF (demand_score > 0.8) THEN
    demand_factor := 1.10;     -- High demand: +10%
  ELSEIF (demand_score < 0.3) THEN
    demand_factor := 0.95;     -- Low demand: -5%
  ELSE
    demand_factor := 1.0;
  END IF;
  
  -- Calculate adjusted price
  adjusted_price := base_price * inventory_factor * demand_factor;
  
  -- Price floor: Don't go below 80% of competitor
  IF (adjusted_price < competitor_price * 0.8) THEN
    adjusted_price := competitor_price * 0.8;
  END IF;
  
  -- Price ceiling: Don't exceed 120% of competitor
  IF (adjusted_price > competitor_price * 1.2) THEN
    adjusted_price := competitor_price * 1.2;
  END IF;
  
  RETURN ROUND(adjusted_price, 2);
END;
$$;

-- Apply dynamic pricing across catalog
SELECT
  product_id,
  product_name,
  base_price,
  calculate_dynamic_price(
    base_price,
    inventory_level,
    demand_score,
    competitor_price
  ) AS optimized_price
FROM products;

Lambda Expressions with Table Column References

Snowflake 2025 enhances higher-order functions by allowing table column references in lambda expressions.

Lambda expressions in Snowflake referencing both array elements and table columns

What Are Higher-Order Functions?

Higher-order functions operate on arrays using lambda functions:

FILTER: Filter array elements MAP/TRANSFORM: Transform each element REDUCE: Aggregate array into single value

New Capability: Column References

Previously, lambda expressions couldn’t reference table columns:

sql

-- OLD: Limited to array elements only
SELECT FILTER(
  price_array,
  x -> x > 100  -- Can only use array elements
)
FROM products;

Now you can reference table columns:

sql

-- NEW: Reference table columns in lambda
CREATE TABLE products (
  product_id INT,
  product_name VARCHAR,
  prices ARRAY,
  discount_threshold FLOAT
);

-- Use table column 'discount_threshold' in lambda
SELECT
  product_id,
  product_name,
  FILTER(
    prices,
    p -> p > discount_threshold  -- References table column!
  ) AS prices_above_threshold
FROM products;

Real-World Use Case: Dynamic Filtering

sql

-- Inventory table with multiple warehouse locations
CREATE TABLE inventory (
  product_id INT,
  warehouse_locations ARRAY,
  min_stock_level INT,
  stock_levels ARRAY
);

-- Filter warehouses where stock is below minimum
SELECT
  product_id,
  FILTER(
    warehouse_locations,
    (loc, idx) -> stock_levels[idx] < min_stock_level
  ) AS understocked_warehouses,
  FILTER(
    stock_levels,
    level -> level < min_stock_level
  ) AS low_stock_amounts
FROM inventory;

Complex Example: Price Optimization

sql

-- Apply dynamic discounts based on product-specific rules
CREATE TABLE product_pricing (
  product_id INT,
  base_prices ARRAY,
  competitor_prices ARRAY,
  max_discount_pct FLOAT,
  margin_threshold FLOAT
);

SELECT
  product_id,
  TRANSFORM(
    base_prices,
    (price, idx) -> 
      CASE
        -- Don't discount if already below competitor
        WHEN price <= competitor_prices[idx] * 0.95 THEN price
        -- Apply discount but respect margin threshold
        WHEN price * (1 - max_discount_pct / 100) >= margin_threshold 
          THEN price * (1 - max_discount_pct / 100)
        -- Use margin threshold as floor
        ELSE margin_threshold
      END
  ) AS optimized_prices
FROM product_pricing;

Additional SQL Improvements in 2025

Beyond the major features, Snowflake 2025 includes numerous enhancements:

Enhanced SEARCH Function Modes

New search modes for more precise text matching:

PHRASE Mode: Match exact phrases with token order

sql

SELECT *
FROM documents
WHERE SEARCH(content, 'data engineering best practices', 'PHRASE');

AND Mode: All tokens must be present

sql

SELECT *
FROM articles
WHERE SEARCH(title, 'snowflake performance optimization', 'AND');

OR Mode: Any token matches (existing, now explicit)

sql

SELECT *
FROM blogs
WHERE SEARCH(content, 'sql python scala', 'OR');

Increased VARCHAR and BINARY Limits

Maximum lengths significantly increased:

VARCHAR: Now 128 MB (previously 16 MB)
VARIANT, ARRAY, OBJECT: Now 128 MB
BINARY, GEOGRAPHY, GEOMETRY: Now 64 MB

This enables:

Storing large JSON documents
Processing big text blobs
Handling complex geographic shapes

Schema-Level Replication for Failover

Selective replication for databases in failover groups:

sql

-- Replicate only specific schemas
ALTER DATABASE production_db
SET REPLICABLE_WITH_FAILOVER_GROUPS = TRUE;

ALTER SCHEMA production_db.critical_schema
SET REPLICABLE_WITH_FAILOVER_GROUPS = TRUE;

-- Other schemas not replicated, reducing costs

XML Format Support (General Availability)

Native XML support for semi-structured data:

sql

-- Load XML files
COPY INTO xml_data
FROM @my_stage/data.xml
FILE_FORMAT = (TYPE = 'XML');

-- Query XML with familiar functions
SELECT
  xml_data:customer:@id::STRING AS customer_id,
  xml_data:customer:name::STRING AS customer_name
FROM xml_data;

Best Practices for Snowflake SQL 2025

This Snowflake SQL tutorial wouldn’t be complete without best practices…

To maximize the benefits of these improvements:

When to Use MERGE ALL BY NAME

Use it when:

Tables have 5+ columns to map
Schemas evolve frequently
Column order varies across systems
Maintenance is a priority

Avoid it when:

Fine control needed over specific columns
Conditional updates require different logic per column
Performance is absolutely critical (marginal difference)

When to Use UNION BY NAME

Use it when:

Combining data from multiple sources with varying schemas
Schema evolution happens regularly
Missing columns should be NULL-filled
Flexibility outweighs performance

Avoid it when:

Schemas are identical and stable
Maximum performance is required
Large-scale production queries (billions of rows)

Cortex AISQL Performance Tips

Optimize AI function usage:

Filter data first before applying AI functions
Batch similar operations together
Use WHERE clauses to limit rows processed
Cache results when possible

Example optimization:

sql

-- POOR: AI function on entire table
SELECT AI_CLASSIFY(text, categories) FROM large_table;

-- BETTER: Filter first, then classify
SELECT AI_CLASSIFY(text, categories)
FROM large_table
WHERE date >= CURRENT_DATE - 7  -- Only recent data
AND text IS NOT NULL
AND LENGTH(text) > 50;  -- Only substantial text

Snowflake Scripting UDF Guidelines

Best practices:

Keep UDFs deterministic when possible
Test thoroughly with edge cases
Document complex logic with comments
Consider performance for frequently-called functions
Use instead of stored procedures when called in SELECT

Migration Guide: Adopting 2025 Features

For teams transitioning to these new features:

Migration roadmap for adopting Snowflake SQL 2025 improvements in four phases

Phase 1: Assess Current Code

Identify candidates for improvement:

sql

-- Find MERGE statements that could use ALL BY NAME
SELECT query_text
FROM snowflake.account_usage.query_history
WHERE query_text ILIKE '%MERGE INTO%'
AND query_text ILIKE '%UPDATE SET%'
AND query_text LIKE '%=%'  -- Has manual mapping
AND start_time >= DATEADD(month, -3, CURRENT_TIMESTAMP());

Phase 2: Test in Development

Create test cases:

Copy production MERGE to dev
Rewrite using ALL BY NAME
Compare results with original
Benchmark performance differences
Review with team

Phase 3: Gradual Rollout

Prioritize by impact:

Start with non-critical pipelines
Monitor for issues
Expand to production incrementally
Update documentation
Train team on new syntax

Phase 4: Standardize

Update coding standards:

Prefer MERGE ALL BY NAME for new code
Refactor existing MERGE when touched
Document exceptions where old syntax preferred
Include in code reviews

Troubleshooting Common Issues

When adopting new features, watch for these issues:

MERGE ALL BY NAME Not Working

Problem: “Column count mismatch”

Solution: Ensure exact column name matches:

sql

-- Check column names match
SELECT column_name 
FROM information_schema.columns 
WHERE table_name = 'TARGET_TABLE'
MINUS
SELECT column_name 
FROM information_schema.columns 
WHERE table_name = 'SOURCE_TABLE';

UNION BY NAME NULL Handling

Problem: Unexpected NULLs in results

Solution: Remember missing columns become NULL:

sql

-- Make NULLs explicit if needed
SELECT
  COALESCE(column_name, 'DEFAULT_VALUE') AS column_name,
  ...
FROM table1
UNION ALL BY NAME
SELECT * FROM table2;

Cortex AISQL Performance

Problem: AI functions running slowly

Solution: Filter data before AI processing:

sql

-- Reduce data volume first
WITH filtered AS (
  SELECT * FROM large_table
  WHERE conditions_to_reduce_rows
)
SELECT AI_CLASSIFY(text, categories)
FROM filtered;

Future SQL Improvements on Snowflake Roadmap

Based on community feedback and Snowflake’s direction, expect these future enhancements:

2026 Predicted Features:

More AI functions in Cortex AISQL
Enhanced MERGE with more flexible conditions
Additional higher-order functions
Improved query optimization for new syntax
Extended lambda capabilities

Community Requests:

MERGE NOT MATCHED BY SOURCE (like SQL Server)
More flexible PIVOT syntax
Additional string manipulation functions
Graph query capabilities

Snowflake SQL 2025 improvements overview showing all major features and enhancements

Conclusion: Embracing Modern SQL in Snowflake

This Snowflake SQL tutorial has covered the revolutionary 2025 improvements represent a significant leap forward in data engineering productivity. MERGE ALL BY NAME alone can save data engineers hours per week by eliminating tedious column mapping.

The key benefits:

Less boilerplate code
Fewer errors from typos
Easier maintenance as schemas evolve
More time for valuable work

For data engineers, these features mean spending less time fighting SQL syntax and more time solving business problems. The tools are more intelligent, the syntax more intuitive, and the results more reliable.

Start today by identifying one MERGE statement you can simplify with ALL BY NAME. Experience the difference these modern SQL features make in your daily work.

The future of SQL is here—and it’s dramatically simpler.

Key Takeaways

MERGE ALL BY NAME automatically matches columns by name, eliminating manual mapping
Announced September 29, 2025, this feature reduces MERGE statements from 50+ lines to 5 lines
UNION BY NAME combines data from sources with different column orders and schemas
Cortex AISQL brings AI

October 13, 2025

Snowflake Optima: 15x Faster Queries at Zero Cost
Revolutionary Performance Without Lifting a Finger

On October 8, 2025, Snowflake unveiled Snowflake Optima—a groundbreaking optimization engine that fundamentally changes how data warehouses handle performance. Unlike traditional optimization that requires manual tuning, configuration, and ongoing maintenance, Snowflake Optima analyzes your workload patterns in real-time and automatically implements optimizations that deliver dramatically faster queries.

Here’s what makes this revolutionary:
- 15x performance improvements in real-world customer workloads
- Zero additional cost—no extra compute or storage charges
- Zero configuration—no knobs to turn, no indexes to manage
- Zero maintenance—continuous automatic optimization in the background
For example, an automotive customer experienced queries dropping from 17.36 seconds to just 1.17 seconds after Snowflake Optima automatically kicked in. That’s a 15x acceleration without changing a single line of code or adjusting any settings.

Moreover, this isn’t just about faster queries—it’s about effortless performance. Snowflake Optima represents a paradigm shift where speed is simply an outcome of using Snowflake, not a goal that requires constant engineering effort.

What is Snowflake Optima?

Snowflake Optima is an intelligent optimization engine built directly into the Snowflake platform that continuously analyzes SQL workload patterns and automatically implements the most effective performance strategies. Specifically, it eliminates the traditional burden of manual query tuning, index management, and performance monitoring.

The Core Innovation of Optima:

Traditionally, database optimization requires:
- First, DBAs analyzing slow queries
- Second, determining which indexes to create
- Third, managing index storage and maintenance
- Fourth, monitoring for performance degradation
- Finally, repeating this cycle continuously
With Optima, however, all of this happens automatically. Instead of requiring human intervention, Snowflake Optima:
- Continuously monitors your workload patterns
- Automatically identifies optimization opportunities
- Intelligently creates hidden indexes when beneficial
- Seamlessly maintains and updates optimizations
- Transparently improves performance without user action
Key Principles Behind Snowflake Optima

Fundamentally, Snowflake Optima operates on three design principles:

Performance First: Every query should run as fast as possible without requiring optimization expertise

Simplicity Always: Zero configuration, zero maintenance, zero complexity

Cost Efficiency: No additional charges for compute, storage, or the optimization service itself

Snowflake Optima Indexing: The Breakthrough Feature

At the heart of Snowflake Optima is Optima Indexing—an intelligent feature built on top of Snowflake’s Search Optimization Service. However, unlike traditional search optimization that requires manual configuration, Optima Indexing works completely automatically.

How Snowflake Optima Indexing Works

Specifically, Snowflake Optima Indexing continuously analyzes your SQL workloads to detect patterns and opportunities. When it identifies repetitive operations—such as frequent point-lookup queries on specific tables—it automatically generates hidden indexes designed to accelerate exactly those workload patterns.

For instance:
1. First, Optima monitors queries running on your Gen2 warehouses
2. Then, it identifies recurring point-lookup queries with high selectivity
3. Next, it analyzes whether an index would provide significant benefit
4. Subsequently, it automatically creates a search index if worthwhile
5. Finally, it maintains the index as data and workloads evolve
Importantly, these indexes operate on a best-effort basis, meaning Snowflake manages them intelligently based on actual usage patterns and performance benefits. Unlike manually created indexes, they appear and disappear as workload patterns change, ensuring optimization remains relevant.

Real-World Snowflake Optima Performance Gains

Let’s examine actual customer results to understand Snowflake Optima’s impact:

Case Study: Automotive Manufacturing Company

Before Snowflake Optima:
- Average query time: 17.36 seconds
- Partition pruning rate: Only 30% of micro-partitions skipped
- Warehouse efficiency: Moderate resource utilization
- User experience: Slow dashboards, delayed analytics
After Snowflake Optima:
- Average query time: 1.17 seconds (15x faster)
- Partition pruning rate: 96% of micro-partitions skipped
- Warehouse efficiency: Reduced resource contention
- User experience: Lightning-fast dashboards, real-time insights
Notably, the improvement wasn’t limited to the directly optimized queries. Because Snowflake Optima reduced resource contention on the warehouse, even queries that weren’t directly accelerated saw a 46% improvement in runtime—almost 2x faster.

Furthermore, average job runtime on the entire warehouse improved from 2.63 seconds to 1.15 seconds—more than 2x faster overall.

The Magic of Micro-Partition Pruning

To understand Snowflake Optima’s power, you need to understand micro-partition pruning:

Snowflake stores data in compressed micro-partitions (typically 50-500 MB). When you run a query, Snowflake first determines which micro-partitions contain relevant data through partition pruning.

Without Snowflake Optima:
- Snowflake uses table metadata (min/max values, distinct counts)
- Typically prunes 30-50% of irrelevant partitions
- Remaining partitions must still be scanned
With Snowflake Optima:
- Additionally uses hidden search indexes
- Dramatically increases pruning rate to 90-96%
- Significantly reduces data scanning requirements
For example, in the automotive case study:
- Total micro-partitions: 10,389
- Pruned by metadata alone: 2,046 (20%)
- Additional pruning by Snowflake Optima: 8,343 (80%)
- Final pruning rate: 96%
- Execution time: Dropped to just 636 milliseconds
Snowflake Optima vs. Traditional Optimization

Let’s compare Snowflake Optima against traditional database optimization approaches:

Traditional Search Optimization Service

Before Snowflake Optima, Snowflake offered the Search Optimization Service (SOS) that required manual configuration:

Requirements:
- DBAs must identify which tables benefit
- Administrators must analyze query patterns
- Teams must determine which columns to index
- Organizations must weigh cost versus benefit manually
- Users must monitor effectiveness continuously
Challenges:
- End users running queries aren’t responsible for costs
- Query users don’t have knowledge to implement optimizations
- Administrators aren’t familiar with every new workload
- Teams lack time to analyze and optimize continuously
Snowflake Optima: The Automatic Alternative

With Snowflake Optima, however:

Requirements:
- Zero—it’s automatically enabled on Gen2 warehouses
Configuration:
- Zero—no settings, no knobs, no parameters
Maintenance:
- Zero—fully automatic in the background
Cost Analysis:
- Zero—no additional charges whatsoever
Monitoring:
- Optional—visibility provided but not required
In other words, Snowflake Optima eliminates every burden associated with traditional optimization while delivering superior results.

Technical Requirements for Snowflake Optima

Currently, Snowflake Optima has specific technical requirements:

Generation 2 Warehouses Only

Snowflake Optima is exclusively available on Snowflake Generation 2 (Gen2) standard warehouses. Therefore, ensure your infrastructure meets this requirement before expecting Optima benefits.

To check your warehouse generation:

sql
```
SHOW WAREHOUSES;
-- Look for TYPE column: STANDARD warehouses on Gen2
```
If needed, migrate to Gen2 warehouses through Snowflake’s upgrade process.

Best-Effort Optimization Model

Unlike manually applied search optimization that guarantees index creation, Snowflake Optima operates on a best-effort basis:

What this means:
- Optima creates indexes when it determines they’re beneficial
- Indexes may appear and disappear as workloads evolve
- Optimization adapts to changing query patterns
- Performance improves automatically but variably
When to use manual search optimization instead:

For specialized workloads requiring guaranteed performance—such as:
- Cybersecurity threat detection (near-instantaneous response required)
- Fraud prevention systems (consistent sub-second queries needed)
- Real-time trading platforms (predictable latency essential)
- Emergency response systems (reliability non-negotiable)
In these cases, manually applying search optimization provides consistent index freshness and predictable performance characteristics.

Monitoring Optima Performance

Transparency is crucial for understanding optimization effectiveness. Fortunately, Snowflake provides comprehensive monitoring capabilities through the Query Profile tab in Snowsight.

Query Insights Pane

The Query Insights pane displays detected optimization insights for each query:

What you’ll see:
- Each type of insight detected for a query
- Every instance of that insight type
- Explicit notation when “Snowflake Optima used”
- Details about which optimizations were applied
To access:
1. Navigate to Query History in Snowsight
2. Select a query to examine
3. Open the Query Profile tab
4. Review the Query Insights pane
When Snowflake Optima has optimized a query, you’ll see “Snowflake Optima used” clearly indicated with specifics about the optimization applied.

Statistics Pane: Pruning Metrics

The Statistics pane quantifies Snowflake Optima’s impact through partition pruning metrics:

Key metric: “Partitions pruned by Snowflake Optima”

What it shows:
- Number of partitions skipped during query execution
- Percentage of total partitions pruned
- Improvement in data scanning efficiency
- Direct correlation to performance gains
For example:
- Total partitions: 10,389
- Pruned by Snowflake Optima: 8,343 (80%)
- Total pruning rate: 96%
- Result: 15x faster query execution
This metric directly correlates to:
- Faster query completion times
- Reduced compute costs
- Lower resource contention
- Better overall warehouse efficiency
Use Cases

Let’s explore specific scenarios where Optima delivers exceptional value:

Use Case 1: E-Commerce Analytics

A large retail chain analyzes customer behavior across e-commerce and in-store platforms.

Challenge:
- Billions of rows across multiple tables
- Frequent point-lookups on customer IDs
- Filter-heavy queries on product SKUs
- Time-sensitive queries on timestamps
Before Optima:
- Dashboard queries: 8-12 seconds average
- Ad-hoc analysis: Extremely slow
- User experience: Frustrated analysts
- Business impact: Delayed decision-making
With Snowflake Optima:
- Dashboard queries: Under 1 second
- Ad-hoc analysis: Lightning fast
- User experience: Delighted analysts
- Business impact: Real-time insights driving revenue
Result: 10x performance improvement enabling real-time personalization and dynamic pricing strategies.

Use Case 2: Financial Services Risk Analysis

A global bank runs complex risk calculations across portfolio data.

Challenge:
- Massive datasets with billions of transactions
- Regulatory requirements for rapid risk assessment
- Recurring queries on account numbers and counterparties
- Performance critical for compliance
Before Snowflake Optima:
- Risk calculations: 15-20 minutes
- Compliance reporting: Hours to complete
- Warehouse costs: High due to long-running queries
- Regulatory risk: Potential delays
With Snowflake Optima:
- Risk calculations: 2-3 minutes
- Compliance reporting: Real-time available
- Warehouse costs: 40% reduction through efficiency
- Regulatory risk: Eliminated through speed
Result: 8x faster risk assessment ensuring regulatory compliance and enabling more sophisticated risk modeling.

Use Case 3: IoT Sensor Data Analysis

A manufacturing company analyzes sensor data from factory equipment.

Challenge:
- High-frequency sensor readings (millions per hour)
- Point-lookups on specific machine IDs
- Time-series queries for anomaly detection
- Real-time requirements for predictive maintenance
Before Snowflake Optima:
- Anomaly detection: 30-45 seconds
- Predictive models: Slow to train
- Alert latency: Minutes behind real-time
- Maintenance: Reactive not predictive
With Snowflake Optima:
- Anomaly detection: 2-3 seconds
- Predictive models: Faster training cycles
- Alert latency: Near real-time
- Maintenance: Truly predictive
Result: 12x performance improvement enabling proactive maintenance preventing $2M+ in equipment failures annually.

Use Case 4: SaaS Application Backend

A B2B SaaS platform powers customer-facing dashboards from Snowflake.

Challenge:
- Customer-specific queries with high selectivity
- User-facing performance requirements (sub-second)
- Variable workload patterns across customers
- Cost efficiency critical for SaaS margins
Before Snowflake Optima:
- Dashboard load times: 5-8 seconds
- User satisfaction: Low (performance complaints)
- Warehouse scaling: Expensive to meet demand
- Competitive position: Disadvantage
With Snowflake Optima:
- Dashboard load times: Under 1 second
- User satisfaction: High (no complaints)
- Warehouse scaling: Optimized automatically
- Competitive position: Performance advantage
Result: 7x performance improvement improving customer retention by 23% and reducing churn.

Cost Implications of Snowflake Optima

One of the most compelling aspects of Snowflake Optima is its cost structure: there isn’t one.

Zero Additional Costs

Snowflake Optima comes at no additional charge beyond your standard Snowflake costs:

Zero Compute Costs:
- Index creation: Free (uses Snowflake background serverless)
- Index maintenance: Free (automatic background processes)
- Query optimization: Free (integrated into query execution)
Free Storage Allocation:
- Index storage: Free (managed by Snowflake internally)
- Overhead: Free (no impact on your storage bill)
No Service Fees Applied:
- Feature access: Free (included in Snowflake platform)
- Monitoring: Free (built into Snowsight)
In contrast, manually applied Search Optimization Service does incur costs:
- Compute: For building and maintaining indexes
- Storage: For the search access path structures
- Ongoing: Continuous maintenance charges
Therefore, Snowflake Optima delivers automatic performance improvements without expanding your budget or requiring cost-benefit analysis.

Indirect Cost Savings

Beyond zero direct costs, Snowflake Optima generates indirect savings:

Reduced compute consumption:
- Faster queries complete in less time
- Fewer credits consumed per query
- Better efficiency across all workloads
Lower warehouse scaling needs:
- Optimized queries reduce resource contention
- Smaller warehouses can handle more load
- Fewer multi-cluster warehouse scale-outs needed
Decreased engineering overhead:
- No DBA time spent on optimization
- No analyst time troubleshooting slow queries
- No DevOps time managing indexes
Improved ROI:
- Faster insights drive better decisions
- Better performance improves user adoption
- Lower costs increase profitability
For example, the automotive customer saw:
- 56% reduction in query execution time
- 40% decrease in overall warehouse utilization
- Estimated $50K annual savings on a single workload
- Zero engineering hours invested in optimization
Snowflake Optima Best Practices

While Snowflake Optima requires zero configuration, following these best practices maximizes its effectiveness:

Best Practice 1: Migrate to Gen2 Warehouses

Ensure you’re running on Generation 2 warehouses:

sql
```
-- Check current warehouse generation
SHOW WAREHOUSES;

-- Contact Snowflake support to upgrade if needed
```
Why this matters:
- Snowflake Optima only works on Gen2 warehouses
- Gen2 includes numerous other performance improvements
- Migration is typically seamless with Snowflake support
Best Practice 2: Monitor Optima Impact

Regularly review Query Profile insights to understand Snowflake Optima’s impact:

Steps:
1. Navigate to Query History in Snowsight
2. Filter for your most important queries
3. Check Query Insights pane for “Snowflake Optima used”
4. Review partition pruning statistics
5. Document performance improvements
Why this matters:
- Visibility into automatic optimizations
- Evidence of value for stakeholders
- Understanding of workload patterns
Best Practice 3: Complement with Manual Optimization for Critical Workloads

For mission-critical queries requiring guaranteed performance:

sql
```
-- Apply manual search optimization
ALTER TABLE critical_table ADD SEARCH OPTIMIZATION 
ON (customer_id, transaction_date);
```
When to use:
- Cybersecurity threat detection
- Fraud prevention systems
- Real-time trading platforms
- Emergency response systems
Why this matters:
- Guaranteed index freshness
- Predictable performance characteristics
- Consistent sub-second response times
Best Practice 4: Maintain Query Quality

Even with Snowflake Optima, write efficient queries:

Good practices:
- Selective filters (WHERE clauses that filter significantly)
- Appropriate data types (exact matches vs. wildcards)
- Proper joins (avoid unnecessary cross joins)
- Result limiting (use LIMIT when appropriate)
Why this matters:
- Snowflake Optima amplifies good query design
- Poor queries may not benefit from optimization
- Best results come from combining both
Best Practice 5: Understand Workload Characteristics

Know which query patterns benefit most from Snowflake Optima:

Optimal for:
- Point-lookup queries (WHERE id = ‘specific_value’)
- Highly selective filters (returns small percentage of rows)
- Recurring patterns (same query structure repeatedly)
- Large tables (billions of rows)
Less optimal for:
- Full table scans (no WHERE clauses)
- Low selectivity (returns most rows)
- One-off queries (never repeated)
- Small tables (already fast)
Why this matters:
- Realistic expectations for performance gains
- Better understanding of when Optima helps
- Strategic planning for workload design
Snowflake Optima and the Future of Performance

Snowflake Optima represents more than just a technical feature—it’s a strategic vision for the future of data warehouse performance.

The Philosophy Behind Snowflake Optima

Traditionally, database performance required trade-offs:
- Performance OR simplicity (fast databases were complex)
- Automation OR control (automatic features lacked flexibility)
- Cost OR speed (faster performance cost more money)
Snowflake Optima eliminates these trade-offs:
- Performance AND simplicity (fast without complexity)
- Automation AND intelligence (smart automatic decisions)
- Cost efficiency AND speed (faster at no extra cost)
The Virtuous Cycle of Intelligence

Snowflake Optima creates a self-improving system:
1. Optima monitors workload patterns continuously
2. Patterns inform optimization decisions intelligently
3. Optimizations improve performance automatically
4. Performance enables more complex workloads
5. New workloads provide more data for learning
6. Cycle repeats, continuously improving
This means your data warehouse becomes smarter over time, learning from usage patterns and continuously improving without human intervention.

What’s Next for Snowflake Optima

Based on Snowflake’s roadmap and industry trends, expect these future developments:

Short-term (2025-2026):
- Expanded query types benefiting from Snowflake Optima
- Additional optimization strategies beyond indexing
- Enhanced monitoring and explainability features
- Support for additional warehouse configurations
Medium-term (2026-2027):
- Cross-query optimization (learning from related queries)
- Workload-specific optimization profiles
- Predictive optimization (anticipating future needs)
- Integration with other Snowflake intelligent features
Long-term (2027+):
- AI-powered optimization using machine learning
- Autonomous database management capabilities
- Self-healing performance issues automatically
- Cognitive optimization understanding business context
Getting Started with Snowflake Optima

The beauty of Snowflake Optima is that getting started requires virtually no effort:

Step 1: Verify Gen2 Warehouses

Check if you’re running Generation 2 warehouses:

sql
```
SHOW WAREHOUSES;
```
Look for:
- TYPE column: Should show STANDARD
- Generation: Contact Snowflake if unsure
If needed:
- Contact Snowflake support for Gen2 upgrade
- Migration is typically seamless and fast
Step 2: Run Your Normal Workloads

Simply continue running your existing queries:

No configuration needed:
- Snowflake Optima monitors automatically
- Optimizations apply in the background
- Performance improves without intervention
No changes required:
- Keep existing query patterns
- Maintain current warehouse configurations
- Continue normal operations
Step 3: Monitor the Impact

After a few days or weeks, review the results:

In Snowsight:
1. Go to Query History
2. Select queries to examine
3. Open Query Profile tab
4. Look for “Snowflake Optima used”
5. Review partition pruning statistics
Key metrics:
- Query duration improvements
- Partition pruning percentages
- Warehouse efficiency gains
Step 4: Share the Success

Document and communicate Snowflake Optima benefits:

For stakeholders:
- Performance improvements (X times faster)
- Cost savings (reduced compute consumption)
- User satisfaction (faster dashboards, better experience)
For technical teams:
- Pruning statistics (data scanning reduction)
- Workload patterns (which queries optimized)
- Best practices (maximizing Optima effectiveness)
Snowflake Optima FAQs

What is Snowflake Optima?

Snowflake Optima is an intelligent optimization engine that automatically analyzes SQL workload patterns and implements performance optimizations without requiring configuration or maintenance. It delivers dramatically faster queries at zero additional cost.

How much does Snowflake Optima cost?

Zero. Snowflake Optima comes at no additional charge beyond your standard Snowflake costs. There are no compute charges, storage charges, or service charges for using Snowflake Optima.

What are the requirements for Snowflake Optima?

Snowflake Optima requires Generation 2 (Gen2) standard warehouses. It’s automatically enabled on qualifying warehouses without any configuration needed.

How does Snowflake Optima compare to manual Search Optimization Service?

Snowflake Optima operates automatically without configuration and at zero cost, while manual Search Optimization Service requires configuration and incurs compute and storage charges. For most workloads, Snowflake Optima is the better choice. However, mission-critical workloads requiring guaranteed performance may still benefit from manual optimization.

How do I monitor Snowflake Optima performance?

Use the Query Profile tab in Snowsight to monitor Snowflake Optima. The Query Insights pane shows when Snowflake Optima was used, and the Statistics pane displays partition pruning metrics showing performance impact.

Can I disable Snowflake Optima?

No, Snowflake Optima cannot be disabled on Gen2 warehouses. However, it operates on a best-effort basis and only creates optimizations when beneficial, so there’s no downside to having it active.

What types of queries benefit from Snowflake Optima?

Snowflake Optima is most effective for point-lookup queries with highly selective filters on large tables, especially recurring query patterns. Queries returning small percentages of rows see the biggest improvements.

Conclusion: The Dawn of Effortless Performance

Snowflake Optima marks a fundamental shift in how organizations approach database performance. For decades, achieving fast query performance required dedicated DBAs, constant tuning, and careful optimization. With Snowflake Optima, however, speed is simply an outcome of using Snowflake.

The results speak for themselves:
- 15x performance improvements in real-world workloads
- Zero additional cost or configuration required
- Zero maintenance burden on teams
- Continuous improvement as workloads evolve
More importantly, Snowflake Optima represents a strategic advantage for organizations managing complex data operations. By removing the burden of manual optimization, your team can focus on deriving insights rather than tuning infrastructure.

The self-adapting nature of Snowflake Optima means your data warehouse becomes smarter over time, learning from usage patterns and continuously improving without human intervention. This creates a virtuous cycle where performance naturally improves as your workloads evolve and grow.

Snowflake Optima streamlines optimization for data engineers, saving countless hours. Analysts benefit from accelerated insights and smoother user experiences. Meanwhile, executives see improved ROI — all without added investment.

The future of database performance isn’t about smarter DBAs or better optimization tools—it’s about intelligent systems that optimize themselves. Optima is that future, available today.

Are you ready to experience effortless performance?

Key Takeaways
- Snowflake Optima delivers automatic query optimization without configuration or cost
- Announced October 8, 2025, currently available on Gen2 standard warehouses
- Real customers achieve 15x performance improvements automatically
- Optima Indexing continuously monitors workloads and creates hidden indexes intelligently
- Zero additional charges for compute, storage, or the optimization service
- Partition pruning improvements from 30% to 96% drive dramatic speed increases
- Best-effort optimization adapts to changing workload patterns automatically
- Monitoring available through Query Profile tab in Snowsight
- Mission-critical workloads can still use manual search optimization for guaranteed performance
- Future roadmap includes AI-powered optimization and autonomous database management
October 12, 2025
Open Semantic Interchange: Solving AI’s $1T Problem
Breaking: Tech Giants Unite to Solve AI’s Biggest Bottleneck

The Open Semantic Interchange was announced by Snowflake in their official blog On September 23, 2025, something unprecedented happened in the data industry. Open Semantic Interchange (OSI), a groundbreaking initiative led by Snowflake, Salesforce, BlackRock, and dbt Labs, was announced to solve AI’s biggest problem. These 15+ technology companies would give away their data secrets—collaboratively creating the Open Semantic Interchange as an open, vendor-neutral standard for how business data is defined across all platforms.

This isn’t just another tech announcement. It’s the industry admitting that the emperor has no clothes.

For decades, every software vendor has defined business metrics differently. Your data warehouse calls it “revenue.” Your BI tool calls it “total sales.” Your CRM calls it “booking amount.” Your AI model? It has no idea they’re the same thing.

This semantic chaos has created what VentureBeat calls “the $1 trillion AI problem“—the massive hidden cost of data preparation, reconciliation, and the manual labor required before any AI project can begin.

Enter the Open Semantic Interchang (OSI)—a groundbreaking initiative that could become as fundamental to AI as SQL was to databases or HTTP was to the web.

What is Open Semantic Interchange (OSI)? Understanding the Semantic Standard

Open Semantic Interchange is an open-source initiative that creates a universal, vendor-neutral specification for defining and sharing semantic metadata across data platforms, BI tools, and AI applications.

The Simple Explanation of Open Semantic Interchange

Think of OSI as a Rosetta Stone for business data. Just as the ancient Rosetta Stone allowed scholars to translate between Egyptian hieroglyphics, Greek, and Demotic script, OSI allows different software systems to understand each other’s data definitions.

When your data warehouse, BI dashboard, and AI model all speak the same semantic language, magic happens:
- No more weeks reconciling conflicting definitions
- No more “which revenue number is correct?”
- No more AI models trained on misunderstood data
- No more rebuilding logic across every tool
Open Semantic Interchange Technical Definition

OSI provides a standardized specification for semantic models that includes:

Business Metrics: Calculations, aggregations, and KPIs (revenue, customer lifetime value, churn rate)

Dimensions: Attributes for slicing data (time, geography, product category)

Hierarchies: Relationships between data elements (country → state → city)

Business Rules: Logic and constraints governing data interpretation

Context & Metadata: Descriptions, ownership, lineage, and governance policies

Built on familiar formats like YAML and compatible with RDF and OWL, this specification stands out by being tailored specifically for modern analytics and AI workloads.

The $1 Trillion Problem: Why Open Semantic Interchange Matters Now

The Hidden Tax: Why Semantic Interchange is Critical for AI Projects

Every AI initiative begins the same way. Data scientists don’t start building models—they start reconciling data.

Week 1-2: “Wait, why are there three different revenue numbers?”

Week 3-4: “Which customer definition should we use?”

Week 5-6: “These date fields don’t match across systems.”

Week 7-8: “We need to rebuild this logic because BI and ML define margins differently.”

According to industry research, data preparation consumes 60-80% of data science time. For enterprises spending millions on AI, this represents a staggering hidden cost.

Real-World Horror Stories Without Semantic Interchange

Fortune 500 Retailer: Spent 9 months building a customer lifetime value model. When deployment came, marketing and finance disagreed on the “customer” definition. Project scrapped.

Global Bank: Built fraud detection across 12 regions. Each region’s “transaction” definition differed. Model accuracy varied 35% between regions due to semantic inconsistency.

Healthcare System: Created patient risk models using EHR data. Clinical teams rejected the model because “readmission” calculations didn’t match their operational definitions.

These aren’t edge cases—they’re the norm. The lack of semantic standards is silently killing AI ROI across every industry.

Why Open Semantic Interchange Now? The AI Inflection Point

Generative AI has accelerated the crisis. When you ask ChatGPT or Claude to “analyze Q3 revenue by region,” the AI needs to understand:
- What “revenue” means in your business
- How “regions” are defined
- Which “Q3” you’re referring to
- What calculations to apply
Without semantic standards, AI agents give inconsistent, untrustworthy answers. As enterprises move from AI pilots to production at scale, semantic fragmentation has become the primary blocker to AI adoption.

The Founding Coalition: Who’s Behind OSI

OSI isn’t a single-vendor initiative—rather it’s an unprecedented collaboration across the data ecosystem.

Companies Leading the Open Semantic Interchange Initiative

Snowflake: The AI Data Cloud company spearheading the initiative, contributing engineering resources and governance infrastructure

Salesforce (Tableau): Co-leading with Snowflake, bringing BI perspective and Tableau’s semantic layer expertise

dbt Labs: Furthermore,contributing the dbt Semantic Layer framework as a foundational technology

BlackRock:Moreover, representing financial services with the Aladdin platform, ensuring real-world enterprise requirements

RelationalAI:Finally, bringing knowledge graph and reasoning capabilities for complex semantic relationships

Launch Partners (17 Total)

BI & Analytics: ThoughtSpot, Sigma, Hex, Omni

Data Governance: Alation, Atlan, Select Star

AI & ML: Mistral AI, Elementum AI

Industry Solutions: Blue Yonder, Honeydew, Cube

This coalition represents competitors agreeing to open-source their competitive advantage for the greater good of the industry.

Why Competitors Are Collaborating on Semantic Interchange

As Christian Kleinerman, EVP Product at Snowflake, explains: “The biggest barrier our customers face when it comes to ROI from AI isn’t a competitor—it’s data fragmentation.”

Indeed, this observation highlights a critical industry truth. Rather than competing against other vendors, organizations are actually fighting against their own internal data inconsistencies. Moreover, this fragmentation costs enterprises millions annually in lost productivity and delayed AI initiatives.

Similarly, Southard Jones, CPO at Tableau, emphasizes the collaborative nature: “This initiative is transformative because it’s not about one company owning the standard—it’s about the industry coming together.”

In other words, the traditional competitive dynamics are being reimagined. Instead of proprietary lock-in strategies, therefore, the industry is choosing open collaboration. Consequently, this shift benefits everyone—vendors, enterprises, and end users alike.

Ryan Segar, CPO at dbt Labs: “Data and analytics engineers will now be able to work with the confidence that their work will be leverageable across the data ecosystem.”

The message is clear: Standardization isn’t a commoditizer—it’s a catalyst. Like USB-C didn’t hurt device makers, OSI won’t hurt data platforms. It shifts competition from data definitions to innovation in user experience and AI capabilities.

How Open Semantic Interchange (OSI) Works: Technical Deep Dive

The Open Semantic Interchange Specification Structure

OSI defines semantic models in a structured, machine-readable format. Here’s what a simplified OSI specification looks like:

Metrics Definition:
- Name, description, and business owner
- Calculation formula with explicit dependencies
- Aggregation rules (sum, average, count distinct)
- Filters and conditions
- Temporal considerations (point-in-time vs. accumulated)
Dimension Definition:
- Attribute names and data types
- Valid values and constraints
- Hierarchical relationships
- Display formatting rules
Relationships:
- How metrics relate to dimensions
- Join logic and cardinality
- Foreign key relationships
- Temporal alignment
Governance Metadata:
- Data lineage and source systems
- Ownership and stewardship
- Access policies and sensitivity
- Certification status and quality scores
- Version history and change logs
Open Semantic Interchange Technology Stack

Format: YAML (human-readable, version-control friendly)

Compilation: Engines that translate OSI specs into platform-specific code (SQL, Python, APIs)

Transport: REST APIs and file-based exchange

Validation: Schema validation and semantic correctness checking

Extension: Plugin architecture for domain-specific semantics

Integration Patterns

Organizations can adopt OSI through multiple approaches:

Native Integration: Platforms like Snowflake directly support OSI specifications

Translation Layer: Tools convert between proprietary formats and OSI

Dual-Write: Systems maintain both proprietary and OSI formats

Federation: Central OSI registry with distributed consumption

Real-World Use Cases: Open Semantic Interchange in Action

Use Case 1: Open Semantic Interchange for Multi-Cloud Analytics

Challenge: A global retailer runs analytics on Snowflake but visualizations in Tableau, with data science in Databricks. Each platform defined “sales” differently.

Before OSI:
- Data team spent 40 hours/month reconciling definitions
- Business users saw conflicting dashboards
- ML models trained on inconsistent logic
- Trust in analytics eroded
With OSI:
- Single OSI specification defines “sales” once
- All platforms consume the same semantic model
- Dashboards, notebooks, and AI agents align
- Data team focuses on new insights, not reconciliation
Impact: 90% reduction in semantic reconciliation time, 35% increase in analytics trust scores

Use Case 2: Semantic Interchange for M&A Integration

Challenge: A financial services company acquired three competitors, each with distinct data definitions for “customer,” “account,” and “portfolio value.”

Before OSI:
- 18-month integration timeline
- $12M spent on data mapping consultants
- Incomplete semantic alignment at launch
- Ongoing reconciliation needed
With OSI:
- Each company publishes OSI specifications
- Automated mapping identifies overlaps and conflicts
- Human review focuses only on genuine business rule differences
- Integration completed in 6 months
Impact: 67% faster integration, 75% lower consulting costs, complete semantic alignment

Use Case 3: Open Semantic Interchange Improves AI Agent Trust

Challenge: An insurance company deployed AI agents for claims processing. Agents gave inconsistent answers because “claim amount,” “deductible,” and “coverage” had multiple definitions.

Before OSI:
- Customer service agents stopped using AI tools
- 45% of AI answers flagged as incorrect
- Manual verification required for all AI outputs
- AI initiative considered a failure
With OSI:
- All insurance concepts defined in OSI specification
- AI agents query consistent semantic layer
- Answers align with operational systems
- Audit trails show which definitions were used
Impact: 92% accuracy rate, 70% reduction in manual verification, AI adoption rate increased to 85%

Use Case 4: Semantic Interchange for Regulatory Compliance

Challenge: A bank needed consistent risk reporting across Basel III, IFRS 9, and CECL requirements. Each framework defined “exposure,” “risk-weighted assets,” and “provisions” slightly differently.

Before OSI:
- Separate data pipelines for each framework
- Manual reconciliation of differences
- Audit findings on inconsistent definitions
- High cost of compliance
With OSI:
- Regulatory definitions captured in domain-specific OSI extensions
- Single data pipeline with multiple semantic views
- Automated reconciliation and variance reporting
- Full audit trail of definition changes
Impact: 60% lower compliance reporting costs, zero audit findings, 80% faster regulatory change implementation

Industry Impact by Vertical

Financial Services

Primary Benefit: Regulatory compliance and cross-platform consistency

Key Use Cases:
- Risk reporting across frameworks (Basel, IFRS, CECL)
- Trading analytics with market data integration
- Customer 360 across wealth, retail, and commercial banking
- Fraud detection with consistent entity definitions
Early Adopter: BlackRock’s Aladdin platform, which already unifies investment management with common data language

Healthcare & Life Sciences

Primary Benefit: Clinical and operational data alignment

Key Use Cases:
- Patient outcomes research across EHR systems
- Claims analytics with consistent diagnosis coding
- Drug safety surveillance with adverse event definitions
- Population health with social determinants integration
Impact: Enables federated analytics while respecting patient privacy

Retail & E-Commerce

Primary Benefit: Omnichannel consistency and supply chain alignment

Key Use Cases:
- Customer lifetime value across channels (online, mobile, in-store)
- Inventory optimization with consistent product hierarchies
- Marketing attribution with unified conversion definitions
- Supply chain analytics with vendor data integration
Impact: True omnichannel understanding of customer behavior

Manufacturing

Primary Benefit: OT/IT convergence and supply chain interoperability

Key Use Cases:
- Predictive maintenance with consistent failure definitions
- Quality analytics across plants and suppliers
- Supply chain visibility with partner data
- Energy consumption with sustainability metrics
Impact: End-to-end visibility from raw materials to customer delivery

Open Semantic Interchange Implementation Roadmap

Phase 1: Foundation (Q4 2025 – Q1 2026)

Goals:
- Publish initial OSI specification v1.0
- Release reference implementations
- Launch community forum and GitHub repository
- Establish governance structure
Deliverables:
- Core specification for metrics, dimensions, relationships
- YAML schema and validation tools
- Sample semantic models for common use cases
- Developer documentation and tutorials
Phase 2: Ecosystem Adoption (Q2-Q4 2026)

Goals:
- Native support in major data platforms
- Translation tools for legacy systems
- Domain-specific extensions (finance, healthcare, retail)
- Growing library of shared semantic models
Milestones:
- 50+ platforms supporting OSI
- 100+ published semantic models
- Enterprise adoption case studies
- Certification program for OSI compliance
Phase 3: Industry Standard (2027+)

Goals:
- Recognition as de facto standard
- International standards body adoption
- Regulatory recognition in key industries
- Continuous evolution through community
Vision:
- OSI as fundamental as SQL for databases
- Semantic models as reusable as open-source libraries
- Cross-industry semantic model marketplace
- AI agents natively understanding OSI specifications
Open Semantic Interchange Benefits for Different Stakeholders

Data Engineers

Before OSI:
- Rebuild semantic logic for each new tool
- Debug definition mismatches
- Manual data reconciliation pipelines
With OSI:
- Define business logic once
- Automatic propagation to all tools
- Focus on data quality, not definition mapping
Time Savings: 40-60% reduction in pipeline development time

Data Analysts

Before OSI:
- Verify metric definitions before trusting reports
- Recreate calculations in each BI tool
- Reconcile conflicting dashboards
With OSI:
- Trust that all tools use same definitions
- Self-service analytics with confidence
- Focus on insights, not validation
Productivity Gain: 3x increase in analysis output

Open Semantic Interchange Benefits for Data Scientists

Before OSI:
- Spend weeks understanding data semantics
- Build custom feature engineering for each project
- Models fail in production due to definition drift
With OSI:
- Leverage pre-defined semantic features
- Reuse feature engineering logic
- Production models aligned with business systems
Impact: 5-10x faster model development

How Semantic Interchange Empowers Business Users

Before OSI:
- Receive conflicting reports from different teams
- Unsure which numbers to trust
- Can’t ask AI agents confidently
With OSI:
- Consistent numbers across all reports
- Trust AI-generated insights
- Self-service analytics without IT
Trust Increase: 50-70% higher confidence in data-driven decisions

Open Semantic Interchange Value for IT Leadership

Before OSI:
- Vendor lock-in through proprietary semantics
- High cost of platform switching
- Difficult to evaluate best-of-breed tools
With OSI:
- Freedom to choose best tools for each use case
- Lower switching costs and negotiating leverage
- Faster time-to-value for new platforms
Strategic Flexibility: 60% reduction in platform lock-in risk

Challenges and Considerations

Challenge 1: Organizational Change for Semantic Interchange

Issue: OSI requires organizations to agree on single source of truth definitions—politically challenging when different departments define metrics differently.

Solution:
- Start with uncontroversial definitions
- Use OSI to make conflicts visible and force resolution
- Establish data governance councils
- Frame as risk reduction, not turf battle
Challenge 2: Integrating Legacy Systems with Semantic Interchange

Issue: Older systems may lack APIs or semantic metadata capabilities.

Solution:
- Build translation layers
- Gradually migrate legacy definitions to OSI
- Focus on high-value use cases first
- Use OSI for new systems, translate for old
Challenge 3: Specification Evolution

Issue: Business definitions change—how does OSI handle versioning and migration?

Solution:
- Built-in versioning in OSI specification
- Deprecation policies and timelines
- Automated impact analysis tools
- Backward compatibility guidelines
Challenge 4: Domain-Specific Complexity

Issue: Some industries have extremely complex semantic models (e.g., derivatives trading, clinical research).

Solution:
- Domain-specific OSI extensions
- Industry working groups
- Pluggable architecture for specialized needs
- Start simple, expand complexity gradually
Challenge 5: Governance and Ownership

Issue: Who owns the semantic definitions? Who can change them?

Solution:
- Clear ownership model in OSI metadata
- Approval workflows for definition changes
- Audit trails and change logs
- Role-based access control
How Open Semantic Interchange Shifts the Competitive Landscape

Before OSI: The Walled Garden Era

Vendors competed by locking in data semantics. Moving from Platform A to Platform B meant rebuilding all your business logic.

This created:
- High switching costs
- Vendor power imbalance
- Slow innovation (vendors focused on lock-in, not features)
- Customer resentment
After OSI: The Innovation Era

With semantic portability, vendors must compete on:
- User experience and interface design
- AI capabilities and intelligence
- Performance and scalability
- Integration breadth and ease
- Support and services
Southard Jones (Tableau): “Standardization isn’t a commoditizer—it’s a catalyst. Think of it like a standard electrical outlet: the outlet itself isn’t the innovation, it’s what you plug into it.”

This shift benefits customers through:
- Better products (vendors focus on innovation)
- Lower costs (competition increases)
- Flexibility (easy to switch or multi-source)
- Faster AI adoption (semantic consistency enables trust)
How to Get Started with Open Semantic Interchange (OSI)

For Enterprises

Step 1: Assess Current State (1-2 weeks)
- Inventory your data platforms and BI tools
- Document how metrics are currently defined
- Identify semantic conflicts and pain points
- Estimate time spent on definition reconciliation
Step 2: Pilot Use Case (1-2 months)
- Choose a high-impact but manageable scope (e.g., revenue metrics)
- Define OSI specification for selected metrics
- Implement in 2-3 key tools
- Measure impact on reconciliation time and trust
Step 3: Expand Gradually (6-12 months)
- Add more metrics and dimensions
- Integrate additional platforms
- Establish governance processes
- Train teams on OSI practices
Step 4: Operationalize (Ongoing)
- Make Open semantic interchange part of standard data modeling
- Integrate into data governance framework
- Participate in community to influence roadmap
- Share learnings and semantic models
For Technology Vendors

Kickoff Phase: Evaluate Strategic Fit (Immediate)
- Review Open semantic interchange specification
- Assess compatibility with your platform
- Identify required engineering work
- Estimate go-to-market impact
Next : Join the Initiative (Q4 2025)
- Become an Open semantic interchange partner
- Participate in working groups
- Contribute to specification development
- Collaborate on reference implementations
Strenghthen the core: Implement Support (2026)
- Add OSI import/export capabilities
- Provide migration tools from proprietary formats
- Update documentation and training
- Certify OSI compliance
Finally: Differentiate (Ongoing)
- Build value-added services on top of OSI
- Focus innovation on user experience
- Lead with interoperability messaging
- Partner with ecosystem for joint solutions
The Future: What’s Next for Open Semantic Interchange

2025-2026: Specification & Early Adoption
- Initial specification published (Q4 2025)
- Reference implementations released
- Major vendors announce support
- First enterprise pilot programs
- Community formation and governance
2027-2028: Mainstream Adoption
- OSI becomes default for new projects
- Translation tools for legacy systems mature
- Domain-specific extensions proliferate
- Marketplace for shared semantic models emerges
- Analyst recognition as emerging standard
2029-2030: Industry Standard Status
- International standards body adoption
- Regulatory recognition in financial services
- Built into enterprise procurement requirements
- University curricula include Open semantic interchange
- Semantic models as common as APIs
Long-Term Vision

The Semantic Web Realized: Open semantic interchange could finally deliver on the promise of the Semantic Web—not through abstract ontologies, but through practical, business-focused semantic standards.

AI Agent Economy: When AI agents understand semantics consistently, they can collaborate across organizational boundaries, creating a true agentic AI ecosystem.

Data Product Marketplace: Open semantic interchange enables data products with embedded semantics, making them immediately usable without integration work.

Cross-Industry Innovation: Semantic models from one industry (e.g., supply chain optimization) could be adapted to others (e.g., healthcare logistics) through shared Open semantic interchange definitions.

Conclusion: The Rosetta Stone Moment for AI

Conclusion: The Rosetta Stone Moment for AI

The launch of Open Semantic Interchange marks a watershed moment in the data industry. For the first time, fierce competitors have set aside proprietary advantages to solve a problem that affects everyone: semantic fragmentation.

However, this isn’t just about technical standards—rather, it’s about unlocking a trillion dollars in trapped AI value.

Specifically, when every platform speaks the same semantic language, AI can finally deliver on its promise:
- First, trustworthy insights that business users believe
- Second, fast time-to-value without months of data prep
- Third, flexible tool choices without vendor lock-in
- Finally, scalable AI adoption across the enterprise
Importantly, the biggest winners will be organizations that adopt early. While others struggle with semantic reconciliation, early adopters will be deploying AI agents, building sophisticated analytics, and making data-driven decisions with confidence.

Ultimately, the question isn’t whether Open Semantic Interchange will become the standard—instead, it’s how quickly you’ll adopt it to stay competitive.

The revolution has begun. Indeed, the Rosetta Stone for business data is here.

So, are you ready to speak the universal language of AI?

Key Takeaways
- Open Semantic Interchange launched Sept 23, 2025, by Snowflake, Salesforce, BlackRock, dbt Labs, and 15+ partners
- Solves the $1 trillion AI problem of data fragmentation and semantic inconsistency
- Creates vendor-neutral, open-source standard for defining business metrics and semantics
- Enables AI agents, BI tools, and data platforms to understand data definitions consistently
- Uses YAML format with compilation engines for platform-specific implementation
- Benefits all stakeholders: engineers, analysts, data scientists, business users, and IT leaders
- Initial specification expected Q4 2025, mainstream adoption by 2027-2028
- Shifts competitive dynamics from semantic lock-in to innovation in features and UX
- Real-world ROI: 60-90% reduction in reconciliation time, 35-70% increase in data trust
- Early adoption provides significant competitive advantage in AI-driven decision-making
October 9, 2025
Synapse to Fabric: Your ADX Migration Guide 2025
The clock is ticking for Azure Synapse Data Explorer (ADX). With its retirement announced, a strategic Synapse to Fabric migration is now a critical task for data teams. This move to Microsoft Fabric’s Real-Time Analytics and its Eventhouse database unlocks a unified, AI-powered experience, and this guide will show you how.

This guide will walk you through the entire process, from planning to execution, complete with practical examples and KQL code snippets to ensure a smooth transition.

Why This is Happening: The Drive Behind the Synapse to Fabric Migration

Microsoft’s vision is clear: a single, integrated platform for all data and analytics workloads. This Synapse to Fabric migration is a direct result of that vision. While powerful, Azure Synapse Analytics was built from a collection of distinct services. Microsoft Fabric breaks down these silos, offering a unified SaaS experience where data engineering, data science, and business intelligence coexist seamlessly.

Eventhouse is the next evolution of the Kusto engine that powered ADX, now deeply integrated within the Fabric ecosystem. It’s built for high-performance querying on streaming, semi-structured data—making it the natural successor for your ADX workloads.

Key Benefits of Migrating to Fabric Eventhouse:
- OneLake Integration: Your data lives in OneLake, a single, tenant-wide data lake, eliminating data duplication and movement.
- Unified Experience: Switch from data ingestion to query to Power BI reporting within a single UI.
- Enhanced T-SQL Support: Query your Eventhouse data using both KQL and a more robust T-SQL surface area.
- AI-Powered Future: Tap into the power of Copilot and other AI capabilities inherent to the Fabric platform.
Phase 1: Assess and Plan Your Migration

Before you move a single byte of data, you need a clear inventory of your current ADX environment.
1. Document Your Clusters: List all your ADX clusters, databases, and tables.
2. Analyze Ingestion Pipelines: Identify all data sources. Are you using Event Hubs, IoT Hubs, or custom scripts?
3. Map Downstream Consumers: Who and what consumes this data? Document all Power BI reports, dashboards, Grafana instances, and applications that query ADX.
4. Export Your Schema: You’ll need the schema for every table and function. Use the .show and .get commands in the ADX query editor to script your objects.
Example: Scripting a Table Schema

Run this KQL command in your Azure Data Explorer query window to get the creation command for a specific table.
```
.get table YourTableName schema as csl
```
This will output the .create table command with all columns, data types, and folder/docstring properties. Save these scripts for each table. Do the same for your functions using .show function YourFunctionName.

Phase 2: The Migration – Data and Schema

With your plan in place, it’s time to create your new home in Fabric and move your data.

Step 1: Create a KQL Database and Eventhouse in Fabric
1. Navigate to your Microsoft Fabric workspace.
2. Select the Real-Time Analytics experience.
3. Create a new KQL Database.
4. Within your KQL Database, Fabric automatically provisions an Eventhouse. This is your primary database for analysis. You can also create “KQL Querysets” which are like saved query collections.
Step 2: Recreate Your Schema

Using the scripts you exported in Phase 1, run the .create table and .create function commands in your new Fabric KQL Database query window.

Step 3: Migrate Your Data

For historical data, the most effective method is exporting from ADX to Parquet format in Azure Data Lake Storage (ADLS) Gen2 and then ingesting into Fabric.

Example: One-Time Data Ingestion with a Fabric Pipeline
1. Export from ADX: Use the .export command in ADX to push your historical table data to a container in ADLS Gen2.Code snippet.
2. Ingest into Fabric: In your Fabric workspace, create a new Data Pipeline.
3. Use the Copy data activity.
  - Source: Connect to your ADLS Gen2 account and point to the exported Parquet files.
  - Destination: Select “Workspace” and choose your KQL Database and target table.
4. Run the pipeline. Fabric will handle the ingestion into your Eventhouse table with optimized performance.
```
.export async to parquet (
    h@"abfss://your-container@your-storage-account.dfs.core.windows.net/path/to/export"
)
<|
YourTableName
```
For ongoing data streams, you will re-point your Event Hubs or IoT Hubs from your old ADX cluster to your new Fabric Eventstream or KQL Database connection string.

Phase 3: Update Queries and Reports

Most of your KQL queries will work in Fabric without modification. The primary task here is updating connection strings in your downstream tools.

Connecting Power BI to Fabric Eventhouse:

This is where the integration shines.
1. Open Power BI Desktop.
2. Click Get Data.
3. Search for the KQL Database connector.
4. Instead of a cluster URI, you’ll see a simple dialog to select your Fabric workspace and the specific KQL Database.
5. Select DirectQuery for real-time analysis.
Your existing Power BI data models and DAX measures should work seamlessly once the connection is updated.

Example: Updating an Application Connection

If you have an application using the ADX SDK, you will need to update the connection string.
- Old ADX Connection String: https://youradxcluster.kusto.windows.net
- New Fabric KQL DB Connection String: https://your-fabric-workspace.kusto.fabric.microsoft.com
You can find the exact query URI in the Fabric portal on your KQL Database’s details page.

Embracing the Future

Completing your Synapse to Fabric migration is more than a technical task—it’s a strategic step into the future of data analytics. By consolidating your workloads, you reduce complexity, unlock powerful new AI capabilities, and empower your team with a truly unified platform. Start planning today to ensure you’re ahead of the curve.

Further Reading & Official Resources

For those looking to dive deeper, here are the official Microsoft documents and resources to guide your migration and learning journey:
October 8, 2025
AI Data Agent Guide 2025: Snowflake Cortex Tutorial
The world of data analytics is changing. For years, accessing insights required writing complex SQL queries. However, the industry is now shifting towards a more intuitive, conversational approach. At the forefront of this revolution is agentic AI—intelligent systems that can understand human language, reason, plan, and automate complex tasks.

Snowflake is leading this charge by transforming its platform into an intelligent and conversational AI Data Cloud. With the recent introduction of Snowflake Cortex Agents, they have provided a powerful tool for developers and data teams to build their own custom AI assistants.

This guide will walk you through, step-by-step, how to build your very first AI data agent. You will learn how to create an agent that can answer complex questions by pulling information from both your database tables and your unstructured documents, all using simple, natural language.

What is a Snowflake Cortex Agent and Why Does it Matter?

First and foremost, a Snowflake Cortex Agent is an AI-powered assistant that you can build on top of your own data. Think of it as a chatbot that has expert knowledge of your business. It understands your data landscape and can perform complex analytical tasks based on simple, conversational prompts.

This is a game-changer for several reasons:
- It Democratizes Data: Business users no longer need to know SQL. Instead, they can ask questions like, “What were our top-selling products in the last quarter?” and get immediate, accurate answers.
- It Automates Analysis: Consequently, data teams are freed from writing repetitive, ad-hoc queries. They can now focus on more strategic initiatives while the agent handles routine data exploration.
- It Provides Unified Insights: Most importantly, a Cortex Agent can synthesize information from multiple sources. It can query your structured sales data from a table and cross-reference it with strategic goals mentioned in a PDF document, all in a single response.
The Blueprint: How a Cortex Agent Works

Under the hood, a Cortex Agent uses a simple yet powerful workflow to answer your questions. It orchestrates several of Snowflake’s Cortex AI features to deliver a comprehensive answer.
1. Planning: The agent first analyzes your natural language question to understand your intent. It figures out what information you need and where it might be located.
2. Tool Use: Next, it intelligently chooses the right tool for the job. If it needs to query structured data, it uses Cortex Analyst to generate and run SQL. If it needs to find information in your documents, it uses Cortex Search.
3. Reflection: Finally, after gathering the data, the agent evaluates the results. It might ask for clarification, refine its approach, or synthesize the information into a clear, concise answer before presenting it to you.
Step-by-Step Tutorial: Building a Sales Analysis Agent

Now, let’s get hands-on. We will build a simple yet powerful sales analysis agent. This agent will be able to answer questions about sales figures from a table and also reference goals from a quarterly business review (QBR) document.

Prerequisites
- A Snowflake account with ACCOUNTADMIN privileges.
- A warehouse to run the queries.
Step 1: Prepare Your Data

First, we need some data to work with. Let’s create two simple tables for sales and products, and then upload a sample PDF document.

Run the following SQL in a Snowflake worksheet:
```
-- Create our database and schema
CREATE DATABASE IF NOT EXISTS AGENT_DEMO;
CREATE SCHEMA IF NOT EXISTS AGENT_DEMO.SALES;
USE SCHEMA AGENT_DEMO.SALES;

-- Create a products table
CREATE OR REPLACE TABLE PRODUCTS (
    product_id INT,
    product_name VARCHAR,
    category VARCHAR
);

INSERT INTO PRODUCTS (product_id, product_name, category) VALUES
(101, 'Quantum Laptop', 'Electronics'),
(102, 'Nebula Smartphone', 'Electronics'),
(103, 'Stardust Keyboard', 'Accessories');

-- Create a sales table
CREATE OR REPLACE TABLE SALES (
    sale_id INT,
    product_id INT,
    sale_date DATE,
    sale_amount DECIMAL(10, 2)
);

INSERT INTO SALES (sale_id, product_id, sale_date, sale_amount) VALUES
(1, 101, '2025-09-01', 1200.00),
(2, 102, '2025-09-05', 800.00),
(3, 101, '2025-09-15', 1250.00),
(4, 103, '2025-09-20', 150.00);

-- Create a stage for our unstructured documents
CREATE OR REPLACE STAGE qbr_documents;
```
Now, create a simple text file named QBR_Report_Q3.txt on your local machine with the following content and upload it to the qbr_documents stage using the Snowsight UI.

Quarterly Business Review – Q3 2025 Summary

Our primary strategic goal for Q3 was to drive the adoption of our new flagship product, the ‘Quantum Laptop’. We aimed for a sales target of over $2,000 for this product. Secondary goals included expanding our market share in the accessories category.

Step 2: Create the Semantic Model

Next, we need to teach the agent about our structured data. We do this by creating a Semantic Model. This is a YAML file that defines our tables, columns, and how they relate to each other.
```
# semantic_model.yaml
model:
  name: sales_insights_model
  tables:
    - name: SALES
      columns:
        - name: sale_id
          type: INT
        - name: product_id
          type: INT
        - name: sale_date
          type: DATE
        - name: sale_amount
          type: DECIMAL
    - name: PRODUCTS
      columns:
        - name: product_id
          type: INT
        - name: product_name
          type: VARCHAR
        - name: category
          type: VARCHAR
  joins:
    - from: SALES
      to: PRODUCTS
      on: SALES.product_id = PRODUCTS.product_id
```
Save this as semantic_model.yaml and upload it to the @qbr_documents stage.

Step 3: Create the Cortex Search Service

Now, let’s make our PDF document searchable. We create a Cortex Search Service on the stage where we uploaded our file.
```
CREATE OR REPLACE CORTEX SEARCH SERVICE sales_qbr_service
    ON @qbr_documents
    TARGET_LAG = '0 seconds'
    WAREHOUSE = 'COMPUTE_WH';
```
Step 4: Combine Them into a Cortex Agent

With all the pieces in place, we can now create our agent. This single SQL statement brings together our semantic model (for SQL queries) and our search service (for document queries).
```
CREATE OR REPLACE CORTEX AGENT sales_agent
    MODEL = 'mistral-large',
    CORTEX_SEARCH_SERVICES = [sales_qbr_service],
    SEMANTIC_MODELS = ['@qbr_documents/semantic_model.yaml'];
```
Step 5: Ask Your Agent Questions!

The agent is now ready! You can interact with it using the CALL command. Let’s try a few questions.

First up: A simple structured data query.
```
CALL sales_agent('What were our total sales?');
```
Next: A more complex query involving joins.
```
CALL sales_agent('Which product had the highest revenue?');
```
Then comes: A question for our unstructured document.
```
CALL sales_agent('Summarize our strategic goals from the latest QBR report.');
```
Finally , the magic: The magic! A question that combines both.
```
CALL sales_agent('Did we meet our sales target for the Quantum Laptop as mentioned in the QBR?');
```
This final query demonstrates the true power of a Snowflake Cortex Agent. It will first query the SALES and PRODUCTS tables to calculate the total sales for the “Quantum Laptop.” Then, it will use Cortex Search to find the sales target mentioned in the QBR document. Finally, it will compare the two and give you a complete, synthesized answer.

Conclusion: The Future is Conversational

You have just built a powerful AI data agent in a matter of minutes. This is a fundamental shift in how we interact with data. By combining natural language processing with the power to query both structured and unstructured data, Snowflake Cortex Agents are paving the way for a future where data-driven insights are accessible to everyone in an organization.

As Snowflake continues to innovate with features like Adaptive Compute and Gen-2 Warehouses, running these AI workloads will only become faster and more efficient. The era of conversational analytics has arrived, and it’s built on the Snowflake AI Data Cloud.

Additional materials
October 8, 2025
Mastering Real-Time ETL with Google Cloud Dataflow: A Comprehensive Tutorial
In the fast-paced world of data engineering, mastering real-time ETL with Google Cloud Dataflow is a game-changer for businesses needing instant insights. Extract, Transform, Load (ETL) processes are evolving from batch to real-time, and Google Cloud Dataflow stands out as a powerful, serverless solution for building streaming data pipelines. This tutorial dives into how Dataflow enables efficient, scalable data processing, its integration with other Google Cloud Platform (GCP) services, and practical steps to get started in 2025.

Whether you’re processing live IoT data, monitoring user activity, or analyzing financial transactions, Dataflow’s ability to handle real-time streams makes it a top choice. Let’s explore its benefits, setup process, and a hands-on example to help you master real-time ETL with Google Cloud Dataflow.

Why Choose Google Cloud Dataflow for Real-Time ETL?

Google Cloud Dataflow offers a unified platform for batch and streaming data processing, powered by the Apache Beam SDK. Its serverless nature eliminates the need to manage infrastructure, allowing you to focus on pipeline logic.

Key benefits include:
- Serverless Architecture: Automatically scales resources based on workload, reducing operational overhead and costs.
- Seamless GCP Integration: Works effortlessly with BigQuery, Pub/Sub, Cloud Storage, and Data Studio, creating an end-to-end data ecosystem.
- Real-Time Processing: Handles continuous data streams with low latency, ideal for time-sensitive applications.
- Flexibility: Supports multiple languages (Java, Python) and custom transformations via Apache Beam.
For businesses in 2025, where real-time analytics drive decisions, Dataflow’s ability to process millions of events per second positions it as a leader in cloud-based ETL solutions.

Setting Up Google Cloud Dataflow

Before building pipelines, set up your GCP environment:
1. Create a GCP Project: Go to the Google Cloud Console and create a new project.
2. Enable Dataflow API: Navigate to APIs & Services > Library, search for “Dataflow API,” and enable it.
3. Install SDK: Use the Cloud SDK or install the Apache Beam SDK:
```
pip install apache-beam[gcp]
```
4. Authenticate: Run gcloud auth login and set your project with gcloud config set project PROJECT_ID.

This setup ensures you’re ready to deploy and manage real-time ETL with Google Cloud Dataflow pipelines.

Building a Real-Time Streaming Pipeline

Let’s create a simple pipeline to process real-time data from Google Cloud Pub/Sub, transform it, and load it into BigQuery. This example streams simulated sensor data and calculates average values.

Step-by-Step Code Example
```
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import json

class DataflowOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument('--input_subscription', default='projects/your-project/subscriptions/your-subscription')
        parser.add_argument('--output_table', default='your-project:dataset.table')

def run():
    options = DataflowOptions()
    with beam.Pipeline(options=options) as p:
        # Read from Pub/Sub
        data = (p
                | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(subscription=options.input_subscription)
                | 'Decode JSON' >> beam.Map(lambda x: json.loads(x.decode('utf-8')))
                )

        # Transform: Calculate average sensor value
        averages = (data
                    | 'Group by Sensor' >> beam.GroupByKey()
                    | 'Compute Average' >> beam.MapTuple(lambda k, v: (k, sum(v) / len(v) if v else 0))
                    )

        # Write to BigQuery
        averages | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
            options.output_table,
            schema='sensor_id:STRING,average_value:FLOAT',
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
        )

if __name__ == '__main__':
    run()
```
How It Works
- Input: Subscribes to a Pub/Sub topic streaming JSON data (e.g., {“sensor_id”: “S1”, “value”: 25.5}).
- Transform: Groups data by sensor ID and computes the running average.
- Output: Loads results into a BigQuery table for real-time analysis.
Run this pipeline with:
```
python your_script.py --project=your-project --job_name=real-time-etl --runner=DataflowRunner --region=us-central1 --setup_file=./setup.py
```
This example showcases real-time ETL with Google Cloud Dataflow’s power to process and store data instantly.

Integrating with Other GCP Services

Dataflow shines with its ecosystem integration:
- Pub/Sub: Ideal for ingesting real-time event streams from IoT devices or web applications.
- Cloud Storage: Use as a staging area for intermediate data or backups.
- BigQuery: Enables SQL-based analytics on processed data.
- Data Studio: Visualize results in dashboards for stakeholders.
For instance, connect Pub/Sub to stream live user clicks, transform them with Dataflow, and visualize trends in Data Studio—all within minutes.

Best Practices for Real-Time ETL with Dataflow
- Optimize Resources: Use autoscaling and monitor CPU/memory usage in the Dataflow monitoring UI.
- Handle Errors: Implement dead-letter queues in Pub/Sub for failed messages.
- Security: Enable IAM roles and encrypt data with Cloud KMS.
- Testing: Test pipelines locally with DirectRunner before deploying.
These practices ensure robust, scalable real-time ETL with Google Cloud Dataflow pipelines.

Benefits in 2025 and Beyond

As of October 2025, Dataflow’s serverless model aligns with the growing demand for cost-efficient, scalable solutions. Its integration with AI/ML services like Vertex AI for predictive analytics further enhances its value. Companies leveraging real-time ETL report up to 40% faster decision-making, according to recent industry trends.

External Resource Links

For deeper dives and references:
- Official Google Cloud Dataflow Documentation – Comprehensive guides on setup and advanced features.
- Apache Beam Programming Guide – Details on Beam SDK for custom pipelines.
- Google Cloud Blog on Real-Time ETL – Case studies and best practices.
- Dataflow Pricing and Quotas – Understand costs for scaling.
- GCP Community Tutorials – User-contributed examples for integration.
Conclusion

Mastering real-time ETL with Google Cloud Dataflow unlocks the potential of streaming data pipelines. Its serverless design, GCP integration, and flexibility make it ideal for modern data challenges. Start with the example above, experiment with your data, and scale as needed.
October 7, 2025

Star Schema vs Snowflake Schema:Key Differences & Use Cases

In the realm of data warehousing, choosing the right schema design is crucial for efficient data management, querying, and analysis. Two of the most popular multidimensional schemas are the star schema and the snowflake schema. These schemas organize data into fact tables (containing measurable metrics) and dimension tables (providing context like who, what, when, and where). Understanding star schema vs snowflake schema helps data engineers, analysts, and architects build scalable systems that support business intelligence (BI) tools and advanced analytics.

This comprehensive guide delves into their structures, pros, cons, when to use each, real-world examples, and which one dominates in modern data practices as of 2025. We’ll also include visual illustrations to make concepts clearer, along with references to authoritative sources for deeper reading.

What is a Star Schema?

A star schema is a denormalized data model resembling a star, with a central fact table surrounded by dimension tables. The fact table holds quantitative data (e.g., sales amounts, quantities) and foreign keys linking to dimensions. Dimension tables store descriptive attributes (e.g., product names, customer details) and are not further normalized.

Hand-drawn star schema diagram for data warehousing

Advantages of Star Schema:

Simplicity and Ease of Use: Fewer tables mean simpler queries with minimal joins, making it intuitive for end-users and BI tools like Tableau or Power BI.
Faster Query Performance: Denormalization reduces join operations, leading to quicker aggregations and reports, especially on large datasets.
Better for Reporting: Ideal for OLAP (Online Analytical Processing) where speed is prioritized over storage efficiency.

Disadvantages of Star Schema:

Data Redundancy: Denormalization can lead to duplicated data in dimension tables, increasing storage needs and risking inconsistencies during updates.
Limited Flexibility for Complex Hierarchies: It struggles with intricate relationships, such as multi-level product categories.

In practice, star schemas are favored in environments where query speed trumps everything else. For instance, in a retail data warehouse, the fact table might record daily sales metrics, while dimensions cover products, customers, stores, and dates. This setup allows quick answers to questions like “What were the total sales by product category last quarter?”

What is a Snowflake Schema?

A snowflake schema is an extension of the star schema but with normalized dimension tables. Here, dimensions are broken down into sub-dimension tables to eliminate redundancy, creating a structure that branches out like a snowflake. The fact table remains central, but dimensions are hierarchical and normalized to third normal form (3NF).

Advantages of Snowflake Schema:

Storage Efficiency: Normalization reduces data duplication, saving disk space—crucial for massive datasets in cloud environments like AWS or Snowflake (the data warehouse platform).
Improved Data Integrity: By minimizing redundancy, updates are easier and less error-prone, maintaining consistency across the warehouse.
Handles Complex Relationships: Better suited for detailed hierarchies, such as product categories subdivided into brands, suppliers, and regions.

Disadvantages of Snowflake Schema:

Slower Query Performance: More joins are required, which can slow down queries on large volumes of data.
Increased Complexity: The normalized structure is harder to understand and maintain, potentially complicating BI tool integrations.

For example, in the same retail scenario, a snowflake schema might normalize the product dimension into separate tables for products, categories, and suppliers. This allows precise queries like “Sales by supplier region” without redundant storage, but at the cost of additional joins.

Key Differences Between Star Schema and Snowflake Schema

To highlight star schema vs snowflake schema, here’s a comparison table:

Aspect	Star Schema	Snowflake Schema
Normalization	Denormalized (1NF or 2NF)	Normalized (3NF)
Structure	Central fact table with direct dimension tables	Fact table with hierarchical sub-dimensions
Joins	Fewer joins, faster queries	More joins, potentially slower
Storage	Higher due to redundancy	Lower, more efficient
Complexity	Simple and user-friendly	More complex, better for integrity
Query Speed	High	Moderate to low
Data Redundancy	High	Low

These differences stem from their design philosophies: star focuses on performance, while snowflake emphasizes efficiency and accuracy.

When to Use Star Schema vs Snowflake Schema

Use Star Schema When:
- Speed is critical (e.g., real-time dashboards).
- Data models are simple without deep hierarchies.
- Storage cost isn’t a concern with cheap cloud options.
- Example: An e-commerce firm uses star schema for rapid sales trend analysis.
Use Snowflake Schema When:
- Storage optimization is key for massive datasets.
- Complex hierarchies exist (e.g., supply chain layers).
- Data integrity is paramount during updates.
- Example: A healthcare provider uses snowflake to manage patient and provider hierarchies.

Hybrid approaches exist, but pure star schemas are often preferred for balance.

Which is Used Most in 2025?

As of 2025, the star schema remains the most commonly used in data warehousing. Its simplicity aligns with the rise of self-service BI tools and cloud platforms like Snowflake and BigQuery, where query optimization mitigates some denormalization drawbacks. Surveys and industry reports indicate that over 70% of data warehouses favor star schemas for their performance advantages, especially in agile environments. Snowflake schemas, while efficient, are more niche—used in about 20-30% of cases where normalization is essential, such as regulated industries like finance or healthcare.

However, with advancements in columnar storage and indexing, the performance gap is narrowing, making snowflake viable for more use cases.

Solid Examples in Action

Consider a healthcare analytics warehouse:

Star Schema Example: Fact table tracks patient visits (metrics: visit count, cost). Dimensions: Patient (ID, name, age), Doctor (ID, specialty), Date (year, month), Location (hospital, city). Queries like “Average cost per doctor specialty in 2024” run swiftly with simple joins.
Snowflake Schema Example: Normalize the Doctor dimension into Doctor (ID, name), Specialty (ID, type, department), and Department (ID, head). This reduces redundancy if specialties change often, but requires extra joins for the same query.

In a financial reporting system, star might aggregate transaction data quickly for dashboards, while snowflake ensures normalized account hierarchies for compliance audits.

Best Practices and References

To implement effectively:

Start with business requirements: Prioritize speed or efficiency?
Use tools like dbt or ERwin for modeling.
Test performance with sample data.

For more, check these resources:

In conclusion, while star schema vs snowflake schema both serve data warehousing, star’s dominance in 2025 underscores the value of simplicity in a fast-paced data landscape. Choose based on your workload—performance for star, efficiency for snowflake—and watch your analytics thrive.

October 7, 2025

Mastering Python Data Pipelines: Extract from APIs & Databases, Load to S3 & Snowflake
Introduction to Data Pipelines in Python

In today’s data-driven world, creating robust data pipelines solutions is essential for businesses to handle large volumes of information efficiently. Whether you’re pulling data from RESTful APIs or external databases, the goal is to extract, transform, and load (ETL) it reliably. This guide walks you through building data pipelines using Python that fetch data from multiple sources, store it in Amazon S3 for scalable storage, and load it into Snowflake for advanced analytics.

By leveraging Python’s powerful libraries like requests for APIs, sqlalchemy for databases, boto3 for S3, and the Snowflake connector, you can automate these processes. This approach ensures data integrity, scalability, and cost-effectiveness, making it ideal for data engineers and developers.

Why Use Python for Data Pipelines?

Python stands out due to its simplicity, extensive ecosystem, and community support. Key benefits include:
- Ease of Integration: Seamlessly connect to APIs, databases, S3, and Snowflake.
- Scalability: Handle large datasets with libraries like Pandas for transformations.
- Automation: Use schedulers like Airflow or cron jobs to run pipelines periodically.
- Cost-Effective: Open-source tools reduce overhead compared to proprietary ETL software.
If you’re dealing with real-time data ingestion or batch processing, Python’s flexibility makes it a top choice for modern data pipelines.

Step 1: Extracting Data from APIs

Extracting data from APIs is a common starting point in data pipelines. We’ll use the requests library to fetch JSON data from a public API, such as a weather service or GitHub API.

First, install the necessary packages:
```
pip install requests pandas
```
Here’s a sample Python script to extract data from an API:
```
import requests
import pandas as pd

def extract_from_api(api_url):
    try:
        response = requests.get(api_url)
        response.raise_for_status()  # Raise error for bad status codes
        data = response.json()
        # Assuming the data is in a list under 'results' key
        df = pd.DataFrame(data.get('results', []))
        print(f"Extracted {len(df)} records from API.")
        return df
    except requests.exceptions.RequestException as e:
        print(f"API extraction error: {e}")
        return pd.DataFrame()

# Example usage
api_url = "https://api.example.com/data"  # Replace with your API endpoint
api_data = extract_from_api(api_url)
```
This function handles errors gracefully and converts the API response into a Pandas DataFrame for easy manipulation in your data pipelines Python.

Step 2: Extracting Data from External Databases

For external databases like MySQL, PostgreSQL, or Oracle, use sqlalchemy to connect and query data. This is crucial for data pipelines involving legacy systems or third-party DBs.

Install the required libraries:
```
pip install sqlalchemy pandas mysql-connector-python  # Adjust driver for your DB
```
Sample code to extract from a MySQL database:
```
from sqlalchemy import create_engine
import pandas as pd

def extract_from_db(db_url, query):
    try:
        engine = create_engine(db_url)
        df = pd.read_sql_query(query, engine)
        print(f"Extracted {len(df)} records from database.")
        return df
    except Exception as e:
        print(f"Database extraction error: {e}")
        return pd.DataFrame()

# Example usage
db_url = "mysql+mysqlconnector://user:password@host:port/dbname"  # Replace with your credentials
query = "SELECT * FROM your_table WHERE date > '2023-01-01'"
db_data = extract_from_db(db_url, query)
```
This method ensures secure connections and efficient data retrieval, forming a solid foundation for your pipelines in Python.

Step 3: Transforming Data (Optional ETL Step)

Before loading, transform the data using Pandas. For instance, merge API and DB data, clean duplicates, or apply calculations.
```
# Assuming api_data and db_data are DataFrames
merged_data = pd.merge(api_data, db_data, on='common_column', how='inner')
merged_data.drop_duplicates(inplace=True)
merged_data['new_column'] = merged_data['value1'] + merged_data['value2']
```
This step in data pipelines ensures data quality and relevance.

Step 4: Loading Data to Amazon S3

Amazon S3 provides durable, scalable storage for your extracted data. Use boto3 to upload files.

Install boto3:
```
pip install boto3
```
Code example:
```
import boto3
import io

def load_to_s3(df, bucket_name, file_key, aws_access_key, aws_secret_key):
    try:
        s3_client = boto3.client('s3', aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)
        csv_buffer = io.StringIO()
        df.to_csv(csv_buffer, index=False)
        s3_client.put_object(Bucket=bucket_name, Key=file_key, Body=csv_buffer.getvalue())
        print(f"Data loaded to S3: {bucket_name}/{file_key}")
    except Exception as e:
        print(f"S3 upload error: {e}")

# Example usage
bucket = "your-s3-bucket"
key = "data/processed_data.csv"
load_to_s3(merged_data, bucket, key, "your_access_key", "your_secret_key")  # Use environment variables for security
```
Storing in S3 acts as an intermediate layer in data pipelines, enabling versioning and easy access.

Step 5: Loading Data into Snowflake

Finally, load the data from S3 into Snowflake for querying and analytics. Use the Snowflake Python connector.

Install the connector:
```
pip install snowflake-connector-python pandas
```
Sample Script:
```
import snowflake.connector
import pandas as pd

def load_to_snowflake(df, snowflake_account, user, password, warehouse, db, schema, table):
    try:
        conn = snowflake.connector.connect(
            user=user,
            password=password,
            account=snowflake_account,
            warehouse=warehouse,
            database=db,
            schema=schema
        )
        cur = conn.cursor()
        # Create table if not exists (simplified)
        cur.execute(f"CREATE TABLE IF NOT EXISTS {table} (col1 VARCHAR, col2 INT)")  # Adjust schema
        # Load data using Pandas to_sql (for small datasets; use COPY for large ones)
        df.to_sql(table, con=conn, schema=schema, if_exists='append', index=False)
        print(f"Data loaded to Snowflake table: {table}")
    except Exception as e:
        print(f"Snowflake load error: {e}")
    finally:
        cur.close()
        conn.close()

# Example usage
load_to_snowflake(merged_data, "your-account", "user", "password", "warehouse", "db", "schema", "your_table")
```
For larger datasets, use Snowflake’s COPY INTO command with S3 stages for better performance in data pipelines Python.

Best Practices for Data Pipelines in Python
- Error Handling: Always include try-except blocks to prevent pipeline failures.
- Security: Use environment variables or AWS Secrets Manager for credentials.
- Scheduling: Integrate with Apache Airflow or AWS Lambda for automated runs.
- Monitoring: Log activities and use tools like Datadog for pipeline health.
- Scalability: For big data, consider PySpark or Dask instead of Pandas.
Conclusion

Building data pipelines Python from APIs and databases to S3 and Snowflake streamlines your ETL workflows, enabling faster insights. With the code examples provided, you can start implementing these pipelines today. If you’re optimizing for cloud efficiency, this setup reduces costs while boosting performance.

Additional materials
October 7, 2025
Revolutionizing Finance: A Deep Dive into Snowflake’s Cortex AI
The financial services industry is in the midst of a technological revolution. At the heart of this change lies Artificial Intelligence. Consequently, financial institutions are constantly seeking new ways to innovate and enhance security. They also want to deliver personalized customer experiences. However, they face a significant hurdle: navigating fragmented data while adhering to strict compliance and governance requirements. To solve this, Snowflake has introduced Cortex AI for Financial Services, a groundbreaking suite of tools designed to unlock the full potential of AI in the sector.

What is Snowflake Cortex AI for Financial Services?

First and foremost, Snowflake Cortex AI is a comprehensive suite of AI capabilities. It empowers financial organizations to unify their data and securely deploy AI models, applications, and agents. By bringing AI directly to the data, Snowflake eliminates the need to move sensitive information. As a result, security and governance are greatly enhanced. This approach allows institutions to leverage their own proprietary data alongside third-party sources and cutting-edge large language models (LLMs). Ultimately, this helps them automate complex tasks and derive faster, more accurate insights.

Key Capabilities Driving the Transformation

Cortex AI for Financial Services is built on a foundation of powerful features. These are specifically designed to accelerate AI adoption within the financial industry.
- Snowflake Data Science Agent: This AI-powered coding assistant automates many time-consuming tasks for data scientists. For instance, it handles data cleaning, feature engineering, and model prototyping. This, in turn, accelerates the development of crucial workflows like risk modeling and fraud detection.
- Cortex AISQL: With its AI-powered functions, Cortex AISQL allows users to process and analyze unstructured data at scale. This includes market research, earnings call transcripts, and transaction details. Therefore, it transforms workflows in customer service, investment analytics, and claims processing.
- Snowflake Intelligence: Furthermore, this feature provides business users with an intuitive, conversational interface. They can query both structured and unstructured data using natural language. This “democratization” of data access means even non-technical users can gain valuable insights without writing complex SQL.
- Managed Model Context Protocol (MCP) Server: The MCP Server is a true game-changer. It securely connects proprietary data with third-party data from partners like FactSet and MSCI. In addition, it provides a standardized method for LLMs to integrate with data APIs, which eliminates the need for custom work and speeds up the delivery of AI applications.
Use Cases: Putting Cortex AI to Work in Finance

The practical applications of Snowflake Cortex AI in the financial services industry are vast and transformative:
- Fraud Detection and Prevention: By training models on historical transaction data, institutions can identify suspicious patterns in real time. Consequently, this proactive approach helps minimize losses and protect customers.
- Credit Risk Analysis: Cortex Analyst, a key feature, can analyze vast amounts of transaction data to assess credit risk. By building a semantic model, for example, financial institutions can enable more accurate and nuanced risk assessments.
- Algorithmic Trading Support:While not a trading platform itself, Snowflake’s infrastructure supports algorithmic strategies. Specifically, Cortex AI provides powerful tools for data analysis, pattern identification, and model development..
- Enhanced Customer Service: Moreover, AI agents powered by Cortex AI can create sophisticated customer support systems. These agents can analyze customer data to provide personalized assistance and automate tasks, leading to improved satisfaction.
- Market and Investment Analysis: Cortex AI can also analyze a wide range of data sources, from earnings calls to market news. This provides real-time insights that are crucial for making informed and timely investment decisions.
The Benefits of a Unified AI and Data Strategy

By adopting Snowflake Cortex AI, financial institutions can realize a multitude of benefits:
- Enhanced Security and Governance: By bringing AI to the data, sensitive financial information remains within Snowflake’s secure and governed environment.
- Faster Innovation: Automating data science tasks allows for the rapid development and deployment of new products.
- Democratization of Data: Natural language interfaces empower more users to access and analyze data directly.
- Reduced Operational Costs: Finally, the automation of complex tasks leads to significant cost savings and increased efficiency.
Getting Started with Snowflake Cortex AI

For institutions ready to begin their AI journey, the path is clear. The Snowflake Quickstarts offer a wealth of tutorials and guides. These resources provide step-by-step instructions on how to set up the environment, create models, and build powerful applications.

The Future of Finance is Here

In conclusion, Snowflake Cortex AI for Financial Services represents a pivotal moment for the industry. By providing a secure, scalable, and unified platform, Snowflake is empowering financial institutions to seize the opportunities of tomorrow. The ability to seamlessly integrate data with the latest AI technology will undoubtedly be a key differentiator in the competitive landscape of finance.

Additional Sources
October 7, 2025
Snowflake Data Science Agent: Automate ML Workflows 2025
The 60–80% Problem Killing Data Science Productivity

Data science productivity is being crushed by the 60–80% problem. Despite powerful platforms like Snowflake and cutting-edge ML tools, data scientists still spend the majority of their time—60 to 80 percent—on repetitive tasks like data cleaning, feature engineering, and environment setup. This bottleneck is stalling innovation and delaying insights that drive business value.

A typical ML project timeline looks like this:
- Weeks 1-2: Finding datasets, setting up environments, searching for similar projects
- Weeks 3-4: Data preprocessing, exploratory analysis, feature engineering
- Weeks 5-6: Model selection, hyperparameter tuning, training
- Weeks 7-8: Evaluation, documentation, deployment preparation
Only after this two-month slog do data scientists reach the interesting work: interpreting results and driving business impact.

What if you could compress weeks of foundational ML work into under an hour?

Enter the Snowflake Data Science Agent, announced at Snowflake Summit 2025 on June 3. This agentic AI companion automates routine ML development tasks, transforming how organizations build and deploy machine learning models.

What is Snowflake Data Science Agent?

Snowflake Data Science Agent is an autonomous AI assistant that automates the entire ML development lifecycle within the Snowflake environment. Currently in private preview with general availability expected in late 2025, it represents a fundamental shift in how data scientists work.

The Core Innovation

Rather than manually coding each step of an ML pipeline, data scientists describe their objective in natural language. The agent then:

Understands Context: Analyzes available datasets, business requirements, and project goals

Plans Strategy: Breaks down the ML problem into logical, executable steps

Generates Code: Creates production-quality Python code for each pipeline component

Executes Workflows: Runs the pipeline directly in Snowflake Notebooks with full observability

Iterates Intelligently: Refines approaches based on results and user feedback

Powered by Claude AI

The Data Science Agent leverages Anthropic’s Claude large language model, running securely within Snowflake’s perimeter. This integration ensures that proprietary data never leaves the governed Snowflake environment while providing state-of-the-art reasoning capabilities.

How Data Science Agent Transforms ML Workflows

Traditional ML Pipeline vs. Agent-Assisted Pipeline

Traditional Approach (4-8 Weeks):
1. Manual dataset discovery and access setup (3-5 days)
2. Exploratory data analysis with custom scripts (5-7 days)
3. Data preprocessing and quality checks (7-10 days)
4. Feature engineering experiments (5-7 days)
5. Model selection and baseline training (3-5 days)
6. Hyperparameter tuning iterations (7-10 days)
7. Model evaluation and documentation (5-7 days)
8. Deployment preparation and handoff (3-5 days)
Agent-Assisted Approach (1-2 Days):
1. Natural language project description (15 minutes)
2. Agent generates complete pipeline (30-60 minutes)
3. Review and customize generated code (2-3 hours)
4. Execute and evaluate results (1-2 hours)
5. Iterate with follow-up prompts (30 minutes per iteration)
6. Production deployment (1-2 hours)
The agent doesn’t eliminate human expertise—it amplifies it. Data scientists focus on problem formulation, result interpretation, and business strategy rather than boilerplate code.

Key Capabilities and Features

1. Automated Data Preparation

The agent handles the most time-consuming aspects of data science:

Data Profiling: Automatically analyzes distributions, identifies missing values, detects outliers, and assesses data quality

Smart Preprocessing: Generates appropriate transformations based on data characteristics—normalization, encoding, imputation, scaling

Feature Engineering: Creates relevant features using domain knowledge embedded in the model, including polynomial features, interaction terms, and temporal aggregations

Data Validation: Implements checks to ensure data quality throughout the pipeline

2. Intelligent Model Selection

Rather than manually testing dozens of algorithms, the agent:

Evaluates Problem Type: Classifies tasks as regression, classification, clustering, or time series

Considers Constraints: Factors in dataset size, feature types, and performance requirements

Recommends Algorithms: Suggests appropriate models with justification for each recommendation

Implements Ensemble Methods: Combines multiple models when beneficial for accuracy

3. Automated Hyperparameter Tuning

The agent configures and executes optimization strategies:

Grid Search: Systematic exploration of parameter spaces for small parameter sets

Random Search: Efficient sampling for high-dimensional parameter spaces

Bayesian Optimization: Intelligent search using previous results to guide exploration

Early Stopping: Prevents overfitting and saves compute resources

4. Production-Ready Code Generation

Generated pipelines aren’t just prototypes—they’re production-quality:

Modular Architecture: Clean, reusable functions with clear separation of concerns

Error Handling: Robust exception handling and logging

Documentation: Inline comments and docstrings explaining logic

Version Control Ready: Structured for Git workflows and collaboration

Snowflake Native: Optimized for Snowflake’s distributed computing environment

5. Explainability and Transparency

Understanding model behavior is crucial for trust and compliance:

Feature Importance: Identifies which variables drive predictions

SHAP Values: Explains individual predictions with Shapley values

Model Diagnostics: Generates confusion matrices, ROC curves, and performance metrics

Audit Trails: Logs all decisions, code changes, and model versions

Real-World Use Cases

Financial Services: Fraud Detection

Challenge: A bank needs to detect fraudulent transactions in real-time with minimal false positives.

Traditional Approach: Data science team spends 6 weeks building and tuning models, requiring deep SQL expertise, feature engineering knowledge, and model optimization skills.

With Data Science Agent:
- Prompt: “Build a fraud detection model using transaction history, customer profiles, and merchant data. Optimize for 99% precision while maintaining 85% recall.”
- Result: Agent generates a complete pipeline with ensemble methods, class imbalance handling, and real-time scoring infrastructure in under 2 hours
- Impact: Model deployed 95% faster, freeing the team to work on sophisticated fraud pattern analysis
Healthcare: Patient Risk Stratification

Challenge: A hospital system wants to identify high-risk patients for proactive intervention.

Traditional Approach: Clinical data analysts spend 8 weeks wrangling EHR data, building features from medical histories, and validating models against clinical outcomes.

With Data Science Agent:
- Prompt: “Create a patient risk stratification model using diagnoses, medications, lab results, and demographics. Focus on interpretability for clinical adoption.”
- Result: Agent produces an explainable model with clinically meaningful features, SHAP explanations for each prediction, and validation against established risk scores
- Impact: Clinicians trust the model due to transparency, leading to 40% adoption rate vs. typical 15%
Retail: Customer Lifetime Value Prediction

Challenge: An e-commerce company needs to predict customer lifetime value to optimize marketing spend.

Traditional Approach: Marketing analytics team collaborates with data scientists for 5 weeks, iterating on feature definitions and model approaches.

With Data Science Agent:
- Prompt: “Predict 12-month customer lifetime value using purchase history, browsing behavior, and demographic data. Segment customers into high/medium/low value tiers.”
- Result: Agent delivers a complete CLV model with customer segmentation, propensity scores, and a dashboard for marketing teams
- Impact: Marketing ROI improves 32% through better targeting, model built 90% faster
Manufacturing: Predictive Maintenance

Challenge: A manufacturer wants to predict equipment failures before they occur to minimize downtime.

Traditional Approach: Engineers and data scientists spend 7 weeks analyzing sensor data, building time-series features, and testing various forecasting approaches.

With Data Science Agent:
- Prompt: “Build a predictive maintenance model using sensor telemetry, maintenance logs, and operational data. Predict failures 24-48 hours in advance.”
- Result: Agent creates a time-series model with automated feature extraction from streaming data, anomaly detection, and failure prediction
- Impact: Unplanned downtime reduced by 28%, maintenance costs decreased by 19%
Technical Architecture

Integration with Snowflake Ecosystem

The Data Science Agent operates natively within Snowflake’s architecture:

Snowflake Notebooks: All code generation and execution happens in collaborative notebooks

Snowpark Python: Leverages Snowflake’s Python runtime for distributed computing

Data Governance: Inherits existing row-level security, masking, and access controls

Cortex AI Suite: Integrates with Cortex Analyst, Search, and AISQL for comprehensive AI capabilities

ML Jobs: Automates model training, scheduling, and monitoring at scale

How It Works: Behind the Scenes

When a data scientist provides a natural language prompt:

🔍Step 1: Understanding
- Claude analyzes the request, identifying ML task type, success metrics, and constraints
- Agent queries Snowflake’s catalog to discover relevant tables and understand schema
🧠 Step 2: Planning
- Generates a step-by-step execution plan covering data prep, modeling, and evaluation
- Identifies required Snowflake features and libraries
💻 Step 3: Code Generation
- Creates executable Python code for each pipeline stage
- Includes data validation, error handling, and logging
🚀 Step 4: Execution
- Runs generated code in Snowflake Notebooks with full visibility
- Provides real-time progress updates and intermediate results
📊 Step 5: Evaluation
- Generates comprehensive model diagnostics and performance metrics
- Recommends next steps based on results
🔁 Step 6: Iteration
- Accepts follow-up prompts to refine the model
- Tracks changes and maintains version history
Best Practices for Using Data Science Agent

1. Write Clear, Specific Prompts

Poor Prompt: “Build a model for sales”

Good Prompt: “Create a weekly sales forecasting model for retail stores using historical sales, promotions, weather, and holidays. Optimize for MAPE under 10%. Include confidence intervals.”

The more context you provide, the better the agent performs.

2. Start with Business Context

Begin prompts with the business problem and success criteria:
- What decision will this model inform?
- What accuracy is acceptable?
- What are the cost/benefit tradeoffs?
- Are there regulatory requirements?
3. Iterate Incrementally

Don’t expect perfection on the first generation. Use follow-up prompts:
- “Add feature importance analysis”
- “Try a gradient boosting approach”
- “Optimize for faster inference time”
- “Add cross-validation with 5 folds”
4. Review Generated Code

While the agent produces high-quality code, always review:
- Data preprocessing logic for business rule compliance
- Feature engineering for domain appropriateness
- Model selection justification
- Performance metrics alignment with business goals
5. Establish Governance Guardrails

Define organizational standards:
- Required documentation templates
- Mandatory model validation steps
- Approved algorithm lists for regulated industries
- Data privacy and security checks
6. Combine Agent Automation with Human Expertise

Use the agent for:
- Rapid prototyping and baseline models
- Automated preprocessing and feature engineering
- Hyperparameter tuning and model selection
- Code documentation and testing
Retain human control for:
- Problem formulation and success criteria
- Business logic validation
- Ethical considerations and bias assessment
- Strategic decision-making on model deployment
Measuring ROI: The Business Impact

Organizations adopting Data Science Agent report significant benefits:

Time-to-Production Acceleration

Before Agent: Average 8-12 weeks from concept to production model

With Agent: Average 1-2 weeks from concept to production model

Impact: 5-10x faster model development cycles

Productivity Multiplication

Before Agent: 2-3 models per data scientist per quarter

With Agent: 8-12 models per data scientist per quarter

Impact: 4x increase in model output, enabling more AI use cases

Quality Improvements

Before Agent: 40-60% of models reach production (many abandoned due to insufficient ROI)

With Agent: 70-85% of models reach production (faster iteration enables more refinement)

Impact: Higher model quality through rapid experimentation

Cost Optimization

Before Agent: $150K-200K average cost per model (personnel time, infrastructure)

With Agent: $40K-60K average cost per model

Impact: 70% reduction in model development costs

Democratization of ML

Before Agent: Only senior data scientists can build production models

With Agent: Junior analysts and citizen data scientists can create sophisticated models

Impact: 3-5x expansion of AI capability across organization

Limitations and Considerations

While powerful, Data Science Agent has important constraints:

Current Limitations

Preview Status: Still in private preview; features and capabilities evolving

Scope Boundaries: Optimized for structured data ML; deep learning and computer vision require different approaches

Domain Knowledge: Agent lacks specific industry expertise; users must validate business logic

Complex Custom Logic: Highly specialized algorithms may require manual implementation

Important Considerations

Data Quality Dependency: Agent’s output quality directly correlates with input data quality—garbage in, garbage out still applies

Computational Costs: Automated hyperparameter tuning can consume significant compute resources

Over-Reliance Risk: Organizations must maintain ML expertise; agents augment, not replace, human judgment

Regulatory Compliance: In highly regulated industries, additional human review and validation required

Bias and Fairness: Automated feature engineering may perpetuate existing biases; fairness testing essential

The Future of Data Science Agent

Based on Snowflake’s roadmap and industry trends, expect these developments:

Short-Term (2025-2026)

General Availability: Broader access as private preview graduates to GA

Expanded Model Types: Support for time series, recommendation systems, and anomaly detection

AutoML Enhancements: More sophisticated algorithm selection and ensemble methods

Deeper Integration: Tighter coupling with Snowflake ML Jobs and model registry

Medium-Term (2026-2027)

Multi-Modal Learning: Support for unstructured data (images, text, audio) alongside structured data

Federated Learning: Distributed model training across data clean rooms

Automated Monitoring: Self-healing models that detect drift and retrain automatically

Natural Language Insights: Plain English explanations of model behavior for business users

Long-Term Vision (2027+)

Autonomous ML Operations: End-to-end model lifecycle management with minimal human intervention

Cross-Domain Transfer Learning: Agents that leverage learnings across industries and use cases

Collaborative Multi-Agent Systems: Specialized agents working together on complex problems

Causal ML Integration: Moving beyond correlation to causal inference and counterfactual analysis

Getting Started with Data Science Agent

Prerequisites

To leverage Data Science Agent, you need:

Snowflake Account: Enterprise edition or higher with Cortex AI enabled

Data Foundation: Structured data in Snowflake tables or views

Clear Use Case: Well-defined business problem with success metrics

User Permissions: Access to Snowflake Notebooks and Cortex features

Request Access

Data Science Agent is currently in private preview:
1. Contact your Snowflake account team to express interest
2. Complete the preview application with use case details
3. Participate in onboarding and training sessions
4. Join the preview community for best practices sharing
Pilot Project Selection

Choose an initial use case with these characteristics:

High Business Value: Clear ROI and stakeholder interest

Data Availability: Clean, accessible data in Snowflake

Reasonable Complexity: Not trivial, but not your most difficult problem

Failure Tolerance: Low risk if the model needs iteration

Measurement Clarity: Easy to quantify success

Success Metrics

Track these KPIs to measure Data Science Agent impact:
- Time from concept to production model
- Number of models per data scientist per quarter
- Percentage of models reaching production
- Model performance metrics vs. baseline
- Cost per model developed
- Data scientist satisfaction scores
Snowflake Data Science Agent vs. Competitors

How It Compares

Databricks AutoML:
- Advantage: Tighter integration with governed data, no data movement
- Trade-off: Databricks offers more deep learning capabilities
Google Cloud AutoML:
- Advantage: Runs on your data warehouse, no egress costs
- Trade-off: Google has broader pre-trained model library
Amazon SageMaker Autopilot:
- Advantage: Simpler for SQL-first organizations
- Trade-off: AWS has more deployment flexibility
H2O.ai Driverless AI:
- Advantage: Native Snowflake integration, better governance
- Trade-off: H2O specializes in AutoML with more tuning options
Why Choose Snowflake Data Science Agent?

Data Gravity: Build ML where your data lives—no movement, no copies, no security risks

Unified Platform: Single environment for data engineering, analytics, and ML

Enterprise Governance: Leverage existing security, compliance, and access controls

Ecosystem Integration: Works seamlessly with BI tools, notebooks, and applications

Scalability: Automatic compute scaling without infrastructure management

Conclusion: The Data Science Revolution Begins Now

The Snowflake Data Science Agent represents more than a productivity tool—it’s a fundamental reimagining of how organizations build machine learning solutions. By automating the 60-80% of work that consumes data scientists’ time, it unleashes their potential to solve harder problems, explore more use cases, and deliver greater business impact.

The transformation is already beginning. Organizations in the private preview report 5-10x faster model development, 4x increases in productivity, and democratization of ML capabilities across their teams. As Data Science Agent reaches general availability in late 2025, these benefits will scale across the entire Snowflake ecosystem.

The question isn’t whether to adopt AI-assisted data science—it’s how quickly you can implement it to stay competitive.

For data leaders, the opportunity is clear: accelerate AI initiatives, multiply data science team output, and tackle the backlog of use cases that were previously too expensive or time-consuming to address.

For data scientists, the promise is equally compelling: spend less time on repetitive tasks and more time on creative problem-solving, strategic thinking, and high-impact analysis.

The future of data science is agentic. The future of data science is here.

Key Takeaways
- Snowflake Data Science Agent automates 60-80% of routine ML development work
- Announced June 3, 2025, at Snowflake Summit; currently in private preview
- Powered by Anthropic’s Claude AI running securely within Snowflake
- Transforms weeks of ML pipeline work into hours through natural language interaction
- Generates production-quality code for data prep, modeling, tuning, and evaluation
- Organizations report 5-10x faster model development and 4x productivity gains
- Use cases span financial services, healthcare, retail, manufacturing, and more
- Maintains Snowflake’s enterprise governance, security, and compliance controls
- Best used for structured data ML; human expertise still essential for strategy
- Expected general availability in late 2025 with continued capability expansion
October 6, 2025

Author: sainath

Snowflake SQL Tutorial: Master MERGE ALL BY NAME in 2025

Revolutionary SQL Features That Transform data engineering

Snowflake SQL Tutorial: MERGE ALL BY NAME Feature

The SQL Problem MERGE ALL BY NAME Solves

The Snowflake SQL Solution: MERGE ALL BY NAME

How MERGE ALL BY NAME Works

Requirements for MERGE ALL BY NAME

Real-World Use Case: Slowly Changing Dimensions

Benefits of MERGE ALL BY NAME

Snowflake SQL UNION BY NAME: Flexible Data Combining

The Traditional UNION Problem

UNION BY NAME Solution

Use Cases for UNION BY NAME

Performance Considerations

Cortex AISQL: AI-Powered SQL Functions

Revolutionary AI Functions

AI_FILTER: Intelligent Data Filtering

AI_CLASSIFY: Intelligent Classification

AI_AGG: Intelligent Aggregation

Cortex AISQL Real-World Example

Enhanced PIVOT and UNPIVOT with Aliases

PIVOT with Column Aliases

UNPIVOT with Aliases

Snowflake Scripting UDFs: Procedural SQL

Traditional UDF Limitations

New: Snowflake Scripting UDFs

Real-World Example: Dynamic Pricing

Lambda Expressions with Table Column References

What Are Higher-Order Functions?

New Capability: Column References

Real-World Use Case: Dynamic Filtering

Complex Example: Price Optimization

Additional SQL Improvements in 2025

Enhanced SEARCH Function Modes

Increased VARCHAR and BINARY Limits

Schema-Level Replication for Failover

XML Format Support (General Availability)

Best Practices for Snowflake SQL 2025

When to Use MERGE ALL BY NAME

When to Use UNION BY NAME

Cortex AISQL Performance Tips

Snowflake Scripting UDF Guidelines

Migration Guide: Adopting 2025 Features

Phase 1: Assess Current Code

Phase 2: Test in Development

Phase 3: Gradual Rollout

Phase 4: Standardize

Troubleshooting Common Issues

MERGE ALL BY NAME Not Working

UNION BY NAME NULL Handling

Cortex AISQL Performance

Future SQL Improvements on Snowflake Roadmap

Conclusion: Embracing Modern SQL in Snowflake

Key Takeaways

Snowflake Optima: 15x Faster Queries at Zero Cost

Revolutionary Performance Without Lifting a Finger

What is Snowflake Optima?

The Core Innovation of Optima:

Key Principles Behind Snowflake Optima

Snowflake Optima Indexing: The Breakthrough Feature

How Snowflake Optima Indexing Works

Real-World Snowflake Optima Performance Gains

The Magic of Micro-Partition Pruning

Snowflake Optima vs. Traditional Optimization

Traditional Search Optimization Service

Snowflake Optima: The Automatic Alternative

Technical Requirements for Snowflake Optima

Generation 2 Warehouses Only

Best-Effort Optimization Model

Monitoring Optima Performance

Query Insights Pane

Statistics Pane: Pruning Metrics

Use Cases

Use Case 1: E-Commerce Analytics

Use Case 2: Financial Services Risk Analysis

Use Case 3: IoT Sensor Data Analysis

Use Case 4: SaaS Application Backend

Cost Implications of Snowflake Optima

Zero Additional Costs