Dive deep into the Snowflake Data Cloud. Guides on building a modern cloud data warehouse, data sharing, performance optimization, and leveraging advanced features like Snowpipe and Streams.
The world of data analytics is changing. For years, accessing insights required writing complex SQL queries. However, the industry is now shifting towards a more intuitive, conversational approach. At the forefront of this revolution is agentic AI—intelligent systems that can understand human language, reason, plan, and automate complex tasks.
Snowflake is leading this charge by transforming its platform into an intelligent and conversational AI Data Cloud. With the recent introduction of Snowflake Cortex Agents, they have provided a powerful tool for developers and data teams to build their own custom AI assistants.
This guide will walk you through, step-by-step, how to build your very first AI data agent. You will learn how to create an agent that can answer complex questions by pulling information from both your database tables and your unstructured documents, all using simple, natural language.
What is a Snowflake Cortex Agent and Why Does it Matter?
First and foremost, a Snowflake Cortex Agent is an AI-powered assistant that you can build on top of your own data. Think of it as a chatbot that has expert knowledge of your business. It understands your data landscape and can perform complex analytical tasks based on simple, conversational prompts.
This is a game-changer for several reasons:
It Democratizes Data: Business users no longer need to know SQL. Instead, they can ask questions like, “What were our top-selling products in the last quarter?” and get immediate, accurate answers.
It Automates Analysis: Consequently, data teams are freed from writing repetitive, ad-hoc queries. They can now focus on more strategic initiatives while the agent handles routine data exploration.
It Provides Unified Insights: Most importantly, a Cortex Agent can synthesize information from multiple sources. It can query your structured sales data from a table and cross-reference it with strategic goals mentioned in a PDF document, all in a single response.
The Blueprint: How a Cortex Agent Works
Under the hood, a Cortex Agent uses a simple yet powerful workflow to answer your questions. It orchestrates several of Snowflake’s Cortex AI features to deliver a comprehensive answer.
Planning: The agent first analyzes your natural language question to understand your intent. It figures out what information you need and where it might be located.
Tool Use: Next, it intelligently chooses the right tool for the job. If it needs to query structured data, it uses Cortex Analyst to generate and run SQL. If it needs to find information in your documents, it uses Cortex Search.
Reflection: Finally, after gathering the data, the agent evaluates the results. It might ask for clarification, refine its approach, or synthesize the information into a clear, concise answer before presenting it to you.
Step-by-Step Tutorial: Building a Sales Analysis Agent
Now, let’s get hands-on. We will build a simple yet powerful sales analysis agent. This agent will be able to answer questions about sales figures from a table and also reference goals from a quarterly business review (QBR) document.
Prerequisites
A Snowflake account with ACCOUNTADMIN privileges.
A warehouse to run the queries.
Step 1: Prepare Your Data
First, we need some data to work with. Let’s create two simple tables for sales and products, and then upload a sample PDF document.
Run the following SQL in a Snowflake worksheet:
-- Create our database and schema
CREATE DATABASE IF NOT EXISTS AGENT_DEMO;
CREATE SCHEMA IF NOT EXISTS AGENT_DEMO.SALES;
USE SCHEMA AGENT_DEMO.SALES;
-- Create a products table
CREATE OR REPLACE TABLE PRODUCTS (
product_id INT,
product_name VARCHAR,
category VARCHAR
);
INSERT INTO PRODUCTS (product_id, product_name, category) VALUES
(101, 'Quantum Laptop', 'Electronics'),
(102, 'Nebula Smartphone', 'Electronics'),
(103, 'Stardust Keyboard', 'Accessories');
-- Create a sales table
CREATE OR REPLACE TABLE SALES (
sale_id INT,
product_id INT,
sale_date DATE,
sale_amount DECIMAL(10, 2)
);
INSERT INTO SALES (sale_id, product_id, sale_date, sale_amount) VALUES
(1, 101, '2025-09-01', 1200.00),
(2, 102, '2025-09-05', 800.00),
(3, 101, '2025-09-15', 1250.00),
(4, 103, '2025-09-20', 150.00);
-- Create a stage for our unstructured documents
CREATE OR REPLACE STAGE qbr_documents;
Now, create a simple text file named QBR_Report_Q3.txt on your local machine with the following content and upload it to the qbr_documents stage using the Snowsight UI.
Quarterly Business Review – Q3 2025 Summary
Our primary strategic goal for Q3 was to drive the adoption of our new flagship product, the ‘Quantum Laptop’. We aimed for a sales target of over $2,000 for this product. Secondary goals included expanding our market share in the accessories category.
Next, we need to teach the agent about our structured data. We do this by creating a Semantic Model. This is a YAML file that defines our tables, columns, and how they relate to each other.
# semantic_model.yaml
model:
name: sales_insights_model
tables:
- name: SALES
columns:
- name: sale_id
type: INT
- name: product_id
type: INT
- name: sale_date
type: DATE
- name: sale_amount
type: DECIMAL
- name: PRODUCTS
columns:
- name: product_id
type: INT
- name: product_name
type: VARCHAR
- name: category
type: VARCHAR
joins:
- from: SALES
to: PRODUCTS
on: SALES.product_id = PRODUCTS.product_id
Save this as semantic_model.yaml and upload it to the @qbr_documents stage.
Now, let’s make our PDF document searchable. We create a Cortex Search Service on the stage where we uploaded our file.
CREATE OR REPLACE CORTEX SEARCH SERVICE sales_qbr_service
ON @qbr_documents
TARGET_LAG = '0 seconds'
WAREHOUSE = 'COMPUTE_WH';
Step 4: Combine Them into a Cortex Agent
With all the pieces in place, we can now create our agent. This single SQL statement brings together our semantic model (for SQL queries) and our search service (for document queries).
CREATE OR REPLACE CORTEX AGENT sales_agent
MODEL = 'mistral-large',
CORTEX_SEARCH_SERVICES = [sales_qbr_service],
SEMANTIC_MODELS = ['@qbr_documents/semantic_model.yaml'];
Step 5: Ask Your Agent Questions!
The agent is now ready! You can interact with it using the CALL command. Let’s try a few questions.
First up: A simple structured data query.
CALL sales_agent('What were our total sales?');
Next: A more complex query involving joins.
CALL sales_agent('Which product had the highest revenue?');
Then comes: A question for our unstructured document.
CALL sales_agent('Summarize our strategic goals from the latest QBR report.');
Finally , the magic: The magic! A question that combines both.
CALL sales_agent('Did we meet our sales target for the Quantum Laptop as mentioned in the QBR?');
This final query demonstrates the true power of a Snowflake Cortex Agent. It will first query the SALES and PRODUCTS tables to calculate the total sales for the “Quantum Laptop.” Then, it will use Cortex Search to find the sales target mentioned in the QBR document. Finally, it will compare the two and give you a complete, synthesized answer.
Conclusion: The Future is Conversational
You have just built a powerful AI data agent in a matter of minutes. This is a fundamental shift in how we interact with data. By combining natural language processing with the power to query both structured and unstructured data, Snowflake Cortex Agents are paving the way for a future where data-driven insights are accessible to everyone in an organization.
As Snowflake continues to innovate with features like Adaptive Compute and Gen-2 Warehouses, running these AI workloads will only become faster and more efficient. The era of conversational analytics has arrived, and it’s built on the Snowflake AI Data Cloud.
In the realm of data warehousing, choosing the right schema design is crucial for efficient data management, querying, and analysis. Two of the most popular multidimensional schemas are the star schema and the snowflake schema. These schemas organize data into fact tables (containing measurable metrics) and dimension tables (providing context like who, what, when, and where). Understanding star schema vs snowflake schema helps data engineers, analysts, and architects build scalable systems that support business intelligence (BI) tools and advanced analytics.
This comprehensive guide delves into their structures, pros, cons, when to use each, real-world examples, and which one dominates in modern data practices as of 2025. We’ll also include visual illustrations to make concepts clearer, along with references to authoritative sources for deeper reading.
What is a Star Schema?
A star schema is a denormalized data model resembling a star, with a central fact table surrounded by dimension tables. The fact table holds quantitative data (e.g., sales amounts, quantities) and foreign keys linking to dimensions. Dimension tables store descriptive attributes (e.g., product names, customer details) and are not further normalized.
Advantages of Star Schema:
Simplicity and Ease of Use: Fewer tables mean simpler queries with minimal joins, making it intuitive for end-users and BI tools like Tableau or Power BI.
Faster Query Performance: Denormalization reduces join operations, leading to quicker aggregations and reports, especially on large datasets.
Better for Reporting: Ideal for OLAP (Online Analytical Processing) where speed is prioritized over storage efficiency.
Disadvantages of Star Schema:
Data Redundancy: Denormalization can lead to duplicated data in dimension tables, increasing storage needs and risking inconsistencies during updates.
Limited Flexibility for Complex Hierarchies: It struggles with intricate relationships, such as multi-level product categories.
In practice, star schemas are favored in environments where query speed trumps everything else. For instance, in a retail data warehouse, the fact table might record daily sales metrics, while dimensions cover products, customers, stores, and dates. This setup allows quick answers to questions like “What were the total sales by product category last quarter?”
What is a Snowflake Schema?
A snowflake schema is an extension of the star schema but with normalized dimension tables. Here, dimensions are broken down into sub-dimension tables to eliminate redundancy, creating a structure that branches out like a snowflake. The fact table remains central, but dimensions are hierarchical and normalized to third normal form (3NF).
Advantages of Snowflake Schema:
Storage Efficiency: Normalization reduces data duplication, saving disk space—crucial for massive datasets in cloud environments like AWS or Snowflake (the data warehouse platform).
Improved Data Integrity: By minimizing redundancy, updates are easier and less error-prone, maintaining consistency across the warehouse.
Handles Complex Relationships: Better suited for detailed hierarchies, such as product categories subdivided into brands, suppliers, and regions.
Disadvantages of Snowflake Schema:
Slower Query Performance: More joins are required, which can slow down queries on large volumes of data.
Increased Complexity: The normalized structure is harder to understand and maintain, potentially complicating BI tool integrations.
For example, in the same retail scenario, a snowflake schema might normalize the product dimension into separate tables for products, categories, and suppliers. This allows precise queries like “Sales by supplier region” without redundant storage, but at the cost of additional joins.
Key Differences Between Star Schema and Snowflake Schema
To highlight star schema vs snowflake schema, here’s a comparison table:
Aspect
Star Schema
Snowflake Schema
Normalization
Denormalized (1NF or 2NF)
Normalized (3NF)
Structure
Central fact table with direct dimension tables
Fact table with hierarchical sub-dimensions
Joins
Fewer joins, faster queries
More joins, potentially slower
Storage
Higher due to redundancy
Lower, more efficient
Complexity
Simple and user-friendly
More complex, better for integrity
Query Speed
High
Moderate to low
Data Redundancy
High
Low
These differences stem from their design philosophies: star focuses on performance, while snowflake emphasizes efficiency and accuracy.
When to Use Star Schema vs Snowflake Schema
Use Star Schema When:
Speed is critical (e.g., real-time dashboards).
Data models are simple without deep hierarchies.
Storage cost isn’t a concern with cheap cloud options.
Example: An e-commerce firm uses star schema for rapid sales trend analysis.
Example: A healthcare provider uses snowflake to manage patient and provider hierarchies.
Hybrid approaches exist, but pure star schemas are often preferred for balance.
Which is Used Most in 2025?
As of 2025, the star schema remains the most commonly used in data warehousing. Its simplicity aligns with the rise of self-service BI tools and cloud platforms like Snowflake and BigQuery, where query optimization mitigates some denormalization drawbacks. Surveys and industry reports indicate that over 70% of data warehouses favor star schemas for their performance advantages, especially in agile environments. Snowflake schemas, while efficient, are more niche—used in about 20-30% of cases where normalization is essential, such as regulated industries like finance or healthcare.
However, with advancements in columnar storage and indexing, the performance gap is narrowing, making snowflake viable for more use cases.
Solid Examples in Action
Consider a healthcare analytics warehouse:
Star Schema Example: Fact table tracks patient visits (metrics: visit count, cost). Dimensions: Patient (ID, name, age), Doctor (ID, specialty), Date (year, month), Location (hospital, city). Queries like “Average cost per doctor specialty in 2024” run swiftly with simple joins.
Snowflake Schema Example: Normalize the Doctor dimension into Doctor (ID, name), Specialty (ID, type, department), and Department (ID, head). This reduces redundancy if specialties change often, but requires extra joins for the same query.
In a financial reporting system, star might aggregate transaction data quickly for dashboards, while snowflake ensures normalized account hierarchies for compliance audits.
Best Practices and References
To implement effectively:
Start with business requirements: Prioritize speed or efficiency?
In conclusion, while star schema vs snowflake schema both serve data warehousing, star’s dominance in 2025 underscores the value of simplicity in a fast-paced data landscape. Choose based on your workload—performance for star, efficiency for snowflake—and watch your analytics thrive.
The financial services industry is in the midst of a technological revolution. At the heart of this change lies Artificial Intelligence. Consequently, financial institutions are constantly seeking new ways to innovate and enhance security. They also want to deliver personalized customer experiences. However, they face a significant hurdle: navigating fragmented data while adhering to strict compliance and governance requirements. To solve this, Snowflake has introduced Cortex AI for Financial Services, a groundbreaking suite of tools designed to unlock the full potential of AI in the sector.
What is Snowflake Cortex AI for Financial Services?
First and foremost, Snowflake Cortex AI is a comprehensive suite of AI capabilities. It empowers financial organizations to unify their data and securely deploy AI models, applications, and agents. By bringing AI directly to the data, Snowflake eliminates the need to move sensitive information. As a result, security and governance are greatly enhanced. This approach allows institutions to leverage their own proprietary data alongside third-party sources and cutting-edge large language models (LLMs). Ultimately, this helps them automate complex tasks and derive faster, more accurate insights.
Key Capabilities Driving the Transformation
Cortex AI for Financial Services is built on a foundation of powerful features. These are specifically designed to accelerate AI adoption within the financial industry.
Snowflake Data Science Agent: This AI-powered coding assistant automates many time-consuming tasks for data scientists. For instance, it handles data cleaning, feature engineering, and model prototyping. This, in turn, accelerates the development of crucial workflows like risk modeling and fraud detection.
Cortex AISQL: With its AI-powered functions, Cortex AISQL allows users to process and analyze unstructured data at scale. This includes market research, earnings call transcripts, and transaction details. Therefore, it transforms workflows in customer service, investment analytics, and claims processing.
Snowflake Intelligence: Furthermore, this feature provides business users with an intuitive, conversational interface. They can query both structured and unstructured data using natural language. This “democratization” of data access means even non-technical users can gain valuable insights without writing complex SQL.
Managed Model Context Protocol (MCP) Server: The MCP Server is a true game-changer. It securely connects proprietary data with third-party data from partners like FactSet and MSCI. In addition, it provides a standardized method for LLMs to integrate with data APIs, which eliminates the need for custom work and speeds up the delivery of AI applications.
Use Cases: Putting Cortex AI to Work in Finance
The practical applications of Snowflake Cortex AI in the financial services industry are vast and transformative:
Fraud Detection and Prevention: By training models on historical transaction data, institutions can identify suspicious patterns in real time. Consequently, this proactive approach helps minimize losses and protect customers.
Credit Risk Analysis: Cortex Analyst, a key feature, can analyze vast amounts of transaction data to assess credit risk. By building a semantic model, for example, financial institutions can enable more accurate and nuanced risk assessments.
Algorithmic Trading Support:While not a trading platform itself, Snowflake’s infrastructure supports algorithmic strategies. Specifically, Cortex AI provides powerful tools for data analysis, pattern identification, and model development..
Enhanced Customer Service: Moreover, AI agents powered by Cortex AI can create sophisticated customer support systems. These agents can analyze customer data to provide personalized assistance and automate tasks, leading to improved satisfaction.
Market and Investment Analysis: Cortex AI can also analyze a wide range of data sources, from earnings calls to market news. This provides real-time insights that are crucial for making informed and timely investment decisions.
The Benefits of a Unified AI and Data Strategy
By adopting Snowflake Cortex AI, financial institutions can realize a multitude of benefits:
Enhanced Security and Governance: By bringing AI to the data, sensitive financial information remains within Snowflake’s secure and governed environment.
Faster Innovation: Automating data science tasks allows for the rapid development and deployment of new products.
Democratization of Data: Natural language interfaces empower more users to access and analyze data directly.
Reduced Operational Costs: Finally, the automation of complex tasks leads to significant cost savings and increased efficiency.
Getting Started with Snowflake Cortex AI
For institutions ready to begin their AI journey, the path is clear. The Snowflake Quickstarts offer a wealth of tutorials and guides. These resources provide step-by-step instructions on how to set up the environment, create models, and build powerful applications.
The Future of Finance is Here
In conclusion, Snowflake Cortex AI for Financial Services represents a pivotal moment for the industry. By providing a secure, scalable, and unified platform, Snowflake is empowering financial institutions to seize the opportunities of tomorrow. The ability to seamlessly integrate data with the latest AI technology will undoubtedly be a key differentiator in the competitive landscape of finance.
The 60–80% Problem Killing Data Science Productivity
Data science productivity is being crushed by the 60–80% problem. Despite powerful platforms like Snowflake and cutting-edge ML tools, data scientists still spend the majority of their time—60 to 80 percent—on repetitive tasks like data cleaning, feature engineering, and environment setup. This bottleneck is stalling innovation and delaying insights that drive business value.
A typical ML project timeline looks like this:
Weeks 1-2: Finding datasets, setting up environments, searching for similar projects
Weeks 3-4: Data preprocessing, exploratory analysis, feature engineering
Weeks 5-6: Model selection, hyperparameter tuning, training
Only after this two-month slog do data scientists reach the interesting work: interpreting results and driving business impact.
What if you could compress weeks of foundational ML work into under an hour?
Enter the Snowflake Data Science Agent, announced at Snowflake Summit 2025 on June 3. This agentic AI companion automates routine ML development tasks, transforming how organizations build and deploy machine learning models.
Snowflake Data Science Agent is an autonomous AI assistant that automates the entire ML development lifecycle within the Snowflake environment. Currently in private preview with general availability expected in late 2025, it represents a fundamental shift in how data scientists work.
The Core Innovation
Rather than manually coding each step of an ML pipeline, data scientists describe their objective in natural language. The agent then:
Understands Context: Analyzes available datasets, business requirements, and project goals
Plans Strategy: Breaks down the ML problem into logical, executable steps
Generates Code: Creates production-quality Python code for each pipeline component
Executes Workflows: Runs the pipeline directly in Snowflake Notebooks with full observability
Iterates Intelligently: Refines approaches based on results and user feedback
Powered by Claude AI
The Data Science Agent leverages Anthropic’s Claude large language model, running securely within Snowflake’s perimeter. This integration ensures that proprietary data never leaves the governed Snowflake environment while providing state-of-the-art reasoning capabilities.
How Data Science Agent Transforms ML Workflows
Traditional ML Pipeline vs. Agent-Assisted Pipeline
Traditional Approach (4-8 Weeks):
Manual dataset discovery and access setup (3-5 days)
Exploratory data analysis with custom scripts (5-7 days)
Data preprocessing and quality checks (7-10 days)
Feature engineering experiments (5-7 days)
Model selection and baseline training (3-5 days)
Hyperparameter tuning iterations (7-10 days)
Model evaluation and documentation (5-7 days)
Deployment preparation and handoff (3-5 days)
Agent-Assisted Approach (1-2 Days):
Natural language project description (15 minutes)
Agent generates complete pipeline (30-60 minutes)
Review and customize generated code (2-3 hours)
Execute and evaluate results (1-2 hours)
Iterate with follow-up prompts (30 minutes per iteration)
Production deployment (1-2 hours)
The agent doesn’t eliminate human expertise—it amplifies it. Data scientists focus on problem formulation, result interpretation, and business strategy rather than boilerplate code.
Key Capabilities and Features
1. Automated Data Preparation
The agent handles the most time-consuming aspects of data science:
Data Profiling: Automatically analyzes distributions, identifies missing values, detects outliers, and assesses data quality
Smart Preprocessing: Generates appropriate transformations based on data characteristics—normalization, encoding, imputation, scaling
Feature Engineering: Creates relevant features using domain knowledge embedded in the model, including polynomial features, interaction terms, and temporal aggregations
Data Validation: Implements checks to ensure data quality throughout the pipeline
2. Intelligent Model Selection
Rather than manually testing dozens of algorithms, the agent:
Evaluates Problem Type: Classifies tasks as regression, classification, clustering, or time series
Considers Constraints: Factors in dataset size, feature types, and performance requirements
Recommends Algorithms: Suggests appropriate models with justification for each recommendation
Implements Ensemble Methods: Combines multiple models when beneficial for accuracy
3. Automated Hyperparameter Tuning
The agent configures and executes optimization strategies:
Grid Search: Systematic exploration of parameter spaces for small parameter sets
Random Search: Efficient sampling for high-dimensional parameter spaces
Bayesian Optimization: Intelligent search using previous results to guide exploration
Early Stopping: Prevents overfitting and saves compute resources
4. Production-Ready Code Generation
Generated pipelines aren’t just prototypes—they’re production-quality:
Modular Architecture: Clean, reusable functions with clear separation of concerns
Error Handling: Robust exception handling and logging
Documentation: Inline comments and docstrings explaining logic
Version Control Ready: Structured for Git workflows and collaboration
Snowflake Native: Optimized for Snowflake’s distributed computing environment
5. Explainability and Transparency
Understanding model behavior is crucial for trust and compliance:
Feature Importance: Identifies which variables drive predictions
SHAP Values: Explains individual predictions with Shapley values
Model Diagnostics: Generates confusion matrices, ROC curves, and performance metrics
Audit Trails: Logs all decisions, code changes, and model versions
Real-World Use Cases
Financial Services: Fraud Detection
Challenge: A bank needs to detect fraudulent transactions in real-time with minimal false positives.
Traditional Approach: Data science team spends 6 weeks building and tuning models, requiring deep SQL expertise, feature engineering knowledge, and model optimization skills.
Prompt: “Build a fraud detection model using transaction history, customer profiles, and merchant data. Optimize for 99% precision while maintaining 85% recall.”
Result: Agent generates a complete pipeline with ensemble methods, class imbalance handling, and real-time scoring infrastructure in under 2 hours
Impact: Model deployed 95% faster, freeing the team to work on sophisticated fraud pattern analysis
Healthcare: Patient Risk Stratification
Challenge: A hospital system wants to identify high-risk patients for proactive intervention.
Traditional Approach: Clinical data analysts spend 8 weeks wrangling EHR data, building features from medical histories, and validating models against clinical outcomes.
With Data Science Agent:
Prompt: “Create a patient risk stratification model using diagnoses, medications, lab results, and demographics. Focus on interpretability for clinical adoption.”
Result: Agent produces an explainable model with clinically meaningful features, SHAP explanations for each prediction, and validation against established risk scores
Impact: Clinicians trust the model due to transparency, leading to 40% adoption rate vs. typical 15%
Retail: Customer Lifetime Value Prediction
Challenge: An e-commerce company needs to predict customer lifetime value to optimize marketing spend.
Traditional Approach: Marketing analytics team collaborates with data scientists for 5 weeks, iterating on feature definitions and model approaches.
Prompt: “Predict 12-month customer lifetime value using purchase history, browsing behavior, and demographic data. Segment customers into high/medium/low value tiers.”
Result: Agent delivers a complete CLV model with customer segmentation, propensity scores, and a dashboard for marketing teams
Impact: Marketing ROI improves 32% through better targeting, model built 90% faster
Manufacturing: Predictive Maintenance
Challenge: A manufacturer wants to predict equipment failures before they occur to minimize downtime.
Traditional Approach: Engineers and data scientists spend 7 weeks analyzing sensor data, building time-series features, and testing various forecasting approaches.
With Data Science Agent:
Prompt: “Build a predictive maintenance model using sensor telemetry, maintenance logs, and operational data. Predict failures 24-48 hours in advance.”
Result: Agent creates a time-series model with automated feature extraction from streaming data, anomaly detection, and failure prediction
Impact: Unplanned downtime reduced by 28%, maintenance costs decreased by 19%
Technical Architecture
Integration with Snowflake Ecosystem
The Data Science Agent operates natively within Snowflake’s architecture:
Snowflake Notebooks: All code generation and execution happens in collaborative notebooks
Snowpark Python: Leverages Snowflake’s Python runtime for distributed computing
Data Governance: Inherits existing row-level security, masking, and access controls
Cortex AI Suite: Integrates with Cortex Analyst, Search, and AISQL for comprehensive AI capabilities
ML Jobs: Automates model training, scheduling, and monitoring at scale
How It Works: Behind the Scenes
When a data scientist provides a natural language prompt:
🔍Step 1: Understanding
Claude analyzes the request, identifying ML task type, success metrics, and constraints
Agent queries Snowflake’s catalog to discover relevant tables and understand schema
🧠Step 2: Planning
Generates a step-by-step execution plan covering data prep, modeling, and evaluation
Identifies required Snowflake features and libraries
💻 Step 3: Code Generation
Creates executable Python code for each pipeline stage
Includes data validation, error handling, and logging
🚀 Step 4: Execution
Runs generated code in Snowflake Notebooks with full visibility
Provides real-time progress updates and intermediate results
📊 Step 5: Evaluation
Generates comprehensive model diagnostics and performance metrics
Recommends next steps based on results
🔁 Step 6: Iteration
Accepts follow-up prompts to refine the model
Tracks changes and maintains version history
Best Practices for Using Data Science Agent
1. Write Clear, Specific Prompts
Poor Prompt: “Build a model for sales”
Good Prompt: “Create a weekly sales forecasting model for retail stores using historical sales, promotions, weather, and holidays. Optimize for MAPE under 10%. Include confidence intervals.”
The more context you provide, the better the agent performs.
2. Start with Business Context
Begin prompts with the business problem and success criteria:
What decision will this model inform?
What accuracy is acceptable?
What are the cost/benefit tradeoffs?
Are there regulatory requirements?
3. Iterate Incrementally
Don’t expect perfection on the first generation. Use follow-up prompts:
“Add feature importance analysis”
“Try a gradient boosting approach”
“Optimize for faster inference time”
“Add cross-validation with 5 folds”
4. Review Generated Code
While the agent produces high-quality code, always review:
Data preprocessing logic for business rule compliance
Feature engineering for domain appropriateness
Model selection justification
Performance metrics alignment with business goals
5. Establish Governance Guardrails
Define organizational standards:
Required documentation templates
Mandatory model validation steps
Approved algorithm lists for regulated industries
Data privacy and security checks
6. Combine Agent Automation with Human Expertise
Use the agent for:
Rapid prototyping and baseline models
Automated preprocessing and feature engineering
Hyperparameter tuning and model selection
Code documentation and testing
Retain human control for:
Problem formulation and success criteria
Business logic validation
Ethical considerations and bias assessment
Strategic decision-making on model deployment
Measuring ROI: The Business Impact
Organizations adopting Data Science Agent report significant benefits:
Time-to-Production Acceleration
Before Agent: Average 8-12 weeks from concept to production model
With Agent: Average 1-2 weeks from concept to production model
Impact: 5-10x faster model development cycles
Productivity Multiplication
Before Agent: 2-3 models per data scientist per quarter
With Agent: 8-12 models per data scientist per quarter
Impact: 4x increase in model output, enabling more AI use cases
Quality Improvements
Before Agent: 40-60% of models reach production (many abandoned due to insufficient ROI)
With Agent: 70-85% of models reach production (faster iteration enables more refinement)
Impact: Higher model quality through rapid experimentation
Cost Optimization
Before Agent: $150K-200K average cost per model (personnel time, infrastructure)
With Agent: $40K-60K average cost per model
Impact: 70% reduction in model development costs
Democratization of ML
Before Agent: Only senior data scientists can build production models
With Agent: Junior analysts and citizen data scientists can create sophisticated models
Impact: 3-5x expansion of AI capability across organization
Limitations and Considerations
While powerful, Data Science Agent has important constraints:
Current Limitations
Preview Status: Still in private preview; features and capabilities evolving
Scope Boundaries: Optimized for structured data ML; deep learning and computer vision require different approaches
Domain Knowledge: Agent lacks specific industry expertise; users must validate business logic
Complex Custom Logic: Highly specialized algorithms may require manual implementation
Important Considerations
Data Quality Dependency: Agent’s output quality directly correlates with input data quality—garbage in, garbage out still applies
Computational Costs: Automated hyperparameter tuning can consume significant compute resources
Over-Reliance Risk: Organizations must maintain ML expertise; agents augment, not replace, human judgment
Regulatory Compliance: In highly regulated industries, additional human review and validation required
Bias and Fairness: Automated feature engineering may perpetuate existing biases; fairness testing essential
The Future of Data Science Agent
Based on Snowflake’s roadmap and industry trends, expect these developments:
Short-Term (2025-2026)
General Availability: Broader access as private preview graduates to GA
Expanded Model Types: Support for time series, recommendation systems, and anomaly detection
AutoML Enhancements: More sophisticated algorithm selection and ensemble methods
Deeper Integration: Tighter coupling with Snowflake ML Jobs and model registry
Medium-Term (2026-2027)
Multi-Modal Learning: Support for unstructured data (images, text, audio) alongside structured data
Federated Learning: Distributed model training across data clean rooms
Automated Monitoring: Self-healing models that detect drift and retrain automatically
Natural Language Insights: Plain English explanations of model behavior for business users
Long-Term Vision (2027+)
Autonomous ML Operations: End-to-end model lifecycle management with minimal human intervention
Cross-Domain Transfer Learning: Agents that leverage learnings across industries and use cases
Collaborative Multi-Agent Systems: Specialized agents working together on complex problems
Causal ML Integration: Moving beyond correlation to causal inference and counterfactual analysis
Getting Started with Data Science Agent
Prerequisites
To leverage Data Science Agent, you need:
Snowflake Account: Enterprise edition or higher with Cortex AI enabled
Data Foundation: Structured data in Snowflake tables or views
Clear Use Case: Well-defined business problem with success metrics
User Permissions: Access to Snowflake Notebooks and Cortex features
Request Access
Data Science Agent is currently in private preview:
Contact your Snowflake account team to express interest
Complete the preview application with use case details
Participate in onboarding and training sessions
Join the preview community for best practices sharing
Pilot Project Selection
Choose an initial use case with these characteristics:
High Business Value: Clear ROI and stakeholder interest
Data Availability: Clean, accessible data in Snowflake
Reasonable Complexity: Not trivial, but not your most difficult problem
Failure Tolerance: Low risk if the model needs iteration
Measurement Clarity: Easy to quantify success
Success Metrics
Track these KPIs to measure Data Science Agent impact:
Time from concept to production model
Number of models per data scientist per quarter
Percentage of models reaching production
Model performance metrics vs. baseline
Cost per model developed
Data scientist satisfaction scores
Snowflake Data Science Agent vs. Competitors
How It Compares
Databricks AutoML:
Advantage: Tighter integration with governed data, no data movement
Trade-off: Databricks offers more deep learning capabilities
Google Cloud AutoML:
Advantage: Runs on your data warehouse, no egress costs
Trade-off: Google has broader pre-trained model library
Trade-off: H2O specializes in AutoML with more tuning options
Why Choose Snowflake Data Science Agent?
Data Gravity: Build ML where your data lives—no movement, no copies, no security risks
Unified Platform: Single environment for data engineering, analytics, and ML
Enterprise Governance: Leverage existing security, compliance, and access controls
Ecosystem Integration: Works seamlessly with BI tools, notebooks, and applications
Scalability: Automatic compute scaling without infrastructure management
Conclusion: The Data Science Revolution Begins Now
The Snowflake Data Science Agent represents more than a productivity tool—it’s a fundamental reimagining of how organizations build machine learning solutions. By automating the 60-80% of work that consumes data scientists’ time, it unleashes their potential to solve harder problems, explore more use cases, and deliver greater business impact.
The transformation is already beginning. Organizations in the private preview report 5-10x faster model development, 4x increases in productivity, and democratization of ML capabilities across their teams. As Data Science Agent reaches general availability in late 2025, these benefits will scale across the entire Snowflake ecosystem.
The question isn’t whether to adopt AI-assisted data science—it’s how quickly you can implement it to stay competitive.
For data leaders, the opportunity is clear: accelerate AI initiatives, multiply data science team output, and tackle the backlog of use cases that were previously too expensive or time-consuming to address.
For data scientists, the promise is equally compelling: spend less time on repetitive tasks and more time on creative problem-solving, strategic thinking, and high-impact analysis.
The future of data science is agentic. The future of data science is here.
Key Takeaways
Snowflake Data Science Agent automates 60-80% of routine ML development work
Introduction: The Dawn of Context-Aware AI in Enterprise Data
Enterprise AI is experiencing a fundamental shift in October 2025. Organizations are no longer satisfied with isolated AI tools that operate in silos. Instead, they’re demanding intelligent systems that understand context, access governed data securely, and orchestrate complex workflows across multiple platforms.
Enter the Snowflake MCP Server—a groundbreaking managed service announced on October 2, 2025, that bridges the gap between AI agents and enterprise data ecosystems. By implementing the Model Context Protocol (MCP), Snowflake has created a standardized pathway for AI agents to interact with both proprietary company data and premium third-party datasets, all while maintaining enterprise-grade security and governance.
This comprehensive guide explores how the Snowflake MCP Server is reshaping enterprise AI, what makes it different from traditional integrations, and how organizations can leverage this technology to build next-generation intelligent applications.
What is the Model Context Protocol (MCP)?
Before diving into Snowflake’s implementation, it’s essential to understand the Model Context Protocol itself.
The Problem MCP Solves
Historically, connecting AI agents to enterprise systems has been a fragmented nightmare. Each integration required custom development work, creating a web of point-to-point connections that were difficult to maintain, scale, and secure. Data teams spent weeks building bespoke integrations instead of focusing on innovation.
The Model Context Protocol emerged as an industry solution to this chaos. Developed by Anthropic and rapidly adopted across the AI ecosystem, MCP provides a standardized interface for AI agents to connect with data sources, APIs, and services.
Think of MCP as a universal adapter for AI agents—similar to how USB-C standardized device connections, MCP standardizes how AI systems interact with enterprise data platforms.
Key Benefits of MCP
Interoperability: AI agents from different vendors can access the same data sources using a common protocol
Security: Centralized governance and access controls rather than scattered custom integrations
Speed to Market: Reduces integration time from weeks to hours
Vendor Flexibility: Organizations aren’t locked into proprietary ecosystems
Snowflake MCP Server: Architecture and Core Components
The Snowflake MCP Server represents a fully managed service that acts as a bridge between external AI agents and the Snowflake AI Data Cloud. Currently in public preview, it offers a sophisticated yet streamlined approach to agentic AI implementation.
How the Architecture Works
At its core, the Snowflake MCP Server connects three critical layers:
Layer 1: External AI Agents and Platforms The server integrates with leading AI platforms including Anthropic Claude, Salesforce Agentforce, Cursor, CrewAI, Devin by Cognition, UiPath, Windsurf, Amazon Bedrock AgentCore, and more. This broad compatibility ensures organizations can use their preferred AI tools without vendor lock-in.
Layer 2: Snowflake Cortex AI Services Within Snowflake, the MCP Server provides access to powerful Cortex capabilities:
Cortex Analyst for querying structured data using semantic models
Cortex Search for retrieving insights from unstructured documents
Cortex AISQL for AI-powered extraction and transcription
Data Science Agent for automated ML workflows
Layer 3: Data Sources This includes both proprietary organizational data stored in Snowflake and premium third-party datasets from partners like MSCI, Nasdaq eVestment, FactSet, The Associated Press, CB Insights, and Deutsche Börse.
The Managed Service Advantage
Unlike traditional integrations that require infrastructure deployment and ongoing maintenance, the Snowflake MCP Server operates as a fully managed service. Organizations configure access through YAML files, define security policies, and the Snowflake platform handles all the operational complexity—from scaling to security patches.
Cortex AI for Financial Services: The First Industry-Specific Implementation
Snowflake launched the MCP Server alongside Cortex AI for Financial Services, demonstrating the practical power of this architecture with industry-specific capabilities.
Why Financial Services First?
The financial services industry faces unique challenges that make it an ideal testing ground for agentic AI:
Data Fragmentation: Financial institutions operate with data scattered across trading systems, risk platforms, customer databases, and market data providers
Regulatory Requirements: Strict compliance and audit requirements demand transparent, governed data access
Real-Time Decisioning: Investment decisions, fraud detection, and customer service require instant access to both structured and unstructured data
Third-Party Dependencies: Financial analysis requires combining proprietary data with market research, news feeds, and regulatory filings
Key Use Cases Enabled
Investment Analytics: AI agents can analyze portfolio performance by combining internal holdings data from Snowflake with real-time market data from Nasdaq, research reports from FactSet, and breaking news from The Associated Press—all through natural language queries.
Claims Management: Insurance companies can process claims by having AI agents retrieve policy documents (unstructured), claims history (structured), and fraud pattern analysis—orchestrating across Cortex Search and Cortex Analyst automatically.
Customer Service: Financial advisors can query “What’s the risk profile of client portfolios exposed to European tech stocks?” and receive comprehensive answers that pull from multiple data sources, with full audit trails maintained.
Regulatory Compliance: Compliance teams can ask questions about exposure limits, trading patterns, or risk concentrations, and AI agents will navigate the appropriate data sources while respecting role-based access controls.
Technical Deep Dive: How to Implement Snowflake MCP Server
For data engineers and architects planning implementations, understanding the technical setup is crucial.
Configuration Basics
The Snowflake MCP Server uses YAML configuration files to define available services and access controls. Here’s what a typical configuration includes:
Service Definitions: Specify which Cortex Analyst semantic models, Cortex Search services, and other tools should be exposed to AI agents
Security Policies: Define SQL statement permissions to control what operations agents can perform (SELECT, INSERT, UPDATE, etc.)
Connection Parameters: Configure authentication methods including OAuth, personal access tokens, or service accounts
Tool Descriptions: Provide clear, descriptive text for each exposed service to help AI agents select the appropriate tool for each task
Integration with AI Platforms
Connecting external platforms to the Snowflake MCP Server follows a standardized pattern:
For platforms like Claude Desktop or Cursor, developers add the Snowflake MCP Server to their configuration file, specifying the connection details and authentication credentials. The MCP client then automatically discovers available tools and makes them accessible to the AI agent.
For custom applications using frameworks like CrewAI or LangChain, developers leverage MCP client libraries to establish connections programmatically, enabling sophisticated multi-agent workflows.
Security and Governance
One of the most compelling aspects of the Snowflake MCP Server is that it maintains all existing Snowflake security controls:
Data Never Leaves Snowflake: Unlike traditional API integrations that extract data, processing happens within Snowflake’s secure perimeter
Audit Logging: All agent interactions are logged for compliance and monitoring
Role-Based Access: Agents operate under defined Snowflake roles with specific privileges
Agentic AI Workflows: From Theory to Practice
Understanding agentic AI workflows is essential to appreciating the Snowflake MCP Server’s value proposition.
What Makes AI “Agentic”?
Traditional AI systems respond to single prompts with single responses. Agentic AI systems, by contrast, can:
Plan Multi-Step Tasks: Break complex requests into sequential subtasks
Use Tools Dynamically: Select and invoke appropriate tools based on the task at hand
Reflect and Iterate: Evaluate results and adjust their approach
Maintain Context: Remember previous interactions within a session
How Snowflake Enables Agentic Workflows
The Snowflake MCP Server enables true agentic behavior through Cortex Agents, which orchestrate across both structured and unstructured data sources.
Example Workflow: Market Analysis Query
When a user asks, “How has our semiconductor portfolio performed compared to industry trends this quarter, and what are analysts saying about the sector?”
The agent plans a multi-step approach:
Query Cortex Analyst to retrieve portfolio holdings and performance metrics (structured data)
Search Cortex Search for analyst reports and news articles about semiconductors (unstructured data)
Cross-reference findings with third-party market data from partners like MSCI
Synthesize a comprehensive response with citations
Each step respects data governance policies, and the entire workflow happens within seconds—a task that would traditionally require multiple analysts hours or days to complete.
Open Semantic Interchange: The Missing Piece of the AI Puzzle
While the Snowflake MCP Server solves the connection problem, the Open Semantic Interchange (OSI) initiative addresses an equally critical challenge: semantic consistency.
The Semantic Fragmentation Problem
Enterprise organizations typically define the same business metrics differently across systems. “Revenue” might include different line items in the data warehouse versus the BI tool versus the AI model. This semantic fragmentation undermines trust in AI insights and creates the “$1 trillion AI problem“—the massive cost of data preparation and reconciliation.
How OSI Complements MCP
Announced on September 23, 2025, alongside the MCP Server development, OSI is an open-source initiative led by Snowflake, Salesforce, BlackRock, and dbt Labs. It creates a vendor-neutral specification for semantic metadata—essentially a universal language for business concepts.
When combined with MCP, OSI ensures that AI agents not only can access data (via MCP) but also understand what that data means (via OSI). A query about “quarterly revenue” will use the same definition whether the agent is accessing Snowflake, Tableau, or a custom ML model.
Industry Impact: Who Benefits from Snowflake MCP Server?
While initially focused on financial services, the Snowflake MCP Server has broad applicability across industries.
Healthcare and Life Sciences
Clinical Research: Combine patient data (structured EHR) with medical literature (unstructured documents) for drug discovery
Population Health: Analyze claims data alongside social determinants of health from third-party sources
Regulatory Submissions: AI agents can compile submission packages by accessing clinical trial data, adverse event reports, and regulatory guidance documents
Retail and E-Commerce
Customer Intelligence: Merge transaction data with customer service transcripts and social media sentiment
Supply Chain Optimization: Agents can analyze inventory levels, supplier performance data, and market demand signals from external sources
Personalization: Create hyper-personalized shopping experiences by combining browsing behavior, purchase history, and trend data
Manufacturing
Predictive Maintenance: Combine sensor data from IoT devices with maintenance logs and parts inventory
Quality Control: Analyze production metrics alongside inspection reports and supplier certifications
Supply Chain Resilience: Monitor supplier health by combining internal order data with external financial and news data
Implementation Best Practices
For organizations planning to implement the Snowflake MCP Server, following best practices ensures success.
Start with Clear Use Cases
Begin with specific, high-value use cases rather than attempting a broad rollout. Identify workflows where combining structured and unstructured data creates measurable business value.
Invest in Semantic Modeling
The quality of Cortex Analyst responses depends heavily on well-defined semantic models. Invest time in creating comprehensive semantic layers using tools like dbt or directly in Snowflake.
Establish Governance Early
Define clear policies about which data sources agents can access, what operations they can perform, and how results should be logged and audited.
Design for Explainability
Configure agents to provide citations and reasoning for their responses. This transparency builds user trust and satisfies regulatory requirements.
Monitor and Iterate
Implement monitoring to track agent performance, query patterns, and user satisfaction. Use these insights to refine configurations and expand capabilities.
Challenges and Considerations
While powerful, the Snowflake MCP Server introduces considerations that organizations must address.
Cost Management
AI agent queries can consume significant compute resources, especially when orchestrating across multiple data sources. Implement query optimization, caching strategies, and resource monitoring to control costs.
Data Quality Dependencies
Agents are only as good as the data they access. Poor data quality, incomplete semantic models, or inconsistent definitions will produce unreliable results.
Skills Gap
Successfully implementing agentic AI requires skills in data engineering, AI/ML, and domain expertise. Organizations may need to invest in training or hire specialized talent.
Privacy and Compliance
While Snowflake provides robust security controls, organizations must ensure that agent behaviors comply with privacy regulations like GDPR, especially when combining internal and external data sources.
The Future of Snowflake MCP Server
Based on current trends and Snowflake’s product roadmap announcements, several developments are likely:
Expanded Industry Packs
Following financial services, expect industry-specific Cortex AI suites for healthcare, retail, manufacturing, and public sector with pre-configured connectors and semantic models.
Enhanced Multi-Agent Orchestration
Future versions will likely support more sophisticated agent crews that can collaborate on complex tasks, with specialized agents for different domains working together.
Deeper OSI Integration
As the Open Semantic Interchange standard matures, expect tighter integration that makes semantic consistency automatic rather than requiring manual configuration.
Real-Time Streaming
Current implementations focus on batch and interactive queries. Future versions may incorporate streaming data sources for real-time agent responses.
Conclusion: Embracing the Agentic AI Revolution
The Snowflake MCP Server represents a pivotal moment in enterprise AI evolution. By standardizing how AI agents access data through the Model Context Protocol, Snowflake has removed one of the primary barriers to agentic AI adoption—integration complexity.
Combined with powerful Cortex AI capabilities and participation in the Open Semantic Interchange initiative, Snowflake is positioning itself at the center of the agentic AI ecosystem. Organizations that embrace this architecture now will gain significant competitive advantages in speed, flexibility, and AI-driven decision-making.
The question is no longer whether to adopt agentic AI, but how quickly you can implement it effectively. With the Snowflake MCP Server now in public preview, the opportunity to lead in your industry is here.
Ready to get started? Explore the Snowflake MCP Server documentation, identify your highest-value use cases, and join the growing community of organizations building the future of intelligent, context-aware enterprise applications.
Key Takeaways
The Snowflake MCP Server launched October 2, 2025, as a managed service connecting AI agents to enterprise data
Model Context Protocol provides a standardized interface for agentic AI integrations
Cortex AI for Financial Services demonstrates practical applications with industry-specific capabilities
Organizations can connect platforms like Anthropic Claude, Salesforce Agentforce, and Cursor to Snowflake data
The Open Semantic Interchange initiative ensures AI agents understand data semantics consistently
Security and governance controls remain intact with all processing happening within Snowflake
Early adoption provides competitive advantages in AI-driven decision-making
Snowflake is renowned for its incredible performance, but as data scales into terabytes and petabytes, no platform is immune to a slow-running query. For a data engineer, mastering Snowflake query optimization is the difference between building an efficient, cost-effective data platform and one that burns through credits and frustrates users.
In 2025, the principles of query optimization remain the same, but the tools and techniques have matured. It’s no longer just about warehouse size; it’s about understanding the query execution plan, leveraging micro-partitions, and writing smarter SQL.
This guide will walk you through the essential strategies and best practices for Snowflake query optimization, moving from the foundational tools to advanced, real-world techniques.
The Golden Rule: Always Start with the Query Profile
Before you change a single line of code, your first and most important step is to analyze the Query Profile. This tool is your window into how Snowflake’s cloud services layer executes your query.
To access it, go to the History tab in the Snowsight UI, find your query, and click on the Query ID.
What to look for:
Operator Time: Which steps in the execution plan are taking the most time? Is it a TableScan, a Join, or a Sort operation?
Partition Pruning: How many partitions is the TableScan reading versus the total partitions in the table? If it’s scanning a high percentage, your pruning is ineffective.
Spilling: Is Snowflake spilling data to local or remote storage? Spilling to remote storage is a major performance killer.
For a detailed walkthrough, check out Snowflake’s official documentation on Using Query Profile.
1. Right-Size Your Warehouse (But It’s Not a Silver Bullet)
It’s tempting to throw a larger warehouse at a slow query, but this isn’t always the right or most cost-effective answer.
Scale Up (Increase Size): Do this when a single, complex query is slow due to heavy computations like large joins, sorts, or aggregations. A larger warehouse provides more memory and local SSD, which can prevent spilling.
Scale Out (Add Clusters): Use a multi-cluster warehouse for high-concurrency scenarios (e.g., a BI dashboard). This won’t make a single query faster, but it will handle more queries at once.
Best Practice: Don’t use a larger warehouse to compensate for a poorly written query. Always try to optimize the query and data structure first.
2. Master Clustering for Effective Pruning
This is the most impactful technique for optimizing queries on large tables. A Clustering Key reorganizes your data into co-located micro-partitions based on the key you define. This allows Snowflake to prune (ignore) massive amounts of data that aren’t relevant to your query’s WHERE clause.
When to Cluster:
On tables larger than a terabyte.
When you frequently filter or join on a high-cardinality column (e.g., user_id, event_timestamp).
Example:
SQL
-- A query on an un-clustered 100TB table
SELECT * FROM event_logs WHERE user_id = 'a1b2-c3d4';
-- This might scan 50% of the table's partitions.
-- Cluster the table
ALTER TABLE event_logs CLUSTER BY (user_id);
-- After reclustering, the same query might scan less than 0.01% of partitions.
SELECT * FROM event_logs WHERE user_id = 'a1b2-c3d4';
The Credit Karma engineering blog has an excellent real-world case study on how they used clustering to dramatically reduce costs.
3. Avoid Spilling at All Costs
Spilling occurs when Snowflake runs out of memory for an operation and has to write intermediate results to storage. Spilling to local SSD is okay, but spilling to remote storage is a performance disaster.
How to Fix It:
Increase Warehouse Size: This is the most direct solution, as it provides more memory.
Optimize the Query: Reduce the data being processed. Filter early in your query (using WHERE clauses) and select only the columns you need.
Use APPROX_COUNT_DISTINCT: For large-scale distinct counts, using an approximation function can be orders of magnitude faster and use less memory than COUNT(DISTINCT ...).
4. Write Smarter SQL
Sometimes, the best optimization is simply writing a better query.
Filter Early, Join Late: Apply your WHERE clause filters to your largest tables before you join them. You can do this with a subquery or a Common Table Expression (CTE).
Reduce Data Movement: In joins, join on columns with the same data type. If you join a STRING to a NUMBER, Snowflake has to cast the data on the fly, which can slow things down.
Leverage Specific Functions: Use optimized functions like MATCH_RECOGNIZE for sequential pattern matching or FLATTEN for parsing semi-structured data instead of writing complex, self-joining SQL.
5. Use Materialized Views for Repetitive Queries
If you have a complex query that runs frequently on data that doesn’t change often, a Materialized View can be a game-changer. It pre-computes and stores the query result, so subsequent queries are just reading the stored results, which is incredibly fast.
Best For:
BI dashboards that query complex aggregations.
Queries on large, slowly changing datasets.
The Snowflake Community offers great primers and discussions on when to best use this feature.
Conclusion
Snowflake query optimization in 2025 is a multi-faceted discipline. It starts with a deep analysis of the Query Profile and extends to making intelligent choices about your warehouse sizing, data clustering, and SQL patterns. By moving beyond brute force and adopting these strategic techniques, you can build a Snowflake environment that is not only lightning-fast but also highly cost-effective.
Snowflake MERGE statements are powerful tools for upserting data, but poor optimization can lead to massive performance bottlenecks. If your MERGE queries are taking hours instead of minutes, you’re not alone. In this comprehensive guide, we’ll explore five advanced techniques to optimize Snowflake MERGE queries and achieve up to 10x performance improvements.
Before diving into optimization techniques, it’s crucial to understand why MERGE queries often become performance bottlenecks. Snowflake’s MERGE operation combines INSERT, UPDATE, and DELETE logic into a single statement, which involves scanning both source and target tables, matching records, and applying changes.
The primary performance challenges include:
Full table scans on large target tables
Inefficient join conditions between source and target
Poor micro-partition pruning
Lack of proper clustering on merge keys
Excessive data movement across compute nodes
Technique 1: Leverage Clustering Keys for MERGE Operations
Clustering keys are Snowflake’s secret weapon for optimizing MERGE queries. By defining clustering keys on your merge columns, you enable aggressive micro-partition pruning, dramatically reducing the data scanned during operations.
Implementation Strategy
-- Define clustering key on the primary merge column
ALTER TABLE target_table
CLUSTER BY (customer_id, transaction_date);
-- Verify clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('target_table',
'(customer_id, transaction_date)');
Clustering keys work by organizing data within micro-partitions based on specified columns. When Snowflake processes a MERGE query, it uses clustering metadata to skip entire micro-partitions that don’t contain matching keys. You can learn more about clustering keys in the official Snowflake documentation.
Best Practices for Clustering
Choose high-cardinality columns that appear in MERGE JOIN conditions
Limit clustering keys to 3-4 columns maximum for optimal performance
Monitor clustering depth regularly using SYSTEM$CLUSTERING_DEPTH
Consider reclustering if depth exceeds 4-5 levels
Pro Tip: Clustering incurs automatic maintenance costs. Use it strategically on tables with frequent MERGE operations and clear access patterns.
Technique 2: Optimize MERGE Predicates with Selective Filtering
One of the most effective ways to optimize Snowflake MERGE performance is by adding selective predicates that reduce the data set before the merge operation begins. This technique, called predicate pushdown optimization, allows Snowflake to prune unnecessary data early in query execution.
Basic vs Optimized MERGE
-- UNOPTIMIZED: Scans entire target table
MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, status) VALUES (s.id, s.status);
-- OPTIMIZED: Adds selective predicates
MERGE INTO target_table t
USING (
SELECT * FROM source_table
WHERE update_date >= CURRENT_DATE - 7
) s
ON t.id = s.id
AND t.region = s.region
AND t.update_date >= CURRENT_DATE - 7
WHEN MATCHED THEN UPDATE SET t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, status, region) VALUES (s.id, s.status, s.region);
The optimized version adds three critical improvements: it filters source data to only recent records, adds partition-aligned predicates (region column), and applies matching filter to target table.
Predicate Selection Guidelines
Predicate Type
Performance Impact
Use Case
Date Range
High
Incremental loads with time-based partitioning
Partition Key
Very High
Multi-tenant or geographically distributed data
Status Flag
Medium
Processing only changed or active records
Existence Check
High
Skipping already processed data
Technique 3: Exploit Micro-Partition Pruning
Snowflake stores data in immutable micro-partitions (typically 50-500MB compressed). Understanding how to leverage micro-partition metadata is essential for MERGE optimization.
Micro-Partition Pruning Strategies
Snowflake maintains metadata for each micro-partition including min/max values, distinct counts, and null counts for all columns. By structuring your MERGE conditions to align with this metadata, you enable aggressive pruning.
-- Check micro-partition metadata
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS(
TABLE_NAME => 'TARGET_TABLE'
))
WHERE ACTIVE_BYTES > 0
ORDER BY PARTITION_NUMBER DESC
LIMIT 10;
-- Optimized MERGE with partition-aligned predicates
MERGE INTO sales_fact t
USING (
SELECT
transaction_id,
customer_id,
sale_date,
amount
FROM staging_sales
WHERE sale_date BETWEEN '2025-01-01' AND '2025-01-31'
AND customer_id IS NOT NULL
) s
ON t.transaction_id = s.transaction_id
AND t.sale_date = s.sale_date
WHEN MATCHED THEN UPDATE SET amount = s.amount
WHEN NOT MATCHED THEN INSERT VALUES (s.transaction_id, s.customer_id, s.sale_date, s.amount);
Maximizing Pruning Efficiency
Always include clustering key columns in MERGE ON conditions
Use equality predicates when possible (more effective than ranges)
Avoid function transformations on join columns (prevents metadata usage)
Leverage Snowflake’s automatic clustering for large tables
Warning: Using functions like UPPER(), TRIM(), or CAST() on merge key columns disables micro-partition pruning. Apply transformations in the source subquery instead.
Technique 4: Implement Incremental MERGE Patterns
Rather than processing entire tables, implement incremental MERGE patterns that only handle changed data. This approach combines multiple optimization techniques for maximum performance.
Change Data Capture (CDC) MERGE Pattern
-- Step 1: Create change tracking view
CREATE OR REPLACE VIEW recent_changes AS
SELECT
s.*,
METADATA$ACTION as cdc_action,
METADATA$ISUPDATE as is_update,
METADATA$ROW_ID as row_identifier
FROM staging_table s
WHERE METADATA$ACTION IN ('INSERT', 'UPDATE')
AND METADATA$UPDATE_TIMESTAMP >= DATEADD(hour, -1, CURRENT_TIMESTAMP);
-- Step 2: Execute incremental MERGE
MERGE INTO dimension_table t
USING recent_changes s
ON t.business_key = s.business_key
WHEN MATCHED AND s.is_update = TRUE
THEN UPDATE SET
t.attribute1 = s.attribute1,
t.attribute2 = s.attribute2,
t.last_modified = s.update_timestamp
WHEN NOT MATCHED
THEN INSERT (business_key, attribute1, attribute2, created_date)
VALUES (s.business_key, s.attribute1, s.attribute2, s.update_timestamp);
Batch Processing Strategy
For very large datasets, implement batch processing with partition-aware MERGE. Learn more about data pipeline best practices in Snowflake.
-- Create processing batches
CREATE OR REPLACE TABLE merge_batches AS
SELECT DISTINCT
DATE_TRUNC('day', event_date) as partition_date,
MOD(ABS(HASH(customer_id)), 10) as batch_number
FROM source_data
WHERE processed_flag = FALSE;
-- Process in batches (use stored procedure for actual implementation)
MERGE INTO target_table t
USING (
SELECT * FROM source_data
WHERE DATE_TRUNC('day', event_date) = '2025-01-15'
AND MOD(ABS(HASH(customer_id)), 10) = 0
) s
ON t.customer_id = s.customer_id
AND t.event_date = s.event_date
WHEN MATCHED THEN UPDATE SET t.amount = s.amount
WHEN NOT MATCHED THEN INSERT VALUES (s.customer_id, s.event_date, s.amount);
Technique 5: Optimize Warehouse Sizing and Query Profile
Proper warehouse configuration can dramatically impact MERGE performance. Understanding the relationship between data volume, complexity, and compute resources is crucial.
-- Get query ID for recent MERGE
SELECT query_id, query_text, execution_time
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%MERGE INTO target_table%'
ORDER BY start_time DESC
LIMIT 1;
-- Analyze detailed query profile
SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_SESSION())
WHERE query_id = 'your-query-id-here';
Performance Monitoring Queries
-- Monitor MERGE performance over time
SELECT
DATE_TRUNC('hour', start_time) as hour,
COUNT(*) as merge_count,
AVG(execution_time)/1000 as avg_seconds,
SUM(bytes_scanned)/(1024*1024*1024) as total_gb_scanned,
AVG(credits_used_cloud_services) as avg_credits
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%MERGE INTO%'
AND start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP)
GROUP BY 1
ORDER BY 1 DESC;
Real-World Performance Comparison
To demonstrate the impact of these techniques, here’s a real-world comparison of MERGE performance optimizations on a 50 million row table:
Optimization Applied
Execution Time
Data Scanned
Cost Reduction
Baseline (no optimization)
45 minutes
2.5 TB
–
+ Clustering Keys
18 minutes
850 GB
60%
+ Selective Predicates
8 minutes
320 GB
82%
+ Incremental Pattern
4 minutes
180 GB
91%
+ Optimized Warehouse
2.5 minutes
180 GB
94%
Common Pitfalls to Avoid
Even with optimization techniques, several common mistakes can sabotage MERGE performance:
1. Over-Clustering
Using too many clustering keys or clustering on low-cardinality columns creates overhead without benefits. Stick to 3-4 high-cardinality columns that align with your MERGE patterns.
2. Ignoring Data Skew
Uneven data distribution causes some micro-partitions to be much larger than others, leading to processing bottlenecks. Monitor and address skew with better partitioning strategies.
3. Full Table MERGE Without Filters
Always apply predicates to limit the scope of MERGE operations. Even on small tables, unnecessary full scans waste resources.
4. Improper Transaction Sizing
Very large single transactions can timeout or consume excessive resources. Break large MERGE operations into manageable batches.
Monitoring and Continuous Optimization
MERGE optimization is not a one-time activity. Implement continuous monitoring to maintain performance as data volumes grow:
-- Create monitoring dashboard query
CREATE OR REPLACE VIEW merge_performance_dashboard AS
SELECT
DATE_TRUNC('day', start_time) as execution_date,
REGEXP_SUBSTR(query_text, 'MERGE INTO (\\w+)', 1, 1, 'e') as target_table,
COUNT(*) as execution_count,
AVG(execution_time)/1000 as avg_execution_seconds,
MAX(execution_time)/1000 as max_execution_seconds,
AVG(bytes_scanned)/(1024*1024*1024) as avg_gb_scanned,
SUM(credits_used_cloud_services) as total_credits
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_type = 'MERGE'
AND start_time >= DATEADD(month, -1, CURRENT_TIMESTAMP)
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC;
Conclusion and Next Steps
Optimizing Snowflake MERGE queries requires a multi-faceted approach combining clustering keys, selective predicates, micro-partition pruning, incremental patterns, and proper warehouse sizing. By implementing these five advanced techniques, you can achieve 10x or greater performance improvements while reducing costs significantly.
Key Takeaways
Define clustering keys on merge columns for aggressive pruning
Add selective predicates to reduce data scanned before merging
Leverage micro-partition metadata with partition-aligned conditions
Implement incremental MERGE patterns using CDC or batch processing
Right-size warehouses and monitor performance continuously
Start by analyzing your current MERGE queries using Query Profile, identify the biggest bottlenecks, and apply these techniques incrementally. Monitor the impact and iterate based on your specific data patterns and workload characteristics.
In today’s data-driven world, providing access to data is undoubtedly crucial. However, what happens when that data contains sensitive Personally Identifiable Information (PII) like emails, phone numbers, or credit card details? Clearly, you can’t just grant open access. For this reason, dynamic data masking in Snowflake becomes an essential tool for modern data governance.
Specifically, Dynamic Data Masking allows you to protect sensitive data by masking it in real-time within your query results, based on the user’s role. Crucially, the underlying data in the table remains unchanged; instead, only the query result is masked. As a result, your data analysts can run queries on a production table without ever seeing the raw sensitive information.
With that in mind, this guide will walk you through the practical steps of creating and applying a masking policy in Snowflake to protect your PII.
The Scenario: Protecting Customer PII
Imagine we have a customers table with the following columns, containing sensitive information:
CUSTOMER_ID
FULL_NAME
EMAIL
PHONE_NUMBER
101
Jane Doe
jane.doe@email.com
555-123-4567
102
John Smith
john.smith@email.com
555-987-6543
Our goal is to create a policy where:
Users with the ANALYST_ROLE see a masked version of the email and phone number.
Users with a privileged PII_ACCESS_ROLE can see the real, unmasked data.
Step 1: Create the Masking Policy
First, we define the rules of how the data should be masked. A masking policy is a schema-level object that uses a CASE statement to apply conditional logic.
This policy will check the user’s current role. If their role is PII_ACCESS_ROLE, it will show the original value. For all other roles, it will show a masked version.
SQL Code to Create the Policy:SQL
-- Create a masking policy for email addresses
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() = 'PII_ACCESS_ROLE' THEN val
ELSE '***-MASKED-***'
END;
-- Create a masking policy for phone numbers
CREATE OR REPLACE MASKING POLICY phone_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() = 'PII_ACCESS_ROLE' THEN val
ELSE REGEXP_REPLACE(val, '.', '*', 1, 8) -- Masks the first 8 characters
END;
The email_mask policy is simple: it shows the real value for the privileged role and a fixed string for everyone else.
The phone_mask policy is slightly more advanced, using a regular expression to replace the first 8 characters of the phone number with asterisks, showing only the last part of the number.
Step 2: Apply the Masking Policy to Your Table
Once the policy is created, you need to apply it to the specific columns in your table that you want to protect. You use the ALTER TABLE command to do this.
SQL Code to Apply the Policy:SQL
-- Apply the email_mask policy to the EMAIL column
ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;
-- Apply the phone_mask policy to the PHONE_NUMBER column
ALTER TABLE customers MODIFY COLUMN phone_number SET MASKING POLICY phone_mask;
That’s it! The policy is now active.
Step 3: Test the Policy with Different Roles
Now, let’s test our setup. We will run the same SELECT query as two different users with two different roles.
Test 1: Querying as a user with ANALYST_ROLESQL
USE ROLE ANALYST_ROLE;
SELECT * FROM customers;
Result (Data is Masked):
CUSTOMER_ID
FULL_NAME
EMAIL
PHONE_NUMBER
101
Jane Doe
*-MASKED-*
********-4567
102
John Smith
*-MASKED-*
********-6543
Test 2: Querying as a user with the privileged PII_ACCESS_ROLESQL
USE ROLE PII_ACCESS_ROLE;
SELECT * FROM customers;
Result (Data is Unmasked):
CUSTOMER_ID
FULL_NAME
EMAIL
PHONE_NUMBER
101
Jane Doe
jane.doe@email.com
555-123-4567
102
John Smith
john.smith@email.com
555-987-6543
As you can see, the same query on the same table produces different results based on the user’s role. The masking happens dynamically at query time, and the underlying data is never changed.
Conclusion: Security and Analytics in Harmony
Dynamic Data Masking is undoubtedly a powerful feature that allows you to democratize data access without compromising on security. Specifically, by implementing masking policies, you can provide broad access to your tables for analytics while at the same time ensuring that sensitive PII is only visible to the specific roles that have a legitimate need to see it. Ultimately, this is a fundamental component of building a secure and well-governed data platform in Snowflake.
In the final part of our Snowflake guide, we move beyond the technical implementation and into one of the most powerful strategic advantages of the platform: governance and secure data sharing. So far, we’ve covered the architecture, learned how to load data, and explored how to query it. Now, we’ll learn how to control, secure, and share that data.
Strong data governance isn’t just about locking data down; it’s about enabling secure access to the right data for the right people at the right time. Snowflake’s approach to this is built on two core pillars: robust, role-based access control and a revolutionary feature called Secure Data Sharing.
Pillar 1: Governance with Role-Based Access Control (RBAC)
In Snowflake, you never grant permissions directly to a user. Instead, all permissions are granted to Roles, and roles are then granted to users. This is a highly scalable and manageable way to control access to your data.
How RBAC Works
Objects: These are the things you want to secure, like databases, schemas, tables, and warehouses.
Privileges: These are the actions that can be performed on objects, such as SELECT, INSERT, CREATE, etc.
Roles: Roles are a collection of privileges. You can create roles for different functions, like ANALYST_ROLE, DEVELOPER_ROLE, or BI_TOOL_ROLE.
Users: Users are granted one or more roles, which in turn gives them the privileges of those roles.
Best Practice: Create a hierarchy of custom roles. For example, you might have a base READ_ONLY role that can select from tables, and an ANALYST role that inherits all the privileges of the READ_ONLY role plus additional permissions. This makes managing permissions much simpler as your organization grows.
Example Code:SQL
-- 1. Create a new role
CREATE ROLE data_analyst;
-- 2. Grant privileges to the role
GRANT USAGE ON DATABASE my_prod_db TO ROLE data_analyst;
GRANT USAGE ON SCHEMA my_prod_db.analytics TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA my_prod_db.analytics TO ROLE data_analyst;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE data_analyst;
-- 3. Grant the role to a user
GRANT ROLE data_analyst TO USER jane_doe;
Pillar 2: The Revolution of Secure Data Sharing
This is arguably one of Snowflake’s most innovative features and a key differentiator. Traditionally, if you wanted to share data with another company or a different department, you had to set up a painful and insecure ETL process. This involved creating data extracts (like CSV files), transferring them via FTP or other methods, and having the consumer load them into their own system. This process is slow, expensive, and creates stale, unsecure copies of your data.
Snowflake Secure Data Sharing eliminates this entire process. It allows you to provide live, read-only access to your data to any other Snowflake account without ever moving or copying the data.
How Secure Data Sharing Works
The Provider: You (the “provider”) create a Share object. A share is a named object that contains a set of privileges on your databases and tables.
Granting Access: You grant access to specific tables or views to your share.
The Consumer: You add a “consumer” Snowflake account to the share. The consumer can then “mount” this share as a read-only database in their own Snowflake account.
The Magic: The consumer is querying your data live in your account, but they are using their own virtual warehouse (their own compute) to do so. The data never leaves your ownership or your secure environment. There are no ETL processes, no data copies, and no additional storage costs.
Use Cases:
Data Monetization: Companies in the Snowflake Marketplace sell access to their datasets using this feature.
Business Partnerships: Securely share data with your suppliers, partners, or customers.
Internal Departments: Share data between different business units without creating multiple copies and ETL pipelines.
Conclusion: The End of Data Silos
By combining a robust Role-Based Access Control system with the game-changing capabilities of Secure Data Sharing, Snowflake provides a comprehensive platform for modern data governance. This approach not only secures your data but also enables seamless and secure collaboration, breaking down the data silos that have plagued businesses for decades.
This concludes our four-part guide to Snowflake. You’ve gone from understanding the fundamental architecture to loading, querying, and now governing and sharing your data. You now have a complete picture of why Snowflake is a leader in the cloud data platform space.
In Part 1 of our guide, we explored Snowflake’s unique architecture, and in Part 2, we learned how to load data. Now comes the most important part: turning that raw data into valuable insights. The primary way we do this is by querying data in Snowflake.
While Snowflake uses standard SQL that will feel familiar to anyone with a database background, it also has powerful extensions and features that set it apart. This guide will cover the fundamentals of querying, how to handle semi-structured data like JSON, and introduce two of Snowflake’s most celebrated features: Zero-Copy Cloning and Time Travel.
The Workhorse: The Snowflake Worksheet
The primary interface for running queries in Snowflake is the Worksheet. It’s a clean, web-based environment where you can write and execute SQL, view results, and analyze query performance.
When you run a query, you are using the compute resources of your selected Virtual Warehouse. Remember, you can have different warehouses for different tasks, ensuring that your complex analytical queries don’t slow down other operations.
Standard SQL: Your Bread and Butter
At its core, querying data in Snowflake involves standard ANSI SQL. All the commands you’re familiar with work exactly as you’d expect.SQL
-- A standard SQL query to find top-selling products by category
SELECT
category,
product_name,
SUM(sale_amount) as total_sales,
COUNT(order_id) as number_of_orders
FROM
sales
WHERE
sale_date >= '2025-01-01'
GROUP BY
1, 2
ORDER BY
total_sales DESC;
Beyond Columns: Querying Semi-Structured Data (JSON)
One of Snowflake’s most powerful features is its native ability to handle semi-structured data. You can load an entire JSON object into a single column with the VARIANT data type and query it directly using a simple, SQL-like syntax.
Let’s say we have a table raw_logs with a VARIANT column named log_payload containing the following JSON:JSON
You can easily extract values from this JSON in your SQL query.
Example Code:SQL
SELECT
log_payload:event_type::STRING AS event,
log_payload:user_details.user_id::STRING AS user_id,
log_payload:user_details.device_type::STRING AS device,
log_payload:timestamp::TIMESTAMP_NTZ AS event_timestamp
FROM
raw_logs
WHERE
event = 'user_login'
AND device = 'mobile';
: is used to traverse the JSON object.
. is used for dot notation to access nested elements.
:: is used to cast the VARIANT value to a specific data type (like STRING or TIMESTAMP).
This flexibility allows you to build powerful pipelines without needing a rigid, predefined schema for all your data.
Game-Changer #1: Zero-Copy Cloning
Imagine you need to create a full copy of your 50TB production database to give your development team a safe environment to test in. In a traditional system, this would be a slow, expensive process that duplicates 50TB of storage.
In Snowflake, this is instantaneous and free (from a storage perspective). Zero-Copy Cloning creates a clone of a table, schema, or entire database by simply copying its metadata.
How it Works: The clone points to the same underlying data micro-partitions as the original. No data is actually moved or duplicated. When you modify the clone, Snowflake automatically creates new micro-partitions for the changed data, leaving the original untouched.
Use Case: Instantly create full-scale development, testing, and QA environments without incurring extra storage costs or waiting hours for data to be copied.
Example Code:SQL
-- This command instantly creates a full copy of your production database
CREATE DATABASE my_dev_db CLONE my_production_db;
Game-Changer #2: Time Travel
Have you ever accidentally run an UPDATE or DELETE statement without a WHERE clause? In most systems, this would mean a frantic call to the DBA to restore from a backup.
With Snowflake Time Travel, you can instantly query data as it existed in the past, up to 90 days by default for Enterprise edition.
How it Works: Snowflake’s storage architecture is immutable. When you change data, it simply creates new micro-partitions and retains the old ones. Time Travel allows you to query the data using those older, historical micro-partitions.
Use Cases:
Instantly recover from accidental data modification.
Analyze how data has changed over a specific period.
Run A/B tests by comparing results before and after a change.
Example Code:SQL
-- Query the table as it existed 5 minutes ago
SELECT *
FROM my_table AT(OFFSET => -60 * 5);
-- Or, restore a table to a previous state
UNDROP TABLE my_accidentally_dropped_table;
Conclusion for Part 3
You’ve now moved beyond just loading data and into the world of powerful analytics and data management. You’ve learned that:
Querying in Snowflake uses standard SQL via Worksheets.
You can seamlessly query JSON and other semi-structured data using the VARIANT type.
Zero-Copy Cloning provides instant, cost-effective data environments.
Time Travel acts as an “undo” button for your data, providing incredible data protection.
In Part 4, the final part of our guide, we will cover “Snowflake Governance & Sharing,” where we’ll explore roles, access control, and the revolutionary Data Sharing feature.