Tag: snowflake

AI Data Agent Guide 2025: Snowflake Cortex Tutorial
The world of data analytics is changing. For years, accessing insights required writing complex SQL queries. However, the industry is now shifting towards a more intuitive, conversational approach. At the forefront of this revolution is agentic AI—intelligent systems that can understand human language, reason, plan, and automate complex tasks.

Snowflake is leading this charge by transforming its platform into an intelligent and conversational AI Data Cloud. With the recent introduction of Snowflake Cortex Agents, they have provided a powerful tool for developers and data teams to build their own custom AI assistants.

This guide will walk you through, step-by-step, how to build your very first AI data agent. You will learn how to create an agent that can answer complex questions by pulling information from both your database tables and your unstructured documents, all using simple, natural language.

What is a Snowflake Cortex Agent and Why Does it Matter?

First and foremost, a Snowflake Cortex Agent is an AI-powered assistant that you can build on top of your own data. Think of it as a chatbot that has expert knowledge of your business. It understands your data landscape and can perform complex analytical tasks based on simple, conversational prompts.

This is a game-changer for several reasons:
- It Democratizes Data: Business users no longer need to know SQL. Instead, they can ask questions like, “What were our top-selling products in the last quarter?” and get immediate, accurate answers.
- It Automates Analysis: Consequently, data teams are freed from writing repetitive, ad-hoc queries. They can now focus on more strategic initiatives while the agent handles routine data exploration.
- It Provides Unified Insights: Most importantly, a Cortex Agent can synthesize information from multiple sources. It can query your structured sales data from a table and cross-reference it with strategic goals mentioned in a PDF document, all in a single response.
The Blueprint: How a Cortex Agent Works

Under the hood, a Cortex Agent uses a simple yet powerful workflow to answer your questions. It orchestrates several of Snowflake’s Cortex AI features to deliver a comprehensive answer.
1. Planning: The agent first analyzes your natural language question to understand your intent. It figures out what information you need and where it might be located.
2. Tool Use: Next, it intelligently chooses the right tool for the job. If it needs to query structured data, it uses Cortex Analyst to generate and run SQL. If it needs to find information in your documents, it uses Cortex Search.
3. Reflection: Finally, after gathering the data, the agent evaluates the results. It might ask for clarification, refine its approach, or synthesize the information into a clear, concise answer before presenting it to you.
Step-by-Step Tutorial: Building a Sales Analysis Agent

Now, let’s get hands-on. We will build a simple yet powerful sales analysis agent. This agent will be able to answer questions about sales figures from a table and also reference goals from a quarterly business review (QBR) document.

Prerequisites
- A Snowflake account with ACCOUNTADMIN privileges.
- A warehouse to run the queries.
Step 1: Prepare Your Data

First, we need some data to work with. Let’s create two simple tables for sales and products, and then upload a sample PDF document.

Run the following SQL in a Snowflake worksheet:
```
-- Create our database and schema
CREATE DATABASE IF NOT EXISTS AGENT_DEMO;
CREATE SCHEMA IF NOT EXISTS AGENT_DEMO.SALES;
USE SCHEMA AGENT_DEMO.SALES;

-- Create a products table
CREATE OR REPLACE TABLE PRODUCTS (
    product_id INT,
    product_name VARCHAR,
    category VARCHAR
);

INSERT INTO PRODUCTS (product_id, product_name, category) VALUES
(101, 'Quantum Laptop', 'Electronics'),
(102, 'Nebula Smartphone', 'Electronics'),
(103, 'Stardust Keyboard', 'Accessories');

-- Create a sales table
CREATE OR REPLACE TABLE SALES (
    sale_id INT,
    product_id INT,
    sale_date DATE,
    sale_amount DECIMAL(10, 2)
);

INSERT INTO SALES (sale_id, product_id, sale_date, sale_amount) VALUES
(1, 101, '2025-09-01', 1200.00),
(2, 102, '2025-09-05', 800.00),
(3, 101, '2025-09-15', 1250.00),
(4, 103, '2025-09-20', 150.00);

-- Create a stage for our unstructured documents
CREATE OR REPLACE STAGE qbr_documents;
```
Now, create a simple text file named QBR_Report_Q3.txt on your local machine with the following content and upload it to the qbr_documents stage using the Snowsight UI.

Quarterly Business Review – Q3 2025 Summary

Our primary strategic goal for Q3 was to drive the adoption of our new flagship product, the ‘Quantum Laptop’. We aimed for a sales target of over $2,000 for this product. Secondary goals included expanding our market share in the accessories category.

Step 2: Create the Semantic Model

Next, we need to teach the agent about our structured data. We do this by creating a Semantic Model. This is a YAML file that defines our tables, columns, and how they relate to each other.
```
# semantic_model.yaml
model:
  name: sales_insights_model
  tables:
    - name: SALES
      columns:
        - name: sale_id
          type: INT
        - name: product_id
          type: INT
        - name: sale_date
          type: DATE
        - name: sale_amount
          type: DECIMAL
    - name: PRODUCTS
      columns:
        - name: product_id
          type: INT
        - name: product_name
          type: VARCHAR
        - name: category
          type: VARCHAR
  joins:
    - from: SALES
      to: PRODUCTS
      on: SALES.product_id = PRODUCTS.product_id
```
Save this as semantic_model.yaml and upload it to the @qbr_documents stage.

Step 3: Create the Cortex Search Service

Now, let’s make our PDF document searchable. We create a Cortex Search Service on the stage where we uploaded our file.
```
CREATE OR REPLACE CORTEX SEARCH SERVICE sales_qbr_service
    ON @qbr_documents
    TARGET_LAG = '0 seconds'
    WAREHOUSE = 'COMPUTE_WH';
```
Step 4: Combine Them into a Cortex Agent

With all the pieces in place, we can now create our agent. This single SQL statement brings together our semantic model (for SQL queries) and our search service (for document queries).
```
CREATE OR REPLACE CORTEX AGENT sales_agent
    MODEL = 'mistral-large',
    CORTEX_SEARCH_SERVICES = [sales_qbr_service],
    SEMANTIC_MODELS = ['@qbr_documents/semantic_model.yaml'];
```
Step 5: Ask Your Agent Questions!

The agent is now ready! You can interact with it using the CALL command. Let’s try a few questions.

First up: A simple structured data query.
```
CALL sales_agent('What were our total sales?');
```
Next: A more complex query involving joins.
```
CALL sales_agent('Which product had the highest revenue?');
```
Then comes: A question for our unstructured document.
```
CALL sales_agent('Summarize our strategic goals from the latest QBR report.');
```
Finally , the magic: The magic! A question that combines both.
```
CALL sales_agent('Did we meet our sales target for the Quantum Laptop as mentioned in the QBR?');
```
This final query demonstrates the true power of a Snowflake Cortex Agent. It will first query the SALES and PRODUCTS tables to calculate the total sales for the “Quantum Laptop.” Then, it will use Cortex Search to find the sales target mentioned in the QBR document. Finally, it will compare the two and give you a complete, synthesized answer.

Conclusion: The Future is Conversational

You have just built a powerful AI data agent in a matter of minutes. This is a fundamental shift in how we interact with data. By combining natural language processing with the power to query both structured and unstructured data, Snowflake Cortex Agents are paving the way for a future where data-driven insights are accessible to everyone in an organization.

As Snowflake continues to innovate with features like Adaptive Compute and Gen-2 Warehouses, running these AI workloads will only become faster and more efficient. The era of conversational analytics has arrived, and it’s built on the Snowflake AI Data Cloud.

Additional materials
October 8, 2025

Star Schema vs Snowflake Schema:Key Differences & Use Cases

In the realm of data warehousing, choosing the right schema design is crucial for efficient data management, querying, and analysis. Two of the most popular multidimensional schemas are the star schema and the snowflake schema. These schemas organize data into fact tables (containing measurable metrics) and dimension tables (providing context like who, what, when, and where). Understanding star schema vs snowflake schema helps data engineers, analysts, and architects build scalable systems that support business intelligence (BI) tools and advanced analytics.

This comprehensive guide delves into their structures, pros, cons, when to use each, real-world examples, and which one dominates in modern data practices as of 2025. We’ll also include visual illustrations to make concepts clearer, along with references to authoritative sources for deeper reading.

What is a Star Schema?

A star schema is a denormalized data model resembling a star, with a central fact table surrounded by dimension tables. The fact table holds quantitative data (e.g., sales amounts, quantities) and foreign keys linking to dimensions. Dimension tables store descriptive attributes (e.g., product names, customer details) and are not further normalized.

Hand-drawn star schema diagram for data warehousing

Advantages of Star Schema:

Simplicity and Ease of Use: Fewer tables mean simpler queries with minimal joins, making it intuitive for end-users and BI tools like Tableau or Power BI.
Faster Query Performance: Denormalization reduces join operations, leading to quicker aggregations and reports, especially on large datasets.
Better for Reporting: Ideal for OLAP (Online Analytical Processing) where speed is prioritized over storage efficiency.

Disadvantages of Star Schema:

Data Redundancy: Denormalization can lead to duplicated data in dimension tables, increasing storage needs and risking inconsistencies during updates.
Limited Flexibility for Complex Hierarchies: It struggles with intricate relationships, such as multi-level product categories.

In practice, star schemas are favored in environments where query speed trumps everything else. For instance, in a retail data warehouse, the fact table might record daily sales metrics, while dimensions cover products, customers, stores, and dates. This setup allows quick answers to questions like “What were the total sales by product category last quarter?”

What is a Snowflake Schema?

A snowflake schema is an extension of the star schema but with normalized dimension tables. Here, dimensions are broken down into sub-dimension tables to eliminate redundancy, creating a structure that branches out like a snowflake. The fact table remains central, but dimensions are hierarchical and normalized to third normal form (3NF).

Advantages of Snowflake Schema:

Storage Efficiency: Normalization reduces data duplication, saving disk space—crucial for massive datasets in cloud environments like AWS or Snowflake (the data warehouse platform).
Improved Data Integrity: By minimizing redundancy, updates are easier and less error-prone, maintaining consistency across the warehouse.
Handles Complex Relationships: Better suited for detailed hierarchies, such as product categories subdivided into brands, suppliers, and regions.

Disadvantages of Snowflake Schema:

Slower Query Performance: More joins are required, which can slow down queries on large volumes of data.
Increased Complexity: The normalized structure is harder to understand and maintain, potentially complicating BI tool integrations.

For example, in the same retail scenario, a snowflake schema might normalize the product dimension into separate tables for products, categories, and suppliers. This allows precise queries like “Sales by supplier region” without redundant storage, but at the cost of additional joins.

Key Differences Between Star Schema and Snowflake Schema

To highlight star schema vs snowflake schema, here’s a comparison table:

Aspect	Star Schema	Snowflake Schema
Normalization	Denormalized (1NF or 2NF)	Normalized (3NF)
Structure	Central fact table with direct dimension tables	Fact table with hierarchical sub-dimensions
Joins	Fewer joins, faster queries	More joins, potentially slower
Storage	Higher due to redundancy	Lower, more efficient
Complexity	Simple and user-friendly	More complex, better for integrity
Query Speed	High	Moderate to low
Data Redundancy	High	Low

These differences stem from their design philosophies: star focuses on performance, while snowflake emphasizes efficiency and accuracy.

When to Use Star Schema vs Snowflake Schema

Use Star Schema When:
- Speed is critical (e.g., real-time dashboards).
- Data models are simple without deep hierarchies.
- Storage cost isn’t a concern with cheap cloud options.
- Example: An e-commerce firm uses star schema for rapid sales trend analysis.
Use Snowflake Schema When:
- Storage optimization is key for massive datasets.
- Complex hierarchies exist (e.g., supply chain layers).
- Data integrity is paramount during updates.
- Example: A healthcare provider uses snowflake to manage patient and provider hierarchies.

Hybrid approaches exist, but pure star schemas are often preferred for balance.

Which is Used Most in 2025?

As of 2025, the star schema remains the most commonly used in data warehousing. Its simplicity aligns with the rise of self-service BI tools and cloud platforms like Snowflake and BigQuery, where query optimization mitigates some denormalization drawbacks. Surveys and industry reports indicate that over 70% of data warehouses favor star schemas for their performance advantages, especially in agile environments. Snowflake schemas, while efficient, are more niche—used in about 20-30% of cases where normalization is essential, such as regulated industries like finance or healthcare.

However, with advancements in columnar storage and indexing, the performance gap is narrowing, making snowflake viable for more use cases.

Solid Examples in Action

Consider a healthcare analytics warehouse:

Star Schema Example: Fact table tracks patient visits (metrics: visit count, cost). Dimensions: Patient (ID, name, age), Doctor (ID, specialty), Date (year, month), Location (hospital, city). Queries like “Average cost per doctor specialty in 2024” run swiftly with simple joins.
Snowflake Schema Example: Normalize the Doctor dimension into Doctor (ID, name), Specialty (ID, type, department), and Department (ID, head). This reduces redundancy if specialties change often, but requires extra joins for the same query.

In a financial reporting system, star might aggregate transaction data quickly for dashboards, while snowflake ensures normalized account hierarchies for compliance audits.

Best Practices and References

To implement effectively:

Start with business requirements: Prioritize speed or efficiency?
Use tools like dbt or ERwin for modeling.
Test performance with sample data.

For more, check these resources:

In conclusion, while star schema vs snowflake schema both serve data warehousing, star’s dominance in 2025 underscores the value of simplicity in a fast-paced data landscape. Choose based on your workload—performance for star, efficiency for snowflake—and watch your analytics thrive.

October 7, 2025

Mastering Python Data Pipelines: Extract from APIs & Databases, Load to S3 & Snowflake
Introduction to Data Pipelines in Python

In today’s data-driven world, creating robust data pipelines solutions is essential for businesses to handle large volumes of information efficiently. Whether you’re pulling data from RESTful APIs or external databases, the goal is to extract, transform, and load (ETL) it reliably. This guide walks you through building data pipelines using Python that fetch data from multiple sources, store it in Amazon S3 for scalable storage, and load it into Snowflake for advanced analytics.

By leveraging Python’s powerful libraries like requests for APIs, sqlalchemy for databases, boto3 for S3, and the Snowflake connector, you can automate these processes. This approach ensures data integrity, scalability, and cost-effectiveness, making it ideal for data engineers and developers.

Why Use Python for Data Pipelines?

Python stands out due to its simplicity, extensive ecosystem, and community support. Key benefits include:
- Ease of Integration: Seamlessly connect to APIs, databases, S3, and Snowflake.
- Scalability: Handle large datasets with libraries like Pandas for transformations.
- Automation: Use schedulers like Airflow or cron jobs to run pipelines periodically.
- Cost-Effective: Open-source tools reduce overhead compared to proprietary ETL software.
If you’re dealing with real-time data ingestion or batch processing, Python’s flexibility makes it a top choice for modern data pipelines.

Step 1: Extracting Data from APIs

Extracting data from APIs is a common starting point in data pipelines. We’ll use the requests library to fetch JSON data from a public API, such as a weather service or GitHub API.

First, install the necessary packages:
```
pip install requests pandas
```
Here’s a sample Python script to extract data from an API:
```
import requests
import pandas as pd

def extract_from_api(api_url):
    try:
        response = requests.get(api_url)
        response.raise_for_status()  # Raise error for bad status codes
        data = response.json()
        # Assuming the data is in a list under 'results' key
        df = pd.DataFrame(data.get('results', []))
        print(f"Extracted {len(df)} records from API.")
        return df
    except requests.exceptions.RequestException as e:
        print(f"API extraction error: {e}")
        return pd.DataFrame()

# Example usage
api_url = "https://api.example.com/data"  # Replace with your API endpoint
api_data = extract_from_api(api_url)
```
This function handles errors gracefully and converts the API response into a Pandas DataFrame for easy manipulation in your data pipelines Python.

Step 2: Extracting Data from External Databases

For external databases like MySQL, PostgreSQL, or Oracle, use sqlalchemy to connect and query data. This is crucial for data pipelines involving legacy systems or third-party DBs.

Install the required libraries:
```
pip install sqlalchemy pandas mysql-connector-python  # Adjust driver for your DB
```
Sample code to extract from a MySQL database:
```
from sqlalchemy import create_engine
import pandas as pd

def extract_from_db(db_url, query):
    try:
        engine = create_engine(db_url)
        df = pd.read_sql_query(query, engine)
        print(f"Extracted {len(df)} records from database.")
        return df
    except Exception as e:
        print(f"Database extraction error: {e}")
        return pd.DataFrame()

# Example usage
db_url = "mysql+mysqlconnector://user:password@host:port/dbname"  # Replace with your credentials
query = "SELECT * FROM your_table WHERE date > '2023-01-01'"
db_data = extract_from_db(db_url, query)
```
This method ensures secure connections and efficient data retrieval, forming a solid foundation for your pipelines in Python.

Step 3: Transforming Data (Optional ETL Step)

Before loading, transform the data using Pandas. For instance, merge API and DB data, clean duplicates, or apply calculations.
```
# Assuming api_data and db_data are DataFrames
merged_data = pd.merge(api_data, db_data, on='common_column', how='inner')
merged_data.drop_duplicates(inplace=True)
merged_data['new_column'] = merged_data['value1'] + merged_data['value2']
```
This step in data pipelines ensures data quality and relevance.

Step 4: Loading Data to Amazon S3

Amazon S3 provides durable, scalable storage for your extracted data. Use boto3 to upload files.

Install boto3:
```
pip install boto3
```
Code example:
```
import boto3
import io

def load_to_s3(df, bucket_name, file_key, aws_access_key, aws_secret_key):
    try:
        s3_client = boto3.client('s3', aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)
        csv_buffer = io.StringIO()
        df.to_csv(csv_buffer, index=False)
        s3_client.put_object(Bucket=bucket_name, Key=file_key, Body=csv_buffer.getvalue())
        print(f"Data loaded to S3: {bucket_name}/{file_key}")
    except Exception as e:
        print(f"S3 upload error: {e}")

# Example usage
bucket = "your-s3-bucket"
key = "data/processed_data.csv"
load_to_s3(merged_data, bucket, key, "your_access_key", "your_secret_key")  # Use environment variables for security
```
Storing in S3 acts as an intermediate layer in data pipelines, enabling versioning and easy access.

Step 5: Loading Data into Snowflake

Finally, load the data from S3 into Snowflake for querying and analytics. Use the Snowflake Python connector.

Install the connector:
```
pip install snowflake-connector-python pandas
```
Sample Script:
```
import snowflake.connector
import pandas as pd

def load_to_snowflake(df, snowflake_account, user, password, warehouse, db, schema, table):
    try:
        conn = snowflake.connector.connect(
            user=user,
            password=password,
            account=snowflake_account,
            warehouse=warehouse,
            database=db,
            schema=schema
        )
        cur = conn.cursor()
        # Create table if not exists (simplified)
        cur.execute(f"CREATE TABLE IF NOT EXISTS {table} (col1 VARCHAR, col2 INT)")  # Adjust schema
        # Load data using Pandas to_sql (for small datasets; use COPY for large ones)
        df.to_sql(table, con=conn, schema=schema, if_exists='append', index=False)
        print(f"Data loaded to Snowflake table: {table}")
    except Exception as e:
        print(f"Snowflake load error: {e}")
    finally:
        cur.close()
        conn.close()

# Example usage
load_to_snowflake(merged_data, "your-account", "user", "password", "warehouse", "db", "schema", "your_table")
```
For larger datasets, use Snowflake’s COPY INTO command with S3 stages for better performance in data pipelines Python.

Best Practices for Data Pipelines in Python
- Error Handling: Always include try-except blocks to prevent pipeline failures.
- Security: Use environment variables or AWS Secrets Manager for credentials.
- Scheduling: Integrate with Apache Airflow or AWS Lambda for automated runs.
- Monitoring: Log activities and use tools like Datadog for pipeline health.
- Scalability: For big data, consider PySpark or Dask instead of Pandas.
Conclusion

Building data pipelines Python from APIs and databases to S3 and Snowflake streamlines your ETL workflows, enabling faster insights. With the code examples provided, you can start implementing these pipelines today. If you’re optimizing for cloud efficiency, this setup reduces costs while boosting performance.

Additional materials
October 7, 2025
Revolutionizing Finance: A Deep Dive into Snowflake’s Cortex AI
The financial services industry is in the midst of a technological revolution. At the heart of this change lies Artificial Intelligence. Consequently, financial institutions are constantly seeking new ways to innovate and enhance security. They also want to deliver personalized customer experiences. However, they face a significant hurdle: navigating fragmented data while adhering to strict compliance and governance requirements. To solve this, Snowflake has introduced Cortex AI for Financial Services, a groundbreaking suite of tools designed to unlock the full potential of AI in the sector.

What is Snowflake Cortex AI for Financial Services?

First and foremost, Snowflake Cortex AI is a comprehensive suite of AI capabilities. It empowers financial organizations to unify their data and securely deploy AI models, applications, and agents. By bringing AI directly to the data, Snowflake eliminates the need to move sensitive information. As a result, security and governance are greatly enhanced. This approach allows institutions to leverage their own proprietary data alongside third-party sources and cutting-edge large language models (LLMs). Ultimately, this helps them automate complex tasks and derive faster, more accurate insights.

Key Capabilities Driving the Transformation

Cortex AI for Financial Services is built on a foundation of powerful features. These are specifically designed to accelerate AI adoption within the financial industry.
- Snowflake Data Science Agent: This AI-powered coding assistant automates many time-consuming tasks for data scientists. For instance, it handles data cleaning, feature engineering, and model prototyping. This, in turn, accelerates the development of crucial workflows like risk modeling and fraud detection.
- Cortex AISQL: With its AI-powered functions, Cortex AISQL allows users to process and analyze unstructured data at scale. This includes market research, earnings call transcripts, and transaction details. Therefore, it transforms workflows in customer service, investment analytics, and claims processing.
- Snowflake Intelligence: Furthermore, this feature provides business users with an intuitive, conversational interface. They can query both structured and unstructured data using natural language. This “democratization” of data access means even non-technical users can gain valuable insights without writing complex SQL.
- Managed Model Context Protocol (MCP) Server: The MCP Server is a true game-changer. It securely connects proprietary data with third-party data from partners like FactSet and MSCI. In addition, it provides a standardized method for LLMs to integrate with data APIs, which eliminates the need for custom work and speeds up the delivery of AI applications.
Use Cases: Putting Cortex AI to Work in Finance

The practical applications of Snowflake Cortex AI in the financial services industry are vast and transformative:
- Fraud Detection and Prevention: By training models on historical transaction data, institutions can identify suspicious patterns in real time. Consequently, this proactive approach helps minimize losses and protect customers.
- Credit Risk Analysis: Cortex Analyst, a key feature, can analyze vast amounts of transaction data to assess credit risk. By building a semantic model, for example, financial institutions can enable more accurate and nuanced risk assessments.
- Algorithmic Trading Support:While not a trading platform itself, Snowflake’s infrastructure supports algorithmic strategies. Specifically, Cortex AI provides powerful tools for data analysis, pattern identification, and model development..
- Enhanced Customer Service: Moreover, AI agents powered by Cortex AI can create sophisticated customer support systems. These agents can analyze customer data to provide personalized assistance and automate tasks, leading to improved satisfaction.
- Market and Investment Analysis: Cortex AI can also analyze a wide range of data sources, from earnings calls to market news. This provides real-time insights that are crucial for making informed and timely investment decisions.
The Benefits of a Unified AI and Data Strategy

By adopting Snowflake Cortex AI, financial institutions can realize a multitude of benefits:
- Enhanced Security and Governance: By bringing AI to the data, sensitive financial information remains within Snowflake’s secure and governed environment.
- Faster Innovation: Automating data science tasks allows for the rapid development and deployment of new products.
- Democratization of Data: Natural language interfaces empower more users to access and analyze data directly.
- Reduced Operational Costs: Finally, the automation of complex tasks leads to significant cost savings and increased efficiency.
Getting Started with Snowflake Cortex AI

For institutions ready to begin their AI journey, the path is clear. The Snowflake Quickstarts offer a wealth of tutorials and guides. These resources provide step-by-step instructions on how to set up the environment, create models, and build powerful applications.

The Future of Finance is Here

In conclusion, Snowflake Cortex AI for Financial Services represents a pivotal moment for the industry. By providing a secure, scalable, and unified platform, Snowflake is empowering financial institutions to seize the opportunities of tomorrow. The ability to seamlessly integrate data with the latest AI technology will undoubtedly be a key differentiator in the competitive landscape of finance.

Additional Sources
October 7, 2025
Snowflake Data Science Agent: Automate ML Workflows 2025
The 60–80% Problem Killing Data Science Productivity

Data science productivity is being crushed by the 60–80% problem. Despite powerful platforms like Snowflake and cutting-edge ML tools, data scientists still spend the majority of their time—60 to 80 percent—on repetitive tasks like data cleaning, feature engineering, and environment setup. This bottleneck is stalling innovation and delaying insights that drive business value.

A typical ML project timeline looks like this:
- Weeks 1-2: Finding datasets, setting up environments, searching for similar projects
- Weeks 3-4: Data preprocessing, exploratory analysis, feature engineering
- Weeks 5-6: Model selection, hyperparameter tuning, training
- Weeks 7-8: Evaluation, documentation, deployment preparation
Only after this two-month slog do data scientists reach the interesting work: interpreting results and driving business impact.

What if you could compress weeks of foundational ML work into under an hour?

Enter the Snowflake Data Science Agent, announced at Snowflake Summit 2025 on June 3. This agentic AI companion automates routine ML development tasks, transforming how organizations build and deploy machine learning models.

What is Snowflake Data Science Agent?

Snowflake Data Science Agent is an autonomous AI assistant that automates the entire ML development lifecycle within the Snowflake environment. Currently in private preview with general availability expected in late 2025, it represents a fundamental shift in how data scientists work.

The Core Innovation

Rather than manually coding each step of an ML pipeline, data scientists describe their objective in natural language. The agent then:

Understands Context: Analyzes available datasets, business requirements, and project goals

Plans Strategy: Breaks down the ML problem into logical, executable steps

Generates Code: Creates production-quality Python code for each pipeline component

Executes Workflows: Runs the pipeline directly in Snowflake Notebooks with full observability

Iterates Intelligently: Refines approaches based on results and user feedback

Powered by Claude AI

The Data Science Agent leverages Anthropic’s Claude large language model, running securely within Snowflake’s perimeter. This integration ensures that proprietary data never leaves the governed Snowflake environment while providing state-of-the-art reasoning capabilities.

How Data Science Agent Transforms ML Workflows

Traditional ML Pipeline vs. Agent-Assisted Pipeline

Traditional Approach (4-8 Weeks):
1. Manual dataset discovery and access setup (3-5 days)
2. Exploratory data analysis with custom scripts (5-7 days)
3. Data preprocessing and quality checks (7-10 days)
4. Feature engineering experiments (5-7 days)
5. Model selection and baseline training (3-5 days)
6. Hyperparameter tuning iterations (7-10 days)
7. Model evaluation and documentation (5-7 days)
8. Deployment preparation and handoff (3-5 days)
Agent-Assisted Approach (1-2 Days):
1. Natural language project description (15 minutes)
2. Agent generates complete pipeline (30-60 minutes)
3. Review and customize generated code (2-3 hours)
4. Execute and evaluate results (1-2 hours)
5. Iterate with follow-up prompts (30 minutes per iteration)
6. Production deployment (1-2 hours)
The agent doesn’t eliminate human expertise—it amplifies it. Data scientists focus on problem formulation, result interpretation, and business strategy rather than boilerplate code.

Key Capabilities and Features

1. Automated Data Preparation

The agent handles the most time-consuming aspects of data science:

Data Profiling: Automatically analyzes distributions, identifies missing values, detects outliers, and assesses data quality

Smart Preprocessing: Generates appropriate transformations based on data characteristics—normalization, encoding, imputation, scaling

Feature Engineering: Creates relevant features using domain knowledge embedded in the model, including polynomial features, interaction terms, and temporal aggregations

Data Validation: Implements checks to ensure data quality throughout the pipeline

2. Intelligent Model Selection

Rather than manually testing dozens of algorithms, the agent:

Evaluates Problem Type: Classifies tasks as regression, classification, clustering, or time series

Considers Constraints: Factors in dataset size, feature types, and performance requirements

Recommends Algorithms: Suggests appropriate models with justification for each recommendation

Implements Ensemble Methods: Combines multiple models when beneficial for accuracy

3. Automated Hyperparameter Tuning

The agent configures and executes optimization strategies:

Grid Search: Systematic exploration of parameter spaces for small parameter sets

Random Search: Efficient sampling for high-dimensional parameter spaces

Bayesian Optimization: Intelligent search using previous results to guide exploration

Early Stopping: Prevents overfitting and saves compute resources

4. Production-Ready Code Generation

Generated pipelines aren’t just prototypes—they’re production-quality:

Modular Architecture: Clean, reusable functions with clear separation of concerns

Error Handling: Robust exception handling and logging

Documentation: Inline comments and docstrings explaining logic

Version Control Ready: Structured for Git workflows and collaboration

Snowflake Native: Optimized for Snowflake’s distributed computing environment

5. Explainability and Transparency

Understanding model behavior is crucial for trust and compliance:

Feature Importance: Identifies which variables drive predictions

SHAP Values: Explains individual predictions with Shapley values

Model Diagnostics: Generates confusion matrices, ROC curves, and performance metrics

Audit Trails: Logs all decisions, code changes, and model versions

Real-World Use Cases

Financial Services: Fraud Detection

Challenge: A bank needs to detect fraudulent transactions in real-time with minimal false positives.

Traditional Approach: Data science team spends 6 weeks building and tuning models, requiring deep SQL expertise, feature engineering knowledge, and model optimization skills.

With Data Science Agent:
- Prompt: “Build a fraud detection model using transaction history, customer profiles, and merchant data. Optimize for 99% precision while maintaining 85% recall.”
- Result: Agent generates a complete pipeline with ensemble methods, class imbalance handling, and real-time scoring infrastructure in under 2 hours
- Impact: Model deployed 95% faster, freeing the team to work on sophisticated fraud pattern analysis
Healthcare: Patient Risk Stratification

Challenge: A hospital system wants to identify high-risk patients for proactive intervention.

Traditional Approach: Clinical data analysts spend 8 weeks wrangling EHR data, building features from medical histories, and validating models against clinical outcomes.

With Data Science Agent:
- Prompt: “Create a patient risk stratification model using diagnoses, medications, lab results, and demographics. Focus on interpretability for clinical adoption.”
- Result: Agent produces an explainable model with clinically meaningful features, SHAP explanations for each prediction, and validation against established risk scores
- Impact: Clinicians trust the model due to transparency, leading to 40% adoption rate vs. typical 15%
Retail: Customer Lifetime Value Prediction

Challenge: An e-commerce company needs to predict customer lifetime value to optimize marketing spend.

Traditional Approach: Marketing analytics team collaborates with data scientists for 5 weeks, iterating on feature definitions and model approaches.

With Data Science Agent:
- Prompt: “Predict 12-month customer lifetime value using purchase history, browsing behavior, and demographic data. Segment customers into high/medium/low value tiers.”
- Result: Agent delivers a complete CLV model with customer segmentation, propensity scores, and a dashboard for marketing teams
- Impact: Marketing ROI improves 32% through better targeting, model built 90% faster
Manufacturing: Predictive Maintenance

Challenge: A manufacturer wants to predict equipment failures before they occur to minimize downtime.

Traditional Approach: Engineers and data scientists spend 7 weeks analyzing sensor data, building time-series features, and testing various forecasting approaches.

With Data Science Agent:
- Prompt: “Build a predictive maintenance model using sensor telemetry, maintenance logs, and operational data. Predict failures 24-48 hours in advance.”
- Result: Agent creates a time-series model with automated feature extraction from streaming data, anomaly detection, and failure prediction
- Impact: Unplanned downtime reduced by 28%, maintenance costs decreased by 19%
Technical Architecture

Integration with Snowflake Ecosystem

The Data Science Agent operates natively within Snowflake’s architecture:

Snowflake Notebooks: All code generation and execution happens in collaborative notebooks

Snowpark Python: Leverages Snowflake’s Python runtime for distributed computing

Data Governance: Inherits existing row-level security, masking, and access controls

Cortex AI Suite: Integrates with Cortex Analyst, Search, and AISQL for comprehensive AI capabilities

ML Jobs: Automates model training, scheduling, and monitoring at scale

How It Works: Behind the Scenes

When a data scientist provides a natural language prompt:

🔍Step 1: Understanding
- Claude analyzes the request, identifying ML task type, success metrics, and constraints
- Agent queries Snowflake’s catalog to discover relevant tables and understand schema
🧠 Step 2: Planning
- Generates a step-by-step execution plan covering data prep, modeling, and evaluation
- Identifies required Snowflake features and libraries
💻 Step 3: Code Generation
- Creates executable Python code for each pipeline stage
- Includes data validation, error handling, and logging
🚀 Step 4: Execution
- Runs generated code in Snowflake Notebooks with full visibility
- Provides real-time progress updates and intermediate results
📊 Step 5: Evaluation
- Generates comprehensive model diagnostics and performance metrics
- Recommends next steps based on results
🔁 Step 6: Iteration
- Accepts follow-up prompts to refine the model
- Tracks changes and maintains version history
Best Practices for Using Data Science Agent

1. Write Clear, Specific Prompts

Poor Prompt: “Build a model for sales”

Good Prompt: “Create a weekly sales forecasting model for retail stores using historical sales, promotions, weather, and holidays. Optimize for MAPE under 10%. Include confidence intervals.”

The more context you provide, the better the agent performs.

2. Start with Business Context

Begin prompts with the business problem and success criteria:
- What decision will this model inform?
- What accuracy is acceptable?
- What are the cost/benefit tradeoffs?
- Are there regulatory requirements?
3. Iterate Incrementally

Don’t expect perfection on the first generation. Use follow-up prompts:
- “Add feature importance analysis”
- “Try a gradient boosting approach”
- “Optimize for faster inference time”
- “Add cross-validation with 5 folds”
4. Review Generated Code

While the agent produces high-quality code, always review:
- Data preprocessing logic for business rule compliance
- Feature engineering for domain appropriateness
- Model selection justification
- Performance metrics alignment with business goals
5. Establish Governance Guardrails

Define organizational standards:
- Required documentation templates
- Mandatory model validation steps
- Approved algorithm lists for regulated industries
- Data privacy and security checks
6. Combine Agent Automation with Human Expertise

Use the agent for:
- Rapid prototyping and baseline models
- Automated preprocessing and feature engineering
- Hyperparameter tuning and model selection
- Code documentation and testing
Retain human control for:
- Problem formulation and success criteria
- Business logic validation
- Ethical considerations and bias assessment
- Strategic decision-making on model deployment
Measuring ROI: The Business Impact

Organizations adopting Data Science Agent report significant benefits:

Time-to-Production Acceleration

Before Agent: Average 8-12 weeks from concept to production model

With Agent: Average 1-2 weeks from concept to production model

Impact: 5-10x faster model development cycles

Productivity Multiplication

Before Agent: 2-3 models per data scientist per quarter

With Agent: 8-12 models per data scientist per quarter

Impact: 4x increase in model output, enabling more AI use cases

Quality Improvements

Before Agent: 40-60% of models reach production (many abandoned due to insufficient ROI)

With Agent: 70-85% of models reach production (faster iteration enables more refinement)

Impact: Higher model quality through rapid experimentation

Cost Optimization

Before Agent: $150K-200K average cost per model (personnel time, infrastructure)

With Agent: $40K-60K average cost per model

Impact: 70% reduction in model development costs

Democratization of ML

Before Agent: Only senior data scientists can build production models

With Agent: Junior analysts and citizen data scientists can create sophisticated models

Impact: 3-5x expansion of AI capability across organization

Limitations and Considerations

While powerful, Data Science Agent has important constraints:

Current Limitations

Preview Status: Still in private preview; features and capabilities evolving

Scope Boundaries: Optimized for structured data ML; deep learning and computer vision require different approaches

Domain Knowledge: Agent lacks specific industry expertise; users must validate business logic

Complex Custom Logic: Highly specialized algorithms may require manual implementation

Important Considerations

Data Quality Dependency: Agent’s output quality directly correlates with input data quality—garbage in, garbage out still applies

Computational Costs: Automated hyperparameter tuning can consume significant compute resources

Over-Reliance Risk: Organizations must maintain ML expertise; agents augment, not replace, human judgment

Regulatory Compliance: In highly regulated industries, additional human review and validation required

Bias and Fairness: Automated feature engineering may perpetuate existing biases; fairness testing essential

The Future of Data Science Agent

Based on Snowflake’s roadmap and industry trends, expect these developments:

Short-Term (2025-2026)

General Availability: Broader access as private preview graduates to GA

Expanded Model Types: Support for time series, recommendation systems, and anomaly detection

AutoML Enhancements: More sophisticated algorithm selection and ensemble methods

Deeper Integration: Tighter coupling with Snowflake ML Jobs and model registry

Medium-Term (2026-2027)

Multi-Modal Learning: Support for unstructured data (images, text, audio) alongside structured data

Federated Learning: Distributed model training across data clean rooms

Automated Monitoring: Self-healing models that detect drift and retrain automatically

Natural Language Insights: Plain English explanations of model behavior for business users

Long-Term Vision (2027+)

Autonomous ML Operations: End-to-end model lifecycle management with minimal human intervention

Cross-Domain Transfer Learning: Agents that leverage learnings across industries and use cases

Collaborative Multi-Agent Systems: Specialized agents working together on complex problems

Causal ML Integration: Moving beyond correlation to causal inference and counterfactual analysis

Getting Started with Data Science Agent

Prerequisites

To leverage Data Science Agent, you need:

Snowflake Account: Enterprise edition or higher with Cortex AI enabled

Data Foundation: Structured data in Snowflake tables or views

Clear Use Case: Well-defined business problem with success metrics

User Permissions: Access to Snowflake Notebooks and Cortex features

Request Access

Data Science Agent is currently in private preview:
1. Contact your Snowflake account team to express interest
2. Complete the preview application with use case details
3. Participate in onboarding and training sessions
4. Join the preview community for best practices sharing
Pilot Project Selection

Choose an initial use case with these characteristics:

High Business Value: Clear ROI and stakeholder interest

Data Availability: Clean, accessible data in Snowflake

Reasonable Complexity: Not trivial, but not your most difficult problem

Failure Tolerance: Low risk if the model needs iteration

Measurement Clarity: Easy to quantify success

Success Metrics

Track these KPIs to measure Data Science Agent impact:
- Time from concept to production model
- Number of models per data scientist per quarter
- Percentage of models reaching production
- Model performance metrics vs. baseline
- Cost per model developed
- Data scientist satisfaction scores
Snowflake Data Science Agent vs. Competitors

How It Compares

Databricks AutoML:
- Advantage: Tighter integration with governed data, no data movement
- Trade-off: Databricks offers more deep learning capabilities
Google Cloud AutoML:
- Advantage: Runs on your data warehouse, no egress costs
- Trade-off: Google has broader pre-trained model library
Amazon SageMaker Autopilot:
- Advantage: Simpler for SQL-first organizations
- Trade-off: AWS has more deployment flexibility
H2O.ai Driverless AI:
- Advantage: Native Snowflake integration, better governance
- Trade-off: H2O specializes in AutoML with more tuning options
Why Choose Snowflake Data Science Agent?

Data Gravity: Build ML where your data lives—no movement, no copies, no security risks

Unified Platform: Single environment for data engineering, analytics, and ML

Enterprise Governance: Leverage existing security, compliance, and access controls

Ecosystem Integration: Works seamlessly with BI tools, notebooks, and applications

Scalability: Automatic compute scaling without infrastructure management

Conclusion: The Data Science Revolution Begins Now

The Snowflake Data Science Agent represents more than a productivity tool—it’s a fundamental reimagining of how organizations build machine learning solutions. By automating the 60-80% of work that consumes data scientists’ time, it unleashes their potential to solve harder problems, explore more use cases, and deliver greater business impact.

The transformation is already beginning. Organizations in the private preview report 5-10x faster model development, 4x increases in productivity, and democratization of ML capabilities across their teams. As Data Science Agent reaches general availability in late 2025, these benefits will scale across the entire Snowflake ecosystem.

The question isn’t whether to adopt AI-assisted data science—it’s how quickly you can implement it to stay competitive.

For data leaders, the opportunity is clear: accelerate AI initiatives, multiply data science team output, and tackle the backlog of use cases that were previously too expensive or time-consuming to address.

For data scientists, the promise is equally compelling: spend less time on repetitive tasks and more time on creative problem-solving, strategic thinking, and high-impact analysis.

The future of data science is agentic. The future of data science is here.

Key Takeaways
- Snowflake Data Science Agent automates 60-80% of routine ML development work
- Announced June 3, 2025, at Snowflake Summit; currently in private preview
- Powered by Anthropic’s Claude AI running securely within Snowflake
- Transforms weeks of ML pipeline work into hours through natural language interaction
- Generates production-quality code for data prep, modeling, tuning, and evaluation
- Organizations report 5-10x faster model development and 4x productivity gains
- Use cases span financial services, healthcare, retail, manufacturing, and more
- Maintains Snowflake’s enterprise governance, security, and compliance controls
- Best used for structured data ML; human expertise still essential for strategy
- Expected general availability in late 2025 with continued capability expansion
October 6, 2025
Enterprise AI 2025: Snowflake MCP Links Agents to Data
Introduction: The Dawn of Context-Aware AI in Enterprise Data

Enterprise AI is experiencing a fundamental shift in October 2025. Organizations are no longer satisfied with isolated AI tools that operate in silos. Instead, they’re demanding intelligent systems that understand context, access governed data securely, and orchestrate complex workflows across multiple platforms.

Enter the Snowflake MCP Server—a groundbreaking managed service announced on October 2, 2025, that bridges the gap between AI agents and enterprise data ecosystems. By implementing the Model Context Protocol (MCP), Snowflake has created a standardized pathway for AI agents to interact with both proprietary company data and premium third-party datasets, all while maintaining enterprise-grade security and governance.

This comprehensive guide explores how the Snowflake MCP Server is reshaping enterprise AI, what makes it different from traditional integrations, and how organizations can leverage this technology to build next-generation intelligent applications.

What is the Model Context Protocol (MCP)?

Before diving into Snowflake’s implementation, it’s essential to understand the Model Context Protocol itself.

The Problem MCP Solves

Historically, connecting AI agents to enterprise systems has been a fragmented nightmare. Each integration required custom development work, creating a web of point-to-point connections that were difficult to maintain, scale, and secure. Data teams spent weeks building bespoke integrations instead of focusing on innovation.

The Model Context Protocol emerged as an industry solution to this chaos. Developed by Anthropic and rapidly adopted across the AI ecosystem, MCP provides a standardized interface for AI agents to connect with data sources, APIs, and services.

Think of MCP as a universal adapter for AI agents—similar to how USB-C standardized device connections, MCP standardizes how AI systems interact with enterprise data platforms.

Key Benefits of MCP

Interoperability: AI agents from different vendors can access the same data sources using a common protocol

Security: Centralized governance and access controls rather than scattered custom integrations

Speed to Market: Reduces integration time from weeks to hours

Vendor Flexibility: Organizations aren’t locked into proprietary ecosystems

Snowflake MCP Server: Architecture and Core Components

The Snowflake MCP Server represents a fully managed service that acts as a bridge between external AI agents and the Snowflake AI Data Cloud. Currently in public preview, it offers a sophisticated yet streamlined approach to agentic AI implementation.

How the Architecture Works

At its core, the Snowflake MCP Server connects three critical layers:

Layer 1: External AI Agents and Platforms The server integrates with leading AI platforms including Anthropic Claude, Salesforce Agentforce, Cursor, CrewAI, Devin by Cognition, UiPath, Windsurf, Amazon Bedrock AgentCore, and more. This broad compatibility ensures organizations can use their preferred AI tools without vendor lock-in.

Layer 2: Snowflake Cortex AI Services Within Snowflake, the MCP Server provides access to powerful Cortex capabilities:
- Cortex Analyst for querying structured data using semantic models
- Cortex Search for retrieving insights from unstructured documents
- Cortex AISQL for AI-powered extraction and transcription
- Data Science Agent for automated ML workflows
Layer 3: Data Sources This includes both proprietary organizational data stored in Snowflake and premium third-party datasets from partners like MSCI, Nasdaq eVestment, FactSet, The Associated Press, CB Insights, and Deutsche Börse.

The Managed Service Advantage

Unlike traditional integrations that require infrastructure deployment and ongoing maintenance, the Snowflake MCP Server operates as a fully managed service. Organizations configure access through YAML files, define security policies, and the Snowflake platform handles all the operational complexity—from scaling to security patches.

Cortex AI for Financial Services: The First Industry-Specific Implementation

Snowflake launched the MCP Server alongside Cortex AI for Financial Services, demonstrating the practical power of this architecture with industry-specific capabilities.

Why Financial Services First?

The financial services industry faces unique challenges that make it an ideal testing ground for agentic AI:

Data Fragmentation: Financial institutions operate with data scattered across trading systems, risk platforms, customer databases, and market data providers

Regulatory Requirements: Strict compliance and audit requirements demand transparent, governed data access

Real-Time Decisioning: Investment decisions, fraud detection, and customer service require instant access to both structured and unstructured data

Third-Party Dependencies: Financial analysis requires combining proprietary data with market research, news feeds, and regulatory filings

Key Use Cases Enabled

Investment Analytics: AI agents can analyze portfolio performance by combining internal holdings data from Snowflake with real-time market data from Nasdaq, research reports from FactSet, and breaking news from The Associated Press—all through natural language queries.

Claims Management: Insurance companies can process claims by having AI agents retrieve policy documents (unstructured), claims history (structured), and fraud pattern analysis—orchestrating across Cortex Search and Cortex Analyst automatically.

Customer Service: Financial advisors can query “What’s the risk profile of client portfolios exposed to European tech stocks?” and receive comprehensive answers that pull from multiple data sources, with full audit trails maintained.

Regulatory Compliance: Compliance teams can ask questions about exposure limits, trading patterns, or risk concentrations, and AI agents will navigate the appropriate data sources while respecting role-based access controls.

Technical Deep Dive: How to Implement Snowflake MCP Server

For data engineers and architects planning implementations, understanding the technical setup is crucial.

Configuration Basics

The Snowflake MCP Server uses YAML configuration files to define available services and access controls. Here’s what a typical configuration includes:

Service Definitions: Specify which Cortex Analyst semantic models, Cortex Search services, and other tools should be exposed to AI agents

Security Policies: Define SQL statement permissions to control what operations agents can perform (SELECT, INSERT, UPDATE, etc.)

Connection Parameters: Configure authentication methods including OAuth, personal access tokens, or service accounts

Tool Descriptions: Provide clear, descriptive text for each exposed service to help AI agents select the appropriate tool for each task

Integration with AI Platforms

Connecting external platforms to the Snowflake MCP Server follows a standardized pattern:

For platforms like Claude Desktop or Cursor, developers add the Snowflake MCP Server to their configuration file, specifying the connection details and authentication credentials. The MCP client then automatically discovers available tools and makes them accessible to the AI agent.

For custom applications using frameworks like CrewAI or LangChain, developers leverage MCP client libraries to establish connections programmatically, enabling sophisticated multi-agent workflows.

Security and Governance

One of the most compelling aspects of the Snowflake MCP Server is that it maintains all existing Snowflake security controls:

Data Never Leaves Snowflake: Unlike traditional API integrations that extract data, processing happens within Snowflake’s secure perimeter

Row-Level Security: Existing row-level security policies automatically apply to agent queries

Audit Logging: All agent interactions are logged for compliance and monitoring

Role-Based Access: Agents operate under defined Snowflake roles with specific privileges

Agentic AI Workflows: From Theory to Practice

Understanding agentic AI workflows is essential to appreciating the Snowflake MCP Server’s value proposition.

What Makes AI “Agentic”?

Traditional AI systems respond to single prompts with single responses. Agentic AI systems, by contrast, can:

Plan Multi-Step Tasks: Break complex requests into sequential subtasks

Use Tools Dynamically: Select and invoke appropriate tools based on the task at hand

Reflect and Iterate: Evaluate results and adjust their approach

Maintain Context: Remember previous interactions within a session

How Snowflake Enables Agentic Workflows

The Snowflake MCP Server enables true agentic behavior through Cortex Agents, which orchestrate across both structured and unstructured data sources.

Example Workflow: Market Analysis Query

When a user asks, “How has our semiconductor portfolio performed compared to industry trends this quarter, and what are analysts saying about the sector?”

The agent plans a multi-step approach:
1. Query Cortex Analyst to retrieve portfolio holdings and performance metrics (structured data)
2. Search Cortex Search for analyst reports and news articles about semiconductors (unstructured data)
3. Cross-reference findings with third-party market data from partners like MSCI
4. Synthesize a comprehensive response with citations
Each step respects data governance policies, and the entire workflow happens within seconds—a task that would traditionally require multiple analysts hours or days to complete.

Open Semantic Interchange: The Missing Piece of the AI Puzzle

While the Snowflake MCP Server solves the connection problem, the Open Semantic Interchange (OSI) initiative addresses an equally critical challenge: semantic consistency.

The Semantic Fragmentation Problem

Enterprise organizations typically define the same business metrics differently across systems. “Revenue” might include different line items in the data warehouse versus the BI tool versus the AI model. This semantic fragmentation undermines trust in AI insights and creates the “$1 trillion AI problem“—the massive cost of data preparation and reconciliation.

How OSI Complements MCP

Announced on September 23, 2025, alongside the MCP Server development, OSI is an open-source initiative led by Snowflake, Salesforce, BlackRock, and dbt Labs. It creates a vendor-neutral specification for semantic metadata—essentially a universal language for business concepts.

When combined with MCP, OSI ensures that AI agents not only can access data (via MCP) but also understand what that data means (via OSI). A query about “quarterly revenue” will use the same definition whether the agent is accessing Snowflake, Tableau, or a custom ML model.

Industry Impact: Who Benefits from Snowflake MCP Server?

While initially focused on financial services, the Snowflake MCP Server has broad applicability across industries.

Healthcare and Life Sciences

Clinical Research: Combine patient data (structured EHR) with medical literature (unstructured documents) for drug discovery

Population Health: Analyze claims data alongside social determinants of health from third-party sources

Regulatory Submissions: AI agents can compile submission packages by accessing clinical trial data, adverse event reports, and regulatory guidance documents

Retail and E-Commerce

Customer Intelligence: Merge transaction data with customer service transcripts and social media sentiment

Supply Chain Optimization: Agents can analyze inventory levels, supplier performance data, and market demand signals from external sources

Personalization: Create hyper-personalized shopping experiences by combining browsing behavior, purchase history, and trend data

Manufacturing

Predictive Maintenance: Combine sensor data from IoT devices with maintenance logs and parts inventory

Quality Control: Analyze production metrics alongside inspection reports and supplier certifications

Supply Chain Resilience: Monitor supplier health by combining internal order data with external financial and news data

Implementation Best Practices

For organizations planning to implement the Snowflake MCP Server, following best practices ensures success.

Start with Clear Use Cases

Begin with specific, high-value use cases rather than attempting a broad rollout. Identify workflows where combining structured and unstructured data creates measurable business value.

Invest in Semantic Modeling

The quality of Cortex Analyst responses depends heavily on well-defined semantic models. Invest time in creating comprehensive semantic layers using tools like dbt or directly in Snowflake.

Establish Governance Early

Define clear policies about which data sources agents can access, what operations they can perform, and how results should be logged and audited.

Design for Explainability

Configure agents to provide citations and reasoning for their responses. This transparency builds user trust and satisfies regulatory requirements.

Monitor and Iterate

Implement monitoring to track agent performance, query patterns, and user satisfaction. Use these insights to refine configurations and expand capabilities.

Challenges and Considerations

While powerful, the Snowflake MCP Server introduces considerations that organizations must address.

Cost Management

AI agent queries can consume significant compute resources, especially when orchestrating across multiple data sources. Implement query optimization, caching strategies, and resource monitoring to control costs.

Data Quality Dependencies

Agents are only as good as the data they access. Poor data quality, incomplete semantic models, or inconsistent definitions will produce unreliable results.

Skills Gap

Successfully implementing agentic AI requires skills in data engineering, AI/ML, and domain expertise. Organizations may need to invest in training or hire specialized talent.

Privacy and Compliance

While Snowflake provides robust security controls, organizations must ensure that agent behaviors comply with privacy regulations like GDPR, especially when combining internal and external data sources.

The Future of Snowflake MCP Server

Based on current trends and Snowflake’s product roadmap announcements, several developments are likely:

Expanded Industry Packs

Following financial services, expect industry-specific Cortex AI suites for healthcare, retail, manufacturing, and public sector with pre-configured connectors and semantic models.

Enhanced Multi-Agent Orchestration

Future versions will likely support more sophisticated agent crews that can collaborate on complex tasks, with specialized agents for different domains working together.

Deeper OSI Integration

As the Open Semantic Interchange standard matures, expect tighter integration that makes semantic consistency automatic rather than requiring manual configuration.

Real-Time Streaming

Current implementations focus on batch and interactive queries. Future versions may incorporate streaming data sources for real-time agent responses.

Conclusion: Embracing the Agentic AI Revolution

The Snowflake MCP Server represents a pivotal moment in enterprise AI evolution. By standardizing how AI agents access data through the Model Context Protocol, Snowflake has removed one of the primary barriers to agentic AI adoption—integration complexity.

Combined with powerful Cortex AI capabilities and participation in the Open Semantic Interchange initiative, Snowflake is positioning itself at the center of the agentic AI ecosystem. Organizations that embrace this architecture now will gain significant competitive advantages in speed, flexibility, and AI-driven decision-making.

The question is no longer whether to adopt agentic AI, but how quickly you can implement it effectively. With the Snowflake MCP Server now in public preview, the opportunity to lead in your industry is here.

Ready to get started? Explore the Snowflake MCP Server documentation, identify your highest-value use cases, and join the growing community of organizations building the future of intelligent, context-aware enterprise applications.

Key Takeaways
- The Snowflake MCP Server launched October 2, 2025, as a managed service connecting AI agents to enterprise data
- Model Context Protocol provides a standardized interface for agentic AI integrations
- Cortex AI for Financial Services demonstrates practical applications with industry-specific capabilities
- Organizations can connect platforms like Anthropic Claude, Salesforce Agentforce, and Cursor to Snowflake data
- The Open Semantic Interchange initiative ensures AI agents understand data semantics consistently
- Security and governance controls remain intact with all processing happening within Snowflake
- Early adoption provides competitive advantages in AI-driven decision-making
October 6, 2025
Snowflake Query Optimization in 2025
Snowflake is renowned for its incredible performance, but as data scales into terabytes and petabytes, no platform is immune to a slow-running query. For a data engineer, mastering Snowflake query optimization is the difference between building an efficient, cost-effective data platform and one that burns through credits and frustrates users.

In 2025, the principles of query optimization remain the same, but the tools and techniques have matured. It’s no longer just about warehouse size; it’s about understanding the query execution plan, leveraging micro-partitions, and writing smarter SQL.

This guide will walk you through the essential strategies and best practices for Snowflake query optimization, moving from the foundational tools to advanced, real-world techniques.

The Golden Rule: Always Start with the Query Profile

Before you change a single line of code, your first and most important step is to analyze the Query Profile. This tool is your window into how Snowflake’s cloud services layer executes your query.

To access it, go to the History tab in the Snowsight UI, find your query, and click on the Query ID.

What to look for:
- Operator Time: Which steps in the execution plan are taking the most time? Is it a TableScan, a Join, or a Sort operation?
- Partition Pruning: How many partitions is the TableScan reading versus the total partitions in the table? If it’s scanning a high percentage, your pruning is ineffective.
- Spilling: Is Snowflake spilling data to local or remote storage? Spilling to remote storage is a major performance killer.
For a detailed walkthrough, check out Snowflake’s official documentation on Using Query Profile.

1. Right-Size Your Warehouse (But It’s Not a Silver Bullet)

It’s tempting to throw a larger warehouse at a slow query, but this isn’t always the right or most cost-effective answer.
- Scale Up (Increase Size): Do this when a single, complex query is slow due to heavy computations like large joins, sorts, or aggregations. A larger warehouse provides more memory and local SSD, which can prevent spilling.
- Scale Out (Add Clusters): Use a multi-cluster warehouse for high-concurrency scenarios (e.g., a BI dashboard). This won’t make a single query faster, but it will handle more queries at once.
Best Practice: Don’t use a larger warehouse to compensate for a poorly written query. Always try to optimize the query and data structure first.

2. Master Clustering for Effective Pruning

This is the most impactful technique for optimizing queries on large tables. A Clustering Key reorganizes your data into co-located micro-partitions based on the key you define. This allows Snowflake to prune (ignore) massive amounts of data that aren’t relevant to your query’s WHERE clause.

When to Cluster:
- On tables larger than a terabyte.
- When you frequently filter or join on a high-cardinality column (e.g., user_id, event_timestamp).
Example:

SQL
```
-- A query on an un-clustered 100TB table
SELECT * FROM event_logs WHERE user_id = 'a1b2-c3d4';
-- This might scan 50% of the table's partitions.

-- Cluster the table
ALTER TABLE event_logs CLUSTER BY (user_id);

-- After reclustering, the same query might scan less than 0.01% of partitions.
SELECT * FROM event_logs WHERE user_id = 'a1b2-c3d4';
```
The Credit Karma engineering blog has an excellent real-world case study on how they used clustering to dramatically reduce costs.

3. Avoid Spilling at All Costs

Spilling occurs when Snowflake runs out of memory for an operation and has to write intermediate results to storage. Spilling to local SSD is okay, but spilling to remote storage is a performance disaster.

How to Fix It:
1. Increase Warehouse Size: This is the most direct solution, as it provides more memory.
2. Optimize the Query: Reduce the data being processed. Filter early in your query (using WHERE clauses) and select only the columns you need.
3. Use APPROX_COUNT_DISTINCT: For large-scale distinct counts, using an approximation function can be orders of magnitude faster and use less memory than COUNT(DISTINCT ...).
4. Write Smarter SQL

Sometimes, the best optimization is simply writing a better query.
- Filter Early, Join Late: Apply your WHERE clause filters to your largest tables before you join them. You can do this with a subquery or a Common Table Expression (CTE).
- Reduce Data Movement: In joins, join on columns with the same data type. If you join a STRING to a NUMBER, Snowflake has to cast the data on the fly, which can slow things down.
- Leverage Specific Functions: Use optimized functions like MATCH_RECOGNIZE for sequential pattern matching or FLATTEN for parsing semi-structured data instead of writing complex, self-joining SQL.
5. Use Materialized Views for Repetitive Queries

If you have a complex query that runs frequently on data that doesn’t change often, a Materialized View can be a game-changer. It pre-computes and stores the query result, so subsequent queries are just reading the stored results, which is incredibly fast.

Best For:
- BI dashboards that query complex aggregations.
- Queries on large, slowly changing datasets.
The Snowflake Community offers great primers and discussions on when to best use this feature.

Conclusion

Snowflake query optimization in 2025 is a multi-faceted discipline. It starts with a deep analysis of the Query Profile and extends to making intelligent choices about your warehouse sizing, data clustering, and SQL patterns. By moving beyond brute force and adopting these strategic techniques, you can build a Snowflake environment that is not only lightning-fast but also highly cost-effective.
October 4, 2025

5 Advanced Techniques for Optimizing Snowflake MERGE Queries

Snowflake MERGE statements are powerful tools for upserting data, but poor optimization can lead to massive performance bottlenecks. If your MERGE queries are taking hours instead of minutes, you’re not alone. In this comprehensive guide, we’ll explore five advanced techniques to optimize Snowflake MERGE queries and achieve up to 10x performance improvements.

Understanding Snowflake MERGE Performance Challenges

Before diving into optimization techniques, it’s crucial to understand why MERGE queries often become performance bottlenecks. Snowflake’s MERGE operation combines INSERT, UPDATE, and DELETE logic into a single statement, which involves scanning both source and target tables, matching records, and applying changes.

The primary performance challenges include:

Full table scans on large target tables
Inefficient join conditions between source and target
Poor micro-partition pruning
Lack of proper clustering on merge keys
Excessive data movement across compute nodes

Technique 1: Leverage Clustering Keys for MERGE Operations

Clustering keys are Snowflake’s secret weapon for optimizing MERGE queries. By defining clustering keys on your merge columns, you enable aggressive micro-partition pruning, dramatically reducing the data scanned during operations.

Visual representation of Snowflake clustering keys organizing data for optimal query performance

Implementation Strategy

-- Define clustering key on the primary merge column
ALTER TABLE target_table 
CLUSTER BY (customer_id, transaction_date);

-- Verify clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('target_table', 
  '(customer_id, transaction_date)');

Clustering keys work by organizing data within micro-partitions based on specified columns. When Snowflake processes a MERGE query, it uses clustering metadata to skip entire micro-partitions that don’t contain matching keys. You can learn more about clustering keys in the official Snowflake documentation.

Best Practices for Clustering

Choose high-cardinality columns that appear in MERGE JOIN conditions
Limit clustering keys to 3-4 columns maximum for optimal performance
Monitor clustering depth regularly using SYSTEM$CLUSTERING_DEPTH
Consider reclustering if depth exceeds 4-5 levels

Pro Tip: Clustering incurs automatic maintenance costs. Use it strategically on tables with frequent MERGE operations and clear access patterns.

Technique 2: Optimize MERGE Predicates with Selective Filtering

One of the most effective ways to optimize Snowflake MERGE performance is by adding selective predicates that reduce the data set before the merge operation begins. This technique, called predicate pushdown optimization, allows Snowflake to prune unnecessary data early in query execution.

Basic vs Optimized MERGE

-- UNOPTIMIZED: Scans entire target table
MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, status) VALUES (s.id, s.status);

-- OPTIMIZED: Adds selective predicates
MERGE INTO target_table t
USING (
  SELECT * FROM source_table 
  WHERE update_date >= CURRENT_DATE - 7
) s
ON t.id = s.id 
   AND t.region = s.region
   AND t.update_date >= CURRENT_DATE - 7
WHEN MATCHED THEN UPDATE SET t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, status, region) VALUES (s.id, s.status, s.region);

The optimized version adds three critical improvements: it filters source data to only recent records, adds partition-aligned predicates (region column), and applies matching filter to target table.

Predicate Selection Guidelines

Predicate Type	Performance Impact	Use Case
Date Range	High	Incremental loads with time-based partitioning
Partition Key	Very High	Multi-tenant or geographically distributed data
Status Flag	Medium	Processing only changed or active records
Existence Check	High	Skipping already processed data

Technique 3: Exploit Micro-Partition Pruning

Snowflake stores data in immutable micro-partitions (typically 50-500MB compressed). Understanding how to leverage micro-partition metadata is essential for MERGE optimization.

Snowflake data architecture diagram illustrating micro-partition structure

Micro-Partition Pruning Strategies

Snowflake maintains metadata for each micro-partition including min/max values, distinct counts, and null counts for all columns. By structuring your MERGE conditions to align with this metadata, you enable aggressive pruning.

-- Check micro-partition metadata
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS(
  TABLE_NAME => 'TARGET_TABLE'
))
WHERE ACTIVE_BYTES > 0
ORDER BY PARTITION_NUMBER DESC
LIMIT 10;

-- Optimized MERGE with partition-aligned predicates
MERGE INTO sales_fact t
USING (
  SELECT 
    transaction_id,
    customer_id,
    sale_date,
    amount
  FROM staging_sales
  WHERE sale_date BETWEEN '2025-01-01' AND '2025-01-31'
    AND customer_id IS NOT NULL
) s
ON t.transaction_id = s.transaction_id
   AND t.sale_date = s.sale_date
WHEN MATCHED THEN UPDATE SET amount = s.amount
WHEN NOT MATCHED THEN INSERT VALUES (s.transaction_id, s.customer_id, s.sale_date, s.amount);

Maximizing Pruning Efficiency

Always include clustering key columns in MERGE ON conditions
Use equality predicates when possible (more effective than ranges)
Avoid function transformations on join columns (prevents metadata usage)
Leverage Snowflake’s automatic clustering for large tables

Warning: Using functions like UPPER(), TRIM(), or CAST() on merge key columns disables micro-partition pruning. Apply transformations in the source subquery instead.

Technique 4: Implement Incremental MERGE Patterns

Rather than processing entire tables, implement incremental MERGE patterns that only handle changed data. This approach combines multiple optimization techniques for maximum performance.

Change Data Capture (CDC) MERGE Pattern

-- Step 1: Create change tracking view
CREATE OR REPLACE VIEW recent_changes AS
SELECT 
  s.*,
  METADATA$ACTION as cdc_action,
  METADATA$ISUPDATE as is_update,
  METADATA$ROW_ID as row_identifier
FROM staging_table s
WHERE METADATA$ACTION IN ('INSERT', 'UPDATE')
  AND METADATA$UPDATE_TIMESTAMP >= DATEADD(hour, -1, CURRENT_TIMESTAMP);

-- Step 2: Execute incremental MERGE
MERGE INTO dimension_table t
USING recent_changes s
ON t.business_key = s.business_key
WHEN MATCHED AND s.is_update = TRUE
  THEN UPDATE SET 
    t.attribute1 = s.attribute1,
    t.attribute2 = s.attribute2,
    t.last_modified = s.update_timestamp
WHEN NOT MATCHED 
  THEN INSERT (business_key, attribute1, attribute2, created_date)
  VALUES (s.business_key, s.attribute1, s.attribute2, s.update_timestamp);

Batch Processing Strategy

For very large datasets, implement batch processing with partition-aware MERGE. Learn more about data pipeline best practices in Snowflake.

-- Create processing batches
CREATE OR REPLACE TABLE merge_batches AS
SELECT DISTINCT
  DATE_TRUNC('day', event_date) as partition_date,
  MOD(ABS(HASH(customer_id)), 10) as batch_number
FROM source_data
WHERE processed_flag = FALSE;

-- Process in batches (use stored procedure for actual implementation)
MERGE INTO target_table t
USING (
  SELECT * FROM source_data
  WHERE DATE_TRUNC('day', event_date) = '2025-01-15'
    AND MOD(ABS(HASH(customer_id)), 10) = 0
) s
ON t.customer_id = s.customer_id 
   AND t.event_date = s.event_date
WHEN MATCHED THEN UPDATE SET t.amount = s.amount
WHEN NOT MATCHED THEN INSERT VALUES (s.customer_id, s.event_date, s.amount);

Technique 5: Optimize Warehouse Sizing and Query Profile

Proper warehouse configuration can dramatically impact MERGE performance. Understanding the relationship between data volume, complexity, and compute resources is crucial.

Warehouse Sizing Guidelines for MERGE

Data Volume	Recommended Size	Expected Performance
Less than 1M rows	X-Small to Small	Less than 30 seconds
1M – 10M rows	Small to Medium	1-5 minutes
10M – 100M rows	Medium to Large	5-15 minutes
More than 100M rows	Large to X-Large	15-60 minutes

Query Profile Analysis

Always analyze your MERGE queries using Snowflake’s Query Profile to identify bottlenecks:

-- Get query ID for recent MERGE
SELECT query_id, query_text, execution_time
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%MERGE INTO target_table%'
ORDER BY start_time DESC
LIMIT 1;

-- Analyze detailed query profile
SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_SESSION())
WHERE query_id = 'your-query-id-here';

Performance Monitoring Queries

-- Monitor MERGE performance over time
SELECT 
  DATE_TRUNC('hour', start_time) as hour,
  COUNT(*) as merge_count,
  AVG(execution_time)/1000 as avg_seconds,
  SUM(bytes_scanned)/(1024*1024*1024) as total_gb_scanned,
  AVG(credits_used_cloud_services) as avg_credits
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%MERGE INTO%'
  AND start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP)
GROUP BY 1
ORDER BY 1 DESC;

Real-World Performance Comparison

To demonstrate the impact of these techniques, here’s a real-world comparison of MERGE performance optimizations on a 50 million row table:

Snowflake query performance metrics dashboard showing execution time improvements

Optimization Applied	Execution Time	Data Scanned	Cost Reduction
Baseline (no optimization)	45 minutes	2.5 TB	–
+ Clustering Keys	18 minutes	850 GB	60%
+ Selective Predicates	8 minutes	320 GB	82%
+ Incremental Pattern	4 minutes	180 GB	91%
+ Optimized Warehouse	2.5 minutes	180 GB	94%

Common Pitfalls to Avoid

Even with optimization techniques, several common mistakes can sabotage MERGE performance:

1. Over-Clustering

Using too many clustering keys or clustering on low-cardinality columns creates overhead without benefits. Stick to 3-4 high-cardinality columns that align with your MERGE patterns.

2. Ignoring Data Skew

Uneven data distribution causes some micro-partitions to be much larger than others, leading to processing bottlenecks. Monitor and address skew with better partitioning strategies.

3. Full Table MERGE Without Filters

Always apply predicates to limit the scope of MERGE operations. Even on small tables, unnecessary full scans waste resources.

4. Improper Transaction Sizing

Very large single transactions can timeout or consume excessive resources. Break large MERGE operations into manageable batches.

Monitoring and Continuous Optimization

MERGE optimization is not a one-time activity. Implement continuous monitoring to maintain performance as data volumes grow:

-- Create monitoring dashboard query
CREATE OR REPLACE VIEW merge_performance_dashboard AS
SELECT 
  DATE_TRUNC('day', start_time) as execution_date,
  REGEXP_SUBSTR(query_text, 'MERGE INTO (\\w+)', 1, 1, 'e') as target_table,
  COUNT(*) as execution_count,
  AVG(execution_time)/1000 as avg_execution_seconds,
  MAX(execution_time)/1000 as max_execution_seconds,
  AVG(bytes_scanned)/(1024*1024*1024) as avg_gb_scanned,
  SUM(credits_used_cloud_services) as total_credits
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_type = 'MERGE'
  AND start_time >= DATEADD(month, -1, CURRENT_TIMESTAMP)
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC;

Conclusion and Next Steps

Optimizing Snowflake MERGE queries requires a multi-faceted approach combining clustering keys, selective predicates, micro-partition pruning, incremental patterns, and proper warehouse sizing. By implementing these five advanced techniques, you can achieve 10x or greater performance improvements while reducing costs significantly.

Key Takeaways

Define clustering keys on merge columns for aggressive pruning
Add selective predicates to reduce data scanned before merging
Leverage micro-partition metadata with partition-aligned conditions
Implement incremental MERGE patterns using CDC or batch processing
Right-size warehouses and monitor performance continuously

Start by analyzing your current MERGE queries using Query Profile, identify the biggest bottlenecks, and apply these techniques incrementally. Monitor the impact and iterate based on your specific data patterns and workload characteristics.

For more Snowflake optimization techniques, check out the official Snowflake performance optimization guide and explore Snowflake Community discussions for real-world insights.

October 1, 2025

How to Implement Dynamic Data Masking in Snowflake
In today’s data-driven world, providing access to data is undoubtedly crucial. However, what happens when that data contains sensitive Personally Identifiable Information (PII) like emails, phone numbers, or credit card details? Clearly, you can’t just grant open access. For this reason, dynamic data masking in Snowflake becomes an essential tool for modern data governance.

Specifically, Dynamic Data Masking allows you to protect sensitive data by masking it in real-time within your query results, based on the user’s role. Crucially, the underlying data in the table remains unchanged; instead, only the query result is masked. As a result, your data analysts can run queries on a production table without ever seeing the raw sensitive information.

With that in mind, this guide will walk you through the practical steps of creating and applying a masking policy in Snowflake to protect your PII.

The Scenario: Protecting Customer PII

Imagine we have a customers table with the following columns, containing sensitive information:

CUSTOMER_ID FULL_NAME EMAIL PHONE_NUMBER
101 Jane Doe jane.doe@email.com 555-123-4567
102 John Smith john.smith@email.com
555-987-6543

Our goal is to create a policy where:
- Users with the ANALYST_ROLE see a masked version of the email and phone number.
- Users with a privileged PII_ACCESS_ROLE can see the real, unmasked data.
Step 1: Create the Masking Policy

First, we define the rules of how the data should be masked. A masking policy is a schema-level object that uses a CASE statement to apply conditional logic.

This policy will check the user’s current role. If their role is PII_ACCESS_ROLE, it will show the original value. For all other roles, it will show a masked version.

SQL Code to Create the Policy:SQL
```
-- Create a masking policy for email addresses
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() = 'PII_ACCESS_ROLE' THEN val
    ELSE '***-MASKED-***'
  END;

-- Create a masking policy for phone numbers
CREATE OR REPLACE MASKING POLICY phone_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() = 'PII_ACCESS_ROLE' THEN val
    ELSE REGEXP_REPLACE(val, '.', '*', 1, 8) -- Masks the first 8 characters
  END;
```
- The email_mask policy is simple: it shows the real value for the privileged role and a fixed string for everyone else.
- The phone_mask policy is slightly more advanced, using a regular expression to replace the first 8 characters of the phone number with asterisks, showing only the last part of the number.
Step 2: Apply the Masking Policy to Your Table

Once the policy is created, you need to apply it to the specific columns in your table that you want to protect. You use the ALTER TABLE command to do this.

SQL Code to Apply the Policy:SQL
```
-- Apply the email_mask policy to the EMAIL column
ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;

-- Apply the phone_mask policy to the PHONE_NUMBER column
ALTER TABLE customers MODIFY COLUMN phone_number SET MASKING POLICY phone_mask;
```
That’s it! The policy is now active.

Step 3: Test the Policy with Different Roles

Now, let’s test our setup. We will run the same SELECT query as two different users with two different roles.

Test 1: Querying as a user with ANALYST_ROLESQL
```
USE ROLE ANALYST_ROLE;
SELECT * FROM customers;
```
Result (Data is Masked):

CUSTOMER_ID FULL_NAME EMAIL PHONE_NUMBER
101 Jane Doe *-MASKED-* ********-4567
102 John Smith *-MASKED-* ********-6543

Test 2: Querying as a user with the privileged PII_ACCESS_ROLESQL
```
USE ROLE PII_ACCESS_ROLE;
SELECT * FROM customers;
```
Result (Data is Unmasked):

CUSTOMER_ID FULL_NAME EMAIL PHONE_NUMBER
101 Jane Doe jane.doe@email.com 555-123-4567
102 John Smith john.smith@email.com 555-987-6543

As you can see, the same query on the same table produces different results based on the user’s role. The masking happens dynamically at query time, and the underlying data is never changed.

Conclusion: Security and Analytics in Harmony

Dynamic Data Masking is undoubtedly a powerful feature that allows you to democratize data access without compromising on security. Specifically, by implementing masking policies, you can provide broad access to your tables for analytics while at the same time ensuring that sensitive PII is only visible to the specific roles that have a legitimate need to see it. Ultimately, this is a fundamental component of building a secure and well-governed data platform in Snowflake.
September 30, 2025
Snowflake Data Sharing and Governance
In the final part of our Snowflake guide, we move beyond the technical implementation and into one of the most powerful strategic advantages of the platform: governance and secure data sharing. So far, we’ve covered the architecture, learned how to load data, and explored how to query it. Now, we’ll learn how to control, secure, and share that data.

Strong data governance isn’t just about locking data down; it’s about enabling secure access to the right data for the right people at the right time. Snowflake’s approach to this is built on two core pillars: robust, role-based access control and a revolutionary feature called Secure Data Sharing.

Pillar 1: Governance with Role-Based Access Control (RBAC)

In Snowflake, you never grant permissions directly to a user. Instead, all permissions are granted to Roles, and roles are then granted to users. This is a highly scalable and manageable way to control access to your data.

How RBAC Works
1. Objects: These are the things you want to secure, like databases, schemas, tables, and warehouses.
2. Privileges: These are the actions that can be performed on objects, such as SELECT, INSERT, CREATE, etc.
3. Roles: Roles are a collection of privileges. You can create roles for different functions, like ANALYST_ROLE, DEVELOPER_ROLE, or BI_TOOL_ROLE.
4. Users: Users are granted one or more roles, which in turn gives them the privileges of those roles.
Best Practice: Create a hierarchy of custom roles. For example, you might have a base READ_ONLY role that can select from tables, and an ANALYST role that inherits all the privileges of the READ_ONLY role plus additional permissions. This makes managing permissions much simpler as your organization grows.

Example Code:SQL
```
-- 1. Create a new role
CREATE ROLE data_analyst;

-- 2. Grant privileges to the role
GRANT USAGE ON DATABASE my_prod_db TO ROLE data_analyst;
GRANT USAGE ON SCHEMA my_prod_db.analytics TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA my_prod_db.analytics TO ROLE data_analyst;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE data_analyst;

-- 3. Grant the role to a user
GRANT ROLE data_analyst TO USER jane_doe;
```
Pillar 2: The Revolution of Secure Data Sharing

This is arguably one of Snowflake’s most innovative features and a key differentiator. Traditionally, if you wanted to share data with another company or a different department, you had to set up a painful and insecure ETL process. This involved creating data extracts (like CSV files), transferring them via FTP or other methods, and having the consumer load them into their own system. This process is slow, expensive, and creates stale, unsecure copies of your data.

Snowflake Secure Data Sharing eliminates this entire process. It allows you to provide live, read-only access to your data to any other Snowflake account without ever moving or copying the data.

How Secure Data Sharing Works
1. The Provider: You (the “provider”) create a Share object. A share is a named object that contains a set of privileges on your databases and tables.
2. Granting Access: You grant access to specific tables or views to your share.
3. The Consumer: You add a “consumer” Snowflake account to the share. The consumer can then “mount” this share as a read-only database in their own Snowflake account.
The Magic: The consumer is querying your data live in your account, but they are using their own virtual warehouse (their own compute) to do so. The data never leaves your ownership or your secure environment. There are no ETL processes, no data copies, and no additional storage costs.

Use Cases:
- Data Monetization: Companies in the Snowflake Marketplace sell access to their datasets using this feature.
- Business Partnerships: Securely share data with your suppliers, partners, or customers.
- Internal Departments: Share data between different business units without creating multiple copies and ETL pipelines.
Conclusion: The End of Data Silos

By combining a robust Role-Based Access Control system with the game-changing capabilities of Secure Data Sharing, Snowflake provides a comprehensive platform for modern data governance. This approach not only secures your data but also enables seamless and secure collaboration, breaking down the data silos that have plagued businesses for decades.

This concludes our four-part guide to Snowflake. You’ve gone from understanding the fundamental architecture to loading, querying, and now governing and sharing your data. You now have a complete picture of why Snowflake is a leader in the cloud data platform space.
September 29, 2025

CUSTOMER_ID	FULL_NAME	EMAIL	PHONE_NUMBER
101	Jane Doe	jane.doe@email.com	555-123-4567
102	John Smith	john.smith@email.com	555-987-6543

Tag: snowflake

What is a Snowflake Cortex Agent and Why Does it Matter?

The Blueprint: How a Cortex Agent Works

Step-by-Step Tutorial: Building a Sales Analysis Agent

Prerequisites

Step 1: Prepare Your Data

Step 2: Create the Semantic Model

Step 3: Create the Cortex Search Service

Step 4: Combine Them into a Cortex Agent

Step 5: Ask Your Agent Questions!

Conclusion: The Future is Conversational

Additional materials

What is a Star Schema?

What is a Snowflake Schema?

Key Differences Between Star Schema and Snowflake Schema

When to Use Star Schema vs Snowflake Schema

Which is Used Most in 2025?

Solid Examples in Action

Best Practices and References

Introduction to Data Pipelines in Python

Why Use Python for Data Pipelines?

Step 1: Extracting Data from APIs

Step 2: Extracting Data from External Databases

Step 3: Transforming Data (Optional ETL Step)

Step 4: Loading Data to Amazon S3

Step 5: Loading Data into Snowflake

Best Practices for Data Pipelines in Python

Conclusion

Additional materials

What is Snowflake Cortex AI for Financial Services?

Key Capabilities Driving the Transformation

Use Cases: Putting Cortex AI to Work in Finance

The Benefits of a Unified AI and Data Strategy

Getting Started with Snowflake Cortex AI

The Future of Finance is Here

Additional Sources

The 60–80% Problem Killing Data Science Productivity

What is Snowflake Data Science Agent?

The Core Innovation

Powered by Claude AI

How Data Science Agent Transforms ML Workflows

Traditional ML Pipeline vs. Agent-Assisted Pipeline

Key Capabilities and Features

1. Automated Data Preparation

2. Intelligent Model Selection

3. Automated Hyperparameter Tuning

4. Production-Ready Code Generation

5. Explainability and Transparency

Real-World Use Cases

Financial Services: Fraud Detection

Healthcare: Patient Risk Stratification

Retail: Customer Lifetime Value Prediction

Manufacturing: Predictive Maintenance

Technical Architecture

Integration with Snowflake Ecosystem

How It Works: Behind the Scenes

Best Practices for Using Data Science Agent

1. Write Clear, Specific Prompts

2. Start with Business Context

3. Iterate Incrementally

4. Review Generated Code

5. Establish Governance Guardrails

6. Combine Agent Automation with Human Expertise

Measuring ROI: The Business Impact

Time-to-Production Acceleration

Productivity Multiplication

Quality Improvements

Cost Optimization

Democratization of ML

Limitations and Considerations

Current Limitations

Important Considerations

The Future of Data Science Agent

Short-Term (2025-2026)

Medium-Term (2026-2027)

Long-Term Vision (2027+)

Getting Started with Data Science Agent

Prerequisites

Request Access

Pilot Project Selection