Snowflake’s Unique Aggregation Functions You Need to Know

A hand-drawn snowflake symbolizes Snowflake Aggregations, featuring a bar graph at the center, a checklist on the left, and a magnifying glass examining dots on the right, representing data analysis and organization.

When you think of aggregation functions in SQL, SUM(), COUNT(), and AVG() likely come to mind first. These are the workhorses of data analysis, undoubtedly. However, Snowflake, a titan in the data cloud, offers a treasure trove of specialized, unique aggregation functions that often fly under the radar. These functions aren’t just novelties; they are powerful tools that can simplify complex analytical problems and provide insights you might otherwise struggle to extract.

Let’s dive into some of Snowflake’s most potent, yet often overlooked, aggregation capabilities.

1. APPROX_TOP_K (and APPROX_TOP_K_ARRAY): Finding the Most Frequent Items Efficiently

Imagine you have billions of customer transactions and you need to quickly identify the top 10 most purchased products, or the top 5 most active users. A GROUP BY and ORDER BY on such a massive dataset can be resource-intensive. This is where APPROX_TOP_K shines.

Hand-drawn image of three orange circles labeled “Top 3” above a pile of gray circles, representing Snowflake Aggregations. An arrow points down, showing the orange circles being placed at the top of the pile.

This function provides an approximate list of the most frequent values in an expression. While not 100% precise (hence “approximate”), it offers a significantly faster and more resource-efficient way to get high-confidence results, especially on very large datasets.

Example Use Case: Top Products by Sales

Let’s use some sample sales data.

-- Create some sample sales data
CREATE OR REPLACE TABLE sales_data (
    sale_id INT,
    product_name VARCHAR(50),
    customer_id INT
);

INSERT INTO sales_data VALUES
(1, 'Laptop', 101),
(2, 'Mouse', 102),
(3, 'Laptop', 103),
(4, 'Keyboard', 101),
(5, 'Mouse', 104),
(6, 'Laptop', 105),
(7, 'Monitor', 101),
(8, 'Laptop', 102),
(9, 'Mouse', 103),
(10, 'External SSD', 106);

-- Find the top 3 most frequently sold products using APPROX_TOP_K_ARRAY
SELECT APPROX_TOP_K_ARRAY(product_name, 3) AS top_3_products
FROM sales_data;

-- Expected Output:
-- [
--   { "VALUE": "Laptop", "COUNT": 4 },
--   { "VALUE": "Mouse", "COUNT": 3 },
--   { "VALUE": "Keyboard", "COUNT": 1 }
-- ]

APPROX_TOP_K returns a single JSON object, while APPROX_TOP_K_ARRAY returns an array of JSON objects, which is often more convenient for downstream processing.

2. MODE(): Identifying the Most Common Value Directly

Often, you need to find the value that appears most frequently within a group. While you could achieve this with GROUP BY, COUNT(), and QUALIFY ROW_NUMBER(), Snowflake simplifies it with a dedicated MODE() function.

Example Use Case: Most Common Payment Method by Region

Imagine you want to know which payment method is most popular in each sales region.

-- Sample transaction data
CREATE OR REPLACE TABLE transactions (
    transaction_id INT,
    region VARCHAR(50),
    payment_method VARCHAR(50)
);

INSERT INTO transactions VALUES
(1, 'North', 'Credit Card'),
(2, 'North', 'Credit Card'),
(3, 'North', 'PayPal'),
(4, 'South', 'Cash'),
(5, 'South', 'Cash'),
(6, 'South', 'Credit Card'),
(7, 'East', 'Credit Card'),
(8, 'East', 'PayPal'),
(9, 'East', 'PayPal');

-- Find the mode of payment_method for each region
SELECT
    region,
    MODE(payment_method) AS most_common_payment_method
FROM
    transactions
GROUP BY
    region;

-- Expected Output:
-- REGION | MOST_COMMON_PAYMENT_METHOD
-- -------|--------------------------
-- North  | Credit Card
-- South  | Cash
-- East   | PayPal

The MODE() function cleanly returns the most frequent non-NULL value. If there’s a tie, it can return any one of the tied values.

3. COLLECT_LIST() and COLLECT_SET(): Aggregating Values into Arrays

These functions are incredibly powerful for denormalization or when you need to gather all related items into a single, iterable structure within a column.

COLLECT_LIST(): Returns an array of all input values, including duplicates, in an arbitrary order.

• COLLECT_SET(): Returns an array of all distinct input values, also in an arbitrary order.

Example Use Case: Customer Purchase History

You want to see all products a customer has ever purchased, aggregated into a single list.

-- Using the sales_data from above
-- Aggregate all products purchased by each customer
SELECT
    customer_id,
    COLLECT_LIST(product_name) AS all_products_purchased,
    COLLECT_SET(product_name) AS distinct_products_purchased
FROM
    sales_data
GROUP BY
    customer_id
ORDER BY customer_id;

-- Expected Output (order of items in array may vary):
-- CUSTOMER_ID | ALL_PRODUCTS_PURCHASED | DISTINCT_PRODUCTS_PURCHASED
-- ------------|------------------------|---------------------------
-- 101         | ["Laptop", "Keyboard", "Monitor"] | ["Laptop", "Keyboard", "Monitor"]
-- 102         | ["Mouse", "Laptop"]    | ["Mouse", "Laptop"]
-- 103         | ["Laptop", "Mouse"]    | ["Laptop", "Mouse"]
-- 104         | ["Mouse"]              | ["Mouse"]
-- 105         | ["Laptop"]             | ["Laptop"]
-- 106         | ["External SSD"]       | ["External SSD"]

These functions are game-changers for building semi-structured data points or preparing data for machine learning features.

4. SKEW() and KURTOSIS(): Advanced Statistical Insights

For data scientists and advanced analysts, understanding the shape of a data distribution is crucial. SKEW() and KURTOSIS() provide direct measures of this.

• SKEW(): Measures the asymmetry of the probability distribution of a real-valued random variable about its mean. A negative skew indicates the tail is on the left, a positive skew on the right.

• KURTOSIS(): Measures the “tailedness” of the probability distribution. High kurtosis means more extreme outliers (heavier tails), while low kurtosis means lighter tails.

Example Use Case: Analyzing Price Distribution

-- Sample product prices
CREATE OR REPLACE TABLE product_prices (
    product_id INT,
    price_usd DECIMAL(10, 2)
);

INSERT INTO product_prices VALUES
(1, 10.00), (2, 12.50), (3, 11.00), (4, 100.00), (5, 9.50),
(6, 11.20), (7, 10.80), (8, 9.90), (9, 13.00), (10, 10.50);

-- Calculate skewness and kurtosis for product prices
SELECT
    SKEW(price_usd) AS price_skewness,
    KURTOSIS(price_usd) AS price_kurtosis
FROM
    product_prices;

-- Expected Output (values will vary based on data):
-- PRICE_SKEWNESS | PRICE_KURTOSIS
-- ---------------|----------------
-- 2.658...       | 6.946...

This clearly shows a positive skew (the price of 100.00 is pulling the average up) and high kurtosis due to that outlier.

Conclusion: Unlock Deeper Insights with Snowflake Unique Aggregations

While the common aggregation functions are essential, mastering these Snowflake unique aggregations can elevate your analytical capabilities significantly. They empower you to solve complex problems more efficiently, prepare data for advanced use cases, and derive insights that might otherwise remain hidden. Don’t let these powerful tools gather dust; integrate them into your data analysis toolkit today.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *