Category: Databricks

Databricks lakehouse platform tutorials: Delta Lake, Apache Spark, Unity Catalog, and MLflow for data engineering.

  • Build a Databricks AI Agent with GPT-5

    Build a Databricks AI Agent with GPT-5

    The age of AI chatbots is evolving into the era of AI doers. Instead of just answering questions, modern AI can now execute tasks, interact with systems, and solve multi-step problems. At the forefront of this revolution on the Databricks platform is the Mosaic AI Agent Framework.

    This guide will walk you through building your first Databricks AI Agent—a powerful assistant that can understand natural language, inspect your data, and execute Spark SQL queries for you, all powered by the latest GPT-5 model.

    What is a Databricks AI Agent?

    A Databricks AI Agent is an autonomous system you create using the Mosaic AI Agent Framework. It leverages a powerful Large Language Model (LLM) as its “brain” to reason and make decisions. You equip this brain with a set of “tools” (custom Python functions) that allow it to interact with the Databricks environment.

    A diagram showing a Databricks Agent. A user prompt goes to an LLM brain (GPT-5), which then connects to tools (Python functions) and a Spark engine. Arrows indicate the flow between components.

    The agent works in a loop:

    1. Reason: Based on your goal, the LLM decides which tool is needed.
    2. Act: The agent executes the chosen Python function.
    3. Observe: It analyzes the result of that function.
    4. Repeat: It continues this process until it has achieved the final objective.

    Our Project: The “Data Analyst” Agent

    We will build an agent whose goal is to answer data questions from a non-technical user. To do this, it will need two primary tools:

    • A tool to get the schema of a table (get_table_schema).
    • A tool to execute a Spark SQL query and return the result (run_spark_sql).

    Let’s start building in a Databricks Notebook.

    Step 1: Setting Up Your Tools (Python Functions)

    An agent’s capabilities are defined by its tools. In Databricks, these are simply Python functions. Let’s define the two functions our agent needs to do its job.

    # Tool #1: A function to get the DDL schema of a table
    def get_table_schema(table_name: str) -> str:
        """
        Returns the DDL schema for a given Spark table name.
        This helps the agent understand the table structure before writing a query.
        """
        try:
            ddl_result = spark.sql(f"SHOW CREATE TABLE {table_name}").first()[0]
            return ddl_result
        except Exception as e:
            return f"Error: Could not retrieve schema for table {table_name}. Reason: {e}"
    
    # Tool #2: A function to execute a Spark SQL query and return the result as a string
    def run_spark_sql(query: str) -> str:
        """
        Executes a Spark SQL query and returns the result.
        This is the agent's primary tool for interacting with data.
        """
        try:
            result_df = spark.sql(query)
            # Convert the result to a string format for the LLM to understand
            return result_df.toPandas().to_string()
        except Exception as e:
            return f"Error: Failed to execute query. Reason: {e}"

    Step 2: Assembling Your Databricks AI Agent

    With our tools defined, we can now use the Mosaic AI Agent Framework to create our agent. This involves importing the Agent class, providing our tools, and selecting an LLM from Model Serving.

    For this example, we’ll use the newly available openai/gpt-5 model endpoint.

    from databricks_agents import Agent
    
    # Define the instructions for the agent's "brain"
    # This prompt guides the agent on how to behave and use its tools
    agent_instructions = """
    You are a world-class data analyst. Your goal is to answer user questions by querying data in Spark.
    
    Here is your plan:
    1.  First, you MUST use the `get_table_schema` tool to understand the columns of the table the user mentions. Do not guess column names.
    2.  After you have the schema, formulate a Spark SQL query to answer the user's question.
    3.  Execute the query using the `run_spark_sql` tool.
    4.  Finally, analyze the result from the query and provide a clear, natural language answer to the user. Do not just return the raw data table. Summarize the findings.
    """
    
    # Create the agent instance
    data_analyst_agent = Agent(
        model="endpoints:/openai-gpt-5", # Using a Databricks Model Serving endpoint for GPT-5
        tools=[get_table_schema, run_spark_sql],
        instructions=agent_instructions
    )

    Step 3: Interacting with Your Agent

    Your Databricks AI Agent is now ready. You can interact with it using the .run() method, providing your question as the input.

    Let’s use the common samples.nyctaxi.trips table.

    # Let's ask our new agent a question
    user_question = "What were the average trip distances for trips paid with cash vs. credit card? Use the samples.nyctaxi.trips table."
    
    # Run the agent and get the final answer
    final_answer = data_analyst_agent.run(user_question)
    
    print(final_answer)

    What Happens Behind the Scenes:

    1. Reason: The agent reads your prompt. It knows it needs to find average trip distances from the samples.nyctaxi.trips table but first needs the schema. It decides to use the get_table_schema tool.
    2. Act: It calls get_table_schema('samples.nyctaxi.trips').
    3. Observe: It receives the table schema and sees columns like trip_distance and payment_type.
    4. Reason: Now it has the schema. It formulates a Spark SQL query like SELECT payment_type, AVG(trip_distance) FROM samples.nyctaxi.trips GROUP BY payment_type. It decides to use the run_spark_sql tool.
    5. Act: It calls run_spark_sql(...) with the generated query.
    6. Observe: It receives the query result as a string (e.g., a small table showing payment types and average distances).
    7. Reason: It has the data. Its final instruction is to summarize the findings.
    8. Final Answer: It generates and returns a human-readable response like: “Based on the data, the average trip distance for trips paid with a credit card was 2.95 miles, while cash-paid trips had an average distance of 2.78 miles.”

    Conclusion: Your Gateway to Autonomous Data Tasks

    Congratulations! You’ve just built a functional Databricks AI Agent. This simple text-to-SQL prototype is just the beginning. By creating more sophisticated tools, you can build agents that perform data quality checks, manage ETL pipelines, or even automate MLOps workflows, all through natural language commands on the Databricks platform.