Build AI assistants that survive contact with reality

Create test queries that cover a wide surface area and let you build robust products

Published: August 4th 2025

The process of developing robust LLM applications is inherently iterative. It involves creating test cases, assessing performance, and refining the system based on these observations.

To create a robust system, we want to test it across a large surface area to help uncover a range of failure modes. Diversity matters: we should stress different parts of the system, not just repeat similar feature paths.

To create a diverse and realistic set of queries to test our system against, we start by defining the key dimensions of our system.

Dimensions

A dimension is a way to categorise different parts of a user query. Each dimension represents one axis of variation.

For example, in a real estate assistant, useful dimension might include

Feature: what task the user wants to perform (e.g property search, scheduling)
Client Persona: the type of client being served (e.g first time buyer, investor)
Scenario Type: how clearly the user expresses their intent (e.g well-specified, ambiguous)

Our application can have many useful dimensions.

To start testing our system, we should come up with at least 3 dimensions. We should choose dimensions that represent realistic queries, and also along which our system is likely to fail.

Once we have defined our dimensions. We can create permutations of them, and then use these permutations to generate queries for our system.

E.g a permutation of dimensions for the real estate agent might look like

Property Search, First Time Buyer, Well Specified
Scheduling, Investor, Ambiguous

A query we generate based on the first permutation might look like

"Find 3 bedroom homes under $600K near downtown that allow pets" "Book a good time for my property viewing in San Antonio"

We want to come up with at least 20 dimension combinations, so that we can use them to generate up to 100 diverse and realistic queries to start testing our system.

100 queries helps give us confidence we have good range to start uncovering common failure modes in our system

Here are some examples of what dimensions might look like in other systems.

Travel Booking Assistant Dimensions

Task Type: Find Flight, Find Hotel, Activity Recommendation, General Enquiry
Traveler Profile: Budget, Business, Family, Luxury
Date Flexibility: Exact Dates, Flexible Dates, Open Ended

Ecommerce Chatbot Dimensions

Intent: Search Product, Compare Prices, Check Availability, Tracker Order, Return Request
Customer Segment: Bargain Hunter, Brand-Loyal, Gift Shopper, Bulk Buyer
Urgency: Immediate Need, Planned Purchase, Browsing