AI apps aren't deterministic — and that changes everything

Most software runs the same way every time. LLM-powered apps don't.

Published: July 23rd 2025

Most software runs the same way every time.

LLM-powered apps don't.

You can give them the same input twice… and get different answers.

That's because LLMs are probabilistic — they choose from many possible next words and tokens, not just one.

This is new for software teams.

We're used to fixed inputs, fixed outputs. Predictable systems.

So how do we build trustworthy products on top of unpredictable models?

We use something called evals.

Evals are like unit tests for LLM apps — they help us check that everything is working as expected.

Let's say we're testing a chatbot.

First, we collect questions users are likely to ask.

We feed them into the bot, and look at the responses — not just the final answer, but every step along the way.

We compare each output to a set of expected answers.

If something's wrong, we tweak the prompts or system until it passes.

Once we have a full eval set, we can test every time we change something — the model, the context, the flow.

And as we get real user feedback, we keep adding edge cases to the evals.

Over time, this gives us an LLM system we can trust.

We have more confidence shipping to production.

Evals help turn chaos into confidence.