NL to SQL Chatbot Evals: Ideas for Testing Your Bot

Published: July 25th 2025

If you're building an NL to SQL chatbot, and need some evals to make sure its working like you want, here are some ideas

To start with, you need a bunch of sample questions you would ask your chatbot (e.g "how many products did we sell last month"), and a corresponding answer you are expecting for each question.

So you now have a large list of question-answer pairs.

You write a script to run these questions against your chatbot. Your bot will convert each question into an SQL statement (that's the job of an NL to SQL bot!)

You can then use some evals similar to the ones below to help give you confidence your bot is working as expected.

Reference Free Evals (soft checks)

  • is the generated sql query valid (i.e, does it actually run without errors)
  • check if certain keywords are showing up in the query, e.g a product name shows up in the query.

Reference Based Metrics (hard checks)

  • compare it to a reference expected query (can be complicated though, because SQL can generate syntactically different queries that generate the same results)

  • run the query, and compare the result data with your reference answers. e.g returned values and number of records line up with what you expect. Interestingly you can set these up like integration tests also, where you populate the db with some data to start with, run the query, compare the results, and then delete the data again at the end of the test.

  • you could convert the query into a JSON entity mapping, and compare it with an expected entity mapping. Helps to overcome the issue of SQL being able to have syntactically different statements for the same results

  • do some stricter checks to make sure the the different parts of the query (SELECT, WHERE, ORDER BY, GROUP) contain the information you are expecting


Have not tried any of these out yet. They are some ideas I am getting while going through the AI Evals For Engineers & PMs course I am doing.

A good first check I think would just be running all your queries all against the db to at least check their validity. That sounds like an easy first win.