7 pitfalls to avoid to 10x RAG evaluation pipelines

Watched a great video from Skylar Payne about pitfalls to avoid when building out AI and RAG pipelines

Published: August 4th 2025

Watched a great video from Skylar Payne about pitfalls to avoid when building out AI and RAG pipelines

They insights come from work with multiple AI teams over the last year, so all very practical and timely

The overarching principle is that evaluating your AI and RAG pipelines is super important. Cause AI failures are subtle, hard to track down, and can lead to a lot of 3am debug sessions

What does it look like on the ground?

→ Most people do not "eval" anything
→ Few people evaluate generation
→ Even less evaluate retrieval
→ And even less evaluate indexing

Here 7 tips from the field for 10x-ing your RAG pipelines

#1 → Have you even indexed the docs?

If not you wont be able to answer the users question well, no matter how good your model is. Its telling that this is his point number one, and lines up with a recent post from Armand Ruiz about this being a major issue (link to that post in comments). Without the right docs, the rest of the pipeline becomes irrelevant

#2 → Information Extraction

When pulling out info from docs, pdfs, table and charts, are you actually pulling that out correctly. Its usually an AI doing this as well, so make sure you are checking this part also before indexing the extracted info

#3 → Chunk Quality

People are usually chunking too small, back from the days of 8K context windows. Imagine chunking a multistep step description with each step as a separate chunk. Big mess, and bad results! Investigate your chunk sizes. Probably make them bigger. Look at your biggest and smallest chunks. Look at your most retrieved chunks. If they are retrieved nearly all the time, they could be candidates for pulling out into the system prompt. Look at never retrieved chunks. Maybe they can be eliminated, or have errors in them introduced during extraction

#4 → Query Rejection

Instead of trying to answer ambiguous questions like "It dosen't work" with incorrect information that ends up frustrating the user, learn to reject queries or ask for more information. Include ambiguous questions in your eval datasets to make sure you are asking for clarifications when expected

#5 → Retrieval Sufficiency

Check the docs being retrieved to answer questions, and make sure there is actually enough information there to answer the questions. If not, think about how you can fix this

#6 → Hallucinations

Give citations in responses so it is possible to check where the information is coming from. Do extra checks to validate these citations actually exists, and do semantic validation checks to make sure they actually contain relevant information

#7 → Dynamic Data

Make sure you have good processes in place for keeping you data sources updated with the latest information, and ideally can time travel through the data so that you can see how it changed over time