I'm always excited to take on new projects and collaborate with innovative minds.

Social Links

AI Engineering

Why I Stopped Building "RAG Chatbots" and Started Measuring Them

Most RAG demos stop at answering questions, but that doesn't prove they're reliable. In this post, I explain why I shifted my focus from building another RAG chatbot to creating a measurable evaluation pipeline. I'll cover the motivation, the engineering challenges, and the roadmap for building a RAG system that can be benchmarked, compared, and continuously improved.

Everyone seems to have a Retrieval-Augmented Generation (RAG) project these days.

Most demos follow the same pattern:

Upload some PDFs → ask a question → receive a surprisingly decent answer.

It's impressive for about five minutes.

Then reality arrives.

How do you know whether the answer is actually correct?

How do you know the model didn't ignore the right document?

How do you compare two retrieval strategies without relying on gut feeling?

Those questions pushed me toward a different kind of project—not another chatbot, but a system that can measure whether a RAG pipeline is actually getting better.

The Problem with Most RAG Demos

A working demo is not the same thing as a reliable system.

If changing the chunk size from 500 to 800 suddenly produces different answers, is that an improvement?

If increasing top_k returns more context, did accuracy improve or did we just introduce more noise?

Without evaluation, every optimization becomes guesswork.

That's a dangerous place to be, especially when RAG is being used for documentation search, internal knowledge bases, or customer support.

What I'm Building

I'm putting together a portfolio project that focuses as much on evaluation as it does on retrieval.

The project includes:

  • Document ingestion with configurable chunking strategies
  • Local vector search
  • A simple RAG pipeline
  • Automated evaluation using multiple quality metrics
  • LLM-as-a-judge scoring for source attribution
  • Tracing for retrieval and generation latency
  • Before/after comparisons between different retrieval approaches

Instead of asking, "Does it work?"

I want to answer:

  • How well does it work?
  • What changed?
  • Why did it improve?

Evaluation Is the Feature

One realization changed how I think about RAG systems.

The retrieval pipeline is only half the problem.

The harder problem is proving that your changes are actually making things better.

That's why I'm treating evaluation as a first-class feature instead of something added at the end.

Each experiment will produce measurable results that can be compared over time.

Not opinions.

Not screenshots.

Not cherry-picked examples.

Actual metrics.

The Experiments

Rather than building one "perfect" pipeline, I'll be experimenting with small, measurable improvements.

Some of the areas I plan to explore include:

  • Different chunk sizes and overlaps
  • Alternative chunking strategies
  • Retrieval parameter tuning
  • Cross-encoder reranking
  • Context quality improvements
  • Better citation and attribution
  • Latency versus answer quality trade-offs

The goal isn't to maximize every metric.

The goal is to understand the trade-offs behind each design decision.

Engineering Beyond the Model

One thing I've learned is that LLM applications aren't only about prompts.

A useful system also needs observability.

When a response is slow, I want to know whether retrieval or generation caused the delay.

When an answer is incorrect, I want to know which documents were retrieved.

When a pipeline changes, I want to compare today's results with last week's—not rely on memory.

Those engineering details often matter more than squeezing another few percentage points from the model itself.

What's Coming Next

This article is the beginning of a series documenting the project from the ground up.

Upcoming posts will cover topics such as:

  • Building a configurable ingestion pipeline
  • Creating an evaluation dataset from real documents
  • Measuring answer quality with automated metrics
  • Using LLMs as judges for source attribution
  • Adding tracing with OpenTelemetry
  • Improving retrieval with rerankers
  • Comparing pipeline versions using measurable results

I'll be sharing both the successes and the experiments that don't move the needle. Sometimes learning what doesn't improve a system is just as valuable.

Final Thoughts

It's easy to build something that looks impressive during a demo.

It's much harder to build something you can confidently improve over time.

That's the direction I'm focusing on.

If a future version of my RAG pipeline performs better, I don't want to say, "It feels more accurate."

I want to point to the data and say, "Here's the evidence."

That's a much stronger story—for engineers, hiring managers, and ultimately for the users who depend on these systems.


I'm documenting this project as I build it. If you're interested in practical LLM engineering, RAG evaluation, and building systems that are measurable instead of merely functional, stay tuned. There are plenty of experiments ahead.
 

4 min read
Jul 04, 2026
By Dheer Gupta
Share

Leave a comment

Your email address will not be published. Required fields are marked *

Related posts

Apr 18, 2026 • 6 min read
Building AI-Native ASP.NET Core Applications: Architecture Patterns That Scale

Most applications bolt AI onto existing architectures. AI-native appli...

Mar 08, 2026 • 7 min read
Securing AI Applications in ASP.NET Core: Prompt Injection, Tool Abuse & Data Protection

Traditional application security is not enough for AI systems. This gu...

Jan 24, 2026 • 6 min read
Building an LLM Evaluation Framework in ASP.NET Core

AI quality degrades silently when prompts, models, retrieval strategie...