I'm always excited to take on new projects and collaborate with innovative minds.

Social Links

AI Engineering

Building an LLM Evaluation Framework in ASP.NET Core

AI quality degrades silently when prompts, models, retrieval strategies, or providers change. This guide shows how to build an automated LLM evaluation pipeline in ASP.NET Core with benchmark datasets, multi-model testing, RAG evaluation, regression detection, CI/CD integration, and quality dashboards.

Imagine your AI application suddenly starts producing worse answers.

Customers begin reporting inaccurate responses.

Support tickets increase.

User satisfaction drops.

What changed?

Was it:

  • The model?
  • The prompt?
  • The retrieval strategy?
  • The provider?
  • The temperature setting?
  • The knowledge base?

Without evaluation, you don't know.

Many teams invest heavily in building AI features but spend very little effort measuring their quality. As a result, changes are often deployed based on intuition rather than evidence.

Shipping AI without evaluation is like deploying an API without tests.

In this article, we'll build a production-ready evaluation pipeline in ASP.NET Core 9 that measures AI quality, detects regressions, compares models, and generates reports before changes reach production.


Why AI Needs Evaluation

Traditional software is deterministic.

Given the same input, the code typically produces the same output.

Input
  ↓
Code
  ↓
Expected Output

Testing is straightforward because expected outcomes are well-defined.

Large Language Models are different.

Input
  ↓
LLM
  ↓
Many Possible Outputs

Multiple responses may be valid.

Some may be better than others.

Traditional assertions such as:

Assert.Equal(expected, actual);

rarely work for AI systems.

Instead, we evaluate quality using metrics, benchmarks, scoring systems, and comparisons.

The goal is not perfect outputs.

The goal is measurable quality.


What We'll Build

Our evaluation architecture looks like this:

              Test Dataset
                    │
            Evaluation Runner
                    │
      Prompt → LLM → Response
                    │
        Automated Evaluation
      ┌────────┬────────┬─────────┐
      │        │        │
Correctness Relevance Groundedness
      │        │        │
      └────────┴────────┴─────────┘
              Evaluation Report
                    │
          ASP.NET Dashboard

This pipeline enables automated quality measurement before deployment.


Project Structure

A practical repository structure might look like:

AspNetCoreLLMEvaluation

├── Evaluations
│     ├── AccuracyEvaluator
│     ├── RelevanceEvaluator
│     └── HallucinationEvaluator
│
├── Dataset
│
├── Services
│
├── Reports
│
├── Dashboard
│
└── README

Each evaluator focuses on a specific quality dimension.


Creating an Evaluation Dataset

One of the biggest mistakes teams make is testing with random prompts.

Random prompts produce random conclusions.

Instead, create a benchmark dataset.

Each test case should include:

{
  "question": "What is the invoice due date policy?",
  "expectedAnswer": "Invoices are due within 30 days.",
  "referenceDocuments": [
    "BillingPolicy.pdf"
  ],
  "evaluationCriteria": [
    "Accuracy",
    "Groundedness"
  ]
}

A benchmark dataset provides:

  • Consistency
  • Repeatability
  • Historical comparison
  • Regression detection

Without a benchmark, improvement cannot be measured.


Metrics That Matter

Evaluation requires meaningful metrics.

Not all metrics are equally valuable.


Correctness

Did the model provide an accurate answer?

Example:

Question:

What is the capital of France?

Correct answer:

Paris

Correctness measures factual accuracy.


Groundedness

Was the answer supported by retrieved context?

Grounded answers are backed by evidence.

Ungrounded answers often indicate hallucinations.

Example:

Source Document
      ↓
Retrieved Context
      ↓
Generated Answer

If information cannot be traced back to source material, confidence should decrease.


Relevance

Did the model answer the actual question?

Users frequently receive responses that are technically correct but irrelevant.

Example:

Question:

How do I reset my password?

Response:

Our company was founded in 2018.

Correct information.

Wrong answer.

Relevance matters.


Completeness

Did the answer omit important information?

Example:

Question:
How do I request a refund?

Incomplete answer:

Contact support.

Complete answer:

Contact support, provide your order number,
and submit the request within 30 days.

Completeness measures coverage.


Latency

Users care about response quality.

They also care about speed.

Track:

  • Average response time
  • P95 latency
  • P99 latency

Example:

Average: 1.8s

P95: 3.4s

P99: 5.2s

Token Usage

Monitor:

  • Prompt Tokens
  • Completion Tokens
  • Total Tokens

Example:

Prompt Tokens: 820

Completion Tokens: 210

Total Tokens: 1030

Token growth often indicates hidden cost increases.


Cost

Every token has a price.

Track:

Cost Per Request

Cost Per Evaluation Run

Monthly Projection

Optimization decisions should consider quality and cost together.


Running Batch Evaluations

Testing a single prompt tells you almost nothing.

Production teams evaluate at scale.

100 Prompts
      ↓
Run Automatically
      ↓
Compare Results
      ↓
Generate Report

Batch evaluations reveal patterns that individual tests miss.

Benefits include:

  • Regression detection
  • Model comparison
  • Prompt optimization
  • Cost analysis

Automated execution should be part of every evaluation pipeline.


Prompt Regression Testing

Prompt changes should never be deployed blindly.

Before changing prompts:

  1. Run the benchmark.
  2. Record baseline results.
  3. Apply changes.
  4. Re-run evaluations.
  5. Compare scores.

Example:

Prompt V1

Accuracy: 91%
Hallucination Rate: 4%

Prompt V2

Accuracy: 86%
Hallucination Rate: 9%

Despite sounding better, Prompt V2 performs worse.

Benchmarks prevent accidental quality degradation.


Model Comparison

The same benchmark can evaluate multiple models.

Example comparison:

ModelAccuracyHallucination RateAvg Latency
GPT-4o92%3%1.8s
Claude Sonnet91%2%2.1s
Gemini89%5%1.7s
Llama83%8%0.9s

This allows teams to make evidence-based decisions.

The best model is not always the most expensive one.

The best model is the one that satisfies business requirements.


RAG Evaluation

Many AI systems use Retrieval-Augmented Generation (RAG).

Failures can occur in retrieval or generation.

Without evaluation, they are often confused.

Measure:

Retrieval Quality

Did the correct documents get retrieved?

Context Precision

How much retrieved content was useful?

Missing Context

Was critical information omitted?

Duplicate Chunks

Was context wasted on redundant information?

Example:

Question
      ↓
Retriever
      ↓
Chunks
      ↓
LLM
      ↓
Answer

Evaluate retrieval separately from generation.

Otherwise root-cause analysis becomes difficult.


Human Feedback

Automated evaluation is powerful.

Human judgment remains essential.

Collect feedback such as:

👍 Helpful

👎 Incorrect

⭐ Rating (1-5)

Store feedback for future benchmark creation.

Over time, production feedback becomes one of the most valuable evaluation datasets available.


Evaluation Reports

Evaluation data should be converted into actionable reports.

Example report:

Overall Score: 90%

Hallucination Rate: 3.2%

Average Cost: $0.0021

Average Latency: 1.9s

Best Model: GPT-4o

Worst Prompt: CustomerSupportV3

Reports make AI quality visible and measurable.


CI/CD Integration

Evaluation should be part of deployment pipelines.

Example workflow:

Build
  ↓
Run Evaluations
  ↓
Generate Report
  ↓
Quality Gate
  ↓
Deploy

Fail deployments when:

  • Accuracy drops below threshold
  • Hallucination rate increases
  • Cost exceeds budget
  • Latency degrades significantly

Treat AI quality like code quality.


Observability Integration

Evaluation data becomes significantly more valuable when combined with observability.

Track:

  • Evaluation history
  • Deployment versions
  • Prompt versions
  • Model versions
  • Cost trends
  • Latency trends
  • Quality scores

Correlating evaluation results with deployments helps identify when regressions were introduced.

A dashboard might display:

Deployment
      ↓
Quality Score
      ↓
Latency Trend
      ↓
Cost Trend

This provides long-term visibility into AI system health.


Common Mistakes

Avoid these anti-patterns:

❌ Testing with five prompts

❌ No benchmark dataset

❌ Changing prompts without evaluation

❌ Measuring only latency

❌ Ignoring hallucinations

❌ No regression testing

❌ Relying solely on intuition

❌ Evaluating generation but not retrieval

Successful AI teams measure quality continuously.


Repository Features

Our ASP.NET Core evaluation platform includes:

  • ASP.NET Core 9
  • Evaluation Pipeline
  • JSON Benchmark Datasets
  • Prompt Comparison Engine
  • Multi-Model Support
  • RAG Evaluation
  • HTML Reports
  • Markdown Reports
  • OpenTelemetry Integration
  • GitHub Actions Workflow
  • Docker Compose Support
  • Dashboard Visualization

This provides a complete foundation for AI quality assurance.


Recommended Screenshots

Include screenshots for:

  1. Evaluation dashboard
  2. Benchmark dataset viewer
  3. Model comparison table
  4. CI/CD pipeline execution
  5. Evaluation reports
  6. Hallucination analysis
  7. Trend charts across deployments
  8. OpenTelemetry traces

These visuals help readers understand how evaluation operates in production.


Conclusion

Building AI features is only half the challenge.

Maintaining quality as prompts, models, retrieval strategies, and business requirements evolve is where mature AI engineering begins.

A robust evaluation pipeline transforms AI development from guesswork into an evidence-driven discipline. By measuring correctness, groundedness, relevance, latency, cost, and hallucinations, teams can confidently deploy improvements while preventing regressions.

The organizations that succeed with AI will not be those that deploy the most models. They will be the ones that continuously measure and improve quality.

Building AI features is only half the challenge. Maintaining their quality as prompts, models, and business requirements evolve requires disciplined evaluation. In the next article, we'll explore Prompt Versioning and A/B Testing to safely evolve AI behavior without breaking production.

6 min read
Jan 24, 2026
By Dheer Gupta
Share

Leave a comment

Your email address will not be published. Required fields are marked *

Related posts

Apr 18, 2026 • 6 min read
Building AI-Native ASP.NET Core Applications: Architecture Patterns That Scale

Most applications bolt AI onto existing architectures. AI-native appli...

Mar 08, 2026 • 7 min read
Securing AI Applications in ASP.NET Core: Prompt Injection, Tool Abuse & Data Protection

Traditional application security is not enough for AI systems. This gu...

Oct 22, 2025 • 7 min read
Building Enterprise MCP Servers in ASP.NET Core: Exposing Your APIs to AI Agents

Most MCP tutorials expose simple calculators. Real enterprises expose...