AI Engineering

Building an LLM Evaluation Framework in ASP.NET Core

AI quality degrades silently when prompts, models, retrieval strategies, or providers change. This guide shows how to build an automated LLM evaluation pipeline in ASP.NET Core with benchmark datasets, multi-model testing, RAG evaluation, regression detection, CI/CD integration, and quality dashboards.

Imagine your AI application suddenly starts producing worse answers.

Customers begin reporting inaccurate responses.

Support tickets increase.

User satisfaction drops.

What changed?

Was it:

The model?
The prompt?
The retrieval strategy?
The provider?
The temperature setting?
The knowledge base?

Without evaluation, you don't know.

Many teams invest heavily in building AI features but spend very little effort measuring their quality. As a result, changes are often deployed based on intuition rather than evidence.

Shipping AI without evaluation is like deploying an API without tests.

In this article, we'll build a production-ready evaluation pipeline in ASP.NET Core 9 that measures AI quality, detects regressions, compares models, and generates reports before changes reach production.

Why AI Needs Evaluation

Traditional software is deterministic.

Given the same input, the code typically produces the same output.

Input
  ↓
Code
  ↓
Expected Output

Testing is straightforward because expected outcomes are well-defined.

Large Language Models are different.

Input
  ↓
LLM
  ↓
Many Possible Outputs

Multiple responses may be valid.

Some may be better than others.

Traditional assertions such as:

Assert.Equal(expected, actual);

rarely work for AI systems.

Instead, we evaluate quality using metrics, benchmarks, scoring systems, and comparisons.

The goal is not perfect outputs.

The goal is measurable quality.

What We'll Build

Our evaluation architecture looks like this:

              Test Dataset
                    │
            Evaluation Runner
                    │
      Prompt → LLM → Response
                    │
        Automated Evaluation
      ┌────────┬────────┬─────────┐
      │        │        │
Correctness Relevance Groundedness
      │        │        │
      └────────┴────────┴─────────┘
              Evaluation Report
                    │
          ASP.NET Dashboard

This pipeline enables automated quality measurement before deployment.

Project Structure

A practical repository structure might look like:

AspNetCoreLLMEvaluation

├── Evaluations
│     ├── AccuracyEvaluator
│     ├── RelevanceEvaluator
│     └── HallucinationEvaluator
│
├── Dataset
│
├── Services
│
├── Reports
│
├── Dashboard
│
└── README

Each evaluator focuses on a specific quality dimension.

Creating an Evaluation Dataset

One of the biggest mistakes teams make is testing with random prompts.

Random prompts produce random conclusions.

Instead, create a benchmark dataset.

Each test case should include:

{
  "question": "What is the invoice due date policy?",
  "expectedAnswer": "Invoices are due within 30 days.",
  "referenceDocuments": [
    "BillingPolicy.pdf"
  ],
  "evaluationCriteria": [
    "Accuracy",
    "Groundedness"
  ]
}

A benchmark dataset provides:

Consistency
Repeatability
Historical comparison
Regression detection

Without a benchmark, improvement cannot be measured.

Metrics That Matter

Evaluation requires meaningful metrics.

Not all metrics are equally valuable.

Correctness

Did the model provide an accurate answer?

Example:

Question:

What is the capital of France?

Correct answer:

Paris

Correctness measures factual accuracy.

Groundedness

Was the answer supported by retrieved context?

Grounded answers are backed by evidence.

Ungrounded answers often indicate hallucinations.

Example:

Source Document
      ↓
Retrieved Context
      ↓
Generated Answer

If information cannot be traced back to source material, confidence should decrease.

Relevance

Did the model answer the actual question?

Users frequently receive responses that are technically correct but irrelevant.

Example:

Question:

How do I reset my password?

Response:

Our company was founded in 2018.

Correct information.

Wrong answer.

Relevance matters.

Completeness

Did the answer omit important information?

Example:

Question:
How do I request a refund?

Incomplete answer:

Contact support.

Complete answer:

Contact support, provide your order number,
and submit the request within 30 days.

Completeness measures coverage.

Latency

Users care about response quality.

They also care about speed.

Track:

Average response time
P95 latency
P99 latency

Example:

Average: 1.8s

P95: 3.4s

P99: 5.2s

Token Usage

Monitor:

Prompt Tokens
Completion Tokens
Total Tokens

Example:

Prompt Tokens: 820

Completion Tokens: 210

Total Tokens: 1030

Token growth often indicates hidden cost increases.

Cost

Every token has a price.

Track:

Cost Per Request

Cost Per Evaluation Run

Monthly Projection

Optimization decisions should consider quality and cost together.

Running Batch Evaluations

Testing a single prompt tells you almost nothing.

Production teams evaluate at scale.

100 Prompts
      ↓
Run Automatically
      ↓
Compare Results
      ↓
Generate Report

Batch evaluations reveal patterns that individual tests miss.

Benefits include:

Regression detection
Model comparison
Prompt optimization
Cost analysis

Automated execution should be part of every evaluation pipeline.

Prompt Regression Testing

Prompt changes should never be deployed blindly.

Before changing prompts:

Run the benchmark.
Record baseline results.
Apply changes.
Re-run evaluations.
Compare scores.

Example:

Prompt V1

Accuracy: 91%
Hallucination Rate: 4%

Prompt V2

Accuracy: 86%
Hallucination Rate: 9%

Despite sounding better, Prompt V2 performs worse.

Benchmarks prevent accidental quality degradation.

Model Comparison

The same benchmark can evaluate multiple models.

Example comparison:

Model	Accuracy	Hallucination Rate	Avg Latency
GPT-4o	92%	3%	1.8s
Claude Sonnet	91%	2%	2.1s
Gemini	89%	5%	1.7s
Llama	83%	8%	0.9s

This allows teams to make evidence-based decisions.

The best model is not always the most expensive one.

The best model is the one that satisfies business requirements.

RAG Evaluation

Many AI systems use Retrieval-Augmented Generation (RAG).

Failures can occur in retrieval or generation.

Without evaluation, they are often confused.

Measure:

Retrieval Quality

Did the correct documents get retrieved?

Context Precision

How much retrieved content was useful?

Missing Context

Was critical information omitted?

Duplicate Chunks

Was context wasted on redundant information?

Example:

Question
      ↓
Retriever
      ↓
Chunks
      ↓
LLM
      ↓
Answer

Evaluate retrieval separately from generation.

Otherwise root-cause analysis becomes difficult.

Human Feedback

Automated evaluation is powerful.

Human judgment remains essential.

Collect feedback such as:

👍 Helpful

👎 Incorrect

⭐ Rating (1-5)

Store feedback for future benchmark creation.

Over time, production feedback becomes one of the most valuable evaluation datasets available.

Evaluation Reports

Evaluation data should be converted into actionable reports.

Example report:

Overall Score: 90%

Hallucination Rate: 3.2%

Average Cost: $0.0021

Average Latency: 1.9s

Best Model: GPT-4o

Worst Prompt: CustomerSupportV3

Reports make AI quality visible and measurable.

CI/CD Integration

Evaluation should be part of deployment pipelines.

Example workflow:

Build
  ↓
Run Evaluations
  ↓
Generate Report
  ↓
Quality Gate
  ↓
Deploy

Fail deployments when:

Accuracy drops below threshold
Hallucination rate increases
Cost exceeds budget
Latency degrades significantly

Treat AI quality like code quality.

Observability Integration

Evaluation data becomes significantly more valuable when combined with observability.

Track:

Evaluation history
Deployment versions
Prompt versions
Model versions
Cost trends
Latency trends
Quality scores

Correlating evaluation results with deployments helps identify when regressions were introduced.

A dashboard might display:

Deployment
      ↓
Quality Score
      ↓
Latency Trend
      ↓
Cost Trend

This provides long-term visibility into AI system health.

Common Mistakes

Avoid these anti-patterns:

❌ Testing with five prompts

❌ No benchmark dataset

❌ Changing prompts without evaluation

❌ Measuring only latency

❌ Ignoring hallucinations

❌ No regression testing

❌ Relying solely on intuition

❌ Evaluating generation but not retrieval

Successful AI teams measure quality continuously.

Repository Features

Our ASP.NET Core evaluation platform includes:

ASP.NET Core 9
Evaluation Pipeline
JSON Benchmark Datasets
Prompt Comparison Engine
Multi-Model Support
RAG Evaluation
HTML Reports
Markdown Reports
OpenTelemetry Integration
GitHub Actions Workflow
Docker Compose Support
Dashboard Visualization

This provides a complete foundation for AI quality assurance.

Recommended Screenshots

Include screenshots for:

Evaluation dashboard
Benchmark dataset viewer
Model comparison table
CI/CD pipeline execution
Evaluation reports
Hallucination analysis
Trend charts across deployments
OpenTelemetry traces

These visuals help readers understand how evaluation operates in production.

Conclusion

Building AI features is only half the challenge.

Maintaining quality as prompts, models, retrieval strategies, and business requirements evolve is where mature AI engineering begins.

A robust evaluation pipeline transforms AI development from guesswork into an evidence-driven discipline. By measuring correctness, groundedness, relevance, latency, cost, and hallucinations, teams can confidently deploy improvements while preventing regressions.

The organizations that succeed with AI will not be those that deploy the most models. They will be the ones that continuously measure and improve quality.

Building AI features is only half the challenge. Maintaining their quality as prompts, models, and business requirements evolve requires disciplined evaluation. In the next article, we'll explore Prompt Versioning and A/B Testing to safely evolve AI behavior without breaking production.

6 min read

Jan 24, 2026

By Dheer Gupta

Your email address will not be published. Required fields are marked *

Comment

Name

Website

Save my name, email, and website in this browser for the next time I comment.

Building an LLM Evaluation Framework in ASP.NET Core

Why AI Needs Evaluation

What We'll Build

Project Structure

Creating an Evaluation Dataset

Metrics That Matter

Correctness

Groundedness

Relevance

Completeness

Latency

Token Usage

Cost

Running Batch Evaluations

Prompt Regression Testing

Model Comparison

RAG Evaluation

Retrieval Quality

Context Precision

Missing Context

Duplicate Chunks

Human Feedback

Evaluation Reports

CI/CD Integration

Observability Integration

Common Mistakes

Repository Features

Recommended Screenshots

Conclusion

Leave a comment

Related posts