I'm always excited to take on new projects and collaborate with innovative minds.
Most RAG tutorials stop after embedding documents and running similarity search. This guide covers the engineering side of Retrieval-Augmented Generation with ASP.NET Core, PostgreSQL, pgvector, metadata filtering, chunking strategies, retrieval optimization, embedding versioning, observability, security, and cost management.
Most Retrieval-Augmented Generation (RAG) tutorials stop after embedding a PDF and querying it.
That is enough to demonstrate the concept.
It is not enough to build a production system.
Real-world RAG applications must handle document updates, metadata filtering, retrieval quality, chunking strategies, embedding versioning, observability, security, cost management, and performance optimization.
Without those pieces, a RAG system quickly becomes expensive, inaccurate, and difficult to maintain.
In this article, we'll build a production-ready RAG architecture using ASP.NET Core, PostgreSQL, pgvector, OpenAI embeddings, and GPT-4o.
More importantly, we'll focus on the engineering decisions that separate a proof of concept from a system you can confidently deploy.
Our architecture looks like this:
Documents
│
Chunking Pipeline
│
Generate Embeddings
│
PostgreSQL + pgvector Storage
│
Semantic Similarity Search
│
Prompt Construction Layer
│
GPT-4o Response
│
ASP.NET Core API
The workflow is straightforward:
Simple in theory.
The challenge lies in making each step reliable and scalable.
Most readers already understand the basics, so we'll keep this brief.
Retrieval-Augmented Generation combines:
Instead of relying only on model training data, the model receives relevant information retrieved from your own documents.
This enables AI applications to answer questions about:
The primary benefit is not intelligence.
The primary benefit is access to private and current information.
Many vector database options exist.
Common choices include:
Let's compare them.
| Solution | Pros | Cons |
|---|---|---|
| Pinecone | Managed service, easy scaling | Additional infrastructure |
| Azure AI Search | Strong Azure integration | Higher complexity |
| Weaviate | Feature-rich | Operational overhead |
| Chroma | Developer-friendly | Limited production maturity |
| PostgreSQL + pgvector | Familiar, affordable, simple | Requires database management |
For many organizations, PostgreSQL is already part of the infrastructure stack.
Adding pgvector allows vector search without introducing another specialized database.
Benefits include:
For internal enterprise systems, PostgreSQL is often the most practical starting point.
Chunking has a larger impact on answer quality than many developers realize.
Poor chunking creates poor retrieval.
Poor retrieval creates poor answers.
No model can fix missing context.
Example:
Chunk 1: Tokens 1-500
Chunk 2: Tokens 501-1000
Chunk 3: Tokens 1001-1500
Advantages:
Disadvantages:
Fixed chunking is acceptable for simple datasets but rarely optimal.
Recursive chunking respects document structure.
Example:
Section
↓
Paragraph
↓
Sentence
Advantages:
This is often the preferred default strategy.
Technical documentation often uses:
Chunking should preserve those structures.
Breaking a table across multiple chunks often destroys its meaning.
Markdown-aware chunking maintains logical boundaries.
Code should never be treated like normal text.
Consider:
public class OrderService
{
}
Splitting a method in half creates unusable context.
Better strategies include:
Code-aware chunking significantly improves retrieval for developer-focused AI systems.
OpenAI currently provides several embedding options.
For most applications, the decision comes down to:
Advantages:
Best for:
Advantages:
Tradeoffs:
Best for:
The right choice depends on your accuracy requirements and budget.
Many tutorials store only content and embeddings.
That approach becomes limiting very quickly.
A production system should store metadata such as:
DocumentId
Title
Source
Section
Tags
CreatedAt
EmbeddingVersion
Metadata enables:
Example:
A user asks:
Show HR vacation policies.
The retrieval system can filter:
Department = HR
before semantic search begins.
This improves both accuracy and performance.
Retrieval follows a predictable sequence.
User Question
↓
Generate Embedding
↓
Vector Search
↓
Top K Results
↓
Prompt Builder
↓
GPT-4o
↓
Final Answer
Each step contributes to answer quality.
Failures earlier in the pipeline propagate downstream.
A common misconception is that better models solve retrieval problems.
They do not.
If retrieval is poor, GPT-4o receives poor context.
The result remains poor.
Retrieval quality often matters more than model selection.
Top K determines how many chunks are retrieved.
Examples:
Top 3
Top 5
Top 10
Top 20
Small values:
Large values:
Most systems perform well between Top 5 and Top 10.
Testing is essential.
Every retrieved chunk should not automatically be accepted.
A similarity threshold helps remove irrelevant matches.
Without thresholds:
Question:
Vacation policy
Retrieved:
Database migration guide
Clearly not useful.
Thresholds improve precision and reduce prompt pollution.
Documents often contain repeated content.
Examples:
Duplicate chunks waste context window space.
Deduplication should occur before prompt construction.
Sending every retrieved chunk to GPT-4o is a mistake.
Large prompts increase:
The goal is not maximum context.
The goal is relevant context.
Prompt building deserves its own layer.
Do not concatenate raw chunks and hope for the best.
A strong prompt includes:
Example structure:
System Instructions
Retrieved Context
User Question
Answer Requirements
This provides consistency and improves answer quality.
Responses should reference sources whenever possible.
Example:
According to the Employee Handbook
[Source: HR-Handbook.pdf, Section 3.2]
Benefits:
Users should know where information originated.
Documents change.
Your embeddings must evolve with them.
Avoid rebuilding everything.
Instead:
Document Updated
↓
Identify Changed Chunks
↓
Re-Embed Changed Chunks
↓
Update Index
This reduces cost and processing time significantly.
Embedding models evolve.
Store:
EmbeddingVersion
for every vector.
Benefits:
Without versioning, upgrades become painful.
Production RAG systems require optimization.
Frequently repeated queries can reuse embeddings.
Benefits:
Generate embeddings in batches whenever possible.
This reduces API overhead.
Document processing should run in background jobs.
Uploading a file should not block for minutes.
Use hosted services or queue workers.
Benefits:
Database connections are expensive.
Configure pooling appropriately.
This becomes increasingly important as query volume grows.
RAG introduces new attack surfaces.
Check:
Never trust uploaded content.
Retrieved content may contain instructions such as:
Ignore previous instructions.
Treat retrieved content as data.
Not instructions.
Embeddings may reveal information about underlying documents.
Never expose raw vectors through public APIs.
For multi-tenant systems, PostgreSQL Row-Level Security provides an additional layer of protection.
Users should only retrieve content they are authorized to access.
Good observability dramatically improves troubleshooting.
Track:
Embedding Latency
Retrieval Latency
Prompt Token Count
Completion Token Count
Retrieved Document IDs
Similarity Scores
These metrics help answer questions such as:
Without metrics, debugging becomes guesswork.
These issues appear repeatedly in production reviews.
Retrieval becomes extremely coarse.
Large chunks reduce retrieval precision.
Filtering and traceability become difficult.
Irrelevant content pollutes prompts.
Higher cost and lower answer quality.
Document updates become operationally painful.
Users cannot verify answers.
The reference implementation includes:
The goal is not simply to answer questions.
The goal is to build a maintainable knowledge retrieval platform.
Include the following visuals:
Show:
Demonstrate:
POST /api/chat
and
POST /api/documents/upload
Visualize the complete retrieval pipeline.
Show:
Question
↓
Embedding
↓
Search
↓
Context
↓
GPT-4o
Display citations and retrieved sources.
Visual evidence makes the architecture easier to understand.
Building a RAG system is not about storing vectors.
It's about building a retrieval pipeline that remains accurate, observable, maintainable, and cost-effective as data grows.
Chunking, metadata, retrieval quality, prompt construction, versioning, security, and observability have a greater impact on long-term success than the vector database itself.
A production-ready RAG system treats retrieval as an engineering discipline rather than a feature.
RAG is only one piece of the puzzle. Our application can now answer questions using private data, but it still relies on a single LLM provider. In the next article, we'll build an AI Gateway that allows us to switch between OpenAI, Azure OpenAI, Claude, and Ollama without changing our application code.
Your email address will not be published. Required fields are marked *