Building a Production-Ready RAG System in ASP.NET Core with PostgreSQL and pgvector

Most RAG tutorials stop after embedding documents and running similarity search. This guide covers the engineering side of Retrieval-Augmented Generation with ASP.NET Core, PostgreSQL, pgvector, metadata filtering, chunking strategies, retrieval optimization, embedding versioning, observability, security, and cost management.

Most Retrieval-Augmented Generation (RAG) tutorials stop after embedding a PDF and querying it.

That is enough to demonstrate the concept.

It is not enough to build a production system.

Real-world RAG applications must handle document updates, metadata filtering, retrieval quality, chunking strategies, embedding versioning, observability, security, cost management, and performance optimization.

Without those pieces, a RAG system quickly becomes expensive, inaccurate, and difficult to maintain.

In this article, we'll build a production-ready RAG architecture using ASP.NET Core, PostgreSQL, pgvector, OpenAI embeddings, and GPT-4o.

More importantly, we'll focus on the engineering decisions that separate a proof of concept from a system you can confidently deploy.

What We'll Build

Our architecture looks like this:

                Documents

                     │

             Chunking Pipeline

                     │

           Generate Embeddings

                     │

      PostgreSQL + pgvector Storage

                     │

        Semantic Similarity Search

                     │

        Prompt Construction Layer

                     │

              GPT-4o Response

                     │

              ASP.NET Core API

The workflow is straightforward:

Documents are uploaded.
Documents are split into chunks.
Chunks are converted into embeddings.
Embeddings are stored in PostgreSQL using pgvector.
User questions are embedded.
Similar chunks are retrieved.
A prompt is constructed using retrieved context.
GPT-4o generates a grounded response.

Simple in theory.

The challenge lies in making each step reliable and scalable.

What is RAG?

Most readers already understand the basics, so we'll keep this brief.

Retrieval-Augmented Generation combines:

A retrieval system
A large language model

Instead of relying only on model training data, the model receives relevant information retrieved from your own documents.

This enables AI applications to answer questions about:

Internal documentation
Product specifications
Policies
Contracts
Knowledge bases
Customer support content

The primary benefit is not intelligence.

The primary benefit is access to private and current information.

Why PostgreSQL + pgvector?

Many vector database options exist.

Common choices include:

Pinecone
Azure AI Search
Weaviate
Chroma
PostgreSQL + pgvector

Let's compare them.

Solution	Pros	Cons
Pinecone	Managed service, easy scaling	Additional infrastructure
Azure AI Search	Strong Azure integration	Higher complexity
Weaviate	Feature-rich	Operational overhead
Chroma	Developer-friendly	Limited production maturity
PostgreSQL + pgvector	Familiar, affordable, simple	Requires database management

For many organizations, PostgreSQL is already part of the infrastructure stack.

Adding pgvector allows vector search without introducing another specialized database.

Benefits include:

Lower operational complexity
Existing backup strategies
Existing monitoring
Existing security controls
Reduced vendor lock-in

For internal enterprise systems, PostgreSQL is often the most practical starting point.

Designing the Chunking Strategy

Chunking has a larger impact on answer quality than many developers realize.

Poor chunking creates poor retrieval.

Poor retrieval creates poor answers.

No model can fix missing context.

Fixed-Size Chunking

Example:

Chunk 1: Tokens 1-500
Chunk 2: Tokens 501-1000
Chunk 3: Tokens 1001-1500

Advantages:

Simple
Fast
Easy to implement

Disadvantages:

Splits concepts arbitrarily
Breaks context boundaries

Fixed chunking is acceptable for simple datasets but rarely optimal.

Recursive Chunking

Recursive chunking respects document structure.

Example:

Section

↓

Paragraph

↓

Sentence

Advantages:

Better context preservation
More natural retrieval
Improved answer quality

This is often the preferred default strategy.

Markdown-Aware Chunking

Technical documentation often uses:

Headings
Lists
Tables
Code blocks

Chunking should preserve those structures.

Breaking a table across multiple chunks often destroys its meaning.

Markdown-aware chunking maintains logical boundaries.

Code-Aware Chunking

Code should never be treated like normal text.

Consider:

public class OrderService
{
}

Splitting a method in half creates unusable context.

Better strategies include:

Per class
Per method
Per file

Code-aware chunking significantly improves retrieval for developer-focused AI systems.

Choosing Embedding Models

OpenAI currently provides several embedding options.

For most applications, the decision comes down to:

text-embedding-3-small
text-embedding-3-large

text-embedding-3-small

Advantages:

Lower cost
Faster generation
Smaller storage footprint

Best for:

Internal tools
Large document volumes
Cost-sensitive applications

text-embedding-3-large

Advantages:

Higher semantic accuracy
Better retrieval quality

Tradeoffs:

Increased cost
Larger vectors

Best for:

Mission-critical search
High-value enterprise knowledge systems

The right choice depends on your accuracy requirements and budget.

Storing Metadata

Many tutorials store only content and embeddings.

That approach becomes limiting very quickly.

A production system should store metadata such as:

DocumentId
Title
Source
Section
Tags
CreatedAt
EmbeddingVersion

Metadata enables:

Filtering
Auditing
Traceability
Source attribution
Multi-tenant isolation

Example:

A user asks:

Show HR vacation policies.

The retrieval system can filter:

Department = HR

before semantic search begins.

This improves both accuracy and performance.

Semantic Search Flow

Retrieval follows a predictable sequence.

User Question

↓

Generate Embedding

↓

Vector Search

↓

Top K Results

↓

Prompt Builder

↓

GPT-4o

↓

Final Answer

Each step contributes to answer quality.

Failures earlier in the pipeline propagate downstream.

Retrieval Quality Matters More Than Model Size

A common misconception is that better models solve retrieval problems.

They do not.

If retrieval is poor, GPT-4o receives poor context.

The result remains poor.

Retrieval quality often matters more than model selection.

Top K Tuning

Top K determines how many chunks are retrieved.

Examples:

Top 3
Top 5
Top 10
Top 20

Small values:

Lower token costs
Faster responses

Large values:

More context
Higher costs
Increased noise

Most systems perform well between Top 5 and Top 10.

Testing is essential.

Similarity Thresholds

Every retrieved chunk should not automatically be accepted.

A similarity threshold helps remove irrelevant matches.

Without thresholds:

Question:
Vacation policy

Retrieved:
Database migration guide

Clearly not useful.

Thresholds improve precision and reduce prompt pollution.

Duplicate Chunk Removal

Documents often contain repeated content.

Examples:

Headers
Footers
Legal notices

Duplicate chunks waste context window space.

Deduplication should occur before prompt construction.

Context Window Management

Sending every retrieved chunk to GPT-4o is a mistake.

Large prompts increase:

Cost
Latency
Noise

The goal is not maximum context.

The goal is relevant context.

Prompt Construction

Prompt building deserves its own layer.

Do not concatenate raw chunks and hope for the best.

A strong prompt includes:

System instructions
Retrieved context
User question
Citation requirements

Example structure:

System Instructions

Retrieved Context

User Question

Answer Requirements

This provides consistency and improves answer quality.

Citations Increase Trust

Responses should reference sources whenever possible.

Example:

According to the Employee Handbook
[Source: HR-Handbook.pdf, Section 3.2]

Benefits:

Verifiability
Transparency
Easier debugging
Increased user trust

Users should know where information originated.

Handling Document Updates

Documents change.

Your embeddings must evolve with them.

Avoid rebuilding everything.

Instead:

Document Updated

↓

Identify Changed Chunks

↓

Re-Embed Changed Chunks

↓

Update Index

This reduces cost and processing time significantly.

Embedding Versioning

Embedding models evolve.

Store:

EmbeddingVersion

for every vector.

Benefits:

Controlled migrations
Incremental upgrades
Rollback capability

Without versioning, upgrades become painful.

Performance Optimization

Production RAG systems require optimization.

Cache Embeddings

Frequently repeated queries can reuse embeddings.

Benefits:

Lower costs
Faster responses

Batch Embedding Requests

Generate embeddings in batches whenever possible.

This reduces API overhead.

Async Ingestion

Document processing should run in background jobs.

Uploading a file should not block for minutes.

Background Indexing

Use hosted services or queue workers.

Benefits:

Better scalability
Better reliability

Connection Pooling

Database connections are expensive.

Configure pooling appropriately.

This becomes increasingly important as query volume grows.

Security Considerations

RAG introduces new attack surfaces.

Validate Uploaded Documents

Check:

File type
Size
Content

Never trust uploaded content.

Prevent Prompt Injection

Retrieved content may contain instructions such as:

Ignore previous instructions.

Treat retrieved content as data.

Not instructions.

Protect Embeddings

Embeddings may reveal information about underlying documents.

Never expose raw vectors through public APIs.

Row-Level Security

For multi-tenant systems, PostgreSQL Row-Level Security provides an additional layer of protection.

Users should only retrieve content they are authorized to access.

Logging and Observability

Good observability dramatically improves troubleshooting.

Track:

Embedding Latency
Retrieval Latency
Prompt Token Count
Completion Token Count
Retrieved Document IDs
Similarity Scores

These metrics help answer questions such as:

Why was this answer slow?
Why was this answer incorrect?
Which documents were retrieved?
Why did costs increase?

Without metrics, debugging becomes guesswork.

Common Mistakes

These issues appear repeatedly in production reviews.

One Embedding for an Entire PDF

Retrieval becomes extremely coarse.

Huge Chunks (5,000+ Tokens)

Large chunks reduce retrieval precision.

Ignoring Metadata

Filtering and traceability become difficult.

No Similarity Threshold

Irrelevant content pollutes prompts.

Sending Every Chunk to GPT

Higher cost and lower answer quality.

No Re-Indexing Strategy

Document updates become operationally painful.

No Source Attribution

Users cannot verify answers.

Repository Features

The reference implementation includes:

ASP.NET Core Web API
PostgreSQL + pgvector
Docker Compose
OpenAI Embeddings
GPT-4o Integration
Swagger
Background Document Indexing
Structured Logging
Sample Documents
Unit Tests

The goal is not simply to answer questions.

The goal is to build a maintainable knowledge retrieval platform.

Suggested Screenshots

Include the following visuals:

pgvector Table Structure

Show:

Documents
Chunks
Embeddings

Swagger Endpoint

Demonstrate:

POST /api/chat

and

POST /api/documents/upload

Architecture Diagram

Visualize the complete retrieval pipeline.

Retrieval Flow

Show:

Question
↓
Embedding
↓
Search
↓
Context
↓
GPT-4o

Example Response

Display citations and retrieved sources.

Visual evidence makes the architecture easier to understand.

Conclusion

Building a RAG system is not about storing vectors.

It's about building a retrieval pipeline that remains accurate, observable, maintainable, and cost-effective as data grows.

Chunking, metadata, retrieval quality, prompt construction, versioning, security, and observability have a greater impact on long-term success than the vector database itself.

A production-ready RAG system treats retrieval as an engineering discipline rather than a feature.

RAG is only one piece of the puzzle. Our application can now answer questions using private data, but it still relies on a single LLM provider. In the next article, we'll build an AI Gateway that allows us to switch between OpenAI, Azure OpenAI, Claude, and Ollama without changing our application code.

7 min read

Nov 16, 2024

By Dheer Gupta

Your email address will not be published. Required fields are marked *

Comment

Name

Website

Save my name, email, and website in this browser for the next time I comment.

Building a Production-Ready RAG System in ASP.NET Core with PostgreSQL and pgvector

What We'll Build

What is RAG?

Why PostgreSQL + pgvector?

Designing the Chunking Strategy

Fixed-Size Chunking

Recursive Chunking

Markdown-Aware Chunking

Code-Aware Chunking

Choosing Embedding Models

text-embedding-3-small

text-embedding-3-large

Storing Metadata

Semantic Search Flow

Retrieval Quality Matters More Than Model Size

Top K Tuning

Similarity Thresholds

Duplicate Chunk Removal

Context Window Management

Prompt Construction

Citations Increase Trust

Handling Document Updates

Embedding Versioning

Performance Optimization

Cache Embeddings

Batch Embedding Requests

Async Ingestion

Background Indexing

Connection Pooling

Security Considerations

Validate Uploaded Documents

Prevent Prompt Injection

Protect Embeddings

Row-Level Security

Logging and Observability

Common Mistakes

One Embedding for an Entire PDF

Huge Chunks (5,000+ Tokens)

Ignoring Metadata

No Similarity Threshold

Sending Every Chunk to GPT

No Re-Indexing Strategy

No Source Attribution

Repository Features

Suggested Screenshots

pgvector Table Structure

Swagger Endpoint

Architecture Diagram

Retrieval Flow

Example Response

Conclusion

Leave a comment