I'm always excited to take on new projects and collaborate with innovative minds.

Social Links

Building a Production-Ready RAG System in ASP.NET Core with PostgreSQL and pgvector

Most RAG tutorials stop after embedding documents and running similarity search. This guide covers the engineering side of Retrieval-Augmented Generation with ASP.NET Core, PostgreSQL, pgvector, metadata filtering, chunking strategies, retrieval optimization, embedding versioning, observability, security, and cost management.

Most Retrieval-Augmented Generation (RAG) tutorials stop after embedding a PDF and querying it.

That is enough to demonstrate the concept.

It is not enough to build a production system.

Real-world RAG applications must handle document updates, metadata filtering, retrieval quality, chunking strategies, embedding versioning, observability, security, cost management, and performance optimization.

Without those pieces, a RAG system quickly becomes expensive, inaccurate, and difficult to maintain.

In this article, we'll build a production-ready RAG architecture using ASP.NET Core, PostgreSQL, pgvector, OpenAI embeddings, and GPT-4o.

More importantly, we'll focus on the engineering decisions that separate a proof of concept from a system you can confidently deploy.


What We'll Build

Our architecture looks like this:

                Documents

                     │

             Chunking Pipeline

                     │

           Generate Embeddings

                     │

      PostgreSQL + pgvector Storage

                     │

        Semantic Similarity Search

                     │

        Prompt Construction Layer

                     │

              GPT-4o Response

                     │

              ASP.NET Core API

The workflow is straightforward:

  1. Documents are uploaded.
  2. Documents are split into chunks.
  3. Chunks are converted into embeddings.
  4. Embeddings are stored in PostgreSQL using pgvector.
  5. User questions are embedded.
  6. Similar chunks are retrieved.
  7. A prompt is constructed using retrieved context.
  8. GPT-4o generates a grounded response.

Simple in theory.

The challenge lies in making each step reliable and scalable.


What is RAG?

Most readers already understand the basics, so we'll keep this brief.

Retrieval-Augmented Generation combines:

  • A retrieval system
  • A large language model

Instead of relying only on model training data, the model receives relevant information retrieved from your own documents.

This enables AI applications to answer questions about:

  • Internal documentation
  • Product specifications
  • Policies
  • Contracts
  • Knowledge bases
  • Customer support content

The primary benefit is not intelligence.

The primary benefit is access to private and current information.


Why PostgreSQL + pgvector?

Many vector database options exist.

Common choices include:

  • Pinecone
  • Azure AI Search
  • Weaviate
  • Chroma
  • PostgreSQL + pgvector

Let's compare them.

SolutionProsCons
PineconeManaged service, easy scalingAdditional infrastructure
Azure AI SearchStrong Azure integrationHigher complexity
WeaviateFeature-richOperational overhead
ChromaDeveloper-friendlyLimited production maturity
PostgreSQL + pgvectorFamiliar, affordable, simpleRequires database management

For many organizations, PostgreSQL is already part of the infrastructure stack.

Adding pgvector allows vector search without introducing another specialized database.

Benefits include:

  • Lower operational complexity
  • Existing backup strategies
  • Existing monitoring
  • Existing security controls
  • Reduced vendor lock-in

For internal enterprise systems, PostgreSQL is often the most practical starting point.


Designing the Chunking Strategy

Chunking has a larger impact on answer quality than many developers realize.

Poor chunking creates poor retrieval.

Poor retrieval creates poor answers.

No model can fix missing context.


Fixed-Size Chunking

Example:

Chunk 1: Tokens 1-500
Chunk 2: Tokens 501-1000
Chunk 3: Tokens 1001-1500

Advantages:

  • Simple
  • Fast
  • Easy to implement

Disadvantages:

  • Splits concepts arbitrarily
  • Breaks context boundaries

Fixed chunking is acceptable for simple datasets but rarely optimal.


Recursive Chunking

Recursive chunking respects document structure.

Example:

Section

↓

Paragraph

↓

Sentence

Advantages:

  • Better context preservation
  • More natural retrieval
  • Improved answer quality

This is often the preferred default strategy.


Markdown-Aware Chunking

Technical documentation often uses:

  • Headings
  • Lists
  • Tables
  • Code blocks

Chunking should preserve those structures.

Breaking a table across multiple chunks often destroys its meaning.

Markdown-aware chunking maintains logical boundaries.


Code-Aware Chunking

Code should never be treated like normal text.

Consider:

public class OrderService
{
}

Splitting a method in half creates unusable context.

Better strategies include:

  • Per class
  • Per method
  • Per file

Code-aware chunking significantly improves retrieval for developer-focused AI systems.


Choosing Embedding Models

OpenAI currently provides several embedding options.

For most applications, the decision comes down to:

  • text-embedding-3-small
  • text-embedding-3-large

text-embedding-3-small

Advantages:

  • Lower cost
  • Faster generation
  • Smaller storage footprint

Best for:

  • Internal tools
  • Large document volumes
  • Cost-sensitive applications

text-embedding-3-large

Advantages:

  • Higher semantic accuracy
  • Better retrieval quality

Tradeoffs:

  • Increased cost
  • Larger vectors

Best for:

  • Mission-critical search
  • High-value enterprise knowledge systems

The right choice depends on your accuracy requirements and budget.


Storing Metadata

Many tutorials store only content and embeddings.

That approach becomes limiting very quickly.

A production system should store metadata such as:

DocumentId
Title
Source
Section
Tags
CreatedAt
EmbeddingVersion

Metadata enables:

  • Filtering
  • Auditing
  • Traceability
  • Source attribution
  • Multi-tenant isolation

Example:

A user asks:

Show HR vacation policies.

The retrieval system can filter:

Department = HR

before semantic search begins.

This improves both accuracy and performance.


Semantic Search Flow

Retrieval follows a predictable sequence.

User Question

↓

Generate Embedding

↓

Vector Search

↓

Top K Results

↓

Prompt Builder

↓

GPT-4o

↓

Final Answer

Each step contributes to answer quality.

Failures earlier in the pipeline propagate downstream.


Retrieval Quality Matters More Than Model Size

A common misconception is that better models solve retrieval problems.

They do not.

If retrieval is poor, GPT-4o receives poor context.

The result remains poor.

Retrieval quality often matters more than model selection.


Top K Tuning

Top K determines how many chunks are retrieved.

Examples:

Top 3
Top 5
Top 10
Top 20

Small values:

  • Lower token costs
  • Faster responses

Large values:

  • More context
  • Higher costs
  • Increased noise

Most systems perform well between Top 5 and Top 10.

Testing is essential.


Similarity Thresholds

Every retrieved chunk should not automatically be accepted.

A similarity threshold helps remove irrelevant matches.

Without thresholds:

Question:
Vacation policy

Retrieved:
Database migration guide

Clearly not useful.

Thresholds improve precision and reduce prompt pollution.


Duplicate Chunk Removal

Documents often contain repeated content.

Examples:

  • Headers
  • Footers
  • Legal notices

Duplicate chunks waste context window space.

Deduplication should occur before prompt construction.


Context Window Management

Sending every retrieved chunk to GPT-4o is a mistake.

Large prompts increase:

  • Cost
  • Latency
  • Noise

The goal is not maximum context.

The goal is relevant context.


Prompt Construction

Prompt building deserves its own layer.

Do not concatenate raw chunks and hope for the best.

A strong prompt includes:

  • System instructions
  • Retrieved context
  • User question
  • Citation requirements

Example structure:

System Instructions

Retrieved Context

User Question

Answer Requirements

This provides consistency and improves answer quality.


Citations Increase Trust

Responses should reference sources whenever possible.

Example:

According to the Employee Handbook
[Source: HR-Handbook.pdf, Section 3.2]

Benefits:

  • Verifiability
  • Transparency
  • Easier debugging
  • Increased user trust

Users should know where information originated.


Handling Document Updates

Documents change.

Your embeddings must evolve with them.

Avoid rebuilding everything.

Instead:

Document Updated

↓

Identify Changed Chunks

↓

Re-Embed Changed Chunks

↓

Update Index

This reduces cost and processing time significantly.


Embedding Versioning

Embedding models evolve.

Store:

EmbeddingVersion

for every vector.

Benefits:

  • Controlled migrations
  • Incremental upgrades
  • Rollback capability

Without versioning, upgrades become painful.


Performance Optimization

Production RAG systems require optimization.


Cache Embeddings

Frequently repeated queries can reuse embeddings.

Benefits:

  • Lower costs
  • Faster responses

Batch Embedding Requests

Generate embeddings in batches whenever possible.

This reduces API overhead.


Async Ingestion

Document processing should run in background jobs.

Uploading a file should not block for minutes.


Background Indexing

Use hosted services or queue workers.

Benefits:

  • Better scalability
  • Better reliability

Connection Pooling

Database connections are expensive.

Configure pooling appropriately.

This becomes increasingly important as query volume grows.


Security Considerations

RAG introduces new attack surfaces.


Validate Uploaded Documents

Check:

  • File type
  • Size
  • Content

Never trust uploaded content.


Prevent Prompt Injection

Retrieved content may contain instructions such as:

Ignore previous instructions.

Treat retrieved content as data.

Not instructions.


Protect Embeddings

Embeddings may reveal information about underlying documents.

Never expose raw vectors through public APIs.


Row-Level Security

For multi-tenant systems, PostgreSQL Row-Level Security provides an additional layer of protection.

Users should only retrieve content they are authorized to access.


Logging and Observability

Good observability dramatically improves troubleshooting.

Track:

Embedding Latency
Retrieval Latency
Prompt Token Count
Completion Token Count
Retrieved Document IDs
Similarity Scores

These metrics help answer questions such as:

  • Why was this answer slow?
  • Why was this answer incorrect?
  • Which documents were retrieved?
  • Why did costs increase?

Without metrics, debugging becomes guesswork.


Common Mistakes

These issues appear repeatedly in production reviews.

One Embedding for an Entire PDF

Retrieval becomes extremely coarse.


Huge Chunks (5,000+ Tokens)

Large chunks reduce retrieval precision.


Ignoring Metadata

Filtering and traceability become difficult.


No Similarity Threshold

Irrelevant content pollutes prompts.


Sending Every Chunk to GPT

Higher cost and lower answer quality.


No Re-Indexing Strategy

Document updates become operationally painful.


No Source Attribution

Users cannot verify answers.


Repository Features

The reference implementation includes:

  • ASP.NET Core Web API
  • PostgreSQL + pgvector
  • Docker Compose
  • OpenAI Embeddings
  • GPT-4o Integration
  • Swagger
  • Background Document Indexing
  • Structured Logging
  • Sample Documents
  • Unit Tests

The goal is not simply to answer questions.

The goal is to build a maintainable knowledge retrieval platform.


Suggested Screenshots

Include the following visuals:

pgvector Table Structure

Show:

  • Documents
  • Chunks
  • Embeddings

Swagger Endpoint

Demonstrate:

POST /api/chat

and

POST /api/documents/upload

Architecture Diagram

Visualize the complete retrieval pipeline.


Retrieval Flow

Show:

Question
↓
Embedding
↓
Search
↓
Context
↓
GPT-4o

Example Response

Display citations and retrieved sources.

Visual evidence makes the architecture easier to understand.


Conclusion

Building a RAG system is not about storing vectors.

It's about building a retrieval pipeline that remains accurate, observable, maintainable, and cost-effective as data grows.

Chunking, metadata, retrieval quality, prompt construction, versioning, security, and observability have a greater impact on long-term success than the vector database itself.

A production-ready RAG system treats retrieval as an engineering discipline rather than a feature.

RAG is only one piece of the puzzle. Our application can now answer questions using private data, but it still relies on a single LLM provider. In the next article, we'll build an AI Gateway that allows us to switch between OpenAI, Azure OpenAI, Claude, and Ollama without changing our application code.

7 min read
Nov 16, 2024
By Dheer Gupta
Share

Leave a comment

Your email address will not be published. Required fields are marked *