Building Persistent AI Memory in ASP.NET Core with Redis, PostgreSQL, and pgvector

Most LLMs don't actually remember previous conversations. Learn how to build production-ready AI memory in ASP.NET Core using Redis, PostgreSQL, pgvector, OpenAI embeddings, semantic retrieval, conversation summarization, and prompt orchestration to create scalable and cost-effective AI applications.

A common misconception is that large language models remember previous conversations.

They don't.

Every request is stateless unless you provide context.

When ChatGPT appears to remember something, there is usually an application layer supplying relevant information to the model.

Many tutorials solve this by appending the entire conversation history to every request.

That works for demos.

It does not work for production systems.

As conversations grow, token costs increase, response times slow down, context windows fill up, and irrelevant information begins to pollute prompts.

In this article, we'll design a scalable AI memory architecture using ASP.NET Core, Redis, PostgreSQL, pgvector, and OpenAI embeddings.

Rather than replaying everything that has ever happened, we'll teach our application how to remember what matters.

What We'll Build

Our architecture combines multiple memory systems.

Each memory type serves a different purpose.

               User

                 │

          Current Message

                 │

        Memory Orchestrator

       ┌─────────┴──────────┐

       │                    │

 Short-Term Memory     Long-Term Memory

       │                    │

 Redis              PostgreSQL + pgvector

       │                    │

 Conversation      Semantic Retrieval

       └─────────┬──────────┘

                 │

          Prompt Builder

                 │

             GPT-4o

Instead of relying entirely on chat history, the system retrieves only the information relevant to the current conversation.

This dramatically improves scalability.

Why Chat History Doesn't Scale

The most common memory implementation looks like this:

Message 1
Message 2
Message 3
Message 4
...
Message 500

Every request sends everything.

Initially this seems harmless.

Over time it becomes expensive.

Context Window Limits

Every model has a maximum context window.

As conversations grow:

More Messages

↓

More Tokens

↓

Less Available Space

↓

Lost Context

Eventually older information must be removed.

The application must decide what remains.

Token Costs Increase

Consider a simple example.

Conversation Length	Tokens Sent
10 Messages	2,000
50 Messages	12,000
100 Messages	25,000
500 Messages	120,000+

The cost growth is not linear from a business perspective.

Every new message requires resending previous messages.

You repeatedly pay for the same information.

Slower Responses

Larger prompts require:

More processing
More tokenization
More network transfer

Response times increase even when the question itself is simple.

Duplicate Information

Conversations often contain repeated details.

Example:

I use .NET.

I use PostgreSQL.

I build AI systems.

Sending the same information hundreds of times is wasteful.

Irrelevant Context

Not every message remains important forever.

A discussion about lunch from three months ago is unlikely to improve a technical support conversation today.

Memory systems should prioritize relevance over completeness.

Types of Memory

Production AI systems typically use multiple memory types.

Each solves a different problem.

Short-Term Memory

Short-term memory contains recent interactions.

Example:

Last 10 Messages

These messages preserve conversational flow.

Questions like:

What did you mean by that?

depend heavily on recent context.

Recent conversation belongs in fast storage.

Redis is an excellent choice.

Long-Term Memory

Long-term memory stores durable facts.

Examples:

User prefers .NET

Uses PostgreSQL

Works on AI products

These facts remain useful across many conversations.

Long-term memory survives sessions.

Semantic Memory

Semantic memory is where retrieval becomes powerful.

Instead of replaying everything:

Conversation History

↓

Retrieve Relevant Memory

↓

Use Only What Matters

The system searches memory using embeddings.

Relevant information is retrieved based on meaning rather than keywords.

This is often the most valuable memory layer.

Memory Architecture

When a new message arrives, several systems collaborate.

User Message

↓

Recent Messages

↓

Semantic Search

↓

Conversation Summary

↓

Prompt Builder

↓

LLM

This approach produces better results than relying exclusively on chat history.

Each memory source contributes different context.

Why Multiple Memory Sources Work Better

Recent messages provide conversational continuity.

Semantic memories provide relevant historical information.

Conversation summaries preserve older context.

User profiles store durable preferences.

No single memory type solves every problem.

Combining them creates a more robust system.

Conversation Summarization

One of the most effective optimizations is summarization.

Instead of storing hundreds of messages:

Conversation #14

↓

Summary

↓

Store Summary

↓

Discard Old Messages

Example summary:

User is building an ASP.NET Core AI platform.

Uses PostgreSQL and Redis.

Currently implementing RAG and Function Calling.

The summary preserves meaning while dramatically reducing token usage.

Memory Storage Strategy

Different memory types belong in different storage systems.

Memory Type	Storage
Recent Chat	Redis
Conversation Summary	PostgreSQL
Semantic Memory	pgvector
User Preferences	SQL

Each storage system is optimized for a specific workload.

Redis

Ideal for:

Fast access
Temporary conversation state
Session storage

Tradeoff:

Data is generally short-lived.

PostgreSQL

Ideal for:

Durable summaries
User profiles
Conversation metadata

Tradeoff:

Slightly slower than Redis.

pgvector

Ideal for:

Similarity search
Semantic retrieval
Memory relevance ranking

Tradeoff:

Additional embedding generation cost.

Memory Retrieval

Before calling GPT-4o, the application retrieves context.

The process typically looks like this:

Load Recent Messages

↓

Retrieve Similar Memories

↓

Load User Profile

↓

Build Prompt

↓

Call Model

A common mistake is retrieving everything.

The goal is relevance, not volume.

Prompt Construction

Prompt structure matters.

A recommended order is:

System Prompt

↓

Conversation Summary

↓

Relevant Memories

↓

Recent Messages

↓

Current Question

Why this order?

The model receives:

Rules first.
Historical context second.
Recent context third.
The current task last.

This helps the model prioritize information correctly.

Memory Expiration

Not all information deserves permanent storage.

Examples:

Data Type	Retention
Session Data	Expire
User Preferences	Keep
Temporary Tasks	Delete
Semantic Knowledge	Retain

Without expiration policies, memory systems become cluttered.

Eventually retrieval quality declines.

Privacy Considerations

Memory introduces responsibility.

The more information we store, the more carefully we must manage it.

Encrypt Sensitive Data

Personal information should be encrypted at rest.

Especially:

Names
Addresses
Financial information

Support Memory Deletion

Users should be able to remove stored memories.

This is increasingly important for compliance requirements.

Separate Personal Data

Avoid combining:

Personal Data

+

Semantic Memory

Store them separately whenever possible.

Be Careful with Embeddings

Embeddings can contain representations of sensitive information.

Not every piece of data should be embedded.

Especially:

Credentials
Secrets
Financial records

Performance Optimizations

Memory systems can become expensive without optimization.

Cache Recent Messages

Redis provides fast retrieval for active conversations.

Batch Embedding Generation

Generate embeddings in batches whenever possible.

This reduces API overhead.

Background Summarization

Summaries should not block user interactions.

Generate them asynchronously.

Async Writes

Memory storage operations should not delay responses.

Use asynchronous persistence.

Lazy Retrieval

Only retrieve semantic memories when needed.

Not every request requires vector search.

Logging and Observability

Memory systems should be measurable.

Track:

Memory Retrieval Time

Retrieved Memory Count

Summary Duration

Prompt Size

Token Savings

These metrics help answer questions such as:

Why was retrieval slow?
Why did token usage increase?
Why was a memory not found?
How effective are summaries?

Without observability, memory quality becomes difficult to improve.

Common Mistakes

These issues appear frequently in production systems.

Sending the Entire Conversation Every Time

This increases cost and latency.

Never Summarizing

Long conversations become unsustainable.

No Memory Expiration

Storage grows endlessly.

Retrieval quality suffers.

Mixing Business Data with Chat History

Creates maintenance and security problems.

Ignoring Retrieval Relevance

More memories does not mean better answers.

Embedding Every Single Message

Many messages have little long-term value.

Embedding everything increases costs unnecessarily.

Repository Features

The reference implementation includes:

ASP.NET Core Web API
Redis Memory Store
PostgreSQL
pgvector
OpenAI Embeddings
Semantic Memory Retrieval
Automatic Conversation Summarization
Swagger
Docker Compose
Structured Logging

The goal is not simply to store conversations.

The goal is to build a memory system that remains useful, efficient, and affordable as conversations grow.

Suggested Screenshots

Include the following visuals.

Memory Architecture Diagram

Show the complete memory orchestration workflow.

Redis Keys

Display conversation storage examples.

pgvector Tables

Show semantic memory storage.

Prompt Composition Flow

Visualize:

Summary

↓

Memories

↓

Recent Messages

↓

Question

Token Usage Comparison

Show:

Before Summarization

After Summarization

This visual often communicates the value of memory optimization immediately.

Conclusion

Building AI memory is not about storing more conversations.

It's about storing the right information in the right place and retrieving it at the right time.

Short-term memory, long-term memory, semantic retrieval, and conversation summaries each solve different problems. Together they create an experience that feels intelligent without overwhelming the model with unnecessary context.

By combining Redis, PostgreSQL, pgvector, embeddings, and summarization strategies, we can build AI systems that scale far beyond simple chat history.

Our AI can now remember conversations efficiently, but operating an AI system in production requires visibility into prompts, latency, token usage, and failures.

In the next article, we'll build an observability layer for AI applications in ASP.NET Core.

7 min read

May 10, 2025

By Dheer Gupta

Your email address will not be published. Required fields are marked *

Comment

Name

Website

Save my name, email, and website in this browser for the next time I comment.

Building Persistent AI Memory in ASP.NET Core with Redis, PostgreSQL, and pgvector

What We'll Build

Why Chat History Doesn't Scale

Context Window Limits

Token Costs Increase

Slower Responses

Duplicate Information

Irrelevant Context

Types of Memory

Short-Term Memory

Long-Term Memory

Semantic Memory

Memory Architecture

Why Multiple Memory Sources Work Better

Conversation Summarization

Memory Storage Strategy

Redis

PostgreSQL

pgvector

Memory Retrieval

Prompt Construction

Memory Expiration

Privacy Considerations

Encrypt Sensitive Data

Support Memory Deletion

Separate Personal Data

Be Careful with Embeddings

Performance Optimizations

Cache Recent Messages

Batch Embedding Generation

Background Summarization

Async Writes

Lazy Retrieval

Logging and Observability

Common Mistakes

Sending the Entire Conversation Every Time

Never Summarizing

No Memory Expiration

Mixing Business Data with Chat History

Ignoring Retrieval Relevance

Embedding Every Single Message

Repository Features

Suggested Screenshots

Memory Architecture Diagram

Redis Keys

pgvector Tables

Prompt Composition Flow

Token Usage Comparison

Conclusion

Leave a comment