I'm always excited to take on new projects and collaborate with innovative minds.

Social Links

Building Persistent AI Memory in ASP.NET Core with Redis, PostgreSQL, and pgvector

Most LLMs don't actually remember previous conversations. Learn how to build production-ready AI memory in ASP.NET Core using Redis, PostgreSQL, pgvector, OpenAI embeddings, semantic retrieval, conversation summarization, and prompt orchestration to create scalable and cost-effective AI applications.

 

A common misconception is that large language models remember previous conversations.

They don't.

Every request is stateless unless you provide context.

When ChatGPT appears to remember something, there is usually an application layer supplying relevant information to the model.

Many tutorials solve this by appending the entire conversation history to every request.

That works for demos.

It does not work for production systems.

As conversations grow, token costs increase, response times slow down, context windows fill up, and irrelevant information begins to pollute prompts.

In this article, we'll design a scalable AI memory architecture using ASP.NET Core, Redis, PostgreSQL, pgvector, and OpenAI embeddings.

Rather than replaying everything that has ever happened, we'll teach our application how to remember what matters.


What We'll Build

Our architecture combines multiple memory systems.

Each memory type serves a different purpose.

               User

                 │

          Current Message

                 │

        Memory Orchestrator

       ┌─────────┴──────────┐

       │                    │

 Short-Term Memory     Long-Term Memory

       │                    │

 Redis              PostgreSQL + pgvector

       │                    │

 Conversation      Semantic Retrieval

       └─────────┬──────────┘

                 │

          Prompt Builder

                 │

             GPT-4o

Instead of relying entirely on chat history, the system retrieves only the information relevant to the current conversation.

This dramatically improves scalability.


Why Chat History Doesn't Scale

The most common memory implementation looks like this:

Message 1
Message 2
Message 3
Message 4
...
Message 500

Every request sends everything.

Initially this seems harmless.

Over time it becomes expensive.


Context Window Limits

Every model has a maximum context window.

As conversations grow:

More Messages

↓

More Tokens

↓

Less Available Space

↓

Lost Context

Eventually older information must be removed.

The application must decide what remains.


Token Costs Increase

Consider a simple example.

Conversation LengthTokens Sent
10 Messages2,000
50 Messages12,000
100 Messages25,000
500 Messages120,000+

The cost growth is not linear from a business perspective.

Every new message requires resending previous messages.

You repeatedly pay for the same information.


Slower Responses

Larger prompts require:

  • More processing
  • More tokenization
  • More network transfer

Response times increase even when the question itself is simple.


Duplicate Information

Conversations often contain repeated details.

Example:

I use .NET.

I use PostgreSQL.

I build AI systems.

Sending the same information hundreds of times is wasteful.


Irrelevant Context

Not every message remains important forever.

A discussion about lunch from three months ago is unlikely to improve a technical support conversation today.

Memory systems should prioritize relevance over completeness.


Types of Memory

Production AI systems typically use multiple memory types.

Each solves a different problem.


Short-Term Memory

Short-term memory contains recent interactions.

Example:

Last 10 Messages

These messages preserve conversational flow.

Questions like:

What did you mean by that?

depend heavily on recent context.

Recent conversation belongs in fast storage.

Redis is an excellent choice.


Long-Term Memory

Long-term memory stores durable facts.

Examples:

User prefers .NET

Uses PostgreSQL

Works on AI products

These facts remain useful across many conversations.

Long-term memory survives sessions.


Semantic Memory

Semantic memory is where retrieval becomes powerful.

Instead of replaying everything:

Conversation History

↓

Retrieve Relevant Memory

↓

Use Only What Matters

The system searches memory using embeddings.

Relevant information is retrieved based on meaning rather than keywords.

This is often the most valuable memory layer.


Memory Architecture

When a new message arrives, several systems collaborate.

User Message

↓

Recent Messages

↓

Semantic Search

↓

Conversation Summary

↓

Prompt Builder

↓

LLM

This approach produces better results than relying exclusively on chat history.

Each memory source contributes different context.


Why Multiple Memory Sources Work Better

Recent messages provide conversational continuity.

Semantic memories provide relevant historical information.

Conversation summaries preserve older context.

User profiles store durable preferences.

No single memory type solves every problem.

Combining them creates a more robust system.


Conversation Summarization

One of the most effective optimizations is summarization.

Instead of storing hundreds of messages:

Conversation #14

↓

Summary

↓

Store Summary

↓

Discard Old Messages

Example summary:

User is building an ASP.NET Core AI platform.

Uses PostgreSQL and Redis.

Currently implementing RAG and Function Calling.

The summary preserves meaning while dramatically reducing token usage.


Memory Storage Strategy

Different memory types belong in different storage systems.

Memory TypeStorage
Recent ChatRedis
Conversation SummaryPostgreSQL
Semantic Memorypgvector
User PreferencesSQL

Each storage system is optimized for a specific workload.


Redis

Ideal for:

  • Fast access
  • Temporary conversation state
  • Session storage

Tradeoff:

Data is generally short-lived.


PostgreSQL

Ideal for:

  • Durable summaries
  • User profiles
  • Conversation metadata

Tradeoff:

Slightly slower than Redis.


pgvector

Ideal for:

  • Similarity search
  • Semantic retrieval
  • Memory relevance ranking

Tradeoff:

Additional embedding generation cost.


Memory Retrieval

Before calling GPT-4o, the application retrieves context.

The process typically looks like this:

Load Recent Messages

↓

Retrieve Similar Memories

↓

Load User Profile

↓

Build Prompt

↓

Call Model

A common mistake is retrieving everything.

The goal is relevance, not volume.


Prompt Construction

Prompt structure matters.

A recommended order is:

System Prompt

↓

Conversation Summary

↓

Relevant Memories

↓

Recent Messages

↓

Current Question

Why this order?

The model receives:

  1. Rules first.
  2. Historical context second.
  3. Recent context third.
  4. The current task last.

This helps the model prioritize information correctly.


Memory Expiration

Not all information deserves permanent storage.

Examples:

Data TypeRetention
Session DataExpire
User PreferencesKeep
Temporary TasksDelete
Semantic KnowledgeRetain

Without expiration policies, memory systems become cluttered.

Eventually retrieval quality declines.


Privacy Considerations

Memory introduces responsibility.

The more information we store, the more carefully we must manage it.


Encrypt Sensitive Data

Personal information should be encrypted at rest.

Especially:

  • Names
  • Addresses
  • Financial information

Support Memory Deletion

Users should be able to remove stored memories.

This is increasingly important for compliance requirements.


Separate Personal Data

Avoid combining:

Personal Data

+

Semantic Memory

Store them separately whenever possible.


Be Careful with Embeddings

Embeddings can contain representations of sensitive information.

Not every piece of data should be embedded.

Especially:

  • Credentials
  • Secrets
  • Financial records

Performance Optimizations

Memory systems can become expensive without optimization.


Cache Recent Messages

Redis provides fast retrieval for active conversations.


Batch Embedding Generation

Generate embeddings in batches whenever possible.

This reduces API overhead.


Background Summarization

Summaries should not block user interactions.

Generate them asynchronously.


Async Writes

Memory storage operations should not delay responses.

Use asynchronous persistence.


Lazy Retrieval

Only retrieve semantic memories when needed.

Not every request requires vector search.


Logging and Observability

Memory systems should be measurable.

Track:

Memory Retrieval Time

Retrieved Memory Count

Summary Duration

Prompt Size

Token Savings

These metrics help answer questions such as:

  • Why was retrieval slow?
  • Why did token usage increase?
  • Why was a memory not found?
  • How effective are summaries?

Without observability, memory quality becomes difficult to improve.


Common Mistakes

These issues appear frequently in production systems.


Sending the Entire Conversation Every Time

This increases cost and latency.


Never Summarizing

Long conversations become unsustainable.


No Memory Expiration

Storage grows endlessly.

Retrieval quality suffers.


Mixing Business Data with Chat History

Creates maintenance and security problems.


Ignoring Retrieval Relevance

More memories does not mean better answers.


Embedding Every Single Message

Many messages have little long-term value.

Embedding everything increases costs unnecessarily.


Repository Features

The reference implementation includes:

  • ASP.NET Core Web API
  • Redis Memory Store
  • PostgreSQL
  • pgvector
  • OpenAI Embeddings
  • Semantic Memory Retrieval
  • Automatic Conversation Summarization
  • Swagger
  • Docker Compose
  • Structured Logging

The goal is not simply to store conversations.

The goal is to build a memory system that remains useful, efficient, and affordable as conversations grow.


Suggested Screenshots

Include the following visuals.

Memory Architecture Diagram

Show the complete memory orchestration workflow.


Redis Keys

Display conversation storage examples.


pgvector Tables

Show semantic memory storage.


Prompt Composition Flow

Visualize:

Summary

↓

Memories

↓

Recent Messages

↓

Question

Token Usage Comparison

Show:

Before Summarization

vs

After Summarization

This visual often communicates the value of memory optimization immediately.


Conclusion

Building AI memory is not about storing more conversations.

It's about storing the right information in the right place and retrieving it at the right time.

Short-term memory, long-term memory, semantic retrieval, and conversation summaries each solve different problems. Together they create an experience that feels intelligent without overwhelming the model with unnecessary context.

By combining Redis, PostgreSQL, pgvector, embeddings, and summarization strategies, we can build AI systems that scale far beyond simple chat history.

Our AI can now remember conversations efficiently, but operating an AI system in production requires visibility into prompts, latency, token usage, and failures.

In the next article, we'll build an observability layer for AI applications in ASP.NET Core.

7 min read
May 10, 2025
By Dheer Gupta
Share

Leave a comment

Your email address will not be published. Required fields are marked *