I'm always excited to take on new projects and collaborate with innovative minds.
Most LLMs don't actually remember previous conversations. Learn how to build production-ready AI memory in ASP.NET Core using Redis, PostgreSQL, pgvector, OpenAI embeddings, semantic retrieval, conversation summarization, and prompt orchestration to create scalable and cost-effective AI applications.
A common misconception is that large language models remember previous conversations.
They don't.
Every request is stateless unless you provide context.
When ChatGPT appears to remember something, there is usually an application layer supplying relevant information to the model.
Many tutorials solve this by appending the entire conversation history to every request.
That works for demos.
It does not work for production systems.
As conversations grow, token costs increase, response times slow down, context windows fill up, and irrelevant information begins to pollute prompts.
In this article, we'll design a scalable AI memory architecture using ASP.NET Core, Redis, PostgreSQL, pgvector, and OpenAI embeddings.
Rather than replaying everything that has ever happened, we'll teach our application how to remember what matters.
Our architecture combines multiple memory systems.
Each memory type serves a different purpose.
User
│
Current Message
│
Memory Orchestrator
┌─────────┴──────────┐
│ │
Short-Term Memory Long-Term Memory
│ │
Redis PostgreSQL + pgvector
│ │
Conversation Semantic Retrieval
└─────────┬──────────┘
│
Prompt Builder
│
GPT-4o
Instead of relying entirely on chat history, the system retrieves only the information relevant to the current conversation.
This dramatically improves scalability.
The most common memory implementation looks like this:
Message 1
Message 2
Message 3
Message 4
...
Message 500
Every request sends everything.
Initially this seems harmless.
Over time it becomes expensive.
Every model has a maximum context window.
As conversations grow:
More Messages
↓
More Tokens
↓
Less Available Space
↓
Lost Context
Eventually older information must be removed.
The application must decide what remains.
Consider a simple example.
| Conversation Length | Tokens Sent |
|---|---|
| 10 Messages | 2,000 |
| 50 Messages | 12,000 |
| 100 Messages | 25,000 |
| 500 Messages | 120,000+ |
The cost growth is not linear from a business perspective.
Every new message requires resending previous messages.
You repeatedly pay for the same information.
Larger prompts require:
Response times increase even when the question itself is simple.
Conversations often contain repeated details.
Example:
I use .NET.
I use PostgreSQL.
I build AI systems.
Sending the same information hundreds of times is wasteful.
Not every message remains important forever.
A discussion about lunch from three months ago is unlikely to improve a technical support conversation today.
Memory systems should prioritize relevance over completeness.
Production AI systems typically use multiple memory types.
Each solves a different problem.
Short-term memory contains recent interactions.
Example:
Last 10 Messages
These messages preserve conversational flow.
Questions like:
What did you mean by that?
depend heavily on recent context.
Recent conversation belongs in fast storage.
Redis is an excellent choice.
Long-term memory stores durable facts.
Examples:
User prefers .NET
Uses PostgreSQL
Works on AI products
These facts remain useful across many conversations.
Long-term memory survives sessions.
Semantic memory is where retrieval becomes powerful.
Instead of replaying everything:
Conversation History
↓
Retrieve Relevant Memory
↓
Use Only What Matters
The system searches memory using embeddings.
Relevant information is retrieved based on meaning rather than keywords.
This is often the most valuable memory layer.
When a new message arrives, several systems collaborate.
User Message
↓
Recent Messages
↓
Semantic Search
↓
Conversation Summary
↓
Prompt Builder
↓
LLM
This approach produces better results than relying exclusively on chat history.
Each memory source contributes different context.
Recent messages provide conversational continuity.
Semantic memories provide relevant historical information.
Conversation summaries preserve older context.
User profiles store durable preferences.
No single memory type solves every problem.
Combining them creates a more robust system.
One of the most effective optimizations is summarization.
Instead of storing hundreds of messages:
Conversation #14
↓
Summary
↓
Store Summary
↓
Discard Old Messages
Example summary:
User is building an ASP.NET Core AI platform.
Uses PostgreSQL and Redis.
Currently implementing RAG and Function Calling.
The summary preserves meaning while dramatically reducing token usage.
Different memory types belong in different storage systems.
| Memory Type | Storage |
|---|---|
| Recent Chat | Redis |
| Conversation Summary | PostgreSQL |
| Semantic Memory | pgvector |
| User Preferences | SQL |
Each storage system is optimized for a specific workload.
Ideal for:
Tradeoff:
Data is generally short-lived.
Ideal for:
Tradeoff:
Slightly slower than Redis.
Ideal for:
Tradeoff:
Additional embedding generation cost.
Before calling GPT-4o, the application retrieves context.
The process typically looks like this:
Load Recent Messages
↓
Retrieve Similar Memories
↓
Load User Profile
↓
Build Prompt
↓
Call Model
A common mistake is retrieving everything.
The goal is relevance, not volume.
Prompt structure matters.
A recommended order is:
System Prompt
↓
Conversation Summary
↓
Relevant Memories
↓
Recent Messages
↓
Current Question
Why this order?
The model receives:
This helps the model prioritize information correctly.
Not all information deserves permanent storage.
Examples:
| Data Type | Retention |
|---|---|
| Session Data | Expire |
| User Preferences | Keep |
| Temporary Tasks | Delete |
| Semantic Knowledge | Retain |
Without expiration policies, memory systems become cluttered.
Eventually retrieval quality declines.
Memory introduces responsibility.
The more information we store, the more carefully we must manage it.
Personal information should be encrypted at rest.
Especially:
Users should be able to remove stored memories.
This is increasingly important for compliance requirements.
Avoid combining:
Personal Data
+
Semantic Memory
Store them separately whenever possible.
Embeddings can contain representations of sensitive information.
Not every piece of data should be embedded.
Especially:
Memory systems can become expensive without optimization.
Redis provides fast retrieval for active conversations.
Generate embeddings in batches whenever possible.
This reduces API overhead.
Summaries should not block user interactions.
Generate them asynchronously.
Memory storage operations should not delay responses.
Use asynchronous persistence.
Only retrieve semantic memories when needed.
Not every request requires vector search.
Memory systems should be measurable.
Track:
Memory Retrieval Time
Retrieved Memory Count
Summary Duration
Prompt Size
Token Savings
These metrics help answer questions such as:
Without observability, memory quality becomes difficult to improve.
These issues appear frequently in production systems.
This increases cost and latency.
Long conversations become unsustainable.
Storage grows endlessly.
Retrieval quality suffers.
Creates maintenance and security problems.
More memories does not mean better answers.
Many messages have little long-term value.
Embedding everything increases costs unnecessarily.
The reference implementation includes:
The goal is not simply to store conversations.
The goal is to build a memory system that remains useful, efficient, and affordable as conversations grow.
Include the following visuals.
Show the complete memory orchestration workflow.
Display conversation storage examples.
Show semantic memory storage.
Visualize:
Summary
↓
Memories
↓
Recent Messages
↓
Question
Show:
Before Summarization
vs
After Summarization
This visual often communicates the value of memory optimization immediately.
Building AI memory is not about storing more conversations.
It's about storing the right information in the right place and retrieving it at the right time.
Short-term memory, long-term memory, semantic retrieval, and conversation summaries each solve different problems. Together they create an experience that feels intelligent without overwhelming the model with unnecessary context.
By combining Redis, PostgreSQL, pgvector, embeddings, and summarization strategies, we can build AI systems that scale far beyond simple chat history.
Our AI can now remember conversations efficiently, but operating an AI system in production requires visibility into prompts, latency, token usage, and failures.
In the next article, we'll build an observability layer for AI applications in ASP.NET Core.
Your email address will not be published. Required fields are marked *