I'm always excited to take on new projects and collaborate with innovative minds.
Most AI APIs wait for the entire response before returning data. In this article, you'll learn how to stream OpenAI responses token-by-token using Server-Sent Events (SSE) in ASP.NET Core, creating a real-time ChatGPT-like experience with proper cancellation, logging, error handling, and performance optimizations.
Most OpenAI tutorials wait for the model to finish generating before returning a response.
The API receives a prompt, sends it to OpenAI, waits several seconds, and finally returns the completed answer.
Technically it works.
From a user experience perspective, it feels slow.
Users stare at a loading spinner wondering whether anything is happening.
ChatGPT doesn't work that way.
Responses appear token by token as they are generated. The total generation time may be identical, but the application feels significantly faster because users see progress immediately.
That difference matters.
In this article, we'll build the same streaming experience in ASP.NET Core using Server-Sent Events (SSE).
By the end, you'll have a production-ready streaming API that supports cancellation, error handling, logging, and real-time token delivery.
Our architecture looks like this:
Browser / React / Angular
│
EventSource (SSE)
│
ASP.NET Core API
│
OpenAI Streaming API
│
GPT-4o
The browser opens a persistent connection to the API.
The API streams tokens as they arrive from OpenAI.
The UI updates immediately instead of waiting for the complete response.
Many developers focus only on total response time.
Users care about perceived response time.
Consider these two scenarios.
User clicks Send
Wait 8 seconds
Response appears
User clicks Send
200ms later:
"H"
300ms later:
"He"
400ms later:
"Hel"
...
Response continues streaming
The second experience feels dramatically faster despite taking the same total time.
Streaming provides several benefits:
This is why virtually every modern AI application streams responses.
Before implementing streaming, we need to choose a transport mechanism.
Several options exist.
Polling repeatedly asks the server whether new data is available.
Client
↓
Request
No Data
Request
No Data
Request
New Data
Problems:
Polling should generally be avoided for AI streaming.
WebSockets provide full duplex communication.
Client ↔ Server
Advantages:
Disadvantages:
WebSockets are excellent for chat systems where both client and server actively send messages.
For one-way token streaming, they are often unnecessary.
SignalR builds on WebSockets and fallback transports.
Advantages:
Disadvantages:
SignalR is ideal for collaborative applications, dashboards, multiplayer experiences, and complex real-time systems.
Server-Sent Events are designed specifically for server-to-client streaming.
Benefits:
Communication is one-way.
That's exactly what we need.
The browser sends a prompt.
The server streams tokens back.
No additional complexity required.
Instead of returning JSON, we'll return a stream.
Endpoint:
GET /api/chat/stream
Content Type:
text/event-stream
This tells the browser to keep the connection open and process incoming events continuously.
Several headers are important.
Response.Headers.Append(
"Content-Type",
"text/event-stream");
You should also disable caching:
Response.Headers.Append(
"Cache-Control",
"no-cache");
And keep the connection alive:
Response.Headers.Append(
"Connection",
"keep-alive");
These headers ensure the browser treats the response as a live stream rather than a traditional HTTP response.
Traditional implementations wait for the entire completion.
Streaming works differently.
OpenAI sends chunks as they become available.
Instead of:
Complete Response
You receive:
H
e
l
l
o
one chunk at a time.
As each chunk arrives:
The client receives updates in real time.
Many developers stream data but forget to flush.
Without flushing:
Tokens buffered
Nothing reaches browser
Response appears all at once
Which defeats the entire purpose.
After writing each token:
await Response.Body.FlushAsync(
cancellationToken);
This pushes data immediately to the client.
Cancellation support is not optional.
It's one of the most important parts of a production streaming API.
Imagine a user:
Without cancellation handling:
User disconnected
OpenAI keeps generating
You keep paying
The server continues consuming tokens nobody will ever see.
Always pass CancellationToken through every layer.
Example:
public async Task StreamAsync(
string prompt,
CancellationToken cancellationToken)
When the client disconnects, generation stops immediately.
ASP.NET Core automatically exposes request cancellation.
HttpContext.RequestAborted
This token is triggered when:
Pass it directly to OpenAI.
Once cancellation occurs:
Stop streaming
Stop generating
Release resources
This reduces unnecessary token costs and improves scalability.
Traditional APIs can return a standard error response.
Streaming APIs are different.
The connection may already be active when a failure occurs.
Consider:
429 Too Many Requests
Request exceeded timeout
Connection interrupted
In these situations:
Avoid exposing raw provider exceptions.
The client should receive a meaningful message while logs capture technical details.
One common mistake is logging every streamed token.
Don't.
This creates:
Instead log:
Request Id
Duration
Completion Status
Prompt Tokens
Completion Tokens
Total Tokens
These metrics provide operational visibility without overwhelming your logging system.
Streaming performance depends on many small decisions.
Buffered responses delay token delivery.
Streaming should bypass buffering whenever possible.
Each token or chunk should be flushed immediately.
This improves perceived responsiveness.
Repeated string concatenation can become expensive.
Prefer efficient buffering strategies.
Streaming is fundamentally I/O bound.
Avoid blocking calls.
Use async operations throughout the entire pipeline.
For provider abstractions, async streams often provide a clean design.
Example:
IAsyncEnumerable<string>
This aligns naturally with token streaming scenarios.
Streaming endpoints deserve the same protection as any other AI endpoint.
Prevent abuse and excessive token consumption.
Example:
Maximum 5,000 characters
or whatever limit fits your application.
One user should not be able to open hundreds of active streams.
Protect resources with concurrency limits.
Never allow streams to remain active indefinitely.
Set reasonable limits.
Examples:
30 seconds
60 seconds
120 seconds
depending on workload requirements.
Streaming endpoints are expensive.
Protect them.
ASP.NET Core rate limiting can help prevent abuse and accidental overload.
Consuming SSE is surprisingly simple.
const eventSource =
new EventSource(
"/api/chat/stream");
eventSource.onmessage = (event) => {
console.log(event.data);
};
eventSource.onerror = () => {
eventSource.close();
};
Every incoming event contains a new chunk of generated text.
Append it to the UI and the response appears progressively.
No polling.
No WebSockets.
No additional libraries.
Just native browser functionality.
These issues appear frequently in production code reviews.
You lose the primary benefit of streaming.
Additional complexity without meaningful benefit.
Leads to wasted compute and unnecessary token costs.
Creates resource leaks and hanging requests.
Increases costs and introduces privacy risks.
Reduces scalability and throughput.
A clean structure keeps streaming concerns isolated.
AspNetCoreAIStreaming
│
├── Controllers
├── Services
├── Streaming
├── Models
├── Middleware
├── Frontend Demo
├── Docker
└── README
This separation makes the solution easier to maintain as streaming features grow.
Include the following screenshots in the article.
Show tokens appearing one at a time.
Display:
Content-Type:
text/event-stream
Show:
Request Id
Duration
Completion Status
Illustrate:
Browser
↓
SSE
↓
ASP.NET Core
↓
OpenAI
Visuals significantly increase reader engagement and help explain the flow.
Streaming transforms the user experience of AI applications.
The model isn't generating answers any faster, but users perceive the application as significantly more responsive because results appear immediately.
By combining ASP.NET Core, Server-Sent Events, cancellation handling, logging, timeouts, and proper error management, you can build a streaming API that feels much closer to ChatGPT than a traditional request-response implementation.
Streaming improves the user experience, but our AI still has no knowledge of our own data.
In the next article, we'll solve that by introducing Retrieval-Augmented Generation (RAG) with PostgreSQL and pgvector.
Your email address will not be published. Required fields are marked *