I'm always excited to take on new projects and collaborate with innovative minds.

Social Links

Streaming AI Responses in ASP.NET Core Using Server-Sent Events

Most AI APIs wait for the entire response before returning data. In this article, you'll learn how to stream OpenAI responses token-by-token using Server-Sent Events (SSE) in ASP.NET Core, creating a real-time ChatGPT-like experience with proper cancellation, logging, error handling, and performance optimizations.

Most OpenAI tutorials wait for the model to finish generating before returning a response.

The API receives a prompt, sends it to OpenAI, waits several seconds, and finally returns the completed answer.

Technically it works.

From a user experience perspective, it feels slow.

Users stare at a loading spinner wondering whether anything is happening.

ChatGPT doesn't work that way.

Responses appear token by token as they are generated. The total generation time may be identical, but the application feels significantly faster because users see progress immediately.

That difference matters.

In this article, we'll build the same streaming experience in ASP.NET Core using Server-Sent Events (SSE).

By the end, you'll have a production-ready streaming API that supports cancellation, error handling, logging, and real-time token delivery.


What We'll Build

Our architecture looks like this:

Browser / React / Angular

        │

 EventSource (SSE)

        │

 ASP.NET Core API

        │

 OpenAI Streaming API

        │

 GPT-4o

The browser opens a persistent connection to the API.

The API streams tokens as they arrive from OpenAI.

The UI updates immediately instead of waiting for the complete response.


Why Streaming Matters

Many developers focus only on total response time.

Users care about perceived response time.

Consider these two scenarios.

Traditional Response

User clicks Send

Wait 8 seconds

Response appears

Streaming Response

User clicks Send

200ms later:
"H"

300ms later:
"He"

400ms later:
"Hel"

...

Response continues streaming

The second experience feels dramatically faster despite taking the same total time.

Streaming provides several benefits:

  • Better perceived performance
  • Lower abandonment rates
  • Improved responsiveness
  • Progressive rendering
  • More engaging user experience

This is why virtually every modern AI application streams responses.


Why Server-Sent Events?

Before implementing streaming, we need to choose a transport mechanism.

Several options exist.

Polling

Polling repeatedly asks the server whether new data is available.

Client
 ↓
Request

No Data

Request

No Data

Request

New Data

Problems:

  • Excessive requests
  • Increased latency
  • Wasted resources
  • Poor scalability

Polling should generally be avoided for AI streaming.


WebSockets

WebSockets provide full duplex communication.

Client ↔ Server

Advantages:

  • Bidirectional communication
  • Low latency
  • Real-time applications

Disadvantages:

  • More complex infrastructure
  • Connection management
  • Additional scaling considerations

WebSockets are excellent for chat systems where both client and server actively send messages.

For one-way token streaming, they are often unnecessary.


SignalR

SignalR builds on WebSockets and fallback transports.

Advantages:

  • Connection management
  • Group messaging
  • Automatic reconnects

Disadvantages:

  • Additional abstraction layer
  • More moving parts

SignalR is ideal for collaborative applications, dashboards, multiplayer experiences, and complex real-time systems.


Why SSE Wins Here

Server-Sent Events are designed specifically for server-to-client streaming.

Benefits:

  • Simple HTTP connection
  • Native browser support
  • Lightweight
  • Easy to implement
  • Perfect for AI token streaming

Communication is one-way.

That's exactly what we need.

The browser sends a prompt.

The server streams tokens back.

No additional complexity required.


Creating a Streaming Endpoint

Instead of returning JSON, we'll return a stream.

Endpoint:

GET /api/chat/stream

Content Type:

text/event-stream

This tells the browser to keep the connection open and process incoming events continuously.


Understanding SSE Response Headers

Several headers are important.

Response.Headers.Append(
    "Content-Type",
    "text/event-stream");

You should also disable caching:

Response.Headers.Append(
    "Cache-Control",
    "no-cache");

And keep the connection alive:

Response.Headers.Append(
    "Connection",
    "keep-alive");

These headers ensure the browser treats the response as a live stream rather than a traditional HTTP response.


Reading Streaming Tokens from OpenAI

Traditional implementations wait for the entire completion.

Streaming works differently.

OpenAI sends chunks as they become available.

Instead of:

Complete Response

You receive:

H
e
l
l
o

one chunk at a time.

As each chunk arrives:

  1. Read the token.
  2. Write it to the response stream.
  3. Flush immediately.

The client receives updates in real time.


Flushing Is Critical

Many developers stream data but forget to flush.

Without flushing:

Tokens buffered

Nothing reaches browser

Response appears all at once

Which defeats the entire purpose.

After writing each token:

await Response.Body.FlushAsync(
    cancellationToken);

This pushes data immediately to the client.


Cancellation Tokens

Cancellation support is not optional.

It's one of the most important parts of a production streaming API.

Imagine a user:

  • Closes the browser
  • Navigates away
  • Refreshes the page

Without cancellation handling:

User disconnected

OpenAI keeps generating

You keep paying

The server continues consuming tokens nobody will ever see.

Always pass CancellationToken through every layer.

Example:

public async Task StreamAsync(
    string prompt,
    CancellationToken cancellationToken)

When the client disconnects, generation stops immediately.


Handling Client Disconnects

ASP.NET Core automatically exposes request cancellation.

HttpContext.RequestAborted

This token is triggered when:

  • Browser closes
  • Network drops
  • User navigates away

Pass it directly to OpenAI.

Once cancellation occurs:

Stop streaming

Stop generating

Release resources

This reduces unnecessary token costs and improves scalability.


Error Handling During Streaming

Traditional APIs can return a standard error response.

Streaming APIs are different.

The connection may already be active when a failure occurs.

Consider:

Rate Limit

429 Too Many Requests

Timeout

Request exceeded timeout

Network Failure

Connection interrupted

In these situations:

  • Log the error
  • Send a final SSE event if possible
  • Close the stream gracefully

Avoid exposing raw provider exceptions.

The client should receive a meaningful message while logs capture technical details.


Logging

One common mistake is logging every streamed token.

Don't.

This creates:

  • Massive log volume
  • Increased storage costs
  • Privacy concerns

Instead log:

Request Id
Duration
Completion Status
Prompt Tokens
Completion Tokens
Total Tokens

These metrics provide operational visibility without overwhelming your logging system.


Performance Tips

Streaming performance depends on many small decisions.

Disable Response Buffering

Buffered responses delay token delivery.

Streaming should bypass buffering whenever possible.


Flush Frequently

Each token or chunk should be flushed immediately.

This improves perceived responsiveness.


Avoid Excessive String Allocations

Repeated string concatenation can become expensive.

Prefer efficient buffering strategies.


Use Async APIs

Streaming is fundamentally I/O bound.

Avoid blocking calls.

Use async operations throughout the entire pipeline.


Consider Async Streams

For provider abstractions, async streams often provide a clean design.

Example:

IAsyncEnumerable<string>

This aligns naturally with token streaming scenarios.


Security Considerations

Streaming endpoints deserve the same protection as any other AI endpoint.

Limit Prompt Size

Prevent abuse and excessive token consumption.

Example:

Maximum 5,000 characters

or whatever limit fits your application.


Limit Concurrent Streams

One user should not be able to open hundreds of active streams.

Protect resources with concurrency limits.


Configure Timeouts

Never allow streams to remain active indefinitely.

Set reasonable limits.

Examples:

30 seconds
60 seconds
120 seconds

depending on workload requirements.


Rate Limiting

Streaming endpoints are expensive.

Protect them.

ASP.NET Core rate limiting can help prevent abuse and accidental overload.


Frontend Example

Consuming SSE is surprisingly simple.

const eventSource =
    new EventSource(
        "/api/chat/stream");

eventSource.onmessage = (event) => {
    console.log(event.data);
};

eventSource.onerror = () => {
    eventSource.close();
};

Every incoming event contains a new chunk of generated text.

Append it to the UI and the response appears progressively.

No polling.

No WebSockets.

No additional libraries.

Just native browser functionality.


Common Mistakes

These issues appear frequently in production code reviews.

Waiting for the Entire Response

You lose the primary benefit of streaming.


Using WebSockets When SSE Is Sufficient

Additional complexity without meaningful benefit.


Forgetting CancellationToken

Leads to wasted compute and unnecessary token costs.


No Timeout Configuration

Creates resource leaks and hanging requests.


Logging Streamed Content

Increases costs and introduces privacy risks.


Blocking Threads

Reduces scalability and throughput.


Suggested Repository Structure

A clean structure keeps streaming concerns isolated.

AspNetCoreAIStreaming

│

├── Controllers

├── Services

├── Streaming

├── Models

├── Middleware

├── Frontend Demo

├── Docker

└── README

This separation makes the solution easier to maintain as streaming features grow.


Recommended Screenshots

Include the following screenshots in the article.

Chat UI

Show tokens appearing one at a time.

Browser Network Tab

Display:

Content-Type:
text/event-stream

Console Logs

Show:

Request Id
Duration
Completion Status

Architecture Diagram

Illustrate:

Browser
 ↓
SSE
 ↓
ASP.NET Core
 ↓
OpenAI

Visuals significantly increase reader engagement and help explain the flow.


Conclusion

Streaming transforms the user experience of AI applications.

The model isn't generating answers any faster, but users perceive the application as significantly more responsive because results appear immediately.

By combining ASP.NET Core, Server-Sent Events, cancellation handling, logging, timeouts, and proper error management, you can build a streaming API that feels much closer to ChatGPT than a traditional request-response implementation.

Streaming improves the user experience, but our AI still has no knowledge of our own data.

In the next article, we'll solve that by introducing Retrieval-Augmented Generation (RAG) with PostgreSQL and pgvector.

7 min read
Sep 07, 2024
By Dheer Gupta
Share

Leave a comment

Your email address will not be published. Required fields are marked *