Streaming - NeMo Guardrails

Streaming Overview

The NeMo Guardrails server supports streaming responses using Server-Sent Events (SSE). When streaming is enabled, the server sends partial message deltas as they are generated, allowing for real-time response display.

Enabling Streaming

To enable streaming, set stream: true in your request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true,
    "guardrails": {"config_id": "my-config"}
  }'

Streaming with Output Rails

When output rails are configured, you need to enable streaming support in your guardrails configuration:

config.yml

rails:
  output:
    flows:
      - check hallucination
      - check sensitive data
    streaming:
      enabled: true
      chunk_size: 200
      context_size: 50
      stream_first: true

Configuration Options

enabled

boolean

default:"false"

required

Enables streaming mode for output rails.

chunk_size

integer

default:"200"

The number of tokens in each processing chunk. This is the size of the token block on which output rails are applied.

context_size

integer

default:"50"

The number of tokens carried over from the previous chunk to provide context for continuity in processing.

stream_first

boolean

default:"true"

If true, token chunks are streamed immediately before output rails are applied. If false, chunks are buffered and streamed only after rails check.

Streaming Response Format

Streaming responses use Server-Sent Events (SSE) format. Each chunk is sent as a data: line:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Structure

string

Unique identifier for the streaming response.

object

string

Always “chat.completion.chunk”.

created

integer

Unix timestamp.

model

string

The model being used.

choices

array

choices[].index

integer

The choice index (always 0).

choices[].delta

object

choices[].delta.content

string

The content delta (token chunk).

choices[].delta.role

string

Present only in the first chunk, always “assistant”.

choices[].finish_reason

string | null

Null during streaming, set to “stop”, “length”, or “content_filter” in the final chunk.

Error Handling in Streaming

If an error occurs during streaming, an error chunk is sent:

{
  "error": {
    "message": "LLM call failed",
    "type": "server_error",
    "code": "llm_error"
  }
}

The stream is then terminated with data: [DONE].

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

try:
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}],
        stream=True,
        extra_body={"guardrails": {"config_id": "my-config"}}
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
except Exception as e:
    print(f"Streaming error: {e}")

Streaming with Rails Applied

When output rails are enabled, the streaming behavior depends on the configuration:

Stream-First Mode (Default)

With stream_first: true, tokens are streamed immediately and output rails are applied in parallel:

LLM generates tokens
Tokens are immediately streamed to client
Output rails process chunks in parallel
If rails detect an issue, streaming is aborted with an ABORT event

async for chunk in rails.stream_async(
    messages=[{"role": "user", "content": "Hello"}]
):
    # Check for ABORT event
    if '{"event": "ABORT"' in chunk:
        print("\nStreaming aborted by rails")
        break
    print(chunk, end="")

Buffer-First Mode

With stream_first: false, chunks are buffered and only streamed after passing rails:

LLM generates tokens
Tokens are buffered into chunks
Output rails process each chunk
Only approved chunks are streamed to client

Performance Considerations

Chunk Size

Larger chunk sizes:

Reduce the number of rail checks
Lower latency for rail processing
Higher time-to-first-token

Smaller chunk sizes:

More frequent rail checks
Higher rail processing overhead
Lower time-to-first-token

Context Size

The context_size parameter ensures continuity between chunks:

rails:
  output:
    streaming:
      chunk_size: 200
      context_size: 50  # Last 50 tokens from previous chunk

This helps rails detect issues that span chunk boundaries.

Advanced Streaming Example

import asyncio
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("config")
rails = LLMRails(config)

async def stream_with_metadata():
    """Stream with metadata and error handling."""
    messages = [{"role": "user", "content": "Tell me a story"}]
    
    full_response = ""
    chunk_count = 0
    
    try:
        async for chunk in rails.stream_async(
            messages=messages,
            include_metadata=True
        ):
            # Check for abort
            if isinstance(chunk, dict) and chunk.get("event") == "ABORT":
                print(f"\n\nAborted: {chunk.get('data')}")
                break
            
            # Handle string chunks
            if isinstance(chunk, str):
                print(chunk, end="")
                full_response += chunk
                chunk_count += 1
            
            # Handle metadata chunks
            elif isinstance(chunk, dict):
                if "metadata" in chunk:
                    print(f"\n[Metadata: {chunk['metadata']}]")
    
    except Exception as e:
        print(f"\n\nStreaming error: {e}")
    
    print(f"\n\nReceived {chunk_count} chunks")
    print(f"Total length: {len(full_response)} characters")

await stream_with_metadata()

Troubleshooting

StreamingNotSupportedError

If you get this error, enable streaming in your config:

config.yml

rails:
  output:
    streaming:
      enabled: true

Slow Streaming

If streaming is slow:

Increase chunk_size to reduce rail processing overhead
Use stream_first: true to stream immediately
Optimize your output rail flows

Incomplete Responses

If responses are cut off:

Check for ABORT events in the stream
Review output rail logs
Adjust max_tokens parameter

Documentation Index

​Streaming Overview

​Enabling Streaming

​Streaming with Output Rails

​Configuration Options

​Streaming Response Format

​Chunk Structure

​Error Handling in Streaming

​Streaming with Rails Applied

​Stream-First Mode (Default)

​Buffer-First Mode

​Performance Considerations

​Chunk Size

​Context Size

​Advanced Streaming Example

​Troubleshooting

​StreamingNotSupportedError

​Slow Streaming

​Incomplete Responses

Streaming Overview

Enabling Streaming

Streaming with Output Rails

Configuration Options

Streaming Response Format

Chunk Structure

Error Handling in Streaming

Streaming with Rails Applied

Stream-First Mode (Default)

Buffer-First Mode

Performance Considerations

Chunk Size

Context Size

Advanced Streaming Example

Troubleshooting

StreamingNotSupportedError

Slow Streaming

Incomplete Responses