Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA-NeMo/Guardrails/llms.txt
Use this file to discover all available pages before exploring further.
Streaming Overview
The NeMo Guardrails server supports streaming responses using Server-Sent Events (SSE). When streaming is enabled, the server sends partial message deltas as they are generated, allowing for real-time response display.Enabling Streaming
To enable streaming, setstream: true in your request:
Streaming with Output Rails
When output rails are configured, you need to enable streaming support in your guardrails configuration:config.yml
Configuration Options
Enables streaming mode for output rails.
The number of tokens in each processing chunk. This is the size of the token block on which output rails are applied.
The number of tokens carried over from the previous chunk to provide context for continuity in processing.
If true, token chunks are streamed immediately before output rails are applied. If false, chunks are buffered and streamed only after rails check.
Streaming Response Format
Streaming responses use Server-Sent Events (SSE) format. Each chunk is sent as adata: line:
Chunk Structure
Unique identifier for the streaming response.
Always “chat.completion.chunk”.
Unix timestamp.
The model being used.
The choice index (always 0).
Null during streaming, set to “stop”, “length”, or “content_filter” in the final chunk.
Error Handling in Streaming
If an error occurs during streaming, an error chunk is sent:data: [DONE].
Streaming with Rails Applied
When output rails are enabled, the streaming behavior depends on the configuration:Stream-First Mode (Default)
Withstream_first: true, tokens are streamed immediately and output rails are applied in parallel:
- LLM generates tokens
- Tokens are immediately streamed to client
- Output rails process chunks in parallel
- If rails detect an issue, streaming is aborted with an ABORT event
Buffer-First Mode
Withstream_first: false, chunks are buffered and only streamed after passing rails:
- LLM generates tokens
- Tokens are buffered into chunks
- Output rails process each chunk
- Only approved chunks are streamed to client
Performance Considerations
Chunk Size
Larger chunk sizes:- Reduce the number of rail checks
- Lower latency for rail processing
- Higher time-to-first-token
- More frequent rail checks
- Higher rail processing overhead
- Lower time-to-first-token
Context Size
Thecontext_size parameter ensures continuity between chunks:
Advanced Streaming Example
Troubleshooting
StreamingNotSupportedError
If you get this error, enable streaming in your config:config.yml
Slow Streaming
If streaming is slow:- Increase
chunk_sizeto reduce rail processing overhead - Use
stream_first: trueto stream immediately - Optimize your output rail flows
Incomplete Responses
If responses are cut off:- Check for ABORT events in the stream
- Review output rail logs
- Adjust
max_tokensparameter