Hallucination Detection - NeMo Guardrails

Hallucination detection helps identify when your bot generates responses that are inconsistent or potentially fabricated.

Overview

The hallucination detection guardrail uses a self-consistency approach:

Generates multiple responses to the same prompt
Compares responses for agreement
Flags potential hallucinations when responses diverge

This is particularly useful for:

Detecting fabricated information
Identifying low-confidence responses
Improving reliability in critical applications
Providing warnings about uncertain answers

Quick Start

Enable hallucination detection

Add the hallucination checking flow to output rails:

config.yml

rails:
  output:
    flows:
      - self check hallucination

Activate per response

Enable hallucination detection for specific responses:

flows.co

flow answer general question
  user ask general question
  
  # Enable hallucination detection
  $check_hallucination = True
  
  bot provide response

Configure your LLM

Ensure you’re using an OpenAI model or one that supports the n parameter:

config.yml

models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-instruct

How It Works

The hallucination detector:

Takes the original prompt used to generate the bot’s response
Generates 2 additional responses with temperature=1.0
Uses an LLM to check if all responses agree
Returns True if hallucination detected, False otherwise

# From actions.py
HALLUCINATION_NUM_EXTRA_RESPONSES = 2

Configuration

Basic Configuration

config.yml

models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-instruct

rails:
  output:
    flows:
      - self check hallucination

Blocking Mode

Block responses when hallucinations are detected:

flows.co

flow answer with hallucination check
  user ask general question
  
  # Enable hallucination detection (blocking)
  $check_hallucination = True
  
  bot provide response
  # Response is automatically blocked if hallucination detected

Warning Mode

Provide a warning instead of blocking:

config.yml

rails:
  output:
    flows:
      - hallucination warning

flows.co

flow answer with hallucination warning
  user ask general question
  
  # Enable hallucination warning
  $hallucination_warning = True
  
  bot provide response
  # Warning is appended if hallucination detected

Two Detection Modes

1. Blocking Mode (`self check hallucination`)

Blocks the response entirely if hallucination is detected:

flows.co

flow self check hallucination
  if $check_hallucination == True
    $is_hallucination = await SelfCheckHallucinationAction()
    $check_hallucination = False
    
    if $is_hallucination
      if $system.config.enable_rails_exceptions
        send SelfCheckHallucinationRailException(message="Hallucination detected")
      else
        bot inform answer unknown
      abort

2. Warning Mode (`hallucination warning`)

Adds a disclaimer to potentially hallucinated responses:

flows.co

flow hallucination warning
  bot said something
  if $hallucination_warning == True
    $is_hallucination = await SelfCheckHallucinationAction()
    $hallucination_warning = False
    
    if $is_hallucination
      bot inform answer prone to hallucination

Warning messages:

“The previous answer is prone to hallucination and may not be accurate. Please double check the answer using additional sources.”
“The above response may have been hallucinated, and should be independently verified.”

LLM Requirements

Hallucination detection is optimized for OpenAI models. Other LLMs may not work correctly.

Required features:

Support for the n parameter (to generate multiple completions)
Beam search or similar multi-completion capability

Supported:

OpenAI models (GPT-3.5, GPT-4)
Models with compatible n parameter

Not supported:

Most non-OpenAI models
Models without multi-completion support

If your model doesn’t support the n parameter, hallucination detection will return False (no hallucination detected) and log a warning.

Context Requirements

The hallucination detector needs:

$bot_message - The bot’s response to check
$_last_bot_prompt - The original prompt (automatically tracked)

bot_response = context.get("bot_message")
last_bot_prompt_string = context.get("_last_bot_prompt")

If either is missing, the detector returns False.

Behavior

With Rails Exceptions

config.yml

rails:
  config:
    enable_rails_exceptions: true

Raises SelfCheckHallucinationRailException when hallucination is detected in blocking mode.

Without Rails Exceptions

In blocking mode: Bot says “I don’t know the answer to that” and aborts. In warning mode: Bot adds a disclaimer about potential hallucination.

Activating Detection

Blocking Mode

Set $check_hallucination = True:

flows.co

flow user ask general question
  user ask general question
  $check_hallucination = True
  bot provide response

Warning Mode

Set $hallucination_warning = True:

flows.co

flow user ask general question
  user ask general question
  $hallucination_warning = True
  bot provide response

Custom Flows

Create custom hallucination handling:

flows.co

flow my hallucination handler
  """Custom hallucination detection with logging."""
  bot said something
  
  if $check_hallucination == True
    $check_hallucination = False
    $is_hallucination = await SelfCheckHallucinationAction()
    
    if $is_hallucination
      log "Hallucination detected in response: {{$bot_message}}"
      
      # Provide a more helpful response
      bot say "I'm not entirely confident in that answer. Let me rephrase:"
      $bot_message = execute generate_alternative_response()
      bot $bot_message

Agreement Checking

The detector prompts the LLM to determine agreement:

prompt = llm_task_manager.render_task_prompt(
    task=Task.SELF_CHECK_HALLUCINATION,
    context={
        "statement": bot_response,
        "paragraph": ". ".join(extra_responses),
    },
)

Customize in prompts.yml:

prompts.yml

task_prompts:
  - task: self_check_hallucination
    content: |
      Statement: {{ statement }}
      
      Other responses: {{ paragraph }}
      
      Do the other responses support the statement?
      Answer "yes" if they agree, "no" if they disagree.

Performance Considerations

Hallucination detection is expensive:

Generates 2 extra responses (with n=2)
Makes an additional LLM call for agreement checking
Significantly increases latency and cost

Best practices:

Use selectively for important responses
Consider using warning mode instead of blocking
Only enable for general knowledge questions (not factual RAG responses)
Monitor API costs carefully

Temperature Settings

Extra responses use high temperature:

temperature=1.0  # For diverse responses

Agreement check uses low temperature:

temperature=config.lowest_temperature  # For consistency

Implementation Details

The hallucination flows are defined in:

/nemoguardrails/library/hallucination/flows.co
/nemoguardrails/library/hallucination/actions.py

Actions:

SelfCheckHallucinationAction - Performs self-consistency check

Use Cases

Good use cases:

General knowledge questions
Creative or opinion-based responses
Uncertain or ambiguous queries
Non-critical information

Poor use cases:

RAG-based factual responses (use fact checking instead)
Time-sensitive information
Deterministic computations
Simple lookups

Alternative: BERT Score

The code includes a TODO for BERT Score-based consistency:

# TODO: Implement BERT-Score based consistency method
# See details: https://arxiv.org/abs/2303.08896

This would provide an alternative to LLM-based agreement checking.

Documentation Index

​Overview

​Quick Start

​How It Works

​Configuration

​Basic Configuration

​Blocking Mode

​Warning Mode

​Two Detection Modes

​1. Blocking Mode (self check hallucination)

​2. Warning Mode (hallucination warning)

​LLM Requirements

​Context Requirements

​Behavior

​With Rails Exceptions

​Without Rails Exceptions

​Activating Detection

​Blocking Mode

​Warning Mode

​Custom Flows

​Agreement Checking

​Performance Considerations

​Temperature Settings

​Implementation Details

​Use Cases

​Alternative: BERT Score

​See Also

Overview

Quick Start

How It Works

Configuration

Basic Configuration

Blocking Mode

Warning Mode

Two Detection Modes

1. Blocking Mode (`self check hallucination`)

2. Warning Mode (`hallucination warning`)

LLM Requirements

Context Requirements

Behavior

With Rails Exceptions

Without Rails Exceptions

Activating Detection

Blocking Mode

Warning Mode

Custom Flows

Agreement Checking

Performance Considerations

Temperature Settings

Implementation Details

Use Cases

Alternative: BERT Score

See Also