PII (Personally Identifiable Information) detection helps protect user privacy by identifying and optionally masking sensitive data.
Overview
The sensitive data detection guardrail uses Microsoft Presidio to:
- Detect PII in user inputs, bot outputs, and retrieved documents
- Mask or block detected sensitive information
- Support custom entity recognizers
- Configure different rules for input, output, and retrieval
Supported entity types:
- PERSON (names)
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN (Social Security Numbers)
- LOCATION
- IP_ADDRESS
- IBAN_CODE
- And many more…
Quick Start
Install dependencies
Install Presidio and spaCy:pip install presidio-analyzer presidio-anonymizer
pip install spacy
python -m spacy download en_core_web_lg
Configure PII detection
Define which entities to detect:rails:
config:
sensitive_data_detection:
input:
score_threshold: 0.4
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN
Enable detection flows
Choose between detection (blocking) or masking:rails:
input:
flows:
- detect sensitive data on input
# OR
- mask sensitive data on input
Detection vs Masking
Detection (Blocking)
Blocks requests containing PII:
rails:
input:
flows:
- detect sensitive data on input
output:
flows:
- detect sensitive data on output
When PII is found, the bot responds with “I don’t know the answer to that” and aborts.
Masking (Redaction)
Replaces PII with placeholder text:
rails:
input:
flows:
- mask sensitive data on input
output:
flows:
- mask sensitive data on output
Example:
Input: "My email is john@example.com"
Masked: "My email is <EMAIL_ADDRESS>"
Configuration
Complete Configuration
colang_version: "2.x"
models:
- type: main
engine: openai
model: gpt-4o-mini
rails:
config:
sensitive_data_detection:
# Input configuration
input:
score_threshold: 0.4
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN
- LOCATION
# Output configuration
output:
score_threshold: 0.4
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN
- LOCATION
# Retrieval configuration (for RAG systems)
retrieval:
score_threshold: 0.4
entities:
- PERSON
- CREDIT_CARD
- US_SSN
input:
flows:
- mask sensitive data on input
output:
flows:
- mask sensitive data on output
Score Threshold
The score_threshold controls detection sensitivity:
0.0 - Detect everything (high false positives)
0.4 - Balanced (recommended default)
1.0 - Only very confident matches (may miss some PII)
sensitive_data_detection:
input:
score_threshold: 0.4 # Adjust based on your needs
Separate Configurations
Configure different rules for input, output, and retrieval:
sensitive_data_detection:
# Strict for user inputs
input:
score_threshold: 0.3
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN
# Moderate for bot outputs
output:
score_threshold: 0.5
entities:
- CREDIT_CARD
- US_SSN
# Very strict for retrieved documents
retrieval:
score_threshold: 0.2
entities:
- CREDIT_CARD
- US_SSN
- IBAN_CODE
Available Flows
Detect (Block):
rails:
input:
flows:
- detect sensitive data on input
Mask (Redact):
rails:
input:
flows:
- mask sensitive data on input
Output Rails
Detect (Block):
rails:
output:
flows:
- detect sensitive data on output
Mask (Redact):
rails:
output:
flows:
- mask sensitive data on output
Retrieval Rails
Detect (Block):
rails:
retrieval:
flows:
- detect sensitive data on retrieval
Mask (Redact):
rails:
retrieval:
flows:
- mask sensitive data on retrieval
Custom Entity Recognizers
Add custom patterns for domain-specific PII:
rails:
config:
sensitive_data_detection:
recognizers:
- name: "EMPLOYEE_ID"
supported_language: "en"
patterns:
- name: "employee_id_pattern"
regex: "EMP-[0-9]{6}"
score: 0.8
- name: "INTERNAL_IP"
supported_language: "en"
patterns:
- name: "internal_ip_pattern"
regex: "10\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"
score: 0.9
input:
score_threshold: 0.4
entities:
- PERSON
- EMAIL_ADDRESS
- EMPLOYEE_ID # Custom entity
- INTERNAL_IP # Custom entity
Supported Entities
Presidio supports many built-in entity types:
Personal Information:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- LOCATION
- DATE_TIME
- URL
Financial:
- CREDIT_CARD
- IBAN_CODE
- CRYPTO
Identification:
- US_SSN
- US_PASSPORT
- US_DRIVER_LICENSE
- UK_NHS
- SG_NRIC_FIN
Technical:
Medical:
See Presidio documentation for the complete list.
Custom Flows
Create custom PII handling:
flow my pii handler
"""Custom PII detection with logging."""
$has_pii = await DetectSensitiveDataAction(source="input", text=$user_message)
if $has_pii
log "PII detected in user message"
bot say "Please don't share personal information. How else can I help?"
abort
Actions
Two actions are available:
DetectSensitiveDataAction
Returns True if PII is detected:
$has_pii = await DetectSensitiveDataAction(
source="input", # "input", "output", or "retrieval"
text=$user_message
)
MaskSensitiveDataAction
Returns masked text:
$masked_text = await MaskSensitiveDataAction(
source="input",
text=$user_message
)
Integration with RAG
Mask PII in retrieved documents:
flow rag with pii masking
user ask question
# Retrieve documents
$relevant_chunks = execute retrieve_documents()
# Mask PII in retrieved content
$relevant_chunks = await MaskSensitiveDataAction(
source="retrieval",
text=$relevant_chunks
)
# Generate response
bot provide response with context
Dependencies
PII detection requires additional packages that must be installed separately.
# Install Presidio
pip install presidio-analyzer presidio-anonymizer
# Install spaCy and language model
pip install spacy
python -m spacy download en_core_web_lg
If these are not installed, you’ll see:
ImportError: Could not import presidio, please install it with
`pip install presidio-analyzer presidio-anonymizer`.
PII detection adds latency:
- spaCy model loading takes time on first run
- Each detection requires NLP processing
- Consider caching results when possible
Optimization tips:
- Only enable for necessary sources (input/output/retrieval)
- Limit entities to those actually needed
- Adjust score threshold to reduce false positives
- Use masking instead of detection when appropriate
Implementation Details
The PII detection flows are defined in:
/nemoguardrails/library/sensitive_data_detection/flows.co
/nemoguardrails/library/sensitive_data_detection/actions.py
Actions:
DetectSensitiveDataAction - Returns boolean for presence of PII
MaskSensitiveDataAction - Returns masked text with PII replaced
Best Practices
- Start with detection - Use blocking mode first to understand what PII appears
- Tune threshold - Adjust based on false positive/negative rates
- Use appropriate entities - Only detect PII relevant to your domain
- Different rules per source - Input/output/retrieval may need different configurations
- Test thoroughly - Verify detection works for your specific use cases
- Consider compliance - Ensure your PII handling meets regulatory requirements (GDPR, CCPA, etc.)
See Also