distilroberta-base-rejection-v1

Maintained By
protectai

distilroberta-base-rejection-v1

PropertyValue
Parameter Count82.1M
LicenseApache 2.0
Base ModelDistilRoBERTa-base
Accuracy98.87%
F1 Score0.9537
PapersDo-Not-Answer Dataset, Predicting Prompt Refusal

What is distilroberta-base-rejection-v1?

This is a specialized model fine-tuned by ProtectAI.com for detecting rejection responses in Large Language Models (LLMs). It's designed to identify when an AI system refuses to respond due to content moderation concerns, classifying outputs into normal (0) or rejection (1) categories. The model demonstrates exceptional performance with 98.87% accuracy and 95.37% F1 score.

Implementation Details

Built on the DistilRoBERTa architecture, this model was trained using a carefully curated dataset combining multiple LLM rejection samples and normal RLHF responses. The training process involved a 90-10 split between normal outputs and rejections, with optimization using Adam optimizer and linear learning rate scheduling.

  • Training batch size: 16 with 3 epochs
  • Learning rate: 2e-05 with 500 warmup steps
  • Supports both PyTorch and ONNX runtime implementations
  • Maximum sequence length: 512 tokens

Core Capabilities

  • Binary classification of AI responses (normal/rejection)
  • High precision (92.79%) and recall (98.10%)
  • Case-sensitive analysis
  • Integrated content moderation support
  • Compatible with LLM Guard's NoRefusal Scanner

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in detecting AI rejection responses with high accuracy makes it valuable for content moderation and safety systems. Its lightweight architecture (82.1M parameters) and excellent performance metrics make it both efficient and reliable.

Q: What are the recommended use cases?

The model is ideal for content moderation systems, automated response validation, and safety monitoring in AI applications. It's particularly useful in scenarios where detecting AI system refusals is crucial for maintaining ethical boundaries and content guidelines.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.