distilroberta-base-rejection-v1
Property | Value |
---|---|
Parameter Count | 82.1M |
License | Apache 2.0 |
Base Model | DistilRoBERTa-base |
Accuracy | 98.87% |
F1 Score | 0.9537 |
Papers | Do-Not-Answer Dataset, Predicting Prompt Refusal |
What is distilroberta-base-rejection-v1?
This is a specialized model fine-tuned by ProtectAI.com for detecting rejection responses in Large Language Models (LLMs). It's designed to identify when an AI system refuses to respond due to content moderation concerns, classifying outputs into normal (0) or rejection (1) categories. The model demonstrates exceptional performance with 98.87% accuracy and 95.37% F1 score.
Implementation Details
Built on the DistilRoBERTa architecture, this model was trained using a carefully curated dataset combining multiple LLM rejection samples and normal RLHF responses. The training process involved a 90-10 split between normal outputs and rejections, with optimization using Adam optimizer and linear learning rate scheduling.
- Training batch size: 16 with 3 epochs
- Learning rate: 2e-05 with 500 warmup steps
- Supports both PyTorch and ONNX runtime implementations
- Maximum sequence length: 512 tokens
Core Capabilities
- Binary classification of AI responses (normal/rejection)
- High precision (92.79%) and recall (98.10%)
- Case-sensitive analysis
- Integrated content moderation support
- Compatible with LLM Guard's NoRefusal Scanner
Frequently Asked Questions
Q: What makes this model unique?
The model's specialization in detecting AI rejection responses with high accuracy makes it valuable for content moderation and safety systems. Its lightweight architecture (82.1M parameters) and excellent performance metrics make it both efficient and reliable.
Q: What are the recommended use cases?
The model is ideal for content moderation systems, automated response validation, and safety monitoring in AI applications. It's particularly useful in scenarios where detecting AI system refusals is crucial for maintaining ethical boundaries and content guidelines.