distilroberta-base-rejection-v1

Property	Value
Parameter Count	82.1M
License	Apache 2.0
Base Model	DistilRoBERTa-base
Accuracy	98.87%
F1 Score	0.9537
Papers	Do-Not-Answer Dataset, Predicting Prompt Refusal

What is distilroberta-base-rejection-v1?

This is a specialized model fine-tuned by ProtectAI.com for detecting rejection responses in Large Language Models (LLMs). It's designed to identify when an AI system refuses to respond due to content moderation concerns, classifying outputs into normal (0) or rejection (1) categories. The model demonstrates exceptional performance with 98.87% accuracy and 95.37% F1 score.

Implementation Details

Built on the DistilRoBERTa architecture, this model was trained using a carefully curated dataset combining multiple LLM rejection samples and normal RLHF responses. The training process involved a 90-10 split between normal outputs and rejections, with optimization using Adam optimizer and linear learning rate scheduling.

Training batch size: 16 with 3 epochs
Learning rate: 2e-05 with 500 warmup steps
Supports both PyTorch and ONNX runtime implementations
Maximum sequence length: 512 tokens

Core Capabilities

Binary classification of AI responses (normal/rejection)
High precision (92.79%) and recall (98.10%)
Case-sensitive analysis
Integrated content moderation support
Compatible with LLM Guard's NoRefusal Scanner

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in detecting AI rejection responses with high accuracy makes it valuable for content moderation and safety systems. Its lightweight architecture (82.1M parameters) and excellent performance metrics make it both efficient and reliable.

Q: What are the recommended use cases?

The model is ideal for content moderation systems, automated response validation, and safety monitoring in AI applications. It's particularly useful in scenarios where detecting AI system refusals is crucial for maintaining ethical boundaries and content guidelines.