RootSignals-Judge-Llama-70B
Property | Value |
---|---|
Base Model | Llama-3.3-70B-Instruct |
Model Type | Text-Only Decoder Transformer |
Training Hardware | LUMI-G / AMD Radeon Instinct™ MI250X (384 GPUs) |
Model URL | https://huggingface.co/root-signals/RootSignals-Judge-Llama-70B |
What is RootSignals-Judge-Llama-70B?
RootSignals-Judge-Llama-70B is a specialized large language model designed specifically for evaluation tasks and hallucination detection. Post-trained from Llama-3.3-70B-Instruct, this model has been optimized using high-quality, human-annotated datasets focusing on pairwise preference judgments and instruction following capabilities. The model achieves state-of-the-art performance in hallucination detection, surpassing even GPT-4 on certain benchmarks while operating at a fraction of the cost.
Implementation Details
The model was trained using DPO with IPO loss for 3 epochs using bfloat16 mixed-precision training on 384 GPUs. It supports context lengths of up to 32k tokens and can be deployed locally using various frameworks like SGLang or vLLM. The model weights are available in FP8 format to enable cost-effective research and commercial applications.
- Achieves 86.3% Pass@1 Rate on HaluBench Test Set
- Outperforms major closed-source models in hallucination detection
- Supports complex, user-defined scoring rubrics
- Provides detailed, structured justifications for evaluations
Core Capabilities
- Context-grounded hallucination detection for RAG applications
- Pairwise preference judgments with strong evaluation capabilities
- Custom evaluation metric implementation through specific rubrics
- Best-of-N decisions for inference-time search tasks
- Local deployment support for privacy-focused applications
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its exceptional performance in hallucination detection and evaluation tasks, achieving better results than GPT-4 and other leading models while being more cost-effective. It provides detailed reasoning for its decisions and supports custom evaluation rubrics.
Q: What are the recommended use cases?
The model is ideal for RAG system evaluation, detecting hallucinations in generated content, performing pairwise comparisons between different model outputs, and implementing custom evaluation metrics. It's particularly valuable in settings requiring local deployment or handling long-context inputs up to 32k tokens.