RootSignals-Judge-Llama-70B

Property	Value
Base Model	Llama-3.3-70B-Instruct
Model Type	Text-Only Decoder Transformer
Training Hardware	LUMI-G / AMD Radeon Instinct™ MI250X (384 GPUs)
Model URL	https://huggingface.co/root-signals/RootSignals-Judge-Llama-70B

What is RootSignals-Judge-Llama-70B?

RootSignals-Judge-Llama-70B is a specialized large language model designed specifically for evaluation tasks and hallucination detection. Post-trained from Llama-3.3-70B-Instruct, this model has been optimized using high-quality, human-annotated datasets focusing on pairwise preference judgments and instruction following capabilities. The model achieves state-of-the-art performance in hallucination detection, surpassing even GPT-4 on certain benchmarks while operating at a fraction of the cost.

Implementation Details

The model was trained using DPO with IPO loss for 3 epochs using bfloat16 mixed-precision training on 384 GPUs. It supports context lengths of up to 32k tokens and can be deployed locally using various frameworks like SGLang or vLLM. The model weights are available in FP8 format to enable cost-effective research and commercial applications.

Achieves 86.3% Pass@1 Rate on HaluBench Test Set
Outperforms major closed-source models in hallucination detection
Supports complex, user-defined scoring rubrics
Provides detailed, structured justifications for evaluations

Core Capabilities

Context-grounded hallucination detection for RAG applications
Pairwise preference judgments with strong evaluation capabilities
Custom evaluation metric implementation through specific rubrics
Best-of-N decisions for inference-time search tasks
Local deployment support for privacy-focused applications

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional performance in hallucination detection and evaluation tasks, achieving better results than GPT-4 and other leading models while being more cost-effective. It provides detailed reasoning for its decisions and supports custom evaluation rubrics.

Q: What are the recommended use cases?

The model is ideal for RAG system evaluation, detecting hallucinations in generated content, performing pairwise comparisons between different model outputs, and implementing custom evaluation metrics. It's particularly valuable in settings requiring local deployment or handling long-context inputs up to 32k tokens.