RootSignals-Judge-Llama-70B

RootSignals-Judge-Llama-70B

root-signals

A powerful 70B parameter judge model fine-tuned from Llama-3.3, specializing in hallucination detection and instruction following evaluation with SOTA performance on multiple benchmarks.

PropertyValue
Base ModelLlama-3.3-70B-Instruct
Model TypeText-Only Decoder Transformer
Training HardwareLUMI-G / AMD Radeon Instinct™ MI250X (384 GPUs)
Model URLhttps://huggingface.co/root-signals/RootSignals-Judge-Llama-70B

What is RootSignals-Judge-Llama-70B?

RootSignals-Judge-Llama-70B is a specialized large language model designed specifically for evaluation tasks and hallucination detection. Post-trained from Llama-3.3-70B-Instruct, this model has been optimized using high-quality, human-annotated datasets focusing on pairwise preference judgments and instruction following capabilities. The model achieves state-of-the-art performance in hallucination detection, surpassing even GPT-4 on certain benchmarks while operating at a fraction of the cost.

Implementation Details

The model was trained using DPO with IPO loss for 3 epochs using bfloat16 mixed-precision training on 384 GPUs. It supports context lengths of up to 32k tokens and can be deployed locally using various frameworks like SGLang or vLLM. The model weights are available in FP8 format to enable cost-effective research and commercial applications.

  • Achieves 86.3% Pass@1 Rate on HaluBench Test Set
  • Outperforms major closed-source models in hallucination detection
  • Supports complex, user-defined scoring rubrics
  • Provides detailed, structured justifications for evaluations

Core Capabilities

  • Context-grounded hallucination detection for RAG applications
  • Pairwise preference judgments with strong evaluation capabilities
  • Custom evaluation metric implementation through specific rubrics
  • Best-of-N decisions for inference-time search tasks
  • Local deployment support for privacy-focused applications

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional performance in hallucination detection and evaluation tasks, achieving better results than GPT-4 and other leading models while being more cost-effective. It provides detailed reasoning for its decisions and supports custom evaluation rubrics.

Q: What are the recommended use cases?

The model is ideal for RAG system evaluation, detecting hallucinations in generated content, performing pairwise comparisons between different model outputs, and implementing custom evaluation metrics. It's particularly valuable in settings requiring local deployment or handling long-context inputs up to 32k tokens.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026