RootSignals-Judge-Llama-70B

Maintained By
root-signals

RootSignals-Judge-Llama-70B

PropertyValue
Base ModelLlama-3.3-70B-Instruct
Model TypeText-Only Decoder Transformer
Training HardwareLUMI-G / AMD Radeon Instinct™ MI250X (384 GPUs)
Model URLhttps://huggingface.co/root-signals/RootSignals-Judge-Llama-70B

What is RootSignals-Judge-Llama-70B?

RootSignals-Judge-Llama-70B is a specialized large language model designed specifically for evaluation tasks and hallucination detection. Post-trained from Llama-3.3-70B-Instruct, this model has been optimized using high-quality, human-annotated datasets focusing on pairwise preference judgments and instruction following capabilities. The model achieves state-of-the-art performance in hallucination detection, surpassing even GPT-4 on certain benchmarks while operating at a fraction of the cost.

Implementation Details

The model was trained using DPO with IPO loss for 3 epochs using bfloat16 mixed-precision training on 384 GPUs. It supports context lengths of up to 32k tokens and can be deployed locally using various frameworks like SGLang or vLLM. The model weights are available in FP8 format to enable cost-effective research and commercial applications.

  • Achieves 86.3% Pass@1 Rate on HaluBench Test Set
  • Outperforms major closed-source models in hallucination detection
  • Supports complex, user-defined scoring rubrics
  • Provides detailed, structured justifications for evaluations

Core Capabilities

  • Context-grounded hallucination detection for RAG applications
  • Pairwise preference judgments with strong evaluation capabilities
  • Custom evaluation metric implementation through specific rubrics
  • Best-of-N decisions for inference-time search tasks
  • Local deployment support for privacy-focused applications

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional performance in hallucination detection and evaluation tasks, achieving better results than GPT-4 and other leading models while being more cost-effective. It provides detailed reasoning for its decisions and supports custom evaluation rubrics.

Q: What are the recommended use cases?

The model is ideal for RAG system evaluation, detecting hallucinations in generated content, performing pairwise comparisons between different model outputs, and implementing custom evaluation metrics. It's particularly valuable in settings requiring local deployment or handling long-context inputs up to 32k tokens.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.