DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

neuralmagic

Optimized 70B parameter LLM using FP8 quantization, achieving 99.8% accuracy recovery while reducing model size by 50% and improving inference speed up to 3x.

PropertyValue
Model TypeLlamaForCausalLM
Parameter Count70B
QuantizationFP8 Dynamic
Release Date2/1/2025
DeveloperNeural Magic
Model URLhuggingface.co/neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

What is DeepSeek-R1-Distill-Llama-70B-FP8-dynamic?

This is a quantized version of the DeepSeek-R1-Distill-Llama-70B model, optimized using FP8 dynamic quantization to reduce model size while maintaining performance. The model achieves remarkable efficiency by reducing the number of bits per parameter from 16 to 8, resulting in approximately 50% reduction in disk size and GPU memory requirements.

Implementation Details

The model employs sophisticated quantization techniques, specifically targeting the linear operators within transformer blocks. It uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations, implemented through LLM Compressor.

  • Achieves up to 1.4x speedup in single-stream deployment
  • Up to 3.0x speedup in multi-stream asynchronous deployment
  • Maintains 99.8% accuracy compared to the original model on OpenLLM V1 benchmarks
  • Successfully deploys using vLLM backend for efficient inference

Core Capabilities

  • Strong performance in reasoning tasks (76.49% average score)
  • Excellent coding capabilities (81% pass@1 on HumanEval)
  • Robust performance across various benchmarks including MATH-500 (95.14%) and GSM8K (93.03%)
  • Efficient scaling across different GPU configurations (A6000, A100, H100)

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional balance between efficiency and performance, using FP8 dynamic quantization to significantly reduce resource requirements while maintaining over 99% of the original model's accuracy across most benchmarks.

Q: What are the recommended use cases?

The model excels in various scenarios including instruction following, multi-turn chat, code completion, and large-scale text processing. It's particularly effective in deployment scenarios where resource optimization is crucial while maintaining high performance standards.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026