Meta-Llama-3.1-405B-Instruct-AWQ-INT4

Maintained By
hugging-quants

Meta-Llama-3.1-405B-Instruct-AWQ-INT4

PropertyValue
Parameter Count405 Billion
Quantization4-bit AWQ
Languages Supported8 (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)
LicenseLlama 3.1
VRAM Required~203 GiB

What is Meta-Llama-3.1-405B-Instruct-AWQ-INT4?

This model is a community-driven quantized version of Meta's largest Llama 3.1 language model, compressed from FP16 to INT4 precision using AutoAWQ quantization. It maintains the powerful capabilities of the original 405B parameter model while significantly reducing the memory footprint through advanced quantization techniques.

Implementation Details

The model employs GEMM kernels with zero-point quantization and a group size of 128, optimized for multilingual dialogue use cases. It requires specialized hardware setup with approximately 203 GiB of VRAM for basic model loading.

  • Supports multiple inference frameworks including Transformers, AutoAWQ, and Text Generation Inference (TGI)
  • Implements 4-bit precision with AWQ quantization
  • Features optimized performance through Marlin kernels in TGI
  • Includes comprehensive chat template support

Core Capabilities

  • Multilingual understanding and generation across 8 languages
  • Optimized for dialogue and conversational tasks
  • High-performance instruction following
  • Efficient memory usage through quantization
  • Compatible with multiple deployment options (TGI, vLLM, direct inference)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for being a successfully quantized version of one of the largest language models available, offering the capabilities of a 405B parameter model in a more efficient format while maintaining performance across 8 languages.

Q: What are the recommended use cases?

The model is particularly well-suited for multilingual dialogue applications, complex instruction following, and scenarios requiring advanced language understanding where hardware constraints make running the full FP16 model impractical.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.