Mistral-Nemo-Instruct-FP8-2407
Property | Value |
---|---|
License | Apache 2.0 |
Context Window | 128k tokens |
Model Type | Instruction-tuned LLM |
Architecture | 40-layer Transformer with GQA |
Model URL | HuggingFace |
What is Mistral-Nemo-Instruct-FP8-2407?
Mistral-Nemo-Instruct-FP8-2407 is a quantized instruction-tuned language model developed jointly by Mistral AI and NVIDIA. It represents a significant advancement in efficient LLM deployment, built upon the Mistral-Nemo-Base-2407 architecture. The model features impressive multilingual capabilities and serves as a drop-in replacement for Mistral 7B while delivering enhanced performance.
Implementation Details
The model employs a sophisticated transformer architecture with 40 layers, featuring a dimension of 5,120 and 32 attention heads (8 KV-heads for grouped-query attention). It utilizes SwiGLU activation and rotary embeddings with theta=1M. The model processes a vocabulary of approximately 128k tokens and supports a substantial 128k context window.
- Multilingual and code-focused training data
- FP8 quantization for efficient deployment
- GQA (Grouped-Query Attention) implementation
- 128k context window support
Core Capabilities
- Strong performance on various benchmarks (83.5% on HellaSwag, 76.8% on Winogrande)
- Multilingual proficiency across 8+ languages with MMLU scores ranging from 59-65%
- Efficient processing through FP8 quantization
- Compatible with vLLM library for deployment
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its combination of efficient quantization, extensive context window, and strong multilingual capabilities, all while maintaining competitive performance across various benchmarks. Its Apache 2.0 license also makes it accessible for commercial use.
Q: What are the recommended use cases?
The model is well-suited for multilingual applications, general text generation, and instruction-following tasks. Its large context window makes it particularly useful for processing lengthy documents, while its efficient quantization enables deployment in resource-constrained environments.