Mistral-Nemo-Instruct-FP8-2407

Property	Value
License	Apache 2.0
Context Window	128k tokens
Model Type	Instruction-tuned LLM
Architecture	40-layer Transformer with GQA
Model URL	HuggingFace

What is Mistral-Nemo-Instruct-FP8-2407?

Mistral-Nemo-Instruct-FP8-2407 is a quantized instruction-tuned language model developed jointly by Mistral AI and NVIDIA. It represents a significant advancement in efficient LLM deployment, built upon the Mistral-Nemo-Base-2407 architecture. The model features impressive multilingual capabilities and serves as a drop-in replacement for Mistral 7B while delivering enhanced performance.

Implementation Details

The model employs a sophisticated transformer architecture with 40 layers, featuring a dimension of 5,120 and 32 attention heads (8 KV-heads for grouped-query attention). It utilizes SwiGLU activation and rotary embeddings with theta=1M. The model processes a vocabulary of approximately 128k tokens and supports a substantial 128k context window.

Multilingual and code-focused training data
FP8 quantization for efficient deployment
GQA (Grouped-Query Attention) implementation
128k context window support

Core Capabilities

Strong performance on various benchmarks (83.5% on HellaSwag, 76.8% on Winogrande)
Multilingual proficiency across 8+ languages with MMLU scores ranging from 59-65%
Efficient processing through FP8 quantization
Compatible with vLLM library for deployment

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its combination of efficient quantization, extensive context window, and strong multilingual capabilities, all while maintaining competitive performance across various benchmarks. Its Apache 2.0 license also makes it accessible for commercial use.

Q: What are the recommended use cases?

The model is well-suited for multilingual applications, general text generation, and instruction-following tasks. Its large context window makes it particularly useful for processing lengthy documents, while its efficient quantization enables deployment in resource-constrained environments.