IBM Granite 3.2 8B Instruct GGUF
Property | Value |
---|---|
Base Model | IBM Granite 3.2 8B |
Quantization Range | 2.84GB - 8.68GB |
Original Source | huggingface.co/ibm-granite/granite-3.2-8b-instruct |
Format | GGUF (llama.cpp compatible) |
What is ibm-granite_granite-3.2-8b-instruct-GGUF?
This is a comprehensive collection of GGUF quantizations of IBM's Granite 3.2 8B instruction-tuned language model. The repository offers 23 different quantization variants, ranging from extremely high quality (Q8_0) to highly compressed (IQ2_M), enabling users to choose the optimal balance between model size and performance for their specific hardware constraints.
Implementation Details
The model uses a specific prompt format: system role, user role, and assistant role markers. All quantizations were created using llama.cpp's imatrix option, optimizing for different use cases and hardware configurations.
- Multiple quantization types (Q8_0 through IQ2_M)
- Special variants with Q8_0 embedding weights for enhanced quality
- Online repacking support for ARM and AVX CPU inference
- Compatible with LM Studio and other llama.cpp-based projects
Core Capabilities
- Flexible deployment options from 2.84GB to 8.68GB file sizes
- Optimized performance on various hardware architectures
- Enhanced quality options with specialized embedding weight quantization
- Support for both CPU and GPU inference
Frequently Asked Questions
Q: What makes this model unique?
The model offers an exceptionally wide range of quantization options, including both traditional K-quants and newer I-quants, making it highly adaptable to different hardware configurations and performance requirements. The implementation includes special variants with Q8_0 embedding weights for enhanced quality in critical model components.
Q: What are the recommended use cases?
For maximum quality, use Q6_K_L or Q8_0 variants. For balanced performance, Q4_K_M is recommended as the default choice. For limited RAM scenarios, the I-quants (IQ3_M, IQ4_XS) offer good performance-to-size ratios, especially on modern GPUs using cuBLAS or rocBLAS.