phi-4-GGUF

Maintained By
bartowski

phi-4-GGUF

PropertyValue
Original ModelMicrosoft/phi-4
Quantization Types25+ variants
Size Range4.49GB - 58.64GB
Authorbartowski
Model LinkHugging Face Repository

What is phi-4-GGUF?

phi-4-GGUF is a comprehensive collection of quantized versions of Microsoft's phi-4 language model, optimized for different hardware configurations and memory constraints. Using llama.cpp's advanced quantization techniques, this collection provides various compression levels while maintaining different quality-performance trade-offs.

Implementation Details

The model utilizes imatrix quantization options and features specific prompt formatting requirements. The collection includes various quantization types from full F32 weights (58.64GB) down to highly compressed IQ2_XS (4.49GB) variants. Notable implementations include special handling of embedding and output weights in certain variants (Q3_K_XL, Q4_K_L) using Q8_0 quantization for improved quality.

  • Supports multiple quantization methods: Q8_0, Q6_K, Q5_K, Q4_K, Q3_K, IQ4, IQ3, IQ2
  • Implements online repacking for ARM and AVX CPU inference
  • Uses SOTA techniques in lower quantization levels to maintain usability
  • Features tokenizer improvements from the unsloth team

Core Capabilities

  • Flexible deployment options for various hardware configurations
  • High-quality preservation in recommended variants (Q6_K_L, Q5_K_M)
  • Optimized performance for both CPU and GPU inference
  • Special handling for embed/output weights in specific variants

Frequently Asked Questions

Q: What makes this model unique?

This implementation offers an exceptionally wide range of quantization options, allowing users to find the perfect balance between model size, performance, and quality for their specific hardware constraints. The inclusion of both traditional K-quants and newer I-quants provides flexibility for different acceleration backends.

Q: What are the recommended use cases?

For optimal performance, choose Q6_K_L or Q5_K_M variants for high-quality results. For systems with limited RAM, the Q4_K_M offers a good balance. For minimal resource requirements, IQ2 variants provide surprisingly usable results while maintaining minimal size requirements.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.