Phi-3-mini-4k-instruct-onnx

Maintained By
microsoft

Phi-3-mini-4k-instruct-onnx

PropertyValue
DeveloperMicrosoft
LicenseMIT
PaperAWQ Paper
Context Length4K tokens

What is Phi-3-mini-4k-instruct-onnx?

Phi-3-mini-4k-instruct-onnx is an optimized ONNX conversion of Microsoft's Phi-3 Mini model, specifically designed for high-performance inference across multiple hardware platforms. This version supports a 4K token context window and has been optimized using various quantization techniques including AWQ (Activation-Aware Quantization) for efficient deployment.

Implementation Details

The model comes in multiple optimized configurations: INT4 for DirectML (Windows), FP16 for CUDA (NVIDIA), INT4 for CUDA, and INT4 for CPU/Mobile. It leverages ONNX Runtime's Generate() API for seamless integration and has shown impressive performance improvements - up to 5x faster than PyTorch for FP16 CUDA and 10x faster for INT4 CUDA implementations.

  • Supports multiple hardware platforms including AMD, Intel, and NVIDIA GPUs
  • Implements AWQ quantization for optimal performance-accuracy trade-off
  • Provides specialized versions for different compute environments
  • Features DirectML support for Windows devices

Core Capabilities

  • High-performance text generation with 4K context window
  • Cross-platform deployment support
  • Hardware-specific optimizations
  • Significant speedup compared to standard implementations
  • Balanced accuracy-performance trade-offs through quantization

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive optimization for ONNX Runtime, supporting multiple hardware platforms while maintaining high performance through specialized quantization techniques. It's particularly notable for achieving up to 10x speedup compared to PyTorch implementations.

Q: What are the recommended use cases?

The model is ideal for deployment scenarios requiring efficient inference across different hardware platforms. It's particularly well-suited for applications needing fast text generation while maintaining reasonable accuracy, especially in resource-constrained environments like mobile devices or when operating with limited GPU memory.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.