Phi-3-mini-128k-instruct-onnx

Property	Value
Developer	Microsoft
License	MIT
Paper	AWQ Paper
Format	ONNX

What is Phi-3-mini-128k-instruct-onnx?

Phi-3-mini-128k-instruct-onnx is an optimized ONNX version of Microsoft's Phi-3 Mini model, specifically designed for efficient inference across various hardware platforms. This variant supports an impressive 128K context length and has been optimized using advanced quantization techniques for superior performance.

Implementation Details

The model comes in multiple optimized configurations: int4 DML for Windows GPUs, fp16 CUDA for NVIDIA GPUs, int4 CUDA for NVIDIA GPUs, and int4 variants for CPU and mobile deployment. It leverages Activation Aware Quantization (AWQ) for the int4 versions, preserving accuracy while significantly reducing model size.

Supports DirectML for cross-platform GPU acceleration
Offers up to 9x faster inference compared to PyTorch on CUDA
Implements AWQ quantization for efficient compression
Compatible with Windows, Linux, and Mac platforms

Core Capabilities

High-performance text generation with 128K context window
Optimized for both CPU and GPU inference
Support for multiple precision formats (int4, fp16)
Cross-platform compatibility via ONNX Runtime
Enhanced instruction following and safety measures

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized ONNX implementation that delivers up to 5x faster performance in FP16 and 9x faster in INT4 compared to PyTorch, while maintaining high accuracy through careful quantization techniques.

Q: What are the recommended use cases?

The model is ideal for applications requiring efficient text generation across various hardware platforms, particularly where deployment efficiency and long context handling are crucial. It's especially suitable for Windows-based applications leveraging DirectML for GPU acceleration.