Llama-3.2-3B-Instruct-ONNX

Llama-3.2-3B-Instruct-ONNX

onnx-community

Optimized ONNX version of Meta's Llama-3.2-3B-Instruct model for accelerated inference, supporting both CPU and GPU with int4 quantization, offering up to 39x speedup on A100 GPUs.

PropertyValue
LicenseMIT + Llama 3.2 Community License
Supported Languages8 (en, de, fr, it, pt, hi, es, th)
FrameworkONNX Runtime
DevelopmentONNX Runtime, Microsoft

What is Llama-3.2-3B-Instruct-ONNX?

Llama-3.2-3B-Instruct-ONNX is an optimized version of Meta's Llama-3.2-3B-Instruct model, converted to ONNX format for enhanced inference performance. This implementation offers significant speed improvements, achieving up to 39x faster performance than PyTorch compile on A100 GPUs and 1.25x faster than llama.cpp on standard CPU configurations.

Implementation Details

The model has been specifically optimized for both CPU and GPU deployment, featuring int4 quantization techniques via RTN. It supports multiple hardware configurations including A100 GPUs, standard CPU environments, and AMD processors.

  • Supports both CPU and GPU execution with int4 quantization
  • Optimized for DirectX 12-capable GPUs with 4GB+ RAM
  • CUDA support for NVIDIA GPUs with Compute Capability >= 7.0
  • Comprehensive multilingual support across 8 languages

Core Capabilities

  • High-performance text generation and instruction following
  • Efficient inference across various hardware platforms
  • Integrated with ONNX Runtime Generate() API for easy deployment
  • Supports both mobile and server-side deployment

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized performance through ONNX Runtime integration, offering significant speed improvements over standard implementations while maintaining model quality through careful quantization.

Q: What are the recommended use cases?

The model is ideal for applications requiring efficient text generation and instruction following, particularly in production environments where performance optimization is crucial. It's especially suitable for deployment across various hardware configurations, from mobile devices to high-performance servers.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026