OmniAudio-2.6B

NexaAIDev

OmniAudio-2.6B is a fast, efficient audio-language model combining Gemma-2-2b and Whisper turbo for on-device text/audio processing at 66 tokens/sec.

Property	Value
Parameter Count	2.6 Billion
Model Type	Audio-Language Model
Architecture	Gemma-2-2b + Whisper turbo + Custom Projector
Resource Requirements	1.30GB RAM, 1.60GB Storage (Q4_K_M version)
Model URL	https://huggingface.co/NexaAIDev/OmniAudio-2.6B

What is OmniAudio-2.6B?

OmniAudio-2.6B is an innovative audio-language model designed for efficient on-device deployment. It uniquely combines text and audio processing capabilities in a single architecture, achieving remarkable performance with speeds up to 66 tokens/second on consumer hardware. The model integrates Gemma-2-2b, Whisper turbo, and a custom projector module to enable secure, responsive audio-text processing without requiring internet connectivity.

Implementation Details

The model employs a three-stage training pipeline: pretraining on MLS English 10k transcription dataset, supervised fine-tuning using synthetic datasets, and Direct Preference Optimization using GPT-4o API as a reference. The architecture uniquely unifies ASR and LLM capabilities instead of chaining them sequentially, resulting in minimal latency and resource overhead.

Unified audio-text processing architecture
Special <|transcribe|> token for task differentiation
Optimized for edge deployment with minimal resource requirements
5.5x to 10.3x faster performance compared to larger models

Core Capabilities

Offline voice query processing
Interactive voice conversations with context understanding
Creative content generation from voice inputs
Meeting recording summarization
Voice tone modification and professional communication enhancement

Frequently Asked Questions

Q: What makes this model unique?

OmniAudio-2.6B stands out for its unified architecture that processes both audio and text inputs in a single efficient model, optimized for on-device deployment. Its performance-to-size ratio is exceptional, delivering up to 66 tokens/second on consumer hardware while requiring only 1.30GB RAM.

Q: What are the recommended use cases?

The model excels in offline voice processing, conversational AI, creative content generation, meeting summarization, and voice tone modification. It's particularly suitable for edge devices where privacy, speed, and resource efficiency are crucial.