OmniAudio-2.6B
Property | Value |
---|---|
Parameter Count | 2.6 Billion |
Model Type | Audio-Language Model |
Architecture | Gemma-2-2b + Whisper turbo + Custom Projector |
Resource Requirements | 1.30GB RAM, 1.60GB Storage (Q4_K_M version) |
Model URL | https://huggingface.co/NexaAIDev/OmniAudio-2.6B |
What is OmniAudio-2.6B?
OmniAudio-2.6B is an innovative audio-language model designed for efficient on-device deployment. It uniquely combines text and audio processing capabilities in a single architecture, achieving remarkable performance with speeds up to 66 tokens/second on consumer hardware. The model integrates Gemma-2-2b, Whisper turbo, and a custom projector module to enable secure, responsive audio-text processing without requiring internet connectivity.
Implementation Details
The model employs a three-stage training pipeline: pretraining on MLS English 10k transcription dataset, supervised fine-tuning using synthetic datasets, and Direct Preference Optimization using GPT-4o API as a reference. The architecture uniquely unifies ASR and LLM capabilities instead of chaining them sequentially, resulting in minimal latency and resource overhead.
- Unified audio-text processing architecture
- Special <|transcribe|> token for task differentiation
- Optimized for edge deployment with minimal resource requirements
- 5.5x to 10.3x faster performance compared to larger models
Core Capabilities
- Offline voice query processing
- Interactive voice conversations with context understanding
- Creative content generation from voice inputs
- Meeting recording summarization
- Voice tone modification and professional communication enhancement
Frequently Asked Questions
Q: What makes this model unique?
OmniAudio-2.6B stands out for its unified architecture that processes both audio and text inputs in a single efficient model, optimized for on-device deployment. Its performance-to-size ratio is exceptional, delivering up to 66 tokens/second on consumer hardware while requiring only 1.30GB RAM.
Q: What are the recommended use cases?
The model excels in offline voice processing, conversational AI, creative content generation, meeting summarization, and voice tone modification. It's particularly suitable for edge devices where privacy, speed, and resource efficiency are crucial.