Shuka v1
Property | Value |
---|---|
Author | sarvamai |
Model Type | Audio Language Model |
Architecture | Encoder-Decoder with Projector |
Model URL | HuggingFace |
What is shuka_v1?
Shuka v1 is an innovative language model specifically designed for understanding audio in Indic languages. It combines a state-of-the-art Saaras v1 audio encoder with Meta's Llama3-8B-Instruct decoder, connected through a lightweight projector containing approximately 60M parameters. The model demonstrates impressive efficiency, requiring less than 100 hours of audio for training.
Implementation Details
The model employs a unique training approach where only the projector weights are fine-tuned while keeping the encoder and decoder frozen. Despite being trained primarily on English and Hindi data, the model exhibits strong zero-shot performance across multiple Indic languages.
- Efficient training methodology using only projector fine-tuning
- Integration with popular libraries like transformers and librosa
- Support for bfloat16 precision
- 16kHz audio sampling rate requirement
Core Capabilities
- Native audio understanding in 11+ Indic languages
- Zero-shot performance in Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu
- Natural and informative responses to audio queries
- Efficient processing with minimal parameter tuning
Frequently Asked Questions
Q: What makes this model unique?
Shuka v1's uniqueness lies in its ability to understand multiple Indic languages without explicit training, achieved through its innovative architecture combining Saaras v1 encoder and Llama3-8B-Instruct decoder.
Q: What are the recommended use cases?
The model is ideal for audio question-answering tasks in Indic languages, multilingual audio understanding, and natural language processing applications requiring audio input in South Asian languages.