Shuka v1

Property	Value
Author	sarvamai
Model Type	Audio Language Model
Architecture	Encoder-Decoder with Projector
Model URL	HuggingFace

What is shuka_v1?

Shuka v1 is an innovative language model specifically designed for understanding audio in Indic languages. It combines a state-of-the-art Saaras v1 audio encoder with Meta's Llama3-8B-Instruct decoder, connected through a lightweight projector containing approximately 60M parameters. The model demonstrates impressive efficiency, requiring less than 100 hours of audio for training.

Implementation Details

The model employs a unique training approach where only the projector weights are fine-tuned while keeping the encoder and decoder frozen. Despite being trained primarily on English and Hindi data, the model exhibits strong zero-shot performance across multiple Indic languages.

Efficient training methodology using only projector fine-tuning
Integration with popular libraries like transformers and librosa
Support for bfloat16 precision
16kHz audio sampling rate requirement

Core Capabilities

Native audio understanding in 11+ Indic languages
Zero-shot performance in Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu
Natural and informative responses to audio queries
Efficient processing with minimal parameter tuning

Frequently Asked Questions

Q: What makes this model unique?

Shuka v1's uniqueness lies in its ability to understand multiple Indic languages without explicit training, achieved through its innovative architecture combining Saaras v1 encoder and Llama3-8B-Instruct decoder.

Q: What are the recommended use cases?

The model is ideal for audio question-answering tasks in Indic languages, multilingual audio understanding, and natural language processing applications requiring audio input in South Asian languages.

shuka_v1