SALMONN

Maintained By
tsinghua-ee

SALMONN: Speech Audio Language Music Open Neural Network

PropertyValue
LicenseApache 2.0
FrameworkPyTorch
PaperResearch Paper
Versions7B and 13B parameters

What is SALMONN?

SALMONN represents a significant advancement in multimodal AI, developed through collaboration between Tsinghua University and ByteDance. It's designed to process and understand various types of audio inputs, including speech, music, and environmental sounds, effectively giving language models "ears" for comprehensive auditory understanding.

Implementation Details

The model employs a sophisticated architecture combining multiple components: a Whisper speech encoder, a BEATs audio encoder, and a window-level Q-Former for fusion. The system uses a LoRA adaptor to align the augmented LLM input space with its output space, enabling seamless processing of varied audio inputs.

  • Window-level Q-Former for audio token fusion
  • Integration with Whisper and BEATs encoders
  • LoRA adaptation for input-output alignment
  • Support for multiple languages and audio types

Core Capabilities

  • Multilingual speech recognition and translation
  • Audio-speech co-reasoning
  • Music captioning and understanding
  • Audio-based storytelling
  • Environmental sound recognition
  • Cross-modal emergent abilities

Frequently Asked Questions

Q: What makes this model unique?

SALMONN stands out for its ability to process multiple types of audio inputs simultaneously, going beyond traditional speech-only or audio-event-only models. It demonstrates emergent capabilities in cross-modal understanding and can follow both textual and spoken commands.

Q: What are the recommended use cases?

The model is ideal for applications requiring comprehensive audio understanding, including automatic speech recognition, audio captioning, music analysis, and audio-based storytelling. It's particularly useful in scenarios requiring multilingual capabilities and complex audio-speech reasoning.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.