TANGO: Text to Audio using iNstruction-Guided diffusiOn
Property | Value |
---|---|
License | CC-BY-NC-SA-4.0 |
Language | English |
Training Data | AudioCaps |
Primary Task | Text-to-Audio Generation |
What is TANGO?
TANGO is a cutting-edge latent diffusion model designed for converting text descriptions into high-quality audio outputs. It leverages a frozen instruction-tuned Flan-T5 language model as its text encoder and employs a UNet-based diffusion model for generating audio. The model excels at producing various types of sounds, including human vocalizations, animal noises, natural phenomena, and artificial sound effects.
Implementation Details
The model architecture combines a powerful text encoder with a sophisticated diffusion model. It was initially pre-trained on TangoPromptBank and then fine-tuned on AudioCaps, achieving state-of-the-art results in text-to-audio generation.
- Uses Flan-T5 as a frozen text encoder
- Implements UNet-based diffusion architecture
- Supports variable sampling steps (default 100, recommended 200 for higher quality)
- Enables batch processing for multiple prompts
Core Capabilities
- Generation of realistic audio from text descriptions
- Support for diverse sound categories including natural, artificial, and animal sounds
- Batch processing of multiple prompts
- Adjustable quality settings through sampling steps
- 16kHz audio output sampling rate
Frequently Asked Questions
Q: What makes this model unique?
TANGO stands out for its state-of-the-art performance in text-to-audio generation, achieved through its innovative combination of a frozen instruction-tuned LLM and latent diffusion model. It has demonstrated superior results across both objective and subjective metrics compared to existing solutions.
Q: What are the recommended use cases?
The model is ideal for generating various types of sound effects, ambient sounds, and natural noises from text descriptions. However, it's important to note that due to its training on the relatively small AudioCaps dataset, it may have limitations with concepts not well-represented in the training data, such as singing or fine-grained control over specific sound attributes.