TANGO: Text to Audio using iNstruction-Guided diffusiOn
Property | Value |
---|---|
License | CC-BY-NC-SA-4.0 |
Language | English |
Primary Dataset | AudioCaps |
Pipeline | Text-to-Audio |
What is TANGO?
TANGO is a cutting-edge latent diffusion model designed for converting text descriptions into realistic audio outputs. It leverages a frozen instruction-tuned Flan-T5 language model as its text encoder, combined with a UNet-based diffusion model for audio generation. The model excels at producing various types of sounds, including human vocalizations, animal noises, natural phenomena, and artificial sound effects.
Implementation Details
The model architecture combines advanced deep learning techniques with a two-stage training approach. Initially pre-trained on TangoPromptBank and later fine-tuned on AudioCaps, TANGO achieves state-of-the-art performance in text-to-audio generation. The implementation supports variable sampling steps (default 100, recommended 200 for higher quality) and batch processing capabilities.
- Utilizes Flan-T5 as a frozen text encoder
- Implements UNet-based latent diffusion
- Supports batch processing for multiple prompts
- Generates 16kHz audio output
Core Capabilities
- Generation of realistic human and animal sounds
- Natural environmental sound synthesis
- Artificial sound effect creation
- Batch processing of multiple text prompts
- Adjustable generation quality via sampling steps
Frequently Asked Questions
Q: What makes this model unique?
TANGO stands out for its state-of-the-art performance in text-to-audio generation, achieved through its innovative combination of a frozen instruction-tuned LLM and latent diffusion model. It produces higher quality audio compared to existing alternatives across both objective and subjective metrics.
Q: What are the recommended use cases?
The model is ideal for generating various audio effects from textual descriptions, particularly useful in content creation, sound design, and prototyping. However, it's important to note that due to training data limitations, it may have constraints with complex concepts not present in the AudioCaps dataset.