TANGO: Text to Audio using iNstruction-Guided diffusiOn

Property	Value
License	CC-BY-NC-SA-4.0
Language	English
Primary Dataset	AudioCaps
Pipeline	Text-to-Audio

What is TANGO?

TANGO is a cutting-edge latent diffusion model designed for converting text descriptions into realistic audio outputs. It leverages a frozen instruction-tuned Flan-T5 language model as its text encoder, combined with a UNet-based diffusion model for audio generation. The model excels at producing various types of sounds, including human vocalizations, animal noises, natural phenomena, and artificial sound effects.

Implementation Details

The model architecture combines advanced deep learning techniques with a two-stage training approach. Initially pre-trained on TangoPromptBank and later fine-tuned on AudioCaps, TANGO achieves state-of-the-art performance in text-to-audio generation. The implementation supports variable sampling steps (default 100, recommended 200 for higher quality) and batch processing capabilities.

Utilizes Flan-T5 as a frozen text encoder
Implements UNet-based latent diffusion
Supports batch processing for multiple prompts
Generates 16kHz audio output

Core Capabilities

Generation of realistic human and animal sounds
Natural environmental sound synthesis
Artificial sound effect creation
Batch processing of multiple text prompts
Adjustable generation quality via sampling steps

Frequently Asked Questions

Q: What makes this model unique?

TANGO stands out for its state-of-the-art performance in text-to-audio generation, achieved through its innovative combination of a frozen instruction-tuned LLM and latent diffusion model. It produces higher quality audio compared to existing alternatives across both objective and subjective metrics.

Q: What are the recommended use cases?

The model is ideal for generating various audio effects from textual descriptions, particularly useful in content creation, sound design, and prototyping. However, it's important to note that due to training data limitations, it may have constraints with complex concepts not present in the AudioCaps dataset.

tango