TANGO: Text to Audio using iNstruction-Guided diffusiOn

Property	Value
License	CC-BY-NC-SA-4.0
Language	English
Training Data	AudioCaps
Primary Task	Text-to-Audio Generation

What is TANGO?

TANGO is a cutting-edge latent diffusion model designed for converting text descriptions into high-quality audio outputs. It leverages a frozen instruction-tuned Flan-T5 language model as its text encoder and employs a UNet-based diffusion model for generating audio. The model excels at producing various types of sounds, including human vocalizations, animal noises, natural phenomena, and artificial sound effects.

Implementation Details

The model architecture combines a powerful text encoder with a sophisticated diffusion model. It was initially pre-trained on TangoPromptBank and then fine-tuned on AudioCaps, achieving state-of-the-art results in text-to-audio generation.

Uses Flan-T5 as a frozen text encoder
Implements UNet-based diffusion architecture
Supports variable sampling steps (default 100, recommended 200 for higher quality)
Enables batch processing for multiple prompts

Core Capabilities

Generation of realistic audio from text descriptions
Support for diverse sound categories including natural, artificial, and animal sounds
Batch processing of multiple prompts
Adjustable quality settings through sampling steps
16kHz audio output sampling rate

Frequently Asked Questions

Q: What makes this model unique?

TANGO stands out for its state-of-the-art performance in text-to-audio generation, achieved through its innovative combination of a frozen instruction-tuned LLM and latent diffusion model. It has demonstrated superior results across both objective and subjective metrics compared to existing solutions.

Q: What are the recommended use cases?

The model is ideal for generating various types of sound effects, ambient sounds, and natural noises from text descriptions. However, it's important to note that due to its training on the relatively small AudioCaps dataset, it may have limitations with concepts not well-represented in the training data, such as singing or fine-grained control over specific sound attributes.

tango