audiogen-medium

Maintained By
facebook

AudioGen-Medium

PropertyValue
Model Size1.5B parameters
LicenseCC-BY-NC-4.0
AuthorFacebook
PaperAudioGen Paper

What is audiogen-medium?

AudioGen-medium is an advanced autoregressive transformer language model designed specifically for text-to-audio generation. Developed by Facebook, this 1.5B parameter model represents a significant evolution in audio synthesis technology, operating on discrete representations learned from raw waveforms using EnCodec tokenization.

Implementation Details

The model operates at 16kHz using an EnCodec tokenizer with 4 codebooks sampled at 50 Hz, implementing a delay pattern between codebooks. This architecture allows for faster generation while maintaining high-quality output, requiring only 50 auto-regressive steps per second of audio.

  • Utilizes MusicGen architecture principles
  • Implements 4-codebook EnCodec tokenization
  • Operates at 16kHz sampling rate
  • 50Hz sampling frequency for codebooks

Core Capabilities

  • Text-to-audio generation
  • General sound synthesis
  • Efficient audio generation with reduced computational requirements
  • Support for variable duration outputs

Frequently Asked Questions

Q: What makes this model unique?

AudioGen-medium's distinctive feature is its efficient architecture that maintains high-quality output while requiring fewer auto-regressive steps, making it faster than traditional audio generation models while achieving similar performance levels.

Q: What are the recommended use cases?

The model is ideal for generating various audio content from text descriptions, including environmental sounds, animal noises, and mechanical sounds. It's particularly useful for content creators, sound designers, and developers working on audio-based applications.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.