sepformer-wham-enhancement

sepformer-wham-enhancement

speechbrain

SepFormer speech enhancement model trained on WHAM! dataset. Achieves 14.35dB SI-SNR and 3.07 PESQ. Specialized for 8kHz audio denoising using transformer architecture.

PropertyValue
LicenseApache 2.0
FrameworkPyTorch (SpeechBrain)
PaperAttention is All You Need in Speech Separation
Performance14.35 dB SI-SNR, 3.07 PESQ

What is sepformer-wham-enhancement?

The sepformer-wham-enhancement is a specialized speech enhancement model based on the SepFormer (Separation Transformer) architecture. Trained on the WHAM! dataset, it excels at removing environmental noise and reverberation from speech signals sampled at 8kHz. This model represents a significant advancement in speech enhancement technology, leveraging the power of transformer architectures for audio processing.

Implementation Details

Built using the SpeechBrain toolkit, this model implements a transformer-based architecture specifically designed for speech enhancement. It processes audio at 8kHz sampling frequency and can be easily deployed using PyTorch. The model has demonstrated robust performance on the WHAM! dataset, which is derived from the WSJ0-Mix dataset with added environmental noise and reverberation.

  • Transformer-based architecture optimized for speech separation
  • Trained on WHAM! dataset with 8kHz sampling rate
  • Implemented using SpeechBrain framework
  • Easy-to-use inference API for audio file processing

Core Capabilities

  • High-quality speech enhancement with 14.35 dB SI-SNR improvement
  • Effective noise and reverberation removal
  • Support for both CPU and GPU inference
  • Simple integration with existing audio processing pipelines

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines the SepFormer architecture with the WHAM! dataset, specifically targeting 8kHz audio enhancement. Its transformer-based approach and impressive SI-SNR and PESQ scores make it particularly effective for real-world speech enhancement applications.

Q: What are the recommended use cases?

The model is ideal for cleaning up noisy speech recordings, particularly those affected by environmental noise and reverberation. It's especially suitable for applications requiring 8kHz audio processing, such as telephony systems, voice messaging, and legacy audio restoration.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026