wav2vec2-large-960h

wav2vec2-large-960h

facebook

Wav2vec2-large-960h is Facebook's advanced speech recognition model, fine-tuned on 960 hours of Librispeech data, achieving state-of-the-art WER of 1.8/3.3 on clean/other test sets.

PropertyValue
LicenseApache 2.0
AuthorFacebook
PaperView Research Paper
Downloads83,319

What is wav2vec2-large-960h?

Wav2vec2-large-960h is a state-of-the-art speech recognition model developed by Facebook AI. It represents a breakthrough in speech processing by learning powerful representations from speech audio alone, followed by fine-tuning on transcribed speech. The model has been trained on 960 hours of Librispeech data and operates on 16kHz sampled speech audio.

Implementation Details

The model utilizes an innovative approach where it masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations. It's implemented using PyTorch and can be easily integrated using the Transformers library.

  • Achieves 1.8/3.3 WER on clean/other test sets using full Librispeech data
  • Performs remarkably well with limited labeled data (4.8/8.2 WER with just 10 minutes of labeled data)
  • Requires 16kHz audio input sampling rate
  • Supports batch processing and GPU acceleration

Core Capabilities

  • Automatic Speech Recognition (ASR) with state-of-the-art accuracy
  • Efficient performance with minimal labeled data requirements
  • Robust performance on both clean and noisy audio
  • Real-time transcription capabilities

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to learn from raw audio and achieve state-of-the-art results with minimal labeled data sets it apart. It can match or exceed the performance of semi-supervised methods while being conceptually simpler.

Q: What are the recommended use cases?

The model is ideal for speech recognition tasks, particularly when working with English language audio. It's especially valuable in scenarios with limited labeled data availability and can be used for both clean and noisy audio environments.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026