granite-speech-3.2-8b

ibm-granite

IBM's 8B parameter speech-language model for ASR/AST tasks. Supports English speech recognition and translation to major languages. Built on Granite architecture with conformer blocks.

Property	Value
Developer	IBM
Release Date	April 2nd, 2025
License	Apache 2.0
Primary Tasks	ASR and AST
Model Size	8B parameters
Training Infrastructure	32 NVIDIA H100 GPUs

What is granite-speech-3.2-8b?

Granite-speech-3.2-8b is IBM's state-of-the-art speech language model designed specifically for automatic speech recognition (ASR) and automatic speech translation (AST). Built on the foundation of granite-3.2-8b-instruct, this model has been specially adapted for speech processing through modality alignment training on diverse public corpora.

Implementation Details

The model features a sophisticated architecture comprising three main components: a speech encoder with 10 conformer blocks, a speech-text modality adapter, and the base granite-3.2-8b-instruct language model. The speech encoder processes input using CTC with block-attention mechanism, while the modality adapter employs a 2-layer window query transformer for temporal downsampling.

Speech encoder with 1024 hidden dimensions and 8 attention heads
Temporal downsampling factor of 10x for efficient processing
LoRA adapters with rank=64 for query and value projections
128k context length capability

Core Capabilities

English speech recognition with state-of-the-art accuracy
Speech translation to French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin
Trained on over 60,000 hours of diverse speech data
Optimized for enterprise applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient architecture that combines speech and language processing capabilities in a relatively compact 8B parameter model, while maintaining high performance through innovative temporal downsampling and modality adaptation techniques.

Q: What are the recommended use cases?

The model is specifically designed for enterprise applications requiring speech processing, particularly English speech-to-text transcription and translation to major languages. It's not recommended for text-only tasks, where the standard Granite language models would be more appropriate.