granite-speech-3.2-8b

granite-speech-3.2-8b

ibm-granite

IBM's 8B parameter speech-language model for ASR/AST tasks. Supports English speech recognition and translation to major languages. Built on Granite architecture with conformer blocks.

PropertyValue
DeveloperIBM
Release DateApril 2nd, 2025
LicenseApache 2.0
Primary TasksASR and AST
Model Size8B parameters
Training Infrastructure32 NVIDIA H100 GPUs

What is granite-speech-3.2-8b?

Granite-speech-3.2-8b is IBM's state-of-the-art speech language model designed specifically for automatic speech recognition (ASR) and automatic speech translation (AST). Built on the foundation of granite-3.2-8b-instruct, this model has been specially adapted for speech processing through modality alignment training on diverse public corpora.

Implementation Details

The model features a sophisticated architecture comprising three main components: a speech encoder with 10 conformer blocks, a speech-text modality adapter, and the base granite-3.2-8b-instruct language model. The speech encoder processes input using CTC with block-attention mechanism, while the modality adapter employs a 2-layer window query transformer for temporal downsampling.

  • Speech encoder with 1024 hidden dimensions and 8 attention heads
  • Temporal downsampling factor of 10x for efficient processing
  • LoRA adapters with rank=64 for query and value projections
  • 128k context length capability

Core Capabilities

  • English speech recognition with state-of-the-art accuracy
  • Speech translation to French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin
  • Trained on over 60,000 hours of diverse speech data
  • Optimized for enterprise applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient architecture that combines speech and language processing capabilities in a relatively compact 8B parameter model, while maintaining high performance through innovative temporal downsampling and modality adaptation techniques.

Q: What are the recommended use cases?

The model is specifically designed for enterprise applications requiring speech processing, particularly English speech-to-text transcription and translation to major languages. It's not recommended for text-only tasks, where the standard Granite language models would be more appropriate.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026