language-perceiver

deepmind

Perceiver IO language model that processes raw UTF-8 bytes using cross-attention with latent vectors, achieving 81.8 GLUE score. Combines efficient processing with flexible output generation.

Property	Value
Developer	DeepMind
Architecture	Perceiver IO
Training Data	English Wikipedia (30%) + C4 (70%)
Paper	Perceiver IO: A General Architecture for Structured Inputs & Outputs
GLUE Score	81.8

What is language-perceiver?

Language Perceiver is an innovative transformer-based model that revolutionizes how we process language data. Unlike traditional transformers, it uses a fixed number of latent vectors to process input through cross-attention, making computational requirements independent of input size. The model works directly with raw UTF-8 bytes instead of tokenized text, eliminating the need for pre-trained tokenizers or fixed vocabularies.

Implementation Details

The model employs a unique architecture where self-attention is performed on a small set of latent vectors (256 or 512), with inputs only participating in cross-attention. This design allows for efficient processing of arbitrary-length inputs while maintaining consistent computational costs. The model uses decoder queries for flexible output generation, capable of producing predictions for masked language modeling tasks.

Direct processing of UTF-8 bytes without tokenization
Cross-attention mechanism with latent vectors
Flexible decoder queries for output generation
Maximum sequence length of 2048 bytes

Core Capabilities

Masked Language Modeling (MLM)
Feature extraction for downstream tasks
Flexible input processing across modalities
Efficient handling of long sequences

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process raw bytes instead of tokens, combined with its fixed computational complexity regardless of input size, makes it highly efficient and flexible. It can handle multiple modalities and doesn't require a pre-trained tokenizer.

Q: What are the recommended use cases?

While the base model excels at masked language modeling, it's primarily designed for fine-tuning on specific downstream tasks. It's particularly useful for applications requiring efficient processing of long sequences or working with multiple modalities.