PIXEL (Pixel-based Encoder of Language)

Property	Value
Parameters	86M (encoder)
License	Apache 2.0
Paper	Language Modelling with Pixels
Training Data	Wikipedia + BookCorpus (3.2B words)

What is pixel-base?

PIXEL is a revolutionary language model that takes a unique approach to text processing by treating text as rendered images rather than using traditional tokenization. Built on the ViT-MAE architecture, it consists of a text renderer, an encoder (Vision Transformer), and a decoder for masked image reconstruction.

Implementation Details

The model processes text through three main stages: First, it renders text as images. Then, it linearly projects image patches to obtain embeddings, with 25% of patches being masked. Finally, a Vision Transformer encoder processes the unmasked patches, while a lightweight decoder with 8 transformer layers reconstructs the masked regions.

86M parameter encoder architecture
Decoder with 512 hidden size and 8 transformer layers
Built on Vision Transformer (ViT) technology
Processes rendered text images instead of using traditional tokenization

Core Capabilities

Language-agnostic processing through rendered text
Pixel-level text reconstruction
Flexible downstream task adaptation
Support for any written language that can be rendered digitally

Frequently Asked Questions

Q: What makes this model unique?

PIXEL's uniqueness lies in its approach to process text as rendered images, eliminating the need for traditional tokenization and enabling potential support for any written language that can be digitally rendered.

Q: What are the recommended use cases?

The model is primarily intended for fine-tuning on downstream NLP tasks. It can be used either as an 86M parameter encoder with task-specific classification heads or as a pixel-level generative language model when retaining the decoder.

pixel-base