fuyu-8b

fuyu-8b

adept

Fuyu-8B: A 9.41B parameter multimodal decoder-only transformer by Adept AI. Handles image-text tasks with arbitrary resolutions and fast inference.

PropertyValue
Parameter Count9.41B parameters
Model TypeDecoder-only multimodal transformer
LicenseCC-BY-NC 4.0
Tensor TypeBF16
AuthorAdept AI

What is fuyu-8b?

Fuyu-8B is an innovative multimodal model developed by Adept AI that bridges the gap between image understanding and text generation. Unlike traditional multimodal architectures, it employs a simplified decoder-only transformer approach without a separate image encoder, making it more efficient and easier to scale.

Implementation Details

The model's architecture is remarkably straightforward, treating image patches as linear projections into the first transformer layer. This design enables processing of arbitrary image resolutions using a raster-scan order approach with special image-newline characters.

  • Supports dynamic image resolutions without requiring separate training stages
  • Uses vanilla decoder-only transformer architecture
  • Processes images in raster-scan order with position embeddings
  • Achieves impressive benchmark scores: 74.2 on VQAv2, 60.6 on OKVQA

Core Capabilities

  • Image-to-text generation and captioning
  • Visual question-answering
  • UI-based question handling
  • Fine-grained image localization
  • Graph and diagram interpretation

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its simplified architecture that eliminates the need for a separate image encoder while maintaining high performance. It can process images of any resolution in less than 100ms, making it particularly suitable for real-world applications.

Q: What are the recommended use cases?

The model is primarily intended for research purposes and requires fine-tuning for specific applications. Ideal use cases include computer control applications, digital agents, and general multimodal research. However, it's important to note that the base model needs fine-tuning for specific tasks like verbose captioning or multimodal chat.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026