Magma-8B

Property	Value
Developer	Microsoft Research
License	MIT License
Architecture	LLaMA-3 backbone with CLIP-ConvNeXt-XXLarge vision encoder
Training Infrastructure	Azure ML (H100s and MI300s)

What is Magma-8B?

Magma-8B is Microsoft's revolutionary foundation model for multimodal AI agents, designed to bridge the gap between virtual and physical environments. This groundbreaking model integrates vision, language, and action capabilities, making it uniquely suited for complex tasks ranging from UI navigation to robotic manipulation and gaming environments.

Implementation Details

Built on Meta's LLaMA-3 architecture and utilizing CLIP-ConvNeXt-XXLarge for vision encoding, Magma-8B implements innovative techniques like Set-of-Mark and Trace-of-Mark for enhanced spatial-temporal understanding. The model was trained using bf16 mixed precision with a batch size of 1024 and can process images up to 1024x1024 resolution.

Leverages unlabeled video data for improved spatial-temporal grounding
Supports maximum sequence length of 4096 tokens
Trained across diverse datasets including LLaVA-Next, Epic-Kitchen, and Open-X-Embodiment

Core Capabilities

UI Navigation and Grounding
Robotic Manipulation Control
Video Understanding and Planning
Spatial Reasoning and Image Analysis
Goal-driven Visual Planning

Frequently Asked Questions

Q: What makes this model unique?

Magma-8B is the first foundation model specifically designed for multimodal AI agents, capable of handling both virtual and physical world interactions. Its unique architecture and training approach enable superior performance in spatial understanding and action planning.

Q: What are the recommended use cases?

The model is primarily intended for research purposes in English, specifically for UI navigation, robotics manipulation, image/video understanding, and spatial reasoning tasks. It's particularly suitable for controlled research environments with proper safety measures in place.

Magma-8B

Magma-8B

What is Magma-8B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models