Magma-8B

Maintained By
microsoft

Magma-8B

PropertyValue
DeveloperMicrosoft Research
LicenseMIT License
ArchitectureLLaMA-3 backbone with CLIP-ConvNeXt-XXLarge vision encoder
Training InfrastructureAzure ML (H100s and MI300s)

What is Magma-8B?

Magma-8B is Microsoft's revolutionary foundation model for multimodal AI agents, designed to bridge the gap between virtual and physical environments. This groundbreaking model integrates vision, language, and action capabilities, making it uniquely suited for complex tasks ranging from UI navigation to robotic manipulation and gaming environments.

Implementation Details

Built on Meta's LLaMA-3 architecture and utilizing CLIP-ConvNeXt-XXLarge for vision encoding, Magma-8B implements innovative techniques like Set-of-Mark and Trace-of-Mark for enhanced spatial-temporal understanding. The model was trained using bf16 mixed precision with a batch size of 1024 and can process images up to 1024x1024 resolution.

  • Leverages unlabeled video data for improved spatial-temporal grounding
  • Supports maximum sequence length of 4096 tokens
  • Trained across diverse datasets including LLaVA-Next, Epic-Kitchen, and Open-X-Embodiment

Core Capabilities

  • UI Navigation and Grounding
  • Robotic Manipulation Control
  • Video Understanding and Planning
  • Spatial Reasoning and Image Analysis
  • Goal-driven Visual Planning

Frequently Asked Questions

Q: What makes this model unique?

Magma-8B is the first foundation model specifically designed for multimodal AI agents, capable of handling both virtual and physical world interactions. Its unique architecture and training approach enable superior performance in spatial understanding and action planning.

Q: What are the recommended use cases?

The model is primarily intended for research purposes in English, specifically for UI navigation, robotics manipulation, image/video understanding, and spatial reasoning tasks. It's particularly suitable for controlled research environments with proper safety measures in place.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.