AlphaMaze-v0.2-1.5B

Property	Value
Base Model	DeepSeek-R1-Distill-Qwen-1.5B
Parameters	1.5B
Training Method	SFT + GRPO
Paper	arXiv:2502.14669
Model Hub	Hugging Face

What is AlphaMaze-v0.2-1.5B?

AlphaMaze-v0.2-1.5B is an innovative language model developed by Menlo Research, specifically designed to enhance visual reasoning capabilities in LLMs through maze-solving tasks. Unlike traditional approaches that rely on image generation, AlphaMaze focuses on pure text-based spatial reasoning, demonstrating that models can develop robust visual thinking abilities without generating actual images.

Implementation Details

The model implements a two-stage training approach: Supervised Fine-Tuning (SFT) using LLaMA-Factory, followed by Generalized Reward-based Policy Optimization (GRPO). Training was conducted on multiple datasets including Maze-Reasoning-v0.1 (420k samples) and Maze-Reasoning-GRPO-v0.1 (180k samples).

Utilizes Flash Attention 2 for efficient processing
Trained on 6xA6000 GPUs with near-zero loss achievement
Implements specialized maze tokens for spatial representation

Core Capabilities

Text-based maze navigation and solving
Strategic path planning and dead-end detection
Spatial relationship understanding from pure text descriptions
Mental mapping of maze structures

Frequently Asked Questions

Q: What makes this model unique?

AlphaMaze-v0.2-1.5B stands out for its ability to develop visual reasoning capabilities purely through text, without requiring image generation. This approach demonstrates that language models can develop sophisticated spatial understanding through carefully designed text-based training.

Q: What are the recommended use cases?

The model is particularly suited for tasks requiring spatial reasoning, path planning, and navigation in text-based environments. It can be used for research in AI spatial cognition, educational applications in problem-solving, and as a benchmark for testing LLM visual reasoning capabilities.