CogACT-Small

Property	Value
Developer	Microsoft Research Asia
License	MIT
Paper	CogACT: A Foundational Vision-Language-Action Model
GitHub	microsoft/CogACT

What is CogACT-Small?

CogACT-Small is an advanced Vision-Language-Action (VLA) model designed specifically for robotic manipulation tasks. It represents a significant advancement in combining visual understanding, language processing, and action prediction for robotics applications. The model utilizes a componentized architecture that includes DINOv2 ViT-L/14 and SigLIP ViT-So400M/14 for vision processing, Llama-2 for language understanding, and a specialized DiT-Small model for action prediction.

Implementation Details

The model architecture consists of three main components working in harmony: a vision backbone, a language model, and an action model. It processes RGB images and text instructions to predict 16 normalized robot actions, represented as 7-DoF end effector deltas (x, y, z, roll, pitch, yaw, gripper). The model has been trained on a subset of the Open X-Embodiment dataset and can be deployed with approximately 30GB of memory in fp32 precision.

Vision Processing: Combines DINOv2 and SigLIP for robust visual feature extraction
Language Processing: Utilizes Llama-2 for instruction understanding
Action Prediction: Implements DiT-Small for precise robot control

Core Capabilities

Zero-shot robot control for setups seen in Open-X pretraining
Fine-tuning capability with minimal demonstration data
Adaptive Action Ensemble support for action refinement
Configurable sampling with DDIM support

Frequently Asked Questions

Q: What makes this model unique?

CogACT-Small's uniqueness lies in its specialized action module that's conditioned on VLM output, rather than using simple action quantization like previous approaches. This makes it more effective for precise robotic control tasks.

Q: What are the recommended use cases?

The model is ideal for robotic manipulation tasks where visual input and natural language instructions need to be translated into precise robot actions. It's particularly effective for tasks similar to those in the Open-X pretraining set and can be fine-tuned for new specific applications with minimal additional training data.

CogACT-Small

CogACT-Small

What is CogACT-Small?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models