CogACT-Small

Maintained By
CogACT

CogACT-Small

PropertyValue
DeveloperMicrosoft Research Asia
LicenseMIT
PaperCogACT: A Foundational Vision-Language-Action Model
GitHubmicrosoft/CogACT

What is CogACT-Small?

CogACT-Small is an advanced Vision-Language-Action (VLA) model designed specifically for robotic manipulation tasks. It represents a significant advancement in combining visual understanding, language processing, and action prediction for robotics applications. The model utilizes a componentized architecture that includes DINOv2 ViT-L/14 and SigLIP ViT-So400M/14 for vision processing, Llama-2 for language understanding, and a specialized DiT-Small model for action prediction.

Implementation Details

The model architecture consists of three main components working in harmony: a vision backbone, a language model, and an action model. It processes RGB images and text instructions to predict 16 normalized robot actions, represented as 7-DoF end effector deltas (x, y, z, roll, pitch, yaw, gripper). The model has been trained on a subset of the Open X-Embodiment dataset and can be deployed with approximately 30GB of memory in fp32 precision.

  • Vision Processing: Combines DINOv2 and SigLIP for robust visual feature extraction
  • Language Processing: Utilizes Llama-2 for instruction understanding
  • Action Prediction: Implements DiT-Small for precise robot control

Core Capabilities

  • Zero-shot robot control for setups seen in Open-X pretraining
  • Fine-tuning capability with minimal demonstration data
  • Adaptive Action Ensemble support for action refinement
  • Configurable sampling with DDIM support

Frequently Asked Questions

Q: What makes this model unique?

CogACT-Small's uniqueness lies in its specialized action module that's conditioned on VLM output, rather than using simple action quantization like previous approaches. This makes it more effective for precise robotic control tasks.

Q: What are the recommended use cases?

The model is ideal for robotic manipulation tasks where visual input and natural language instructions need to be translated into precise robot actions. It's particularly effective for tasks similar to those in the Open-X pretraining set and can be fine-tuned for new specific applications with minimal additional training data.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.