RDT-1B

Property	Value
License	MIT
Paper	arXiv:2410.07864
Developer	TSAIL group at Tsinghua University
Architecture	Diffusion Policy with Transformers

What is rdt-1b?

RDT-1B is a groundbreaking 1B-parameter imitation learning Diffusion Transformer designed for robotic control. Pre-trained on over 1 million multi-robot episodes, it represents a significant advancement in vision-language-action modeling for robotics. The model uniquely combines visual input from up to three camera views with natural language instructions to predict precise robot actions.

Implementation Details

The model leverages sophisticated multi-modal encoders including siglip-so400m-patch14-384 for vision processing and t5-v1_1-xxl for language understanding. It can predict 64 consecutive robot actions and supports various control paradigms including single-arm, dual-arm, joint-based, and end-effector-based control.

Unified action space supporting multiple robot configurations
Multi-view visual processing capability
Flexible control frequency adaptation
Support for both position and velocity-based control

Core Capabilities

Multi-robot episode processing
Natural language instruction interpretation
Real-time action prediction
Wheeled locomotion support
Cross-platform compatibility

Frequently Asked Questions

Q: What makes this model unique?

RDT-1B stands out for its ability to handle multiple robot configurations and control paradigms within a single model, combined with its sophisticated vision-language processing capabilities and extensive pre-training on diverse robotics datasets.

Q: What are the recommended use cases?

The model is ideal for robotic manipulation tasks where visual feedback and natural language instructions guide the robot's actions. It's particularly well-suited for scenarios involving mobile manipulators, whether single-arm or dual-arm configurations, and can handle both position and velocity-based control schemes.

rdt-1b