RDT-1B
Property | Value |
---|---|
License | MIT |
Paper | arXiv:2410.07864 |
Developer | TSAIL group at Tsinghua University |
Architecture | Diffusion Policy with Transformers |
What is rdt-1b?
RDT-1B is a groundbreaking 1B-parameter imitation learning Diffusion Transformer designed for robotic control. Pre-trained on over 1 million multi-robot episodes, it represents a significant advancement in vision-language-action modeling for robotics. The model uniquely combines visual input from up to three camera views with natural language instructions to predict precise robot actions.
Implementation Details
The model leverages sophisticated multi-modal encoders including siglip-so400m-patch14-384 for vision processing and t5-v1_1-xxl for language understanding. It can predict 64 consecutive robot actions and supports various control paradigms including single-arm, dual-arm, joint-based, and end-effector-based control.
- Unified action space supporting multiple robot configurations
- Multi-view visual processing capability
- Flexible control frequency adaptation
- Support for both position and velocity-based control
Core Capabilities
- Multi-robot episode processing
- Natural language instruction interpretation
- Real-time action prediction
- Wheeled locomotion support
- Cross-platform compatibility
Frequently Asked Questions
Q: What makes this model unique?
RDT-1B stands out for its ability to handle multiple robot configurations and control paradigms within a single model, combined with its sophisticated vision-language processing capabilities and extensive pre-training on diverse robotics datasets.
Q: What are the recommended use cases?
The model is ideal for robotic manipulation tasks where visual feedback and natural language instructions guide the robot's actions. It's particularly well-suited for scenarios involving mobile manipulators, whether single-arm or dual-arm configurations, and can handle both position and velocity-based control schemes.