rtdetr_r50vd

PekingU

Real-time object detection transformer model achieving 53.1% AP on COCO at 108 FPS. Combines DETR accuracy with YOLO-like speed using 43M parameters.

Property	Value
Parameter Count	43M parameters
License	Apache-2.0
Paper	DETRs Beat YOLOs on Real-time Object Detection
Performance	53.1% AP on COCO, 108 FPS on T4 GPU

What is rtdetr_r50vd?

RT-DETR (Real-Time Detection Transformer) is a groundbreaking object detection model that bridges the gap between DETR's accuracy and YOLO's speed. Developed by researchers at Peking University, it's the first real-time end-to-end object detector that eliminates the need for Non-Maximum Suppression (NMS) while maintaining high performance.

Implementation Details

The model utilizes a hybrid architecture combining an efficient hybrid encoder with uncertainty-minimal query selection. It processes multi-scale features through two key components: Attention-based Intra-scale Feature Interaction (AIFI) and CNN-based Cross-scale Feature Fusion (CCFF). Images are preprocessed to 640x640 pixels with specific normalization parameters.

Trained on COCO 2017 dataset (118k training images)
Supports flexible speed tuning through adjustable decoder layers
Achieves 53.1% AP on COCO validation set
Operates at 108 FPS on T4 GPU

Core Capabilities

Real-time object detection with state-of-the-art accuracy
End-to-end detection without NMS post-processing
Multi-scale feature processing
Flexible speed-accuracy trade-off

Frequently Asked Questions

Q: What makes this model unique?

RT-DETR uniquely combines transformer-based detection with real-time performance, outperforming both YOLO models in speed and accuracy while eliminating the need for NMS. It's 21 times faster than DINO-R50 while achieving better accuracy.

Q: What are the recommended use cases?

The model is ideal for real-time object detection applications requiring both speed and accuracy, such as surveillance systems, autonomous driving, and real-time video analysis. Its flexible architecture allows for deployment in various scenarios with different speed-accuracy requirements.