Florence-2-base

microsoft

Microsoft's Florence-2-base is a 0.23B parameter vision foundation model supporting multiple tasks like captioning, detection, and OCR with superior zero-shot capabilities.

Property	Value
Parameter Count	0.23B
License	MIT
Paper	arXiv:2311.06242
Architecture	Transformers-based Vision Foundation Model

What is Florence-2-base?

Florence-2-base is a compact yet powerful vision foundation model developed by Microsoft that uses a prompt-based approach to handle various vision and vision-language tasks. Trained on the massive FLD-5B dataset containing 5.4 billion annotations across 126 million images, it represents a significant advancement in multi-task visual understanding.

Implementation Details

The model implements a sequence-to-sequence architecture optimized for both zero-shot and fine-tuned applications. It utilizes PyTorch and the Transformers library, supporting float16 precision for efficient processing.

Leverages prompt-based task specification
Processes both image and text inputs
Supports batch processing and GPU acceleration
Implements beam search for generation tasks

Core Capabilities

Image Captioning (with multiple detail levels)
Object Detection (mAP 34.7 on COCO val2017)
Dense Region Captioning
OCR with Region Detection
Phrase Grounding
Visual Question Answering

Frequently Asked Questions

Q: What makes this model unique?

Florence-2-base stands out for its ability to handle multiple vision tasks through simple prompts without task-specific fine-tuning, achieving strong zero-shot performance despite its relatively small size of 0.23B parameters.

Q: What are the recommended use cases?

The model excels in various computer vision tasks including image captioning (133.0 CIDEr on COCO), object detection, OCR, and visual grounding. It's particularly suitable for applications requiring multiple vision capabilities in a single model.