Florence-2-base-ft

Property	Value
Parameter Count	0.23B
License	MIT
Paper	Florence-2 Paper
Model Type	Vision-Language Model

What is Florence-2-base-ft?

Florence-2-base-ft is a finetuned version of the base Florence-2 model, designed as a versatile vision foundation model that can handle multiple vision and vision-language tasks through a prompt-based approach. This 0.23B parameter model has been finetuned on a collection of downstream tasks to provide enhanced performance across various applications.

Implementation Details

The model utilizes a sequence-to-sequence architecture and is trained using the FLD-5B dataset, which contains 5.4 billion annotations across 126 million images. It's implemented using HuggingFace's transformers library and supports both zero-shot and fine-tuned operations.

Trained with float16 precision
Supports CUDA acceleration
Requires minimal prompt engineering for different tasks
Integrates seamlessly with the HuggingFace ecosystem

Core Capabilities

Image Captioning (Multiple detail levels)
Object Detection with bounding boxes
Dense Region Captioning
OCR with and without region detection
Caption to Phrase Grounding
Region Proposal Generation

Frequently Asked Questions

Q: What makes this model unique?

Florence-2-base-ft stands out for its ability to handle multiple vision tasks through simple prompts without requiring task-specific fine-tuning. Despite its relatively small size (0.23B parameters), it achieves competitive performance across various benchmarks.

Q: What are the recommended use cases?

The model excels in scenarios requiring multi-task vision capabilities, including image captioning (CIDEr score of 140.0 on COCO Caption), object detection (41.4 mAP on COCO Det.), and visual question answering (79.7% accuracy on VQAv2). It's particularly suitable for applications needing integrated vision-language capabilities.