Florence-2-base-ft
Property | Value |
---|---|
Parameter Count | 0.23B |
License | MIT |
Paper | Florence-2 Paper |
Model Type | Vision-Language Model |
What is Florence-2-base-ft?
Florence-2-base-ft is a finetuned version of the base Florence-2 model, designed as a versatile vision foundation model that can handle multiple vision and vision-language tasks through a prompt-based approach. This 0.23B parameter model has been finetuned on a collection of downstream tasks to provide enhanced performance across various applications.
Implementation Details
The model utilizes a sequence-to-sequence architecture and is trained using the FLD-5B dataset, which contains 5.4 billion annotations across 126 million images. It's implemented using HuggingFace's transformers library and supports both zero-shot and fine-tuned operations.
- Trained with float16 precision
- Supports CUDA acceleration
- Requires minimal prompt engineering for different tasks
- Integrates seamlessly with the HuggingFace ecosystem
Core Capabilities
- Image Captioning (Multiple detail levels)
- Object Detection with bounding boxes
- Dense Region Captioning
- OCR with and without region detection
- Caption to Phrase Grounding
- Region Proposal Generation
Frequently Asked Questions
Q: What makes this model unique?
Florence-2-base-ft stands out for its ability to handle multiple vision tasks through simple prompts without requiring task-specific fine-tuning. Despite its relatively small size (0.23B parameters), it achieves competitive performance across various benchmarks.
Q: What are the recommended use cases?
The model excels in scenarios requiring multi-task vision capabilities, including image captioning (CIDEr score of 140.0 on COCO Caption), object detection (41.4 mAP on COCO Det.), and visual question answering (79.7% accuracy on VQAv2). It's particularly suitable for applications needing integrated vision-language capabilities.