Florence-2-large-no-flash-attn

Property	Value
Parameter Count	0.77B
License	MIT
Paper	Florence-2 Paper
Architecture	Vision Foundation Model

What is Florence-2-large-no-flash-attn?

Florence-2-large-no-flash-attn is a modified version of Microsoft's Florence-2 vision foundation model, specifically adapted to work without the flash-attention mechanism. This large-scale model (0.77B parameters) is designed to handle a wide range of vision and vision-language tasks through a prompt-based approach, trained on the extensive FLD-5B dataset containing 5.4 billion annotations across 126 million images.

Implementation Details

The model implements a sequence-to-sequence architecture optimized for both zero-shot and fine-tuned applications. It uses PyTorch framework and operates with float16 precision on CUDA devices, falling back to float32 on CPU. The modification removes flash-attention dependencies while maintaining core functionality, though potentially with some performance impact.

Supports multiple vision tasks through simple prompt engineering
Processes both image and text inputs simultaneously
Implements efficient attention mechanisms without flash-attention requirement
Achieves strong performance metrics across various benchmarks

Core Capabilities

Image Captioning (COCO CIDEr score: 135.6)
Object Detection (COCO val2017 mAP: 37.5)
Dense Region Captioning
OCR and Text Recognition
Visual Question Answering
Phrase Grounding

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to handle multiple vision tasks without task-specific fine-tuning, while maintaining high performance without the flash-attention mechanism. It achieves impressive results with relatively fewer parameters (0.77B) compared to competitors like PaLI (17B) or Flamingo (80B).

Q: What are the recommended use cases?

The model excels in various computer vision tasks including image captioning, object detection, OCR, and visual question answering. It's particularly suitable for applications requiring multiple vision capabilities in a single model, especially when flash-attention isn't available in the deployment environment.