Florence-2-large-no-flash-attn

Maintained By
multimodalart

Florence-2-large-no-flash-attn

PropertyValue
Parameter Count0.77B
LicenseMIT
PaperFlorence-2 Paper
ArchitectureVision Foundation Model

What is Florence-2-large-no-flash-attn?

Florence-2-large-no-flash-attn is a modified version of Microsoft's Florence-2 vision foundation model, specifically adapted to work without the flash-attention mechanism. This large-scale model (0.77B parameters) is designed to handle a wide range of vision and vision-language tasks through a prompt-based approach, trained on the extensive FLD-5B dataset containing 5.4 billion annotations across 126 million images.

Implementation Details

The model implements a sequence-to-sequence architecture optimized for both zero-shot and fine-tuned applications. It uses PyTorch framework and operates with float16 precision on CUDA devices, falling back to float32 on CPU. The modification removes flash-attention dependencies while maintaining core functionality, though potentially with some performance impact.

  • Supports multiple vision tasks through simple prompt engineering
  • Processes both image and text inputs simultaneously
  • Implements efficient attention mechanisms without flash-attention requirement
  • Achieves strong performance metrics across various benchmarks

Core Capabilities

  • Image Captioning (COCO CIDEr score: 135.6)
  • Object Detection (COCO val2017 mAP: 37.5)
  • Dense Region Captioning
  • OCR and Text Recognition
  • Visual Question Answering
  • Phrase Grounding

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to handle multiple vision tasks without task-specific fine-tuning, while maintaining high performance without the flash-attention mechanism. It achieves impressive results with relatively fewer parameters (0.77B) compared to competitors like PaLI (17B) or Flamingo (80B).

Q: What are the recommended use cases?

The model excels in various computer vision tasks including image captioning, object detection, OCR, and visual question answering. It's particularly suitable for applications requiring multiple vision capabilities in a single model, especially when flash-attention isn't available in the deployment environment.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.