CogFlorence-2.2-Large

thwri

Advanced image-to-text model based on Florence-2-large, fine-tuned on 40K images from Ejafa/ye-pop dataset with CogVLM2-generated captions. 823M params, FP16 precision.

Property	Value
Parameter Count	823M
Model Type	Image-to-Text
License	MIT
Precision	FP16

What is CogFlorence-2.2-Large?

CogFlorence-2.2-Large is an advanced image-to-text model that builds upon Microsoft's Florence-2-large architecture. This model has been specifically fine-tuned on a carefully curated dataset of 40,000 images from Ejafa/ye-pop, with captions generated using the powerful THUDM/cogvlm2-llama3-chat-19B model and refined with google/gemma-2-9b.

Implementation Details

The model employs a sophisticated training approach with a frozen vision encoder to maintain stability. Training was conducted with a batch size of 64, using gradient accumulation steps of 16, and an AdamW optimizer with a polynomial scheduler. The learning rate was set to 5.12e-05 over 8.36 epochs.

Frozen vision encoder architecture
Optimized training parameters for stability
Post-processed captions for enhanced clarity
Efficient FP16 precision implementation

Core Capabilities

Detailed image caption generation
High-quality visual understanding
Efficient processing with reduced precision
Robust handling of diverse image types

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful Florence-2-large architecture with carefully curated training data and sophisticated caption generation using CogVLM2, making it particularly effective for detailed image description tasks.

Q: What are the recommended use cases?

The model excels in generating detailed, context-aware image captions, making it ideal for content description, accessibility applications, and automated image cataloging systems.