CogFlorence-2.2-Large

CogFlorence-2.2-Large

thwri

Advanced image-to-text model based on Florence-2-large, fine-tuned on 40K images from Ejafa/ye-pop dataset with CogVLM2-generated captions. 823M params, FP16 precision.

PropertyValue
Parameter Count823M
Model TypeImage-to-Text
LicenseMIT
PrecisionFP16

What is CogFlorence-2.2-Large?

CogFlorence-2.2-Large is an advanced image-to-text model that builds upon Microsoft's Florence-2-large architecture. This model has been specifically fine-tuned on a carefully curated dataset of 40,000 images from Ejafa/ye-pop, with captions generated using the powerful THUDM/cogvlm2-llama3-chat-19B model and refined with google/gemma-2-9b.

Implementation Details

The model employs a sophisticated training approach with a frozen vision encoder to maintain stability. Training was conducted with a batch size of 64, using gradient accumulation steps of 16, and an AdamW optimizer with a polynomial scheduler. The learning rate was set to 5.12e-05 over 8.36 epochs.

  • Frozen vision encoder architecture
  • Optimized training parameters for stability
  • Post-processed captions for enhanced clarity
  • Efficient FP16 precision implementation

Core Capabilities

  • Detailed image caption generation
  • High-quality visual understanding
  • Efficient processing with reduced precision
  • Robust handling of diverse image types

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful Florence-2-large architecture with carefully curated training data and sophisticated caption generation using CogVLM2, making it particularly effective for detailed image description tasks.

Q: What are the recommended use cases?

The model excels in generating detailed, context-aware image captions, making it ideal for content description, accessibility applications, and automated image cataloging systems.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026