Swin-GPorTuguese-2

Property	Value
Parameter Count	240M
Model Type	Vision Encoder Decoder
Primary Language	Brazilian Portuguese
Training Dataset	Flickr30K Portuguese
Base Models	Swin Transformer + GPT2-small-portuguese

What is swin-gportuguese-2?

Swin-GPorTuguese-2 is a specialized vision-language model designed for generating image captions in Brazilian Portuguese. It combines a Swin Transformer visual encoder pre-trained on ImageNet-1k with a GPT-2 Portuguese language decoder, creating a powerful system for understanding images and producing natural language descriptions.

Implementation Details

The model architecture leverages a Swin Transformer base with patch size 4 and window size 7 for image encoding at 224x224 resolution. The decoder utilizes pierreguillou's GPT2-small-portuguese model, supporting sequences up to 1024 tokens. The model achieves impressive performance metrics, including a CIDEr-D score of 64.71 and BLEU@4 of 23.15.

Pre-trained on ImageNet-1k for visual understanding
Fine-tuned on translated Flickr30K Portuguese dataset
Supports 224x224 image resolution
Implements vision encoder-decoder architecture

Core Capabilities

Generate natural Brazilian Portuguese captions for images
Process images at 224x224 resolution
Achieve competitive performance metrics (METEOR: 44.36, ROUGE-L: 39.39)
Support for batch processing and inference

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Brazilian Portuguese image captioning, combining state-of-the-art vision and language models. It's one of the few models specifically trained for Portuguese image captioning, showing competitive performance against similar architectures.

Q: What are the recommended use cases?

The model is ideal for applications requiring Portuguese image descriptions, such as accessibility tools, content management systems, and educational resources. It's particularly suited for scenarios requiring accurate and naturalistic Portuguese captions for images.