Swin-GPorTuguese-2
Property | Value |
---|---|
Parameter Count | 240M |
Model Type | Vision Encoder Decoder |
Primary Language | Brazilian Portuguese |
Training Dataset | Flickr30K Portuguese |
Base Models | Swin Transformer + GPT2-small-portuguese |
What is swin-gportuguese-2?
Swin-GPorTuguese-2 is a specialized vision-language model designed for generating image captions in Brazilian Portuguese. It combines a Swin Transformer visual encoder pre-trained on ImageNet-1k with a GPT-2 Portuguese language decoder, creating a powerful system for understanding images and producing natural language descriptions.
Implementation Details
The model architecture leverages a Swin Transformer base with patch size 4 and window size 7 for image encoding at 224x224 resolution. The decoder utilizes pierreguillou's GPT2-small-portuguese model, supporting sequences up to 1024 tokens. The model achieves impressive performance metrics, including a CIDEr-D score of 64.71 and BLEU@4 of 23.15.
- Pre-trained on ImageNet-1k for visual understanding
- Fine-tuned on translated Flickr30K Portuguese dataset
- Supports 224x224 image resolution
- Implements vision encoder-decoder architecture
Core Capabilities
- Generate natural Brazilian Portuguese captions for images
- Process images at 224x224 resolution
- Achieve competitive performance metrics (METEOR: 44.36, ROUGE-L: 39.39)
- Support for batch processing and inference
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Brazilian Portuguese image captioning, combining state-of-the-art vision and language models. It's one of the few models specifically trained for Portuguese image captioning, showing competitive performance against similar architectures.
Q: What are the recommended use cases?
The model is ideal for applications requiring Portuguese image descriptions, such as accessibility tools, content management systems, and educational resources. It's particularly suited for scenarios requiring accurate and naturalistic Portuguese captions for images.