vit-rugpt2-image-captioning

tuman

Russian image captioning model combining ViT encoder and ruGPT2 decoder, trained on translated COCO2014 dataset. First of its kind for Russian language.

Property	Value
Model Type	Vision Encoder-Decoder
Architecture	ViT + ruGPT2
Primary Language	Russian
BLEU Score	8.672

What is vit-rugpt2-image-captioning?

vit-rugpt2-image-captioning is a groundbreaking image captioning model specifically designed for the Russian language. It combines a Vision Transformer (ViT) encoder with a Russian GPT-2 decoder to generate natural language descriptions of images. The model was trained on a Russian-translated version of the COCO2014 dataset, marking it as the first dedicated image captioning model for Russian language content.

Implementation Details

The model architecture consists of google/vit-base-patch16-224-in21k as the encoder and sberbank-ai/rugpt3large_based_on_gpt2 as the decoder. It achieves a BLEU score of 8.672, with specific precision metrics of 30.567 for unigrams, 7.895 for bigrams, and 3.261 for trigrams.

Utilizes transformer-based architecture for both vision and text processing
Supports batch processing of images
Implements beam search with configurable parameters
Compatible with HuggingFace's transformers library

Core Capabilities

Russian language image caption generation
Support for RGB image processing
Beam search optimization for better caption quality
Easy integration through transformers pipeline

Frequently Asked Questions

Q: What makes this model unique?

This is the first image captioning model specifically trained for Russian language output, filling a crucial gap in non-English language image processing capabilities.

Q: What are the recommended use cases?

The model is ideal for automated image description in Russian content management systems, accessibility applications, and content cataloging where Russian language output is required.