vit-rugpt2-image-captioning

vit-rugpt2-image-captioning

tuman

Russian image captioning model combining ViT encoder and ruGPT2 decoder, trained on translated COCO2014 dataset. First of its kind for Russian language.

PropertyValue
Model TypeVision Encoder-Decoder
ArchitectureViT + ruGPT2
Primary LanguageRussian
BLEU Score8.672

What is vit-rugpt2-image-captioning?

vit-rugpt2-image-captioning is a groundbreaking image captioning model specifically designed for the Russian language. It combines a Vision Transformer (ViT) encoder with a Russian GPT-2 decoder to generate natural language descriptions of images. The model was trained on a Russian-translated version of the COCO2014 dataset, marking it as the first dedicated image captioning model for Russian language content.

Implementation Details

The model architecture consists of google/vit-base-patch16-224-in21k as the encoder and sberbank-ai/rugpt3large_based_on_gpt2 as the decoder. It achieves a BLEU score of 8.672, with specific precision metrics of 30.567 for unigrams, 7.895 for bigrams, and 3.261 for trigrams.

  • Utilizes transformer-based architecture for both vision and text processing
  • Supports batch processing of images
  • Implements beam search with configurable parameters
  • Compatible with HuggingFace's transformers library

Core Capabilities

  • Russian language image caption generation
  • Support for RGB image processing
  • Beam search optimization for better caption quality
  • Easy integration through transformers pipeline

Frequently Asked Questions

Q: What makes this model unique?

This is the first image captioning model specifically trained for Russian language output, filling a crucial gap in non-English language image processing capabilities.

Q: What are the recommended use cases?

The model is ideal for automated image description in Russian content management systems, accessibility applications, and content cataloging where Russian language output is required.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026