ToriiGate-v0.4-7B

Minthy

ToriiGate-v0.4-7B is a specialized vision-language model for anime/artwork captioning, built on Qwen2-VL with 900k+ training samples and advanced character recognition capabilities.

Property	Value
Model Size	7 Billion Parameters
Base Architecture	Qwen2-VL
Model Type	Vision-Language Model (VLM)
Author	Minthy
Model URL	https://huggingface.co/Minthy/ToriiGate-v0.4-7B

What is ToriiGate-v0.4-7B?

ToriiGate-v0.4-7B is a specialized vision-language model designed for captioning anime pictures, digital artworks, and various images. Built upon Qwen2-VL and fine-tuned with over 900,000 artwork-caption pairs, it represents a significant advancement in understanding complex scenes, cultural concepts, and character interactions in artistic content.

Implementation Details

The model implements multiple captioning modes including structured output (JSON/Markdown), pre-defined caption variants, long/short descriptions, and bounding box detection. It features flexible grounding capabilities through booru tags, character lists, and trait descriptions to enhance accuracy.

Built on Qwen2-VL architecture with 7B parameters
Trained on 900k+ anime/artwork samples
Supports multiple output formats and grounding methods
Available in various quantization options (8bpw, 6bpw, 4bpw)

Core Capabilities

Advanced anime and digital art understanding
Accurate character name recognition and usage
Structured output generation with character-specific details
Scene composition and atmosphere description
Support for multiple captioning styles and formats
Caption review and correction functionality

Frequently Asked Questions

Q: What makes this model unique?

ToriiGate-v0.4-7B is currently the only opensource small-sized VLM that effectively handles multiple character names and provides structured outputs specifically designed for anime and artwork content. Its flexible grounding system and multiple captioning modes make it versatile for various use cases.

Q: What are the recommended use cases?

The model is ideal for automated artwork captioning, dataset creation for AI training, detailed scene analysis of anime/digital art, and generating structured descriptions for character-centric images. It's particularly useful when accurate character recognition and detailed scene description are required.