WD ViT-Large Tagger v3

Property	Value
Parameter Count	315M
Model Type	Vision Transformer
License	Apache 2.0
Tensor Type	F32
Framework	timm, ONNX, Safetensors

What is wd-vit-large-tagger-v3?

WD ViT-Large Tagger v3 is a state-of-the-art image tagging model built on the Vision Transformer architecture. Trained on the extensive Danbooru dataset, it specializes in identifying and tagging anime and manga-style images with high precision. The model represents a significant upgrade from its predecessors, featuring enhanced compatibility with the timm library and improved batch processing capabilities.

Implementation Details

The model was trained using the JAX-CV framework with TPU support from the TRC program. It processes images from the Danbooru dataset up to ID 7220105, utilizing a specific training-validation split strategy. The training set includes images with IDs modulo 0000-0899, while validation uses IDs modulo 0950-0999.

Achieves F1 score of 0.4674 at threshold 0.2606
Supports batch inference in ONNX format
Requires onnxruntime >= 1.17.0
Compatible with timm library for easy integration

Core Capabilities

Comprehensive tagging support for ratings, characters, and general tags
Filtered training on high-quality data (10+ general tags per image)
Tag coverage for items with 600+ image examples
Updated tag database through February 2024

Frequently Asked Questions

Q: What makes this model unique?

This model combines the power of Vision Transformers with extensive anime/manga domain knowledge, offering improved batch processing and broader framework compatibility compared to previous versions. Its training on a carefully curated dataset ensures high-quality tag predictions.

Q: What are the recommended use cases?

The model is ideal for automated tagging of anime and manga-style images, content organization, and database management. It's particularly useful for large-scale image classification tasks where accurate tag prediction is crucial.