wd-vit-tagger-v3

SmilingWolf

Vision Transformer-based image tagger model with 94.6M parameters, trained on Danbooru dataset for multi-label classification of anime/manga artwork.

Property	Value
Parameter Count	94.6M
License	Apache-2.0
Framework	timm, ONNX, Safetensors
Tensor Type	F32

What is wd-vit-tagger-v3?

WD ViT Tagger v3 is an advanced Vision Transformer model designed for multi-label image classification, specifically optimized for anime and manga artwork tagging. Developed by SmilingWolf, this model represents a significant improvement over its predecessors, achieving an F1 score of 0.4402 at a threshold of 0.2614.

Implementation Details

The model was trained using the JAX-CV framework with TPU support from the TRC program. It processes Danbooru images and can identify ratings, characters, and general tags. The training dataset included images with IDs modulo 0000-0899, with validation performed on IDs modulo 0950-0999.

Trained on images with at least 10 general tags
Tags with fewer than 600 images were filtered out
Implements tag frequency-based loss scaling
Compatible with both timm and ONNX runtimes

Core Capabilities

Multi-label image classification
Supports batch inference in ONNX format
Handles ratings, character identification, and general tag assignment
Improved class imbalance handling through frequency-based loss scaling

Frequently Asked Questions

Q: What makes this model unique?

This model combines the power of Vision Transformers with specialized training for anime/manga artwork classification, featuring improved F1 scores and better handling of class imbalance compared to previous versions.

Q: What are the recommended use cases?

The model is ideal for automated tagging of anime and manga artwork, content organization, and image database management systems requiring detailed classification of artistic content.