Vokan

Maintained By
ShoukanLabs

Vokan

PropertyValue
AuthorShoukanLabs
LicenseMIT
Base ArchitectureStyleTTS2
Training Resources300h on H100, 600h on 3090

What is Vokan?

Vokan is an advanced fine-tuned implementation of StyleTTS2, specifically engineered for expressive zero-shot text-to-speech synthesis. The model represents a significant achievement in speech synthesis, trained on a diverse dataset comprising 672 speakers from AniSpeech, VCTK, and LibriTTS-R, totaling over 6 days worth of audio data.

Implementation Details

The model's architecture builds upon StyleTTS2's foundation, incorporating extensive training across multiple high-performance computing platforms. The training process involved 300 hours on an NVIDIA H100 GPU and an additional 600 hours on an NVIDIA 3090, demonstrating the substantial computational requirements for achieving high-quality speech synthesis.

  • Trained on multiple datasets for enhanced diversity
  • Optimized for zero-shot performance
  • Incorporates various accents and speaking styles
  • Designed as a robust base model for further fine-tuning

Core Capabilities

  • High-quality zero-shot speech synthesis
  • Diverse accent and speaker characteristic reproduction
  • Expressive and natural-sounding output
  • Versatile base for custom fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

Vokan's uniqueness lies in its extensive training across diverse speaker datasets and its optimization for expressiveness in zero-shot scenarios. Despite using less training data than the original StyleTTS2, it achieves remarkable performance through carefully curated diverse data.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring expressive text-to-speech synthesis, especially where authenticity and natural-sounding output are crucial. It serves excellently as a base model for further fine-tuning projects.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.