Lyric Alignment Model
Property | Value |
---|---|
License | CC-BY-NC-4.0 |
Paper | CTC-Segmentation Paper |
Language | Vietnamese |
Framework | PyTorch |
What is lyric-alignment?
The lyric-alignment model is a sophisticated framework designed to align Vietnamese song lyrics with their corresponding audio segments. Built on wav2vec2 architecture, it employs CTC-Segmentation to accurately map lyrics to their temporal positions in music recordings. The model achieves an impressive IoU (Intersection over Union) score of 0.632 on the public leaderboard.
Implementation Details
The model utilizes a three-step alignment process: frame-wise label probability estimation, trellis matrix generation for time-step alignment probability, and optimal path finding. It's trained on 1,500 hours of audio data and incorporates spoken form conversion for better accuracy.
- Based on wav2vec2-large-vi-vlsp2020 architecture
- Trained on 13k hours of pre-training data + 1,500 hours of fine-tuning data
- Implements CTC-Segmentation for precise temporal alignment
- Includes special handling for English words and numbers in Vietnamese context
Core Capabilities
- Precise word-level temporal alignment in audio
- Handles mixed Vietnamese-English lyrics
- Supports number format and special character conversion
- Achieves 0.2267 WER on Zalo public dataset
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines wav2vec2 architecture with CTC-Segmentation for Vietnamese lyrics, including sophisticated handling of mixed-language content and special characters. Its ability to convert written to spoken form makes it particularly effective for real-world applications.
Q: What are the recommended use cases?
The model is ideal for karaoke applications, music analysis, and automatic subtitle generation for Vietnamese songs. It's particularly useful when precise temporal alignment between lyrics and audio is required.