Lyric Alignment Model

Property	Value
License	CC-BY-NC-4.0
Paper	CTC-Segmentation Paper
Language	Vietnamese
Framework	PyTorch

What is lyric-alignment?

The lyric-alignment model is a sophisticated framework designed to align Vietnamese song lyrics with their corresponding audio segments. Built on wav2vec2 architecture, it employs CTC-Segmentation to accurately map lyrics to their temporal positions in music recordings. The model achieves an impressive IoU (Intersection over Union) score of 0.632 on the public leaderboard.

Implementation Details

The model utilizes a three-step alignment process: frame-wise label probability estimation, trellis matrix generation for time-step alignment probability, and optimal path finding. It's trained on 1,500 hours of audio data and incorporates spoken form conversion for better accuracy.

Based on wav2vec2-large-vi-vlsp2020 architecture
Trained on 13k hours of pre-training data + 1,500 hours of fine-tuning data
Implements CTC-Segmentation for precise temporal alignment
Includes special handling for English words and numbers in Vietnamese context

Core Capabilities

Precise word-level temporal alignment in audio
Handles mixed Vietnamese-English lyrics
Supports number format and special character conversion
Achieves 0.2267 WER on Zalo public dataset

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines wav2vec2 architecture with CTC-Segmentation for Vietnamese lyrics, including sophisticated handling of mixed-language content and special characters. Its ability to convert written to spoken form makes it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model is ideal for karaoke applications, music analysis, and automatic subtitle generation for Vietnamese songs. It's particularly useful when precise temporal alignment between lyrics and audio is required.

lyric-alignment