Phi-4-mm-inst-asr-turkish

Property	Value
Base Model	microsoft/Phi-4-multimodal-instruct
Training Data	600-hour Turkish audio dataset
Author	ysdede
Model Link	Hugging Face

What is Phi-4-mm-inst-asr-turkish?

Phi-4-mm-inst-asr-turkish is a specialized fine-tuned version of Microsoft's Phi-4-multimodal-instruct model, specifically optimized for Turkish speech recognition. The model was trained on a substantial 600-hour Turkish audio dataset for one epoch, achieving significant improvements in speech recognition accuracy.

Implementation Details

The model employs a fine-tuning approach using the prompt "Transcribe the Turkish audio". It demonstrates remarkable improvement in performance metrics, with the Word Error Rate (WER) reducing from 127.29 to 47.57 and Character Error Rate (CER) improving from 78.22 to 20.52. The training loss showed significant improvement, decreasing from 1.423 to 0.176.

Learning rate: 1e-05
Batch size: 4 (training), 8 (evaluation)
Optimizer: AdamW with betas=(0.9,0.95)
Linear learning rate scheduler with 5000 warmup steps
Native AMP mixed precision training

Core Capabilities

Specialized Turkish speech recognition
Improved accuracy with source language specification
Reduced hallucination rates
Significant WER and CER improvements

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in Turkish speech recognition and its significant performance improvements make it stand out. The reduction in WER by nearly 63% demonstrates its effectiveness for Turkish ASR tasks.

Q: What are the recommended use cases?

The model is specifically designed for Turkish speech transcription tasks. It performs best when the source language is specified during inference, making it ideal for applications requiring Turkish audio transcription.