vit_spectrogram
Property | Value |
---|---|
License | Apache 2.0 |
Base Model | google/vit-base-patch16-224-in21k |
Framework | TensorFlow 2.4.0 |
Training Precision | Mixed Float16 |
What is vit_spectrogram?
vit_spectrogram is a specialized Vision Transformer model fine-tuned for analyzing Mel spectrograms to classify audio as either 'Male' or 'Female'. Built upon Google's ViT architecture, it demonstrates impressive performance with a validation accuracy of 93.66% and perfect top-3 accuracy.
Implementation Details
The model utilizes the Vision Transformer architecture with AdamWeightDecay optimizer and implements a polynomial decay learning rate strategy starting at 3e-05. It's trained using mixed precision floating point (float16) for optimal performance and efficiency.
- Learning rate decay steps: 3032
- Beta values: β1=0.9, β2=0.999
- Weight decay rate: 0.01
- Initial scale: 32768.0
Core Capabilities
- Binary gender classification from spectrograms
- High accuracy (93.66% validation)
- Perfect top-3 accuracy (1.0)
- Efficient training with mixed precision
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely applies Vision Transformer technology to audio classification by processing Mel spectrograms, achieving high accuracy in gender classification while maintaining efficient processing through mixed precision training.
Q: What are the recommended use cases?
The model is specifically designed for gender classification from audio spectrograms, making it suitable for applications in voice analysis, automated audio processing systems, and gender-based audio content organization.