paraphrase-filipino-mpnet-base-v2
Property | Value |
---|---|
Author | meedan |
Model Type | Sentence Transformer |
Vector Dimension | 768 |
Base Architecture | XLM-RoBERTa |
HuggingFace | Link |
What is paraphrase-filipino-mpnet-base-v2?
This is a specialized sentence transformer model designed for Filipino/Tagalog language processing. It was developed using a teacher-student training approach, where paraphrase-mpnet-base-v2 served as the teacher and paraphrase-multilingual-mpnet-base-v2 as the student model. The model converts sentences and paragraphs into 768-dimensional dense vector representations, enabling various NLP tasks like semantic search and clustering.
Implementation Details
The model was trained for 2 epochs using carefully filtered English-Tagalog and English-Filipino parallel data from OPUS. The training process employed a batch size of 64 and utilized the Compact Language Detection kit (CLDv3) to ensure data quality. The model architecture is based on XLM-R and implements mean pooling for sentence embeddings.
- Trained using MSELoss with AdamW optimizer
- Learning rate: 2e-05 with warmup steps: 10000
- Maximum sequence length: 128 tokens
- Implements both word-level and sentence-level embeddings
Core Capabilities
- Semantic similarity computation between Filipino texts
- Cross-lingual embeddings for Filipino and English
- Achieves 0.75 correlation on Filipino STS tasks
- Maintains 0.80 correlation on English STS evaluation
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Filipino language processing while maintaining strong performance on English texts. It combines the benefits of MPNet architecture with careful data filtering and specialized training for Filipino language understanding.
Q: What are the recommended use cases?
The model is ideal for semantic search applications, text clustering, similarity detection, and cross-lingual tasks involving Filipino and English content. It's particularly useful for applications requiring semantic understanding of Filipino text.