paraphrase-filipino-mpnet-base-v2

Maintained By
meedan

paraphrase-filipino-mpnet-base-v2

PropertyValue
Authormeedan
Model TypeSentence Transformer
Vector Dimension768
Base ArchitectureXLM-RoBERTa
HuggingFaceLink

What is paraphrase-filipino-mpnet-base-v2?

This is a specialized sentence transformer model designed for Filipino/Tagalog language processing. It was developed using a teacher-student training approach, where paraphrase-mpnet-base-v2 served as the teacher and paraphrase-multilingual-mpnet-base-v2 as the student model. The model converts sentences and paragraphs into 768-dimensional dense vector representations, enabling various NLP tasks like semantic search and clustering.

Implementation Details

The model was trained for 2 epochs using carefully filtered English-Tagalog and English-Filipino parallel data from OPUS. The training process employed a batch size of 64 and utilized the Compact Language Detection kit (CLDv3) to ensure data quality. The model architecture is based on XLM-R and implements mean pooling for sentence embeddings.

  • Trained using MSELoss with AdamW optimizer
  • Learning rate: 2e-05 with warmup steps: 10000
  • Maximum sequence length: 128 tokens
  • Implements both word-level and sentence-level embeddings

Core Capabilities

  • Semantic similarity computation between Filipino texts
  • Cross-lingual embeddings for Filipino and English
  • Achieves 0.75 correlation on Filipino STS tasks
  • Maintains 0.80 correlation on English STS evaluation

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Filipino language processing while maintaining strong performance on English texts. It combines the benefits of MPNet architecture with careful data filtering and specialized training for Filipino language understanding.

Q: What are the recommended use cases?

The model is ideal for semantic search applications, text clustering, similarity detection, and cross-lingual tasks involving Filipino and English content. It's particularly useful for applications requiring semantic understanding of Filipino text.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.