bert-base-turkish-cased-mean-nli-stsb-tr

Property	Value
License	Apache 2.0
Language	Turkish
Vector Dimension	768
Downloads	330,121

What is bert-base-turkish-cased-mean-nli-stsb-tr?

This is a specialized Turkish language model based on BERT architecture, designed for semantic similarity tasks. It transforms Turkish text into 768-dimensional dense vector representations, making it particularly effective for tasks like clustering and semantic search. The model has been trained on machine-translated versions of NLI and STS-B datasets, specifically adapted for Turkish language processing.

Implementation Details

The model implements a sentence-transformers architecture with mean pooling strategy. It can be easily used through both the sentence-transformers library and HuggingFace Transformers. The model achieves impressive performance metrics, with correlation scores above 0.83 on various evaluation metrics including cosine, euclidean, and manhattan similarity measures.

Trained using specialized NLI and STS-B training scripts
Implements batch size of 16 with AdamW optimizer
Uses WarmupLinear scheduler with 144 warmup steps
Maximum sequence length of 75 tokens

Core Capabilities

Sentence and paragraph embedding generation
Semantic similarity computation
Text clustering
Semantic search operations
Cross-lingual capabilities through machine-translated training data

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specific optimization for Turkish language processing, combining both NLI and STS-B training data with impressive correlation scores (0.834 cosine_pearson on test set). It's particularly valuable for Turkish natural language processing tasks requiring semantic understanding.

Q: What are the recommended use cases?

The model is ideal for applications requiring semantic similarity matching in Turkish text, including document clustering, semantic search engines, text classification, and information retrieval systems. It's particularly effective for tasks requiring understanding of semantic relationships between Turkish sentences or paragraphs.