ko-sbert-multitask
Property | Value |
---|---|
Author | jhgan |
Model Type | Sentence Transformer |
Embedding Dimensions | 768 |
Paper | KorNLI and KorSTS Paper |
What is ko-sbert-multitask?
ko-sbert-multitask is a specialized Korean language model designed for generating semantic sentence embeddings. It's built on the SBERT architecture and trained through a multi-task learning approach using both KorSTS and KorNLI datasets. The model converts Korean sentences into 768-dimensional dense vectors, making it particularly effective for semantic search, clustering, and similarity analysis tasks.
Implementation Details
The model utilizes a BERT-based architecture with mean pooling and was trained using multiple loss functions including MultipleNegativesRankingLoss and CosineSimilarityLoss. It achieved impressive evaluation results on KorSTS, with Cosine Pearson correlation of 84.13 and Spearman correlation of 84.71.
- Trained with AdamW optimizer (learning rate: 2e-05)
- Uses warm-up linear scheduling with 360 warmup steps
- Maximum sequence length: 128 tokens
- Implements mean pooling strategy for sentence embeddings
Core Capabilities
- Semantic similarity computation between Korean sentences
- Dense vector representation for Korean text
- Supports clustering and information retrieval
- Efficient sentence embedding generation
- High performance on semantic textual similarity tasks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized focus on Korean language understanding and its multi-task training approach, combining both KorSTS and KorNLI datasets. The high performance scores on semantic similarity tasks make it particularly valuable for Korean NLP applications.
Q: What are the recommended use cases?
The model is ideal for tasks such as semantic search in Korean documents, sentence clustering, document similarity analysis, and information retrieval applications. It's particularly effective when you need to compare or analyze the semantic meaning of Korean text passages.