sentence-bert-base-ja-mean-tokens-v2

Property	Value
Parameter Count	111M
License	CC-BY-SA-4.0
Author	sonoisa
Base Model	cl-tohoku/bert-base-japanese-whole-word-masking

What is sentence-bert-base-ja-mean-tokens-v2?

This is an improved Japanese Sentence-BERT model that leverages the MultipleNegativesRankingLoss function for better sentence embedding generation. Built upon the successful cl-tohoku/bert-base-japanese-whole-word-masking architecture, this v2 model demonstrates 1.5-2 points higher accuracy compared to its predecessor on private datasets.

Implementation Details

The model utilizes PyTorch framework and requires fugashi and ipadic for inference. It implements mean pooling strategy for generating sentence embeddings and supports batch processing for efficient computation.

Improved loss function using MultipleNegativesRankingLoss
Built on BERT-base Japanese whole word masking
Supports batch processing with customizable batch sizes
Implements efficient mean pooling strategy

Core Capabilities

Japanese sentence embedding generation
Semantic similarity comparison
Feature extraction for Japanese text
Efficient batch processing of sentences

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its improved training approach using MultipleNegativesRankingLoss and its specific optimization for Japanese language processing, showing measurable improvements over the v1 model.

Q: What are the recommended use cases?

The model is ideal for Japanese text processing tasks including semantic similarity comparison, document classification, and feature extraction for downstream NLP tasks.