gbert-large-paraphrase-euclidean

Property	Value
License	MIT
Base Model	deepset/gbert-large
Embedding Dimension	1024
Language	German

What is gbert-large-paraphrase-euclidean?

This is a specialized German language model designed for sentence similarity tasks, built on the sentence-transformers framework. It transforms German text into 1024-dimensional dense vector representations, specifically optimized for few-shot text classification using SetFit. The model employs euclidean distance metrics and is fine-tuned using BatchHardSoftMarginTripletLoss.

Implementation Details

The model leverages a carefully curated version of the deutsche-telekom/ger-backtrans-paraphrase dataset, with specific filtering criteria including minimum character length of 15 and maximum token count of 30. Training was conducted with a learning rate of 5.55e-06 over 7 epochs with a batch size of 68.

Utilizes BatchHardSoftMarginTripletLoss with euclidean distance
Built on deepset/gbert-large architecture
Optimized for German language processing
Implements sophisticated filtering criteria for training data

Core Capabilities

Sentence and paragraph embedding generation
Few-shot text classification
Paraphrase detection and similarity scoring
German language text processing

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for German language processing using euclidean distance metrics, outperforming multilingual alternatives and even base BERT models in few-shot scenarios.

Q: What are the recommended use cases?

The model is particularly well-suited for German text classification tasks with limited training data, semantic similarity comparisons, and paraphrase detection in German language content.