electra-small-nordic

jonfd

ELECTRA-Small model trained on Nordic languages (Icelandic, Norwegian, Swedish, Danish) with 14.82B tokens and 96K vocabulary, optimized for Nordic NLP tasks

Property	Value
Author	jonfd
Model Type	ELECTRA Small
Languages	Icelandic, Norwegian, Swedish, Danish
Training Data	14.82B tokens
Vocabulary Size	96,105 tokens
Model URL	HuggingFace

What is electra-small-nordic?

electra-small-nordic is a specialized ELECTRA model trained specifically for Nordic languages. It represents a significant contribution to Nordic NLP, trained on a comprehensive dataset of 14.82B tokens equally distributed across Icelandic, Norwegian, Swedish, and Danish languages. The model leverages the efficient ELECTRA architecture in a smaller configuration, making it more accessible for practical applications while maintaining strong performance on Nordic language tasks.

Implementation Details

The model was trained using a carefully curated collection of Nordic language corpora, including the Icelandic Gigaword Corpus (IGC), Icelandic Common Crawl Corpus (IC3), Icelandic Crawled Corpus (ICC), and selected Nordic language content from the Multilingual Colossal Clean Crawled Corpus (mC4). The training process involved:

1 million training steps with a batch size of 256
WordPiece tokenizer with 96,105 vocabulary size
Equal distribution of training data across four Nordic languages
Document-level deduplication and filtering for quality assurance
Training conducted on Google's TPU Research Cloud (TRC)

Core Capabilities

Specialized processing of Nordic languages (Icelandic, Norwegian, Swedish, Danish)
Efficient representation learning with ELECTRA's discriminative pre-training
Balanced multi-language support across Nordic languages
Optimized for downstream Nordic NLP tasks
Compact model size while maintaining strong performance

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for Nordic languages with balanced training across four major Nordic languages, making it particularly effective for regional NLP applications. The use of the ELECTRA architecture in a small configuration provides an efficient solution for Nordic language processing tasks.

Q: What are the recommended use cases?

The model is well-suited for various Nordic language processing tasks, including text classification, named entity recognition, and other downstream NLP tasks specific to Nordic languages. It's particularly valuable for applications requiring efficient processing of multiple Nordic languages simultaneously.