electra-small-nordic

Maintained By
jonfd

electra-small-nordic

PropertyValue
Authorjonfd
Model TypeELECTRA Small
LanguagesIcelandic, Norwegian, Swedish, Danish
Training Data14.82B tokens
Vocabulary Size96,105 tokens
Model URLHuggingFace

What is electra-small-nordic?

electra-small-nordic is a specialized ELECTRA model trained specifically for Nordic languages. It represents a significant contribution to Nordic NLP, trained on a comprehensive dataset of 14.82B tokens equally distributed across Icelandic, Norwegian, Swedish, and Danish languages. The model leverages the efficient ELECTRA architecture in a smaller configuration, making it more accessible for practical applications while maintaining strong performance on Nordic language tasks.

Implementation Details

The model was trained using a carefully curated collection of Nordic language corpora, including the Icelandic Gigaword Corpus (IGC), Icelandic Common Crawl Corpus (IC3), Icelandic Crawled Corpus (ICC), and selected Nordic language content from the Multilingual Colossal Clean Crawled Corpus (mC4). The training process involved:

  • 1 million training steps with a batch size of 256
  • WordPiece tokenizer with 96,105 vocabulary size
  • Equal distribution of training data across four Nordic languages
  • Document-level deduplication and filtering for quality assurance
  • Training conducted on Google's TPU Research Cloud (TRC)

Core Capabilities

  • Specialized processing of Nordic languages (Icelandic, Norwegian, Swedish, Danish)
  • Efficient representation learning with ELECTRA's discriminative pre-training
  • Balanced multi-language support across Nordic languages
  • Optimized for downstream Nordic NLP tasks
  • Compact model size while maintaining strong performance

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for Nordic languages with balanced training across four major Nordic languages, making it particularly effective for regional NLP applications. The use of the ELECTRA architecture in a small configuration provides an efficient solution for Nordic language processing tasks.

Q: What are the recommended use cases?

The model is well-suited for various Nordic language processing tasks, including text classification, named entity recognition, and other downstream NLP tasks specific to Nordic languages. It's particularly valuable for applications requiring efficient processing of multiple Nordic languages simultaneously.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.