alephbert-base

Maintained By
onlplab

AlephBERT Base

PropertyValue
LicenseApache-2.0
PaperBERT Paper
Primary LanguageHebrew
Training DataOSCAR, Wikipedia, Twitter

What is alephbert-base?

AlephBERT is a state-of-the-art language model specifically designed for Hebrew text processing. Based on Google's BERT architecture, it represents a significant advancement in Hebrew natural language processing. The model was trained on an extensive dataset comprising 20 million sentences from OSCAR, 3 million sentences from Wikipedia, and 70 million sentences from Twitter, totaling approximately 17.7GB of text data.

Implementation Details

The model was trained on a DGX machine with 8 V100 GPUs using the Hugging Face training procedure. To optimize training efficiency, the data was strategically divided into four sections based on token length, with each section undergoing 10 epochs of training - 5 epochs at 1e-4 learning rate followed by 5 epochs at 1e-5 learning rate. The total training duration was 8 days.

  • Optimized using Masked Language Model loss
  • Four-section training strategy based on token length (32, 64, 128, 512 tokens)
  • Implementation available through HuggingFace's transformers library

Core Capabilities

  • Hebrew text understanding and processing
  • Masked language modeling
  • Transfer learning for downstream Hebrew NLP tasks
  • Handles various text lengths up to 512 tokens

Frequently Asked Questions

Q: What makes this model unique?

AlephBERT is specifically optimized for Hebrew language processing, trained on a diverse and extensive Hebrew corpus including social media content, making it particularly robust for modern Hebrew text analysis.

Q: What are the recommended use cases?

The model is ideal for Hebrew text processing tasks including text classification, named entity recognition, and other NLP applications requiring deep understanding of Hebrew language context. It's particularly well-suited for applications involving modern Hebrew text due to its training on contemporary sources like Twitter.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.