hindi-bert

Maintained By
monsoon-nlp

hindi-bert

PropertyValue
Parameter Count14.7M
Model TypeELECTRA
FrameworkPyTorch, TensorFlow
Downloads3,233
Tensor TypeF32

What is hindi-bert?

hindi-bert is an ELECTRA-based language model specifically trained for Hindi natural language processing tasks. Developed by monsoon-nlp, this model represents one of the first attempts at creating a dedicated Hindi language model using Google Research's ELECTRA architecture. The model was trained on a comprehensive dataset combining Hindi CommonCrawl (deduped by OSCAR) and the latest Hindi Wikipedia dumps.

Implementation Details

The model utilizes the ELECTRA architecture with 14.7M parameters, implemented using both PyTorch and TensorFlow frameworks. It features a custom vocabulary created using HuggingFace Tokenizers and supports both discriminator and generator components typical of ELECTRA models. The training process involved structured data organization with pretrain TFRecords and supports both GPU and TPU setups.

  • Custom vocabulary implementation with adjustable size
  • Supports model conversion between PyTorch and TensorFlow
  • Flexible training configuration through configure_pretraining.py
  • Compatible with SimpleTransformers and ktrain frameworks

Core Capabilities

  • News Classification: Comparable performance to Multilingual BERT on BBC Hindi news classification
  • Sentiment Analysis: Effective for Hindi movie reviews analysis
  • Question-Answering: Supports MLQA dataset tasks
  • Feature Extraction: Suitable for various Hindi NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is one of the first ELECTRA-based models specifically trained for Hindi language processing, offering a lighter alternative to larger models while maintaining competitive performance on various NLP tasks.

Q: What are the recommended use cases?

The model is particularly effective for Hindi text classification, sentiment analysis, and question-answering tasks. For more comprehensive tasks, the author recommends considering Google's MuRIL model or sberbank-ai/mGPT for causal language modeling.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.