IndicBERTv2-MLM-only

Property	Value
Parameters	278M
License	MIT
Languages Supported	26
Training Objective	Masked Language Modeling (MLM)

What is IndicBERTv2-MLM-only?

IndicBERTv2-MLM-only is a state-of-the-art multilingual language model specifically designed for Indian languages. Developed by AI4Bharat, it's trained on IndicCorp v2 and supports 26 languages including various Indian languages and English. The model employs a vanilla BERT architecture with Masked Language Modeling (MLM) as its primary training objective.

Implementation Details

The model is built using the Transformers architecture and implemented in PyTorch. It features 278M parameters and is trained specifically for fill-mask tasks. The implementation supports various scripts including Devanagari, Bengali, Malayalam, Tamil, and others, making it truly versatile for Indian language processing.

Trained on comprehensive IndicCorp v2 dataset
Supports 26 different languages and their respective scripts
Implements standard BERT architecture with MLM objective
Available through Hugging Face's model hub

Core Capabilities

Masked Language Modeling for 26 different languages
Zero-shot cross-lingual transfer
Support for multiple Indian scripts and writing systems
Fine-tuning capabilities for various downstream tasks
Inference endpoints available for production deployment

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its comprehensive coverage of Indian languages and scripts, making it one of the largest multilingual models specifically designed for Indian languages. With 278M parameters and support for 26 languages, it provides robust performance across various Indian language processing tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for tasks such as text completion, language understanding, and masked word prediction across Indian languages. It can be fine-tuned for specific downstream tasks including NER, paraphrase detection, question answering, sentiment analysis, and more.