muril-base-cased

Maintained By
google

MuRIL: Multilingual Representations for Indian Languages

PropertyValue
Model TypeBERT-based Multilingual
DeveloperGoogle
Languages Supported17 Indian languages + English
PaperarXiv:2103.10730
Training DataWikipedia, Common Crawl, PMINDIA, Dakshina

What is muril-base-cased?

MuRIL (Multilingual Representations for Indian Languages) is a sophisticated BERT-based model specifically designed to handle the linguistic diversity of Indian languages. It has been pre-trained on 17 Indian languages and their transliterated versions, making it uniquely suited for processing Indian language content in both native scripts and romanized forms.

Implementation Details

The model employs a BERT base architecture trained from scratch using a comprehensive dataset comprising Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora. The training process incorporates both monolingual and parallel data, including translated and transliterated text pairs. Notable technical specifications include training for 1000K steps with a batch size of 4096 and maximum sequence length of 512.

  • Uses modified upsampling with 0.3 exponent value for better low-resource language performance
  • Implements whole word masking with up to 80 predictions
  • Incorporates both native script and transliterated training data
  • Supports cross-lingual transfer learning

Core Capabilities

  • Strong performance on tasks like PANX, UDPOS, XNLI, and TyDiQA
  • Significantly improved results on transliterated text compared to mBERT
  • Effective cross-lingual understanding across Indian languages
  • Superior performance in zero-shot learning scenarios
  • Handles both formal and transliterated text effectively

Frequently Asked Questions

Q: What makes this model unique?

MuRIL stands out for its specialized focus on Indian languages and their transliterated versions, achieving significant improvements over mBERT across various NLP tasks. It's particularly notable for handling both native scripts and romanized text effectively.

Q: What are the recommended use cases?

The model is ideal for various NLP tasks in Indian languages, including named entity recognition, part-of-speech tagging, question answering, and natural language inference. It's particularly effective when dealing with mixed-script content and cross-lingual applications.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.