muril-base-cased

muril-base-cased

google

BERT-based multilingual model pre-trained on 17 Indian languages, optimized for both native and transliterated text processing, achieving strong cross-lingual performance.

PropertyValue
Model TypeBERT-based Multilingual
DeveloperGoogle
Languages Supported17 Indian languages + English
PaperarXiv:2103.10730
Training DataWikipedia, Common Crawl, PMINDIA, Dakshina

What is muril-base-cased?

MuRIL (Multilingual Representations for Indian Languages) is a sophisticated BERT-based model specifically designed to handle the linguistic diversity of Indian languages. It has been pre-trained on 17 Indian languages and their transliterated versions, making it uniquely suited for processing Indian language content in both native scripts and romanized forms.

Implementation Details

The model employs a BERT base architecture trained from scratch using a comprehensive dataset comprising Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora. The training process incorporates both monolingual and parallel data, including translated and transliterated text pairs. Notable technical specifications include training for 1000K steps with a batch size of 4096 and maximum sequence length of 512.

  • Uses modified upsampling with 0.3 exponent value for better low-resource language performance
  • Implements whole word masking with up to 80 predictions
  • Incorporates both native script and transliterated training data
  • Supports cross-lingual transfer learning

Core Capabilities

  • Strong performance on tasks like PANX, UDPOS, XNLI, and TyDiQA
  • Significantly improved results on transliterated text compared to mBERT
  • Effective cross-lingual understanding across Indian languages
  • Superior performance in zero-shot learning scenarios
  • Handles both formal and transliterated text effectively

Frequently Asked Questions

Q: What makes this model unique?

MuRIL stands out for its specialized focus on Indian languages and their transliterated versions, achieving significant improvements over mBERT across various NLP tasks. It's particularly notable for handling both native scripts and romanized text effectively.

Q: What are the recommended use cases?

The model is ideal for various NLP tasks in Indian languages, including named entity recognition, part-of-speech tagging, question answering, and natural language inference. It's particularly effective when dealing with mixed-script content and cross-lingual applications.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026