IndicBARTSS

Maintained By
ai4bharat

IndicBARTSS

PropertyValue
DeveloperAI4Bharat
Training Data452M sentences, 9B tokens
Languages12 (11 Indian + English)
LicenseMIT
PaperarXiv:2109.02903

What is IndicBARTSS?

IndicBARTSS is a groundbreaking multilingual sequence-to-sequence model specifically designed for Indian languages and English. Built on the mBART architecture, it represents a significant advancement in natural language processing for Indic languages, offering support for 12 languages while maintaining a smaller computational footprint than its predecessors.

Implementation Details

The model employs a text-infilling objective similar to mBART and is implemented using the Transformers library. It features a unique tokenization approach that preserves language-specific scripts, eliminating the need for script mapping to Devanagari. The model can be easily integrated using HuggingFace's transformers library with both AutoModelForSeq2SeqLM and MBartForConditionalGeneration classes.

  • Supports Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, Telugu, and English
  • Trained on IndicCorp dataset with 452 million sentences
  • Smaller model size compared to mBART and mT5-base
  • Native script support for all languages

Core Capabilities

  • Machine Translation between supported languages
  • Text Summarization
  • Question Generation
  • Text Infilling and Masked Language Modeling
  • Natural Language Generation tasks

Frequently Asked Questions

Q: What makes this model unique?

IndicBARTSS stands out for its specialized focus on Indian languages, native script support, and efficient architecture that requires less computational resources while maintaining high performance. It's particularly notable for supporting languages not covered by mBART50 and mT5.

Q: What are the recommended use cases?

The model is ideal for building natural language generation applications in Indian languages, including machine translation, text summarization, and question generation. It's particularly suited for organizations working with multiple Indian languages who need a single, efficient model.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.