ParsBERT (v2.0)

Property	Value
License	Apache-2.0
Paper	arXiv:2005.12515
Training Corpus Size	3.9M documents, 73M sentences, 1.3B words
Language	Persian

What is bert-fa-base-uncased?

ParsBERT is a state-of-the-art monolingual language model specifically designed for Persian language understanding. Based on Google's BERT architecture, this model has been pre-trained on a diverse collection of Persian corpora including scientific texts, novels, news articles, and various other sources. The model represents a significant advancement in Persian natural language processing, offering superior performance across multiple downstream tasks.

Implementation Details

The model is implemented using the Transformers architecture and supports both PyTorch and TensorFlow 2.0 frameworks. It features extensive pre-processing that combines POS tagging and WordPiece segmentation, optimized specifically for Persian language characteristics.

Comprehensive training on diverse Persian corpora including Wikidumps and MirasText
Achieves masked language modeling accuracy of 68.66%
Perfect next sentence prediction accuracy of 100%
Supports both masked language modeling and next sentence prediction tasks

Core Capabilities

Sentiment Analysis: Achieves up to 92.42% accuracy on binary classification tasks
Text Classification: Demonstrates excellent performance with up to 97.44% accuracy on news classification
Named Entity Recognition: Achieves impressive results of up to 99.84% on the ARMAN dataset
Supports fine-tuning for various downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

ParsBERT stands out due to its extensive training on Persian-specific content and its superior performance compared to multilingual BERT and other models in Persian language tasks. It's the first BERT model to achieve such high accuracy across multiple Persian NLP tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for Persian language processing tasks including sentiment analysis, text classification, and named entity recognition. It's designed to be fine-tuned for specific downstream tasks while maintaining strong base performance.

bert-fa-base-uncased