phobert-base-v2

vinai

PhoBERT-base-v2 is a state-of-the-art Vietnamese language model with 135M parameters, trained on 140GB of text data, optimized for NLP tasks.

Property	Value
Parameter Count	135M
Architecture	RoBERTa Base
Maximum Length	256 tokens
License	AGPL-3.0
Training Data	140GB (Wikipedia, News, OSCAR-2301)

What is PhoBERT-base-v2?

PhoBERT-base-v2 is an advanced Vietnamese language model that builds upon the success of the original PhoBERT architecture. Based on RoBERTa optimization of BERT, this model represents a significant breakthrough in Vietnamese natural language processing. It's trained on an extensive dataset of 140GB, combining 20GB of Wikipedia and news texts with 120GB from OSCAR-2301, making it one of the most comprehensively trained Vietnamese language models available.

Implementation Details

The model implements a RoBERTa-based architecture with 135M parameters, designed specifically for Vietnamese language understanding. It requires word-segmented input and integrates seamlessly with the Hugging Face transformers library. The model supports both PyTorch and TensorFlow 2.0+ implementations.

Pre-trained on a massive 140GB Vietnamese text corpus
Implements RoBERTa's optimized training approach
Supports maximum sequence length of 256 tokens
Requires specialized Vietnamese word segmentation preprocessing

Core Capabilities

Part-of-speech tagging with state-of-the-art accuracy
Dependency parsing for Vietnamese text
Named-entity recognition
Natural language inference
Fill-mask prediction tasks

Frequently Asked Questions

Q: What makes this model unique?

PhoBERT-base-v2 stands out for its extensive training on Vietnamese-specific data and its optimization using RoBERTa's approach. It's specifically designed for Vietnamese language processing and achieves state-of-the-art performance across multiple NLP tasks.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language processing tasks including part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. It requires word-segmented input, and it's recommended to use the RDRSegmenter from VnCoreNLP for preprocessing raw text.