phobert-base-v2

phobert-base-v2

vinai

PhoBERT-base-v2 is a state-of-the-art Vietnamese language model with 135M parameters, trained on 140GB of text data, optimized for NLP tasks.

PropertyValue
Parameter Count135M
ArchitectureRoBERTa Base
Maximum Length256 tokens
LicenseAGPL-3.0
Training Data140GB (Wikipedia, News, OSCAR-2301)

What is PhoBERT-base-v2?

PhoBERT-base-v2 is an advanced Vietnamese language model that builds upon the success of the original PhoBERT architecture. Based on RoBERTa optimization of BERT, this model represents a significant breakthrough in Vietnamese natural language processing. It's trained on an extensive dataset of 140GB, combining 20GB of Wikipedia and news texts with 120GB from OSCAR-2301, making it one of the most comprehensively trained Vietnamese language models available.

Implementation Details

The model implements a RoBERTa-based architecture with 135M parameters, designed specifically for Vietnamese language understanding. It requires word-segmented input and integrates seamlessly with the Hugging Face transformers library. The model supports both PyTorch and TensorFlow 2.0+ implementations.

  • Pre-trained on a massive 140GB Vietnamese text corpus
  • Implements RoBERTa's optimized training approach
  • Supports maximum sequence length of 256 tokens
  • Requires specialized Vietnamese word segmentation preprocessing

Core Capabilities

  • Part-of-speech tagging with state-of-the-art accuracy
  • Dependency parsing for Vietnamese text
  • Named-entity recognition
  • Natural language inference
  • Fill-mask prediction tasks

Frequently Asked Questions

Q: What makes this model unique?

PhoBERT-base-v2 stands out for its extensive training on Vietnamese-specific data and its optimization using RoBERTa's approach. It's specifically designed for Vietnamese language processing and achieves state-of-the-art performance across multiple NLP tasks.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language processing tasks including part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. It requires word-segmented input, and it's recommended to use the RDRSegmenter from VnCoreNLP for preprocessing raw text.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026