CamemBERTav2-base

Property	Value
Parameter Count	111M parameters
License	MIT
Language	French
Paper	View Paper
Training Data	275B tokens

What is camembertav2-base?

CamemBERTav2-base is an advanced French language model that represents a significant evolution in French NLP. Built on the DebertaV2 architecture, this model has been trained on an impressive 275B tokens of French text, combining data from OSCAR dumps, scientific documents from HALvest, and French Wikipedia.

Implementation Details

The model implements a sophisticated architecture with several technical improvements over its predecessor:

Extended context window of 1024 tokens
New WordPiece tokenizer with 32,768 tokens
Improved number handling and emoji support
Trained using Replaced Token Detection (RTD) with 20% mask rate

Core Capabilities

State-of-the-art performance in POS tagging (97.71%)
Superior NER capabilities (93.40% on FTB-NER)
Excellent performance on XNLI (84.82%)
Advanced question answering capabilities (83.04% F1 on FQuAD)
Enhanced medical NER performance (73.98%)

Frequently Asked Questions

Q: What makes this model unique?

CamemBERTav2 stands out due to its massive training dataset (275B tokens vs previous 32B), improved tokenizer design, and state-of-the-art performance across multiple French NLP tasks. It's particularly notable for its balanced performance across both general and specialized domains.

Q: What are the recommended use cases?

The model excels in various NLP tasks including POS tagging, named entity recognition, text classification, and question answering. It's particularly well-suited for both general French language processing and specialized domains like medical text analysis.