CamemBERTav2-base
Property | Value |
---|---|
Parameter Count | 111M parameters |
License | MIT |
Language | French |
Paper | View Paper |
Training Data | 275B tokens |
What is camembertav2-base?
CamemBERTav2-base is an advanced French language model that represents a significant evolution in French NLP. Built on the DebertaV2 architecture, this model has been trained on an impressive 275B tokens of French text, combining data from OSCAR dumps, scientific documents from HALvest, and French Wikipedia.
Implementation Details
The model implements a sophisticated architecture with several technical improvements over its predecessor:
- Extended context window of 1024 tokens
- New WordPiece tokenizer with 32,768 tokens
- Improved number handling and emoji support
- Trained using Replaced Token Detection (RTD) with 20% mask rate
Core Capabilities
- State-of-the-art performance in POS tagging (97.71%)
- Superior NER capabilities (93.40% on FTB-NER)
- Excellent performance on XNLI (84.82%)
- Advanced question answering capabilities (83.04% F1 on FQuAD)
- Enhanced medical NER performance (73.98%)
Frequently Asked Questions
Q: What makes this model unique?
CamemBERTav2 stands out due to its massive training dataset (275B tokens vs previous 32B), improved tokenizer design, and state-of-the-art performance across multiple French NLP tasks. It's particularly notable for its balanced performance across both general and specialized domains.
Q: What are the recommended use cases?
The model excels in various NLP tasks including POS tagging, named entity recognition, text classification, and question answering. It's particularly well-suited for both general French language processing and specialized domains like medical text analysis.