camembertav2-base

Maintained By
almanach

CamemBERTav2-base

PropertyValue
Parameter Count111M
LicenseMIT
PaperView Paper
ArchitectureDebertaV2
Training Data275B tokens

What is camembertav2-base?

CamemBERTav2-base is an advanced French language model that represents a significant evolution in French natural language processing. Built on the DebertaV2 architecture, it has been trained on an impressive 275B tokens of French text, making it one of the most comprehensive French language models available.

Implementation Details

The model employs the Replaced Token Detection (RTD) objective with a 20% mask rate, trained on 32 H100 GPUs. It features a newly designed WordPiece tokenizer with 32,768 tokens and an extended context window of 1024 tokens.

  • Utilizes a combination of OSCAR, HALvest, and French Wikipedia datasets
  • Implements improved number handling with two-digit token splitting
  • Supports emojis and special characters including newline and tab
  • Compatible with DebertaV2TokenizerFast from the transformers library

Core Capabilities

  • Achieves state-of-the-art performance on multiple French NLP tasks
  • Excels in POS tagging (97.71% accuracy)
  • Strong performance in Named Entity Recognition (93.40% on FTB-NER)
  • Superior results in question answering (83.04% F1 score on FQuAD)
  • Effective for text classification and natural language inference tasks

Frequently Asked Questions

Q: What makes this model unique?

CamemBERTav2-base stands out due to its massive training dataset (275B tokens compared to the previous 32B), improved tokenizer design, and state-of-the-art performance across various French NLP tasks.

Q: What are the recommended use cases?

The model is ideal for French language tasks including POS tagging, dependency parsing, named entity recognition, text classification, and question answering. It's particularly well-suited for both academic and industrial applications requiring robust French language understanding.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.