en_core_web_trf
Property | Value |
---|---|
License | MIT |
Author | Explosion AI |
Base Architecture | RoBERTa-base |
spaCy Compatibility | ≥3.7.2, <3.8.0 |
What is en_core_web_trf?
en_core_web_trf is a powerful English language transformer-based pipeline model built on the RoBERTa architecture. It represents spaCy's state-of-the-art offering for English language processing, combining high accuracy with comprehensive language understanding capabilities.
Implementation Details
The model is implemented using a transformer architecture based on RoBERTa-base, featuring byte-BPE tokenization with a vocabulary size of 50,265 tokens. It employs a sophisticated pipeline including transformer, tagger, parser, named entity recognizer, attribute ruler, and lemmatizer components.
- Transformer Configuration: 768-dimensional embeddings with 144 token window size
- Named Entity Recognition F-score: 90.19%
- Part-of-Speech Tagging Accuracy: 98.13%
- Dependency Parsing (LAS): 93.91%
Core Capabilities
- Named Entity Recognition with 18 entity types
- Part-of-Speech Tagging with 50+ tag classes
- Dependency Parsing with 45 dependency labels
- Sentence Boundary Detection (90.11% F-score)
- Lemmatization and Attribute Assignment
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its exceptional accuracy across multiple NLP tasks, particularly in POS tagging (98.13%) and NER (90.19% F-score). It's built on the robust RoBERTa architecture and trained on high-quality datasets including OntoNotes 5.
Q: What are the recommended use cases?
The model excels in production environments requiring high-accuracy language understanding, including document analysis, information extraction, and text analytics. It's particularly suitable for applications needing precise entity recognition, syntactic analysis, or detailed linguistic annotation.