en_core_web_lg
Property | Value |
---|---|
License | MIT |
Vector Dimensions | 300 |
Vocabulary Size | 514,157 words |
Author | Explosion |
What is en_core_web_lg?
en_core_web_lg is a comprehensive English language processing model developed by spaCy, optimized for CPU performance. It's a large-scale model that combines multiple NLP capabilities including tokenization, part-of-speech tagging, dependency parsing, and named entity recognition (NER).
Implementation Details
The model implements a sophisticated pipeline architecture with components including tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, and named entity recognition. It features 514,157 unique word vectors with 300 dimensions, trained on a diverse corpus including Wikipedia, OSCAR, OpenSubtitles, and WMT News Crawl.
- Token Classification Accuracy: 99.86%
- Part-of-Speech Tagging Accuracy: 97.35%
- Named Entity Recognition F-Score: 85.43%
- Dependency Parsing (UAS): 92.08%
Core Capabilities
- Advanced Named Entity Recognition with 18 entity types
- High-accuracy Part-of-Speech tagging with 50+ tags
- Dependency parsing with 45 dependency labels
- Sentence boundary detection with 90.71% F-score
- Large vocabulary with comprehensive word vectors
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its comprehensive feature set and large vocabulary, making it ideal for production environments requiring accurate English language processing. Its CPU optimization allows for efficient processing without requiring specialized hardware.
Q: What are the recommended use cases?
This model is particularly well-suited for applications requiring detailed text analysis, including information extraction, content categorization, semantic analysis, and advanced NLP tasks requiring word vectors. It's ideal for production environments where accuracy is crucial.