en_core_web_lg
Property | Value |
---|---|
License | MIT |
Vector Dimensions | 300 |
Vocabulary Size | 514,157 words |
Author | Explosion |
What is en_core_web_lg?
en_core_web_lg is a comprehensive English language processing model developed by spaCy, optimized for CPU usage. It's a large-scale model that combines powerful word vectors with multiple NLP capabilities, achieving high accuracy across various linguistic tasks.
Implementation Details
The model implements a sophisticated pipeline architecture with seven core components: tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, and named entity recognition (NER). It features 514,157 unique word vectors with 300 dimensions, trained on a diverse corpus including OSCAR 2109, Wikipedia, OpenSubtitles, and WMT News Crawl.
- Token Classification Accuracy: 99.86%
- Part-of-speech Tagging Accuracy: 97.35%
- Named Entity Recognition F-Score: 85.43%
- Dependency Parsing Accuracy (UAS): 92.08%
Core Capabilities
- Advanced Named Entity Recognition with 18 entity types
- Comprehensive part-of-speech tagging with 50+ tag categories
- Dependency parsing with 45 label types
- Sentence segmentation with 90.71% F-score
- Lemmatization and attribute ruling
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its combination of extensive vocabulary coverage (514K words), high accuracy across multiple NLP tasks, and CPU optimization, making it suitable for production environments without requiring GPU resources.
Q: What are the recommended use cases?
This model is ideal for production applications requiring comprehensive English language processing, including text analysis, information extraction, content categorization, and linguistic annotation. It's particularly suitable for scenarios requiring accurate named entity recognition, dependency parsing, and part-of-speech tagging.