en_core_web_lg

Maintained By
spacy

en_core_web_lg

PropertyValue
LicenseMIT
Vector Dimensions300
Vocabulary Size514,157 words
AuthorExplosion

What is en_core_web_lg?

en_core_web_lg is a comprehensive English language processing model developed by spaCy, optimized for CPU performance. It's a large-scale model that combines multiple NLP capabilities including tokenization, part-of-speech tagging, dependency parsing, and named entity recognition (NER).

Implementation Details

The model implements a sophisticated pipeline architecture with components including tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, and named entity recognition. It features 514,157 unique word vectors with 300 dimensions, trained on a diverse corpus including Wikipedia, OSCAR, OpenSubtitles, and WMT News Crawl.

  • Token Classification Accuracy: 99.86%
  • Part-of-Speech Tagging Accuracy: 97.35%
  • Named Entity Recognition F-Score: 85.43%
  • Dependency Parsing (UAS): 92.08%

Core Capabilities

  • Advanced Named Entity Recognition with 18 entity types
  • High-accuracy Part-of-Speech tagging with 50+ tags
  • Dependency parsing with 45 dependency labels
  • Sentence boundary detection with 90.71% F-score
  • Large vocabulary with comprehensive word vectors

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its comprehensive feature set and large vocabulary, making it ideal for production environments requiring accurate English language processing. Its CPU optimization allows for efficient processing without requiring specialized hardware.

Q: What are the recommended use cases?

This model is particularly well-suited for applications requiring detailed text analysis, including information extraction, content categorization, semantic analysis, and advanced NLP tasks requiring word vectors. It's ideal for production environments where accuracy is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.