en_core_web_sm
Property | Value |
---|---|
License | MIT |
Author | Explosion AI |
spaCy Version | >=3.7.2,<3.8.0 |
Token Accuracy | 99.86% |
What is en_core_web_sm?
en_core_web_sm is a lightweight English language processing model optimized for CPU usage, developed by Explosion AI. It's part of the spaCy ecosystem and provides comprehensive natural language processing capabilities while maintaining a small footprint.
Implementation Details
The model implements a sophisticated pipeline architecture comprising seven core components: tok2vec, tagger, parser, senter, ner, attribute_ruler, and lemmatizer. It's trained on various high-quality datasets including OntoNotes 5, ClearNLP, and WordNet 3.0.
- Named Entity Recognition (NER) with 84.56% F-score
- Part-of-speech tagging with 97.25% accuracy
- Dependency parsing with 91.75% unlabeled attachment score
- Sentence segmentation with 90.59% F-score
Core Capabilities
- 18 distinct NER categories including PERSON, ORG, DATE, and more
- Comprehensive token classification with 50+ POS tags
- 44 dependency parsing labels for detailed syntactic analysis
- High-accuracy sentence boundary detection
Frequently Asked Questions
Q: What makes this model unique?
The model's strength lies in its balanced performance across multiple NLP tasks while maintaining a small footprint, making it ideal for CPU-based applications requiring quick processing.
Q: What are the recommended use cases?
This model is particularly well-suited for production environments where computational resources are limited but require reliable English language processing, including named entity recognition, POS tagging, and dependency parsing.