llm-data-textbook-quality-fasttext-classifier-v2
Property | Value |
---|---|
License | MIT |
Paper Reference | Textbooks Are All You Need |
Language | English |
Framework | FastText |
What is llm-data-textbook-quality-fasttext-classifier-v2?
This is a specialized classifier designed to evaluate the educational value of text content, inspired by the "Textbooks Are All You Need" research. It employs a three-tier classification system (High/Mid/Low) to assess text quality, particularly useful for LLM training data curation. The model can process over 2000 examples per second on CPU, making it suitable for real-time filtering during model training.
Implementation Details
The classifier uses FastText architecture and evaluates text on a scale from 0 to 2, where High quality represents the top 25%, Mid quality represents 25-75%, and Low quality represents the bottom 25% of educational value. The implementation allows for rapid, on-the-fly processing of text data, making it particularly valuable for large-scale data filtering operations.
- Processes text at 2000+ examples per second on CPU
- Uses FastText architecture for efficient classification
- Provides granular scoring between 0-2 for educational value
- Supports real-time text evaluation during model training
Core Capabilities
- High-speed text quality assessment
- Three-tier classification system
- CPU-based processing without GPU requirement
- Effective for filtering training data for LLMs
- Benchmark testing across various datasets
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to rapidly assess educational value in text while running efficiently on CPU hardware. Its three-tier classification system provides more nuanced evaluation than binary classifiers, and its proven effectiveness across various datasets makes it particularly valuable for LLM training data curation.
Q: What are the recommended use cases?
The primary use cases include filtering training data for language models, evaluating educational content quality, and processing large-scale text datasets for educational value assessment. It's particularly useful for organizations building educational AI models or curating high-quality training datasets.