TopicClassifier-NoURL
Property | Value |
---|---|
Base Model | gte-base-en-v1.5 |
Parameters | 140M |
Training Data | 1M + 100K annotated documents |
Paper | arXiv:2502.10341 |
What is TopicClassifier-NoURL?
TopicClassifier-NoURL is a specialized model designed to categorize web content into 24 distinct topics without relying on URL information. Built on the gte-base-en-v1.5 architecture, it has been fine-tuned on a large dataset of documents annotated by advanced language models Llama-3.1-8B and Llama-3.1-405B-FP8.
Implementation Details
The model employs a two-stage training process: first using 1M documents annotated by Llama-3.1-8B, followed by refinement with 100K documents annotated by Llama-3.1-405B-FP8. It uses efficient attention mechanisms and supports unpadded inputs for optimized performance.
- Built on gte-base-en-v1.5 architecture
- Supports bfloat16 for efficient inference
- Compatible with xformers for memory-efficient attention
- Implements sequence classification with softmax probability distribution
Core Capabilities
- Classifies content into 24 diverse categories including Tech, Business, Health, etc.
- Processes raw text without URL dependencies
- Provides probability distributions across all categories
- Supports efficient batch processing
- Optimized for production environments with memory-efficient options
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to classify content without URL information, combined with its efficient architecture and comprehensive category coverage, makes it particularly valuable for content organization tasks. The two-stage training process with high-quality Llama annotations ensures robust performance.
Q: What are the recommended use cases?
The model is ideal for content organization systems, recommendation engines, content filtering, and automated content categorization pipelines. It's particularly useful in scenarios where URL information is unavailable or unreliable.