TopicClassifier-NoURL

Property	Value
Base Model	gte-base-en-v1.5
Parameters	140M
Training Data	1M + 100K annotated documents
Paper	arXiv:2502.10341

What is TopicClassifier-NoURL?

TopicClassifier-NoURL is a specialized model designed to categorize web content into 24 distinct topics without relying on URL information. Built on the gte-base-en-v1.5 architecture, it has been fine-tuned on a large dataset of documents annotated by advanced language models Llama-3.1-8B and Llama-3.1-405B-FP8.

Implementation Details

The model employs a two-stage training process: first using 1M documents annotated by Llama-3.1-8B, followed by refinement with 100K documents annotated by Llama-3.1-405B-FP8. It uses efficient attention mechanisms and supports unpadded inputs for optimized performance.

Built on gte-base-en-v1.5 architecture
Supports bfloat16 for efficient inference
Compatible with xformers for memory-efficient attention
Implements sequence classification with softmax probability distribution

Core Capabilities

Classifies content into 24 diverse categories including Tech, Business, Health, etc.
Processes raw text without URL dependencies
Provides probability distributions across all categories
Supports efficient batch processing
Optimized for production environments with memory-efficient options

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to classify content without URL information, combined with its efficient architecture and comprehensive category coverage, makes it particularly valuable for content organization tasks. The two-stage training process with high-quality Llama annotations ensures robust performance.

Q: What are the recommended use cases?

The model is ideal for content organization systems, recommendation engines, content filtering, and automated content categorization pipelines. It's particularly useful in scenarios where URL information is unavailable or unreliable.