FormatClassifier-NoURL

Maintained By
WebOrganizer

FormatClassifier-NoURL

PropertyValue
Base Modelgte-base-en-v1.5
Parameters140M
Training Data1.1M annotated documents
PaperarXiv:2502.10341

What is FormatClassifier-NoURL?

FormatClassifier-NoURL is a specialized model designed to categorize web content into 24 distinct formats based solely on text content, without relying on URL information. Built on the gte-base-en-v1.5 architecture, this model has been fine-tuned using a two-stage training process with high-quality annotations from Llama-3.1 models.

Implementation Details

The model utilizes a sophisticated architecture with 140M parameters and implements efficient inference capabilities through xformers. It's trained on a comprehensive dataset consisting of 1M documents from first-stage training and 100K documents from second-stage training, annotated by advanced language models.

  • Two-stage training process using Llama-3.1-8B and Llama-3.1-405B-FP8 annotations
  • Supports memory-efficient attention and unpadded inputs
  • Compatible with bfloat16 data type for optimized performance
  • Provides probability distribution across 24 format categories

Core Capabilities

  • Classifies content into categories ranging from Academic Writing to User Reviews
  • Processes text input without URL dependencies
  • Generates probability distributions for all possible format categories
  • Supports efficient inference with memory optimization options

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to classify web content without relying on URL information, combined with its comprehensive coverage of 24 format categories and efficient implementation options, makes it particularly valuable for content organization tasks.

Q: What are the recommended use cases?

This model is ideal for content management systems, web crawlers, and data organization tools that need to categorize web content based purely on text. It's particularly useful when URL information is unavailable or unreliable.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.