finerweb-quality-classifier

Maintained By
TurkuNLP

FinerWeb Quality Classifier

PropertyValue
Base Modelmicrosoft/deberta-v3-base
LicenseApache 2.0
PaperarXiv:2501.07314
Training Data328,472 lines from 20,000 documents

What is finerweb-quality-classifier?

The FinerWeb Quality Classifier is a specialized DeBERTa-v3-based model developed by the University of Turku for assessing the quality of web text at a line-by-line level. It's designed to distinguish between high-quality (Clean) content and various categories of low-quality text, outputting a quality score between 0 and 1 for each input line.

Implementation Details

The model was trained using bfloat16 precision with a learning rate of 1e-5 and batch size of 16. Training utilized early stopping with a patience of 5 epochs and incorporated label smoothing of 0.1 in the cross-entropy loss function. The training data underwent a sophisticated labeling process using GPT-4 and further refinement with OpenAI's o1-preview model.

  • Achieves 0.81 micro-F1 and 0.66 macro-F1 scores
  • Exceptional performance on Clean class identification (F1: 0.90)
  • Trained on a balanced dataset with 86.24% Clean lines

Core Capabilities

  • Line-level quality assessment of English web text
  • Binary classification between Clean and low-quality content
  • Produces interpretable quality scores (0-1 range)
  • Optimized for web content filtering and dataset curation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on line-level web text quality assessment, trained on a large-scale dataset with sophisticated LLM-based labeling. Its high performance on Clean content identification (0.90 F1 score) makes it particularly valuable for data filtering tasks.

Q: What are the recommended use cases?

The model is ideal for filtering and quality assessment of English web text, particularly in dataset curation for language model training. It's specifically designed for line-by-line analysis but should not be used for non-English content or highly specialized domains.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.