preselect-fasttext-classifier

Maintained By
hkust-nlp

PreSelect FastText Classifier

PropertyValue
AuthorHKUST-NLP
PaperPredictive Data Selection: The Data That Predicts Is the Data That Teaches
Classification TypeBinary Classification
Labels__label__1 (positive), __label__0 (negative)

What is preselect-fasttext-classifier?

The preselect-fasttext-classifier is a specialized binary classification model designed to identify and filter high-quality data in pretraining corpora. This model serves as the backbone for creating the PreSelect-100B dataset, implementing a selective approach with a 10% threshold for data quality assessment. The classifier leverages FastText architecture to make efficient decisions about text quality, making it particularly valuable for large-scale data curation tasks.

Implementation Details

The model implements a straightforward yet effective approach to data filtering using FastText architecture. It can be easily integrated into existing data processing pipelines through Python, supporting both JSON and Parquet file formats. The implementation includes built-in support for parallel processing with customizable task distribution.

  • Binary classification with positive (__label__1) and negative (__label__0) labels
  • Supports batch processing through LocalPipelineExecutor
  • Flexible input/output handling with JsonlReader and JsonlWriter
  • Configurable threshold settings for classification

Core Capabilities

  • High-throughput data quality assessment
  • Efficient binary classification of text data
  • Scalable processing for large datasets
  • Integration with standard data processing pipelines
  • Support for multiple input formats

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for identifying high-quality training data, making it instrumental in creating the PreSelect-100B dataset. Its FastText architecture ensures efficient processing while maintaining effective classification capabilities.

Q: What are the recommended use cases?

The model is ideal for data curation tasks, particularly when processing large-scale text corpora for machine learning model training. It's especially useful for filtering and selecting high-quality training data from larger, potentially noisy datasets.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.