PreSelect FastText Classifier

Property	Value
Author	HKUST-NLP
Paper	Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Classification Type	Binary Classification
Labels	__label__1 (positive), __label__0 (negative)

What is preselect-fasttext-classifier?

The preselect-fasttext-classifier is a specialized binary classification model designed to identify and filter high-quality data in pretraining corpora. This model serves as the backbone for creating the PreSelect-100B dataset, implementing a selective approach with a 10% threshold for data quality assessment. The classifier leverages FastText architecture to make efficient decisions about text quality, making it particularly valuable for large-scale data curation tasks.

Implementation Details

The model implements a straightforward yet effective approach to data filtering using FastText architecture. It can be easily integrated into existing data processing pipelines through Python, supporting both JSON and Parquet file formats. The implementation includes built-in support for parallel processing with customizable task distribution.

Binary classification with positive (__label__1) and negative (__label__0) labels
Supports batch processing through LocalPipelineExecutor
Flexible input/output handling with JsonlReader and JsonlWriter
Configurable threshold settings for classification

Core Capabilities

High-throughput data quality assessment
Efficient binary classification of text data
Scalable processing for large datasets
Integration with standard data processing pipelines
Support for multiple input formats

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for identifying high-quality training data, making it instrumental in creating the PreSelect-100B dataset. Its FastText architecture ensures efficient processing while maintaining effective classification capabilities.

Q: What are the recommended use cases?

The model is ideal for data curation tasks, particularly when processing large-scale text corpora for machine learning model training. It's especially useful for filtering and selecting high-quality training data from larger, potentially noisy datasets.