paraphraser-bart-large

Property	Value
Base Model	facebook/bart-large
Paper	AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data
Author	stanford-oval
Training Data	ParaBank 2 (5M sentence pairs)

What is paraphraser-bart-large?

paraphraser-bart-large is a specialized language model designed for high-quality sentence paraphrasing, developed by Stanford researchers. Built on the BART-large architecture, this model was trained on a carefully curated subset of the ParaBank 2 dataset, consisting of 5 million high-quality sentence pairs derived from English-Czech translations.

Implementation Details

The model utilizes a fine-tuned version of facebook/bart-large, trained for 4 epochs on cleaned ParaBank 2 data. The training process employs token-level cross-entropy loss and uses mini-batches of 1280 examples, with sentences grouped by length for optimal training efficiency.

Trained on back-translated Czech-English pairs for grammatical accuracy
Uses cleaned dataset removing URLs and excessive special characters
Implements efficient batch processing with length-based grouping
Supports controllable generation with top_p and temperature parameters

Core Capabilities

Sentence-level paraphrasing with high grammatical accuracy
Controllable diversity through temperature parameter (0-1)
Maintains semantic meaning while varying expression
Optimized for single-sentence transformation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its focus on high-quality, grammatically correct paraphrasing, achieved through its unique training approach using back-translated data and careful dataset curation. The ability to control output diversity through temperature settings makes it highly versatile for different use cases.

Q: What are the recommended use cases?

The model is best suited for sentence-level paraphrasing tasks. For optimal results, use top_p=0.9 and adjust temperature between 0-1 (higher values for more diverse paraphrases). When working with paragraphs, it's recommended to split them into individual sentences first.