quest-corruption-7b-s375-v3-GRPO
Property | Value |
---|---|
Model Size | 7B parameters |
Author | Quest-AI |
Model URL | HuggingFace Repository |
Training Infrastructure | 8x H200 GPUs (Deepshard) |
What is quest-corruption-7b-s375-v3-GRPO?
quest-corruption-7b-s375-v3-GRPO is a specialized language model designed for text repair and corruption handling. It implements a unique "fill in the middle" training approach with randomized UTF8 character substitution capabilities. The model was enhanced using two custom GRPO (Generalized Reward Programming Objectives) reward functions to improve its ability to handle XML-styled content.
Implementation Details
The model employs a distinctive prompt formatting structure that includes corrupted text with UTF8 substitutions, followed by objective specifications in Claude-style XML tags. The implementation is specifically designed to work with PyQT GUI tools and focuses on synthesizing rejected or lower quality preference data from pre-existing SFT (Supervised Fine-Tuning) data.
- Custom pseudo "fill in the middle" training methodology
- Specialized handling of varying corruption rates
- Integration with XML-style formatting
- Optimized through GRPO reward functions
Core Capabilities
- Text repair and reconstruction
- UTF8 character substitution handling
- Preference data synthesis
- XML-aware content processing
- Integration with GUI tooling
Frequently Asked Questions
Q: What makes this model unique?
The model's unique feature is its ability to handle corrupted text using UTF8 character substitution while maintaining awareness of XML styling. It's specifically designed to generate lower quality, subtly incoherent completions that can be used for training reward models.
Q: What are the recommended use cases?
The primary use case is synthesizing rejected or lower quality preference data from pre-existing SFT data. This is particularly valuable for training reward models and developing generalized preferences from subtly incoherent base model completions.