SciLitLLM1.5-14B
Property | Value |
---|---|
Base Model | Qwen2.5-14B |
Parameters | 14 Billion |
Developer | Uni-SMART |
Model Hub | Hugging Face |
Paper | arXiv:2408.15545 |
What is SciLitLLM1.5-14B?
SciLitLLM1.5-14B is a specialized large language model designed specifically for scientific literature understanding. Built upon the Qwen2.5-14B architecture, it implements a hybrid approach combining continual pre-training (CPT) and supervised fine-tuning (SFT) to enhance its capabilities in processing scientific content.
Implementation Details
The model employs a sophisticated pipeline that addresses two primary challenges in scientific text processing: high-quality CPT corpora construction and diverse SFT instruction generation. The implementation includes advanced PDF text extraction, content error correction, quality filtering, and synthetic instruction creation mechanisms.
- Hybrid training strategy combining CPT and SFT
- Advanced PDF processing and text extraction capabilities
- Quality-focused content filtering system
- Synthetic instruction generation for enhanced performance
Core Capabilities
- Superior performance on scientific literature benchmarks (4.0% improvement on SciAssess)
- Enhanced scientific content understanding and analysis
- Outperforms larger models like Llama3.1 and Qwen2.5-70B on SciRIFF
- Efficient processing of academic and research materials
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its specialized focus on scientific literature understanding, achieved through a novel hybrid training approach that combines continuous pre-training with supervised fine-tuning. It demonstrates superior performance compared to larger models while maintaining efficiency.
Q: What are the recommended use cases?
SciLitLLM1.5-14B is particularly suited for scientific literature analysis, research paper understanding, academic content summarization, and technical document processing. It's ideal for researchers, academics, and professionals working with scientific content.