SmolVLM-Instruct-DPO
Property | Value |
---|---|
License | Apache 2.0 |
Base Models | SmolLM2-1.7B-Instruct, SigLIP-so400m |
Training Data | HuggingFaceH4/rlaif-v_formatted |
Language | English |
What is SmolVLM-Instruct-DPO?
SmolVLM-Instruct-DPO is a compact multimodal model designed for efficient processing of combined image and text inputs. Built on the Idefics3 architecture, it represents a significant advancement in lightweight multimodal processing, optimized through Direct Preference Optimization (DPO) training.
Implementation Details
The model is implemented using PEFT (Parameter-Efficient Fine-Tuning) and leverages Flash Attention 2 for enhanced performance on CUDA devices. It processes inputs through a specialized processor that can handle interleaved sequences of images and text.
- Supports bfloat16 precision for efficient computation
- Implements chat template formatting for structured inputs
- Utilizes LoRA for parameter-efficient adaptation
- Compatible with PyTorch framework
Core Capabilities
- Image and text sequence processing
- Visual question answering
- Image captioning
- Multi-image storytelling
- Pure language model functionality
Frequently Asked Questions
Q: What makes this model unique?
SmolVLM-Instruct-DPO stands out for its efficient architecture that enables multimodal processing while maintaining a compact size, making it suitable for on-device applications. The DPO training enhances its performance on preference-aligned tasks.
Q: What are the recommended use cases?
The model excels in tasks requiring visual and textual understanding, including image description, visual QA, and creating narratives from multiple images. It's particularly suitable for applications where computational efficiency is crucial.