SmolVLM-Instruct-DPO

Property	Value
License	Apache 2.0
Base Models	SmolLM2-1.7B-Instruct, SigLIP-so400m
Training Data	HuggingFaceH4/rlaif-v_formatted
Language	English

What is SmolVLM-Instruct-DPO?

SmolVLM-Instruct-DPO is a compact multimodal model designed for efficient processing of combined image and text inputs. Built on the Idefics3 architecture, it represents a significant advancement in lightweight multimodal processing, optimized through Direct Preference Optimization (DPO) training.

Implementation Details

The model is implemented using PEFT (Parameter-Efficient Fine-Tuning) and leverages Flash Attention 2 for enhanced performance on CUDA devices. It processes inputs through a specialized processor that can handle interleaved sequences of images and text.

Supports bfloat16 precision for efficient computation
Implements chat template formatting for structured inputs
Utilizes LoRA for parameter-efficient adaptation
Compatible with PyTorch framework

Core Capabilities

Image and text sequence processing
Visual question answering
Image captioning
Multi-image storytelling
Pure language model functionality

Frequently Asked Questions

Q: What makes this model unique?

SmolVLM-Instruct-DPO stands out for its efficient architecture that enables multimodal processing while maintaining a compact size, making it suitable for on-device applications. The DPO training enhances its performance on preference-aligned tasks.

Q: What are the recommended use cases?

The model excels in tasks requiring visual and textual understanding, including image description, visual QA, and creating narratives from multiple images. It's particularly suitable for applications where computational efficiency is crucial.