SmolVLM-Instruct-DPO

Maintained By
HuggingFaceTB

SmolVLM-Instruct-DPO

PropertyValue
LicenseApache 2.0
Base ModelsSmolLM2-1.7B-Instruct, SigLIP-so400m
Training DataHuggingFaceH4/rlaif-v_formatted
LanguageEnglish

What is SmolVLM-Instruct-DPO?

SmolVLM-Instruct-DPO is a compact multimodal model designed for efficient processing of combined image and text inputs. Built on the Idefics3 architecture, it represents a significant advancement in lightweight multimodal processing, optimized through Direct Preference Optimization (DPO) training.

Implementation Details

The model is implemented using PEFT (Parameter-Efficient Fine-Tuning) and leverages Flash Attention 2 for enhanced performance on CUDA devices. It processes inputs through a specialized processor that can handle interleaved sequences of images and text.

  • Supports bfloat16 precision for efficient computation
  • Implements chat template formatting for structured inputs
  • Utilizes LoRA for parameter-efficient adaptation
  • Compatible with PyTorch framework

Core Capabilities

  • Image and text sequence processing
  • Visual question answering
  • Image captioning
  • Multi-image storytelling
  • Pure language model functionality

Frequently Asked Questions

Q: What makes this model unique?

SmolVLM-Instruct-DPO stands out for its efficient architecture that enables multimodal processing while maintaining a compact size, making it suitable for on-device applications. The DPO training enhances its performance on preference-aligned tasks.

Q: What are the recommended use cases?

The model excels in tasks requiring visual and textual understanding, including image description, visual QA, and creating narratives from multiple images. It's particularly suitable for applications where computational efficiency is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.