SmolVLM-Instruct-DPO

SmolVLM-Instruct-DPO

HuggingFaceTB

Compact multimodal model combining vision and language capabilities, optimized with DPO training. Efficient architecture for image+text tasks with Apache 2.0 license.

PropertyValue
LicenseApache 2.0
Base ModelsSmolLM2-1.7B-Instruct, SigLIP-so400m
Training DataHuggingFaceH4/rlaif-v_formatted
LanguageEnglish

What is SmolVLM-Instruct-DPO?

SmolVLM-Instruct-DPO is a compact multimodal model designed for efficient processing of combined image and text inputs. Built on the Idefics3 architecture, it represents a significant advancement in lightweight multimodal processing, optimized through Direct Preference Optimization (DPO) training.

Implementation Details

The model is implemented using PEFT (Parameter-Efficient Fine-Tuning) and leverages Flash Attention 2 for enhanced performance on CUDA devices. It processes inputs through a specialized processor that can handle interleaved sequences of images and text.

  • Supports bfloat16 precision for efficient computation
  • Implements chat template formatting for structured inputs
  • Utilizes LoRA for parameter-efficient adaptation
  • Compatible with PyTorch framework

Core Capabilities

  • Image and text sequence processing
  • Visual question answering
  • Image captioning
  • Multi-image storytelling
  • Pure language model functionality

Frequently Asked Questions

Q: What makes this model unique?

SmolVLM-Instruct-DPO stands out for its efficient architecture that enables multimodal processing while maintaining a compact size, making it suitable for on-device applications. The DPO training enhances its performance on preference-aligned tasks.

Q: What are the recommended use cases?

The model excels in tasks requiring visual and textual understanding, including image description, visual QA, and creating narratives from multiple images. It's particularly suitable for applications where computational efficiency is crucial.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026