Ovis1.5-Llama3-8B

Ovis1.5-Llama3-8B

AIDC-AI

Ovis1.5-Llama3-8B is an open-source multimodal LLM combining SigLip-400M for vision and Llama3-8B for text, offering strong performance on visual-language tasks.

PropertyValue
Model TypeMultimodal LLM
Vision ModelSigLip-400M
Language ModelLlama3-8B-Instruct
LicenseApache 2.0
PaperarXiv:2405.20797

What is Ovis1.5-Llama3-8B?

Ovis1.5-Llama3-8B is a state-of-the-art Multimodal Large Language Model (MLLM) that uniquely combines vision and language capabilities through structural embedding alignment. Built on the foundation of SigLip-400M for visual processing and Llama3-8B for language understanding, it demonstrates exceptional performance across multiple benchmarks, including MMTBench-VAL (60.7%) and MMBench-EN-V1.1 (78.2%).

Implementation Details

The model implements a novel architecture for aligning visual and textual embeddings structurally. It's fully open-source, providing access to training datasets, code, and model weights for complete transparency and reproducibility.

  • Integrated SigLip-400M vision transformer for image processing
  • Llama3-8B-Instruct foundation for language understanding
  • 8192 token multimodal context length
  • Supports bfloat16 precision for efficient inference

Core Capabilities

  • Strong performance on visual-language tasks (78.2% on MMBench-EN)
  • Robust mathematical reasoning (65.7% on MathVista-Mini)
  • Advanced OCR capabilities (743 score on OCRBench)
  • Excellent visual reasoning abilities (82.5% on AI2D)

Frequently Asked Questions

Q: What makes this model unique?

Ovis1.5-Llama3-8B stands out for its structural embedding alignment approach and complete open-source nature, including training datasets - a feature lacking in many competing models. It achieves superior performance across multiple benchmarks while maintaining full transparency.

Q: What are the recommended use cases?

The model excels in multimodal tasks including visual question-answering, image understanding, mathematical reasoning with visual context, and OCR applications. It's particularly suited for applications requiring both visual and textual understanding with high accuracy.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026