Sa2VA-26B

Sa2VA-26B

ByteDance

Sa2VA-26B is a multimodal LLM that combines SAM2 and LLaVA capabilities for advanced image/video understanding, segmentation, and QA tasks at 26B parameters.

PropertyValue
Base ModelInternVL2.5-26B
Language Modelinternlm2_5-20b-chat
MMBench Score85.8
Model HubByteDance/Sa2VA-26B

What is Sa2VA-26B?

Sa2VA-26B is a state-of-the-art multimodal large language model that combines the capabilities of SAM2 with LLaVA to enable dense grounded understanding of both images and videos. It represents the largest model in the Sa2VA family, built on InternVL2.5 architecture with 26 billion parameters.

Implementation Details

The model is implemented using a hybrid architecture that integrates visual understanding and language processing capabilities. It utilizes the InternVL2.5-26B as its base and incorporates internlm2_5-20b-chat for language processing tasks.

  • Supports both image and video processing
  • Implements flash attention for improved performance
  • Features distributed computing capabilities across multiple GPUs
  • Provides BFloat16 precision support

Core Capabilities

  • Question answering on visual content with 85.8 score on MMBench
  • Dense object segmentation for both images and videos
  • Visual prompt understanding and processing
  • State-of-the-art performance on RefCOCO benchmarks (82.9/79.3/81.2)
  • Advanced video analysis with DAVIS score of 78.6

Frequently Asked Questions

Q: What makes this model unique?

Sa2VA-26B stands out for its ability to combine dense visual understanding with language capabilities, offering both segmentation and question-answering abilities that surpass other SOTA models like Qwen2-VL and InternVL2.5.

Q: What are the recommended use cases?

The model excels in image and video analysis tasks including detailed object segmentation, visual question answering, and dense visual understanding. It's particularly suitable for applications requiring both visual analysis and natural language interaction.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026