VARCO-VISION-14B

Maintained By
NCSOFT

VARCO-VISION-14B

PropertyValue
Parameter Count15.2B
Base ArchitectureLLaVA-OneVision
LicenseCC BY-NC 4.0
LanguagesEnglish, Korean
PaperResearch Paper

What is VARCO-VISION-14B?

VARCO-VISION-14B is an advanced Vision-Language Model developed by NC Research's Multimodal Generation Team. Built on the Qwen2.5-14B-Instruct foundation and incorporating google/siglip-so400m-patch14-384 as its vision encoder, this model represents a significant advancement in multimodal AI capabilities. It excels in both English and Korean language processing, making it particularly valuable for bilingual applications.

Implementation Details

The model architecture follows LLaVA-OneVision and underwent four distinct training phases, culminating in a preference optimization stage. It utilizes BF16 tensor type for efficient processing and implements advanced features like flash attention 2 for optimal performance.

  • Specialized token system for grounding (), OCR (), and object referencing
  • Support for single image and text input combinations
  • Advanced bounding box detection capabilities
  • Integrated OCR functionality for text recognition in images

Core Capabilities

  • Multimodal understanding and generation in English and Korean
  • Object grounding with precise bounding box coordinates
  • Text recognition and extraction from images
  • Visual question answering with location-specific responses
  • Detailed image description with object localization

Frequently Asked Questions

Q: What makes this model unique?

VARCO-VISION-14B stands out for its bilingual capabilities and comprehensive feature set, including grounding, OCR, and referring capabilities, all while maintaining performance levels comparable to proprietary models. Its specialized token system allows for precise object localization and text recognition tasks.

Q: What are the recommended use cases?

The model is ideal for applications requiring bilingual image understanding, detailed object detection, text extraction from images, and visual question answering. It's particularly suitable for applications needing precise object localization and text recognition in both English and Korean contexts.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.