nanoLLaVA

qnguyen3

Efficient 1B param vision-language model optimized for edge devices. Combines Quyen-SE-v0.1 LLM with SigLIP vision encoder. Strong performance on VQA tasks.

Property	Value
Parameter Count	1.05B
Model Type	Vision-Language Model
License	Apache-2.0
Tensor Type	BF16

What is nanoLLaVA?

nanoLLaVA is a compact yet powerful vision-language model designed specifically for edge device deployment. Built on the foundation of Quyen-SE-v0.1 (Qwen1.5-0.5B) as its base LLM and utilizing google/siglip-so400m-patch14-384 as its vision encoder, this model achieves impressive performance despite its relatively small size of 1.05B parameters.

Implementation Details

The model follows the ChatML standard for prompt formatting and can be easily implemented using the transformers library. It supports both CPU and CUDA implementations, with optimized inference through PyTorch.

Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
Vision Encoder: google/siglip-so400m-patch14-384
Tensor Format: BF16
Comprehensive multimodal understanding capabilities

Core Capabilities

VQA v2 Score: 70.84
TextVQA Performance: 46.71
ScienceQA Accuracy: 58.97
POPE Score: 84.1
MMMU Test Performance: 28.6
GQA Score: 54.79

Frequently Asked Questions

Q: What makes this model unique?

nanoLLaVA stands out for its efficient design that enables deployment on edge devices while maintaining strong performance across various vision-language tasks. Its compact size of 1.05B parameters makes it particularly suitable for resource-constrained environments.

Q: What are the recommended use cases?

The model is ideal for applications requiring visual question answering, image description, and general vision-language understanding tasks on edge devices. It's particularly effective for scenarios where computational resources are limited but reliable multimodal understanding is necessary.