AKI-4B-phi-3.5-mini

Maintained By
Sony

AKI-4B-phi-3.5-mini

PropertyValue
AuthorSony
Vision Encodergoogle/siglip-so400m-patch14-384
Language Modelmicrosoft/Phi-3.5-mini-instruct
LicenseCC-BY-NC 4.0
PaperarXiv:2503.02597

What is AKI-4B-phi-3.5-mini?

AKI-4B-phi-3.5-mini is an innovative multimodal foundation model that introduces a novel approach called modality-mutual attention (MMA) for better vision-language alignment. The model uniquely unlocks causal attention in the language model to enable bidirectional information flow between image and text modalities, achieving this without additional parameters or increased training time.

Implementation Details

The model architecture combines three key components: a SigLIP vision encoder for image processing, a Perceiver Resampler for vision-language connection, and the Phi-3.5-mini model for language processing. It was trained on comprehensive datasets including Blip3-kale and Blip3-OCR-200m for pretraining, and multiple task-specific datasets for fine-tuning.

  • Advanced vision-language alignment through modality-mutual attention
  • Zero-parameter overhead implementation
  • Extensive training on diverse multimodal datasets
  • Support for various vision-language tasks

Core Capabilities

  • Visual question answering (VQAv2, GQA)
  • Optical character recognition understanding (OCRVQA)
  • Scientific reasoning (ScienceQA)
  • Visual reference resolution (RefCOCO series)
  • General visual grounding (VisualGnome)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its modality-mutual attention mechanism, which allows for bidirectional information flow between vision and language components without requiring additional parameters. This results in significant improvements across multiple benchmarks, with up to 29.5% improvement on certain tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for chat-based interactions involving image analysis, visual question answering, and tasks requiring detailed understanding of visual content. It performs optimally when used with the specified chat format and can handle a wide range of vision-language tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.