MoshiVis
Property | Value |
---|---|
License | CC-BY-4.0 |
Author | Kyutai |
Paper | arXiv:2503.15633 |
Total Parameters | ~7.6B (7B Moshi + 400M PaliGemma2 + 200M new) |
What is moshika-vis-pytorch-bf16?
MoshiVis is an innovative multimodal AI model that extends the Moshi conversational agent with visual understanding capabilities. Built on a frozen Moshi backbone (~7B parameters) and integrated with PaliGemma2 vision encoder (~400M parameters), the model adds only ~200M trainable parameters to enable seamless visual-conversational abilities while maintaining low latency.
Implementation Details
The model employs a cross-attention mechanism to infuse visual information into the language model, implemented in PyTorch with bfloat16 precision. It was trained on a single DGX node with 8 H100 GPUs, using various public datasets including DOCCI, PixMo, and DocVQA.
- Efficient architecture with minimal additional parameters
- Cross-attention mechanism for visual-language integration
- Multiple backend support: PyTorch, Rust, and MLX
Core Capabilities
- Natural conversation with visual context understanding
- Low-latency interactions despite multimodal capabilities
- Casual conversations, basic facts, and advice
- Image recognition and discussion
- Adaptable to different visual domains
Frequently Asked Questions
Q: What makes this model unique?
MoshiVis uniquely combines visual perception with conversational abilities while maintaining efficiency through minimal parameter addition to the base model. Its architecture allows for low-latency interactions despite handling multiple modalities.
Q: What are the recommended use cases?
The model is best suited for research purposes in conversational AI with visual understanding, casual conversations, basic fact-finding, and roleplaying scenarios. However, it's not recommended for professional advice or impersonation purposes.