moshika-vis-pytorch-bf16

moshika-vis-pytorch-bf16

kyutai

MoshiVis - A perceptually augmented multimodal model combining vision, speech, and text capabilities, built on Moshi backbone with PaliGemma2 vision encoder. ~7.6B total parameters.

PropertyValue
LicenseCC-BY-4.0
AuthorKyutai
PaperarXiv:2503.15633
Total Parameters~7.6B (7B Moshi + 400M PaliGemma2 + 200M new)

What is moshika-vis-pytorch-bf16?

MoshiVis is an innovative multimodal AI model that extends the Moshi conversational agent with visual understanding capabilities. Built on a frozen Moshi backbone (~7B parameters) and integrated with PaliGemma2 vision encoder (~400M parameters), the model adds only ~200M trainable parameters to enable seamless visual-conversational abilities while maintaining low latency.

Implementation Details

The model employs a cross-attention mechanism to infuse visual information into the language model, implemented in PyTorch with bfloat16 precision. It was trained on a single DGX node with 8 H100 GPUs, using various public datasets including DOCCI, PixMo, and DocVQA.

  • Efficient architecture with minimal additional parameters
  • Cross-attention mechanism for visual-language integration
  • Multiple backend support: PyTorch, Rust, and MLX

Core Capabilities

  • Natural conversation with visual context understanding
  • Low-latency interactions despite multimodal capabilities
  • Casual conversations, basic facts, and advice
  • Image recognition and discussion
  • Adaptable to different visual domains

Frequently Asked Questions

Q: What makes this model unique?

MoshiVis uniquely combines visual perception with conversational abilities while maintaining efficiency through minimal parameter addition to the base model. Its architecture allows for low-latency interactions despite handling multiple modalities.

Q: What are the recommended use cases?

The model is best suited for research purposes in conversational AI with visual understanding, casual conversations, basic fact-finding, and roleplaying scenarios. However, it's not recommended for professional advice or impersonation purposes.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026