moshika-vis-pytorch-bf16

Maintained By
kyutai

MoshiVis

PropertyValue
LicenseCC-BY-4.0
AuthorKyutai
PaperarXiv:2503.15633
Total Parameters~7.6B (7B Moshi + 400M PaliGemma2 + 200M new)

What is moshika-vis-pytorch-bf16?

MoshiVis is an innovative multimodal AI model that extends the Moshi conversational agent with visual understanding capabilities. Built on a frozen Moshi backbone (~7B parameters) and integrated with PaliGemma2 vision encoder (~400M parameters), the model adds only ~200M trainable parameters to enable seamless visual-conversational abilities while maintaining low latency.

Implementation Details

The model employs a cross-attention mechanism to infuse visual information into the language model, implemented in PyTorch with bfloat16 precision. It was trained on a single DGX node with 8 H100 GPUs, using various public datasets including DOCCI, PixMo, and DocVQA.

  • Efficient architecture with minimal additional parameters
  • Cross-attention mechanism for visual-language integration
  • Multiple backend support: PyTorch, Rust, and MLX

Core Capabilities

  • Natural conversation with visual context understanding
  • Low-latency interactions despite multimodal capabilities
  • Casual conversations, basic facts, and advice
  • Image recognition and discussion
  • Adaptable to different visual domains

Frequently Asked Questions

Q: What makes this model unique?

MoshiVis uniquely combines visual perception with conversational abilities while maintaining efficiency through minimal parameter addition to the base model. Its architecture allows for low-latency interactions despite handling multiple modalities.

Q: What are the recommended use cases?

The model is best suited for research purposes in conversational AI with visual understanding, casual conversations, basic fact-finding, and roleplaying scenarios. However, it's not recommended for professional advice or impersonation purposes.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.