MM-Embed

Property	Value
License	CC-BY-NC-4.0
Base Architecture	LLaVA v1.6 Mistral-7B
Paper	arXiv:2411.02571
Maximum Token Length	32,768 Tokens

What is MM-Embed?

MM-Embed is a groundbreaking multimodal retrieval model developed by NVIDIA that extends the capabilities of NV-Embed-v1. It represents a significant advancement in multimodal AI, achieving state-of-the-art results with a 52.7 averaged score on the UniIR benchmark, surpassing previous best results of 48.9. The model also improves text retrieval accuracy to 60.3 on MTEB benchmark across 15 retrieval tasks.

Implementation Details

The model is built on a decoder-only transformer architecture, leveraging the LLaVA v1.6 Mistral-7B foundation. It implements innovative training strategies, including modality-aware hard negative mining for improved multimodal retrieval accuracy, and features a continual text-to-text fine-tuning method.

Supports both text and image inputs in various formats
Employs PyTorch as the runtime engine
Optimized for NVIDIA Hopper architecture
Handles maximum token length of 32,768

Core Capabilities

Multimodal retrieval with state-of-the-art accuracy
Enhanced text-to-text retrieval performance
Flexible input handling for both text and images
Support for high-dimensional embedding generation
Integrated instruction-based querying system

Frequently Asked Questions

Q: What makes this model unique?

MM-Embed stands out for its superior performance in multimodal retrieval tasks and its ability to maintain high accuracy in both text-to-text and multimodal scenarios. Its implementation of modality-aware hard negative mining and continual fine-tuning strategies represents a significant advancement in the field.

Q: What are the recommended use cases?

The model is ideal for research applications requiring multimodal retrieval, including text-image search systems, document retrieval, and question-answering systems. However, due to its CC-BY-NC-4.0 license, it should not be used for commercial purposes. For commercial applications, NVIDIA recommends using NeMo Retriever Microservices (NIMs).

MM-Embed

MM-Embed

What is MM-Embed?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models