MM-Embed
Property | Value |
---|---|
License | CC-BY-NC-4.0 |
Base Architecture | LLaVA v1.6 Mistral-7B |
Paper | arXiv:2411.02571 |
Maximum Token Length | 32,768 Tokens |
What is MM-Embed?
MM-Embed is a groundbreaking multimodal retrieval model developed by NVIDIA that extends the capabilities of NV-Embed-v1. It represents a significant advancement in multimodal AI, achieving state-of-the-art results with a 52.7 averaged score on the UniIR benchmark, surpassing previous best results of 48.9. The model also improves text retrieval accuracy to 60.3 on MTEB benchmark across 15 retrieval tasks.
Implementation Details
The model is built on a decoder-only transformer architecture, leveraging the LLaVA v1.6 Mistral-7B foundation. It implements innovative training strategies, including modality-aware hard negative mining for improved multimodal retrieval accuracy, and features a continual text-to-text fine-tuning method.
- Supports both text and image inputs in various formats
- Employs PyTorch as the runtime engine
- Optimized for NVIDIA Hopper architecture
- Handles maximum token length of 32,768
Core Capabilities
- Multimodal retrieval with state-of-the-art accuracy
- Enhanced text-to-text retrieval performance
- Flexible input handling for both text and images
- Support for high-dimensional embedding generation
- Integrated instruction-based querying system
Frequently Asked Questions
Q: What makes this model unique?
MM-Embed stands out for its superior performance in multimodal retrieval tasks and its ability to maintain high accuracy in both text-to-text and multimodal scenarios. Its implementation of modality-aware hard negative mining and continual fine-tuning strategies represents a significant advancement in the field.
Q: What are the recommended use cases?
The model is ideal for research applications requiring multimodal retrieval, including text-image search systems, document retrieval, and question-answering systems. However, due to its CC-BY-NC-4.0 license, it should not be used for commercial purposes. For commercial applications, NVIDIA recommends using NeMo Retriever Microservices (NIMs).