msmarco-spanish-mt5-base-v1

Property	Value
License	Apache 2.0
Base Model	google/mt5-base
Training Dataset	unicamp-dl/mMARCO
Primary Paper	doc2query Paper

What is msmarco-spanish-mt5-base-v1?

msmarco-spanish-mt5-base-v1 is a specialized doc2query model based on mT5 architecture, designed specifically for Spanish language document expansion and query generation. The model was fine-tuned for 66,000 training steps on 500,000 training pairs from MS MARCO, making it particularly effective for improving search relevance and generating training data.

Implementation Details

The model implements a sophisticated text-to-text generation approach, utilizing both beam search and top-k/top-p sampling strategies. It processes input text up to 320 word pieces and can generate outputs up to 64 word pieces in length. The implementation supports two primary generation methods: beam search for high-quality, focused queries, and sampling for more diverse query generation.

Beam search implementation with 5 beams and no-repeat ngram size of 2
Top-k/Top-p sampling with p=0.95 and k=10
Supports multiple query generation per input text
Built on the powerful mT5-base architecture

Core Capabilities

Document Expansion: Generates 20-40 queries per paragraph for enhanced BM25 indexing
Training Data Generation: Creates (query, text) pairs for training dense embedding models
Lexical Gap Bridging: Generates synonyms and alternative phrasings
Multi-query Generation: Supports both deterministic and non-deterministic query generation

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in Spanish language document expansion, combining the power of mT5 architecture with doc2query methodology. Its ability to generate both precise and diverse queries makes it particularly valuable for improving search systems and generating training data.

Q: What are the recommended use cases?

The model is ideal for enhancing Spanish language search systems through document expansion, generating training data for embedding models, and improving information retrieval systems. It's particularly effective when integrated with standard BM25 indexes like Elasticsearch or OpenSearch.