msmarco-spanish-mt5-base-v1
Property | Value |
---|---|
License | Apache 2.0 |
Base Model | google/mt5-base |
Training Dataset | unicamp-dl/mMARCO |
Primary Paper | doc2query Paper |
What is msmarco-spanish-mt5-base-v1?
msmarco-spanish-mt5-base-v1 is a specialized doc2query model based on mT5 architecture, designed specifically for Spanish language document expansion and query generation. The model was fine-tuned for 66,000 training steps on 500,000 training pairs from MS MARCO, making it particularly effective for improving search relevance and generating training data.
Implementation Details
The model implements a sophisticated text-to-text generation approach, utilizing both beam search and top-k/top-p sampling strategies. It processes input text up to 320 word pieces and can generate outputs up to 64 word pieces in length. The implementation supports two primary generation methods: beam search for high-quality, focused queries, and sampling for more diverse query generation.
- Beam search implementation with 5 beams and no-repeat ngram size of 2
- Top-k/Top-p sampling with p=0.95 and k=10
- Supports multiple query generation per input text
- Built on the powerful mT5-base architecture
Core Capabilities
- Document Expansion: Generates 20-40 queries per paragraph for enhanced BM25 indexing
- Training Data Generation: Creates (query, text) pairs for training dense embedding models
- Lexical Gap Bridging: Generates synonyms and alternative phrasings
- Multi-query Generation: Supports both deterministic and non-deterministic query generation
Frequently Asked Questions
Q: What makes this model unique?
This model specializes in Spanish language document expansion, combining the power of mT5 architecture with doc2query methodology. Its ability to generate both precise and diverse queries makes it particularly valuable for improving search systems and generating training data.
Q: What are the recommended use cases?
The model is ideal for enhancing Spanish language search systems through document expansion, generating training data for embedding models, and improving information retrieval systems. It's particularly effective when integrated with standard BM25 indexes like Elasticsearch or OpenSearch.