msmarco-portuguese-mt5-base-v1
Property | Value |
---|---|
Model Type | doc2query / mT5-based |
Training Data | MS MARCO Portuguese Dataset |
Training Steps | 66k (4 epochs on 500k pairs) |
Model Hub | Hugging Face |
What is msmarco-portuguese-mt5-base-v1?
This is a specialized doc2query model built on mT5 architecture, designed specifically for Portuguese language document expansion and query generation. The model was trained on the MS MARCO dataset to generate relevant queries from document passages, enhancing search capabilities and training data generation for NLP tasks.
Implementation Details
The model implements a sequence-to-sequence architecture based on mT5, fine-tuned for 66,000 training steps using 500,000 training pairs from MS MARCO. It processes input text up to 320 word pieces and generates outputs up to 64 word pieces in length.
- Supports both beam search and top-k/top-p sampling for query generation
- Implements input truncation to 320 word pieces
- Generates multiple queries per input text
- Optimized for Portuguese language processing
Core Capabilities
- Document Expansion: Generates 20-40 queries per paragraph for enhanced BM25 indexing
- Query Generation: Creates diverse search queries from input text
- Training Data Generation: Produces (query, text) pairs for embedding model training
- Lexical Gap Bridging: Generates synonyms and alternative phrasings
Frequently Asked Questions
Q: What makes this model unique?
This model specializes in Portuguese language document expansion and query generation, utilizing both beam search and sampling techniques to generate diverse, high-quality queries. It's particularly effective for improving search relevance and generating training data for other NLP models.
Q: What are the recommended use cases?
The model is ideal for enhancing search systems through document expansion, generating training data for embedding models, and improving information retrieval systems. It can be integrated with Elasticsearch, OpenSearch, or Lucene for better search results.