msmarco-portuguese-mt5-base-v1

msmarco-portuguese-mt5-base-v1

doc2query

Portuguese doc2query model based on mT5 for document expansion and query generation. Generates multiple queries from text to improve search relevance and training data generation.

PropertyValue
Model Typedoc2query / mT5-based
Training DataMS MARCO Portuguese Dataset
Training Steps66k (4 epochs on 500k pairs)
Model HubHugging Face

What is msmarco-portuguese-mt5-base-v1?

This is a specialized doc2query model built on mT5 architecture, designed specifically for Portuguese language document expansion and query generation. The model was trained on the MS MARCO dataset to generate relevant queries from document passages, enhancing search capabilities and training data generation for NLP tasks.

Implementation Details

The model implements a sequence-to-sequence architecture based on mT5, fine-tuned for 66,000 training steps using 500,000 training pairs from MS MARCO. It processes input text up to 320 word pieces and generates outputs up to 64 word pieces in length.

  • Supports both beam search and top-k/top-p sampling for query generation
  • Implements input truncation to 320 word pieces
  • Generates multiple queries per input text
  • Optimized for Portuguese language processing

Core Capabilities

  • Document Expansion: Generates 20-40 queries per paragraph for enhanced BM25 indexing
  • Query Generation: Creates diverse search queries from input text
  • Training Data Generation: Produces (query, text) pairs for embedding model training
  • Lexical Gap Bridging: Generates synonyms and alternative phrasings

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in Portuguese language document expansion and query generation, utilizing both beam search and sampling techniques to generate diverse, high-quality queries. It's particularly effective for improving search relevance and generating training data for other NLP models.

Q: What are the recommended use cases?

The model is ideal for enhancing search systems through document expansion, generating training data for embedding models, and improving information retrieval systems. It can be integrated with Elasticsearch, OpenSearch, or Lucene for better search results.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026