msmarco-portuguese-mt5-base-v1

Maintained By
doc2query

msmarco-portuguese-mt5-base-v1

PropertyValue
Model Typedoc2query / mT5-based
Training DataMS MARCO Portuguese Dataset
Training Steps66k (4 epochs on 500k pairs)
Model HubHugging Face

What is msmarco-portuguese-mt5-base-v1?

This is a specialized doc2query model built on mT5 architecture, designed specifically for Portuguese language document expansion and query generation. The model was trained on the MS MARCO dataset to generate relevant queries from document passages, enhancing search capabilities and training data generation for NLP tasks.

Implementation Details

The model implements a sequence-to-sequence architecture based on mT5, fine-tuned for 66,000 training steps using 500,000 training pairs from MS MARCO. It processes input text up to 320 word pieces and generates outputs up to 64 word pieces in length.

  • Supports both beam search and top-k/top-p sampling for query generation
  • Implements input truncation to 320 word pieces
  • Generates multiple queries per input text
  • Optimized for Portuguese language processing

Core Capabilities

  • Document Expansion: Generates 20-40 queries per paragraph for enhanced BM25 indexing
  • Query Generation: Creates diverse search queries from input text
  • Training Data Generation: Produces (query, text) pairs for embedding model training
  • Lexical Gap Bridging: Generates synonyms and alternative phrasings

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in Portuguese language document expansion and query generation, utilizing both beam search and sampling techniques to generate diverse, high-quality queries. It's particularly effective for improving search relevance and generating training data for other NLP models.

Q: What are the recommended use cases?

The model is ideal for enhancing search systems through document expansion, generating training data for embedding models, and improving information retrieval systems. It can be integrated with Elasticsearch, OpenSearch, or Lucene for better search results.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.