Published
Sep 27, 2024
Updated
Oct 18, 2024

Beyond Words: The AI That Sees and Guides You

Show and Guide: Instructional-Plan Grounded Vision and Language Model
By
Diogo Glória-Silva|David Semedo|João Magalhães

Summary

Imagine an AI assistant that not only understands your questions but also *sees* your progress and offers guidance, much like a helpful friend in real life. That’s the groundbreaking concept behind the “Show and Guide” research, which introduces a new multimodal AI model called MM-PlanLLM. Traditional AI excels at text, but real-world tasks like cooking or assembling furniture are inherently visual. This new model bridges the gap, using both text instructions (like a recipe) and visual cues (like a photo of your current progress) to provide dynamic, context-aware assistance. For instance, if you’re following a recipe and unsure what to do next, you could simply snap a picture of your dish. MM-PlanLLM would analyze the image, determine which step you're on, and provide the next instruction, even if you're not following the recipe precisely. This is possible thanks to MM-PlanLLM’s unique ability to learn cross-modal representations—linking text and visual information within the context of a task. This model is trained in stages, starting with basic image-caption matching and then progressively learning more complex tasks like video moment retrieval (showing you the relevant part of an instructional video) and visually-informed step generation (telling you what to do next based on an image). While current AI assistants mostly interact through text, this research paves the way for more intuitive, truly helpful AI companions that understand both what you say and what you see. However, challenges remain. MM-PlanLLM’s current context window is limited, and broader multimodal support, like answering complex visual questions, is still in its early stages. Nevertheless, this research offers an exciting glimpse into the future of AI—one where your AI assistant not only answers but also sees, guides, and collaborates with you in the real world.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does MM-PlanLLM's cross-modal representation learning work?
MM-PlanLLM learns to connect text and visual information through a staged training process. The model first masters basic image-caption matching, then progresses to more complex tasks like video moment retrieval and visually-informed step generation. This staged approach allows the system to build increasingly sophisticated connections between visual and textual data. For example, when cooking, the model can match a photo of diced vegetables with the corresponding recipe step, then determine if the cutting technique is correct and what needs to be done next. This progressive learning enables the model to provide context-aware guidance across different types of tasks.
What are the main benefits of AI assistants that can process both visual and text inputs?
AI assistants that handle both visual and text inputs offer more intuitive and natural interactions. They can understand context better by 'seeing' what you're doing, similar to having a human helper nearby. These systems can provide more accurate guidance in practical tasks like cooking, DIY projects, or learning new skills. For instance, instead of just reading instructions, you can show the AI your progress and get specific feedback. This dual-input capability makes AI assistance more accessible to users who might struggle with text-only instructions and helps prevent mistakes by confirming visual progress.
How will multimodal AI change the future of personal assistance?
Multimodal AI is set to revolutionize personal assistance by making interactions more natural and context-aware. Instead of purely text-based exchanges, future AI assistants will be able to see, understand, and guide users through complex tasks in real-time. This technology could transform everything from home cooking to education, where AI can provide visual feedback and personalized guidance. For example, an AI could help you learn a new instrument by watching your technique, assist with home repairs by identifying tools and parts, or guide you through complex assembly processes with real-time visual verification.

PromptLayer Features

  1. Testing & Evaluation
  2. The model's staged training approach and multimodal capabilities require comprehensive testing across different input types and scenarios
Implementation Details
Create test suites with image-text pairs, implement batch testing for different stages of instruction generation, establish evaluation metrics for visual-textual accuracy
Key Benefits
• Systematic validation of multimodal responses • Quality assurance across different task types • Performance tracking across model versions
Potential Improvements
• Add specialized metrics for visual-textual alignment • Implement automated regression testing for model updates • Develop cross-modal evaluation frameworks
Business Value
Efficiency Gains
Reduced time in validating model responses across different input types
Cost Savings
Early detection of performance issues prevents costly deployment errors
Quality Improvement
Ensures consistent performance across visual and textual domains
  1. Workflow Management
  2. Complex multimodal processing requires orchestrated workflows from image input to final instruction generation
Implementation Details
Design reusable templates for different task types, implement version tracking for multimodal prompts, create structured pipelines for visual-textual processing
Key Benefits
• Streamlined handling of multimodal inputs • Consistent processing across different use cases • Traceable model behavior and outputs
Potential Improvements
• Add specialized handlers for different visual input types • Implement parallel processing for multiple modalities • Create adaptive workflow templates
Business Value
Efficiency Gains
Streamlined deployment and management of multimodal AI systems
Cost Savings
Reduced operational overhead through automated workflows
Quality Improvement
Consistent handling of complex multimodal interactions

The first platform built for prompt engineering