Published
Sep 30, 2024
Updated
Sep 30, 2024

Unlocking How-To Videos: AI Plans Your Next Steps

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
By
Md Mohaiminul Islam|Tushar Nagarajan|Huiyu Wang|Fu-Jen Chu|Kris Kitani|Gedas Bertasius|Xitong Yang

Summary

Imagine an AI assistant that not only understands your current progress in, say, assembling a bookshelf, but also suggests your next steps. Researchers at Meta AI and UNC Chapel Hill are bringing this closer to reality with VidAssist, an AI system that generates step-by-step plans from instructional videos. Their research tackles the challenge of anticipating sequences of actions towards a goal—like a recipe or DIY project—directly from visual demonstrations. Traditional approaches struggle to generalize to new tasks, but VidAssist leverages large language models (LLMs) for a more flexible approach. It works like this: the system first translates video or images of a task into text descriptions. Then, it enters a 'propose, assess, search' loop. The LLM proposes possible next steps, a set of specialized functions assess how well these align with the ultimate objective, and a search algorithm dynamically explores the most promising action sequences. Essentially, the LLM acts as both a knowledge base and a judge of its own suggestions. This approach avoids the trap of overfitting to limited training data and can even perform well in situations with little or no specific examples. VidAssist achieved state-of-the-art results, outperforming fully supervised models in predicting action sequences. This research has far-reaching implications for AI assistance in everyday tasks. Imagine an app that provides dynamic guidance during cooking or guides a robot to build furniture. While challenges like enhancing long-term planning and improving visual recognition remain, VidAssist represents an exciting step toward truly intelligent assistants that can understand and guide our actions in the real world.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does VidAssist's 'propose, assess, search' loop work to generate step-by-step plans?
VidAssist's loop is a three-stage process that combines LLMs with specialized assessment functions. First, the LLM proposes potential next steps based on the current state and goal. Then, specialized functions evaluate these proposals by checking their alignment with the final objective. Finally, a search algorithm explores and ranks the most promising action sequences. For example, in a furniture assembly task, the system might propose 'attach leg A to base,' assess if this moves closer to the completed furniture goal, and then search for the most efficient sequence of subsequent steps. This approach enables dynamic planning that can adapt to various scenarios without requiring extensive task-specific training data.
What are the practical benefits of AI-powered instructional assistants in everyday life?
AI-powered instructional assistants make complex tasks more manageable by providing real-time, personalized guidance. They can break down complicated procedures into easy-to-follow steps, adapt to your pace, and offer corrections when needed. For example, while cooking, an AI assistant could track your progress, suggest ingredient preparations at the right time, and adjust instructions based on your skill level. This technology can help in various scenarios, from DIY projects to learning new skills, making traditionally challenging tasks more accessible to everyone. The key advantage is the ability to receive dynamic, contextual help without needing human expertise present.
How is AI changing the way we learn new skills and follow instructions?
AI is revolutionizing skill acquisition by providing personalized, adaptive learning experiences. Instead of following static instructions, AI systems can understand your current progress, adjust the difficulty level, and offer customized guidance based on your needs. This makes learning more efficient and less frustrating, as the AI can identify common mistakes and provide preventive advice. From cooking to crafting to technical skills, AI assistants can break down complex processes into manageable steps, offer real-time feedback, and adapt their teaching style to match your learning pace, making skill development more accessible and effective.

PromptLayer Features

  1. Workflow Management
  2. VidAssist's 'propose, assess, search' loop maps directly to multi-step prompt orchestration needs
Implementation Details
Create sequential prompt templates for video-to-text conversion, action proposal, and assessment stages with version tracking
Key Benefits
• Reproducible multi-stage prompt sequences • Versioned tracking of each stage's performance • Reusable templates for different instruction types
Potential Improvements
• Add branching logic for different task types • Implement feedback loops for continuous improvement • Integrate visual processing checkpoints
Business Value
Efficiency Gains
30-40% faster deployment of multi-stage AI workflows
Cost Savings
Reduced development time through reusable templates
Quality Improvement
Better consistency across complex prompt chains
  1. Testing & Evaluation
  2. VidAssist's performance comparison against supervised models requires robust testing infrastructure
Implementation Details
Set up batch testing environments with ground truth data and implement A/B testing for different prompt variations
Key Benefits
• Systematic comparison of prompt versions • Quantitative performance metrics tracking • Automated regression testing
Potential Improvements
• Add specialized metrics for visual tasks • Implement cross-validation frameworks • Enhance error analysis capabilities
Business Value
Efficiency Gains
50% faster prompt optimization cycles
Cost Savings
Reduced errors through automated testing
Quality Improvement
More reliable and consistent model outputs

The first platform built for prompt engineering