Published
Oct 29, 2024
Updated
Oct 29, 2024

Querying Text Like a Database

Efficient Learned Query Execution over Text and Tables [Technical Report]
By
Matthias Urban|Carsten Binnig

Summary

Imagine querying unstructured text data with the same ease and efficiency as a structured database. That's the promise of ELEET, a novel query execution engine designed to handle text and tables seamlessly. Currently, analyzing data spread across tables and text files requires complex extraction pipelines and manual transformations. ELEET changes this by introducing *learned multi-modal operators* (MMOps). These operators, powered by a specialized small language model (SLM), act like traditional database operators (joins, unions, etc.) but can directly process text. For example, a multi-modal join can link a patient table directly with their textual medical reports, allowing analysts to correlate patient age with diagnoses extracted *directly from the text*. Why not just use large language models (LLMs)? While LLMs like GPT-4 can technically convert text to tables, their massive size makes them computationally expensive for query processing. ELEET's smaller, purpose-built SLM dramatically speeds up queries—up to 575x faster than LLM-based approaches in experiments—without sacrificing accuracy. This efficiency comes from a novel model architecture that avoids the costly autoregressive decoding used in LLMs, opting for a faster, single-pass extraction method. The model is pre-trained on a massive new dataset of aligned Wikipedia text and Wikidata tables, teaching it the core skills needed for table extraction. While this research primarily focuses on online query execution, ELEET can also pre-compute extractions offline, like a materialized view. However, online execution offers advantages for constantly updated text collections and ad-hoc queries. This work opens exciting possibilities for querying unstructured data efficiently. Future research includes optimizing the query compiler and extending ELEET to handle other data modalities, such as images, bringing us closer to truly unified data analysis.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does ELEET's small language model (SLM) architecture differ from traditional LLMs in terms of text processing efficiency?
ELEET's SLM uses a single-pass extraction method instead of autoregressive decoding used in traditional LLMs. The architecture processes text directly through specialized multi-modal operators (MMOps) that can handle both text and tabular data simultaneously. This results in up to 575x faster processing speeds compared to LLM-based approaches. For example, when analyzing medical records, ELEET can directly extract and correlate patient information from text reports with structured database tables in a single efficient pass, rather than requiring multiple transformative steps through a large language model.
What are the main benefits of unified text and database querying for businesses?
Unified text and database querying allows businesses to seamlessly analyze both structured and unstructured data without complex integration processes. This approach saves time and resources by eliminating the need for separate analysis pipelines, enables real-time insights from diverse data sources, and supports better decision-making through comprehensive data analysis. For instance, a retail company could simultaneously analyze customer feedback comments and sales data to identify product improvement opportunities, or a healthcare provider could efficiently correlate patient records with medical notes for better treatment planning.
How is AI transforming the way we handle unstructured data in everyday applications?
AI is revolutionizing unstructured data handling by making it more accessible and analyzable for everyday use. Modern AI systems can automatically convert informal text, emails, social media posts, and documents into structured, searchable information. This transformation enables better organization, faster search capabilities, and more intelligent data-driven decisions across various applications. For example, email systems can now automatically categorize messages, extract important dates and tasks, and create calendar events, while customer service systems can analyze customer feedback across multiple channels to identify trends and issues automatically.

PromptLayer Features

  1. Testing & Evaluation
  2. ELEET's comparison of small language models vs large language models for query processing aligns with PromptLayer's testing capabilities for comparing different model approaches
Implementation Details
Set up A/B tests comparing different model sizes and architectures for text extraction tasks, track performance metrics, and analyze speed vs accuracy tradeoffs
Key Benefits
• Quantifiable performance comparisons between different models • Data-driven decision making for model selection • Automated regression testing across model versions
Potential Improvements
• Add specialized metrics for text extraction accuracy • Implement cost-per-query tracking • Develop automated testing pipelines for multi-modal operations
Business Value
Efficiency Gains
Reduce model evaluation time by 40-60% through automated testing
Cost Savings
Optimize model selection to reduce compute costs by identifying most efficient model size
Quality Improvement
Ensure consistent extraction quality across different data types and queries
  1. Workflow Management
  2. ELEET's multi-modal operators for combining text and structured data parallel PromptLayer's workflow orchestration capabilities
Implementation Details
Create reusable templates for text extraction workflows, version control the extraction patterns, and chain multiple operations
Key Benefits
• Standardized text processing pipelines • Reproducible query execution flows • Version-controlled extraction patterns
Potential Improvements
• Add support for custom operator definitions • Implement parallel processing capabilities • Create visual workflow designer for complex queries
Business Value
Efficiency Gains
Reduce workflow development time by 50% through reusable templates
Cost Savings
Minimize redundant processing through optimized workflows
Quality Improvement
Ensure consistent processing across different data sources and queries

The first platform built for prompt engineering