HTML-Pruner-Llama-1B

Property	Value
Parameter Count	1.24B parameters
License	Apache 2.0
Paper	HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Base Model	meta-llama/Llama-3.2-1B

What is HTML-Pruner-Llama-1B?

HTML-Pruner-Llama-1B is a specialized language model designed to enhance RAG (Retrieval-Augmented Generation) systems by optimizing HTML content processing. This 1.24B parameter model implements an innovative two-step HTML pruning approach that maintains semantic integrity while reducing content length for more efficient processing.

Implementation Details

The model employs a sophisticated two-step block-tree-based HTML pruning strategy: first utilizing an embedding model for block scoring, followed by a path generative model for further refinement. It includes a Lossless HTML Cleaning process that preserves semantic information while removing redundant structures.

Two-Step Block-Tree-Based HTML Pruning architecture
Lossless HTML Cleaning capability
Built on LLaMA architecture with BF16 tensor type
Optimized for context windows up to 60 tokens

Core Capabilities

Efficient HTML content pruning while maintaining semantic meaning
Block-tree structure analysis and optimization
Integration with various embedding models (BM25, BGE, E5-Mistral)
Competitive performance across multiple QA datasets (ASQA, HotpotQA, NQ, TriviaQA)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized HTML processing capabilities, offering a novel two-step pruning approach that outperforms traditional text-based RAG systems. It achieves state-of-the-art results across multiple benchmarks while maintaining HTML structure integrity.

Q: What are the recommended use cases?

The model is ideal for RAG systems requiring HTML document processing, question-answering systems, and applications needing efficient HTML content summarization while preserving semantic structure. It's particularly effective for scenarios where context length optimization is crucial.