ReaderLM-v2

jinaai

ReaderLM-v2: A 1.5B parameter LLM specialized in HTML-to-markdown/JSON conversion with 512K context window, supporting 29 languages and achieving 0.84 ROUGE-L score.

Property	Value
Parameter Count	1.54B
Model Type	Autoregressive, decoder-only transformer
Context Window	512K tokens
Languages Supported	29 languages
Model URL	https://huggingface.co/jinaai/ReaderLM-v2

What is ReaderLM-v2?

ReaderLM-v2 is an advanced language model specifically designed for converting HTML content into beautifully formatted markdown or structured JSON. Built on Qwen2.5-1.5B-Instruction, this model represents a significant advancement in document processing and content transformation, featuring superior accuracy and enhanced longer context handling capabilities.

Implementation Details

The model utilizes a sophisticated architecture with 28 layers, 12 query heads, and 2 KV heads, with a hidden size of 1536 and intermediate size of 8960. It employs a unique training pipeline including long-context pretraining, supervised fine-tuning, and direct preference optimization, resulting in state-of-the-art performance in HTML content transformation tasks.

Architecture: 1.54B parameters with 28 layers and advanced attention mechanism
Context Processing: Handles up to 512K tokens for input and output combined
Multilingual Support: Comprehensive coverage of 29 languages
Performance Metrics: Achieves 0.84 ROUGE-L score and 0.82 Jaro-Winkler Similarity

Core Capabilities

Advanced Markdown Generation with support for complex elements like code fences and LaTeX equations
Direct HTML-to-JSON conversion using predefined schemas
Enhanced stability for long-form content processing
Comprehensive multilingual support across major global languages
High-accuracy content extraction and transformation

Frequently Asked Questions

Q: What makes this model unique?

ReaderLM-v2 stands out for its specialized focus on HTML processing and transformation, combined with its impressive context window of 512K tokens and support for 29 languages. Its ability to generate both markdown and JSON outputs with high accuracy makes it particularly valuable for content processing pipelines.

Q: What are the recommended use cases?

The model excels in content extraction, web scraping, document transformation, and structured data extraction. It's particularly suitable for applications requiring accurate HTML parsing, content reformatting, and multilingual document processing.