We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.
翻译:我们提出了ReaderLM-v2,这是一个拥有15亿参数的紧凑型语言模型,专为高效的网页内容提取而设计。我们的模型能够处理长达512K令牌的文档,以高精度将杂乱的HTML转换为整洁的Markdown或JSON格式——这使其成为支撑大型语言模型的理想工具。该模型的有效性源于两项关键创新:(1)一个三阶段数据合成流程,通过迭代地草拟、精炼和评判网页内容提取过程,生成高质量、多样化的训练数据;(2)一个结合了持续预训练与多目标优化的统一训练框架。密集评估表明,在精心构建的基准测试中,ReaderLM-v2的性能优于GPT-4o-2024-08-06及其他更大模型15-20%,尤其在处理超过100K令牌的文档时表现卓越,同时保持了显著更低的计算需求。