This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.
翻译:本技术报告介绍了 Uni-Parser,一个专为科学文献和专利设计的工业级文档解析引擎,具备高吞吐量、鲁棒的准确性和成本效益。与基于流水线的文档解析方法不同,Uni-Parser 采用了一种模块化、松耦合的多专家架构,该架构保留了文本、公式、表格、图形和化学结构之间细粒度的跨模态对齐,同时易于扩展到新兴模态。该系统集成了自适应 GPU 负载均衡、分布式推理、动态模块编排和可配置模式,支持整体解析或特定模态解析。针对大规模云部署进行了优化,Uni-Parser 在 8 块 NVIDIA RTX 4090D GPU 上实现了高达每秒 20 页 PDF 的处理速率,从而能够在数十亿页规模上实现经济高效的推理。这种可扩展性为广泛的下游应用提供了便利,范围涵盖文献检索与摘要、化学结构、反应方案和生物活性数据的提取,以及为训练下一代大语言模型和 AI4Science 模型而进行的大规模语料库构建。