Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.
翻译:文档内容分析一直是计算机视觉领域的重要研究方向。尽管OCR、版面检测和公式识别等方法已取得显著进展,但由于文档类型和内容的多样性,现有开源解决方案在持续提供高质量内容提取方面仍面临挑战。为解决这些问题,我们提出了MinerU——一种用于高精度文档内容提取的开源解决方案。MinerU利用先进的PDF-Extract-Kit模型有效提取各类文档内容,并采用精细调整的预处理与后处理规则确保最终结果的准确性。实验结果表明,MinerU在不同文档类型上均能保持优异性能,显著提升了内容提取的质量与一致性。MinerU开源项目地址为https://github.com/opendatalab/MinerU。