Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.
翻译:大语言模型(LLMs)日益被视为自动化科学信息抽取的强大工具。然而,现有方法与工具常难以应对科学文献的现实挑战:长上下文文档、多模态内容,以及将多篇出版物中多样且不一致的细粒度信息协调并整合为标准格式。当所需的数据模式或抽取本体快速变化时,这些挑战进一步加剧,使得重新架构或微调现有系统变得困难。我们提出了SciEx,一个模块化且可组合的框架,其解耦了PDF解析、多模态检索、信息抽取与聚合等关键组件。该设计简化了按需数据抽取流程,同时支持新模型、提示策略与推理机制的灵活集成与扩展。我们在涵盖三个科学主题的数据集上评估了SciEx,检验其准确且一致地抽取细粒度信息的能力。我们的研究结果为当前基于LLM的流程的优势与局限性提供了实用见解。