Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained transformer models. While the opportunity for scientists outside of NLP to evaluate and apply such systems to their own domains has never been clearer, these models are difficult to compare: they accept different input formats, are often black-box and give little insight into processing failures, and rarely handle PDF documents, the most common format of scientific publication. In this work, we present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Collage allows the use and evaluation of any HuggingFace token classifier, several LLMs, and multiple other task-specific models out of the box, and provides extensible software interfaces to accelerate experimentation with new models. Further, we enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
翻译:近年来,自然语言处理领域持续发展针对科学文献的领域专用信息抽取工具,同时发布了日益多模态的预训练Transformer模型。尽管非自然语言处理领域的科研人员评估并将此类系统应用于自身领域的机会比以往更加明确,但这些模型难以直接比较:它们接受不同的输入格式,通常呈现黑箱特性且对处理失败缺乏解释,且鲜少支持科学出版物最常见的PDF文档格式。本研究提出Collage——一个专为科学PDF文档上不同信息抽取模型的快速原型构建、可视化与评估而设计的工具。Collage支持开箱即用地使用和评估任意HuggingFace分词分类器、多种大语言模型及其他多个任务专用模型,并提供可扩展的软件接口以加速新模型的实验验证。此外,我们通过提供处理中间状态的细粒度视图,使基于自然语言处理的工具开发者和使用者能够检查、调试并深入理解建模流程。我们在材料科学领域文献综述辅助场景下,通过信息抽取任务展示了本系统的应用价值。