Harnessing multiple LLMs for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publications

Deep Learning (DL) techniques are increasingly applied in scientific studies across various domains to address complex research questions. However, the methodological details of these DL models are often hidden in the unstructured text. As a result, critical information about how these models are designed, trained, and evaluated is challenging to access and comprehend. To address this issue, in this work, we use five different open-source Large Language Models (LLMs): Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B in combination with Retrieval-Augmented Generation (RAG) approach to extract and process DL methodological details from scientific publications automatically. We built a voting classifier from the outputs of five LLMs to accurately report DL methodological information. We tested our approach using biodiversity publications, building upon our previous research. To validate our pipeline, we employed two datasets of DL-related biodiversity publications: a curated set of 100 publications from our prior work and a set of 364 publications from the Ecological Informatics journal. Our results demonstrate that the multi-LLM, RAG-assisted pipeline enhances the retrieval of DL methodological information, achieving an accuracy of 69.5% (417 out of 600 comparisons) based solely on textual content from publications. This performance was assessed against human annotators who had access to code, figures, tables, and other supplementary information. Although demonstrated in biodiversity, our methodology is not limited to this field; it can be applied across other scientific domains where detailed methodological reporting is essential for advancing knowledge and ensuring reproducibility. This study presents a scalable and reliable approach for automating information extraction, facilitating better reproducibility and knowledge transfer across studies.

翻译：深度学习技术正日益广泛地应用于各领域的科学研究中，以解决复杂的研究问题。然而，这些深度学习模型的方法学细节往往隐藏在非结构化文本中。因此，关于这些模型如何设计、训练和评估的关键信息难以获取和理解。为解决这一问题，本研究采用五种不同的开源大型语言模型——Llama-3 70B、Llama-3.1 70B、Mixtral-8x22B-Instruct-v0.1、Mixtral 8x7B和Gemma 2 9B，结合检索增强生成方法，自动从科学文献中提取和处理深度学习方法学细节。我们基于五种大型语言模型的输出构建了投票分类器，以准确报告深度学习方法学信息。我们在先前研究基础上，使用生物多样性文献对该方法进行了测试。为验证流程的有效性，我们采用了两组与深度学习相关的生物多样性文献数据集：一组来自我们前期工作的100篇经人工筛选的文献，另一组为《生态信息学》期刊的364篇文献。研究结果表明，基于多大型语言模型的检索增强生成流程显著提升了深度学习方法学信息的检索效果，仅依据文献文本内容即在600组对比中达到69.5%的准确率（417组正确）。该性能评估是以能够访问代码、图表及其他补充信息的人工标注结果为基准进行的。虽然本研究以生物多样性领域为案例，但该方法并不局限于该领域；它可推广至其他科学领域，在这些领域中详细的方法学报告对于推进知识和确保可重复性至关重要。本研究提出了一种可扩展且可靠的信息自动化提取方法，有助于提升不同研究之间的可重复性和知识迁移效率。