As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
翻译:随着大型语言模型(LLMs)的能力逐渐趋同,提升其性能的关键在于识别并整合有价值的新信息源。然而,如何评估哪些文本集合值得投入数字化、预处理及整合到LLM系统所需的大量资源,仍然是一个重大挑战。针对这一问题,我们提出了一种创新方法:构建无需模型训练或微调的自动化流程,用于评估文本集合的潜在信息增益。该方法通过从文本生成多项选择题(MCQs),并测量LLM在有/无访问源材料条件下的表现差异,以此作为评估集合信息潜力的代理指标。我们使用五个经策略性选择的数据集验证了该方法的有效性:EPFL博士论文手稿、威尼斯历史记录私有收藏、两组相关主题的维基百科文章以及一个合成基线数据集。实验结果表明,该方法能有效识别包含有价值新颖信息的文本集合,为数据获取与整合工作的优先级决策提供了实用工具。