Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction (MIMDE) tasks, which involves extracting an optimal set of insights from a document corpus and mapping these insights back to their source documents. This task is fundamental to many practical applications, from analyzing survey responses to processing medical records, where identifying and tracing key insights across documents is crucial. We develop an evaluation framework for MIMDE and introduce a novel set of complementary human and synthetic datasets to examine the potential of synthetic data for LLM evaluation. After establishing optimal metrics for comparing extracted insights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis reveals a strong correlation (0.71) between the ability of LLMs to extracts insights on our two datasets but synthetic data fails to capture the complexity of document-level analysis. These findings offer crucial guidance for the use of synthetic data in evaluating text analysis systems, highlighting both its potential and limitations.
翻译:大型语言模型(LLM)在文本分析任务中展现出卓越能力,但其在复杂现实应用中的评估仍具挑战性。我们定义了一类任务——多洞察多文档提取(MIMDE)任务,其核心在于从文档语料库中提取最优洞察集合,并将这些洞察映射回源文档。此类任务对许多实际应用至关重要,从分析调查反馈到处理医疗记录,跨文档识别与追踪关键洞察均具有核心意义。我们开发了针对MIMDE的评估框架,并引入了一套新颖的互补性人类数据集与合成数据集,以探究合成数据在LLM评估中的潜力。在确立用于比较提取洞察的最优指标后,我们对20个前沿LLM在两个数据集上进行了基准测试。分析表明,LLM在两个数据集上的洞察提取能力呈现强相关性(0.71),但合成数据未能充分捕捉文档级分析的复杂性。这些发现为合成数据在文本分析系统评估中的应用提供了关键指导,既揭示了其潜力,也明确了其局限性。