Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
翻译:近年来,大型语言模型(LLM)的研究日益关注扩展模型上下文规模以更好捕捉长文档中的依赖关系。尽管已有基准测试用于评估长距离处理能力,但现有工作主要考虑与真实应用场景未必对齐的通用任务。相比之下,本工作针对实用会议助手场景提出了一种新的长上下文LLM基准测试。在该场景中,长上下文包含由自动语音识别获取的转录文本,由于这类数据固有的噪声性和口语化特征,为LLM带来了独特挑战。本基准测试命名为ELITR-Bench,通过271个手工设计的问题及其标准答案扩充了现有ELITR语料库的转录文本。我们在ELITR-Bench上对近期长上下文LLM的实验凸显了开源模型与商业模型之间的差距,尤其在对话内顺序提问场景中更为显著。此外,我们对基于GPT-4的评估方法进行了深入分析,涵盖众包研究的见解。研究结果表明,尽管GPT-4的评估分数与人工评估相关,但其区分超过三个评分等级的能力可能有限。