We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.
翻译:我们介绍了CUS-QA,一个用于评估涵盖文本和视觉模态的开放式区域性问答的基准。我们还利用最先进的大语言模型(LLMs)提供了强有力的基线。我们的数据集包含基于维基百科、由来自捷克、斯洛伐克和乌克兰的母语者创建并配有英文翻译的人工精编问题与答案。它既包含纯文本问题,也包含需要视觉理解的问题。我们通过提示工程评估了最先进的LLMs,并增加了对答案正确性的人工判断。利用这些人工评估,我们分析了现有自动评估指标的可靠性。我们的基线结果表明,即使是最佳的开源权重LLMs,在文本问题上也仅能达到40%以上的准确率,在视觉问题上则低于30%。基于LLM的评估指标与人工判断显示出强相关性,而传统的字符串重叠指标由于答案中普遍存在命名实体,其表现也出人意料地好。