We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
翻译:我们提出了Ko-MuSR,这是首个全面评估长韩语叙事中多步软推理能力、同时最大限度减少数据污染的基准。Ko-MuSR沿袭MuSR框架构建,其特色在于包含完整的韩语叙事、推理链以及经过人工标注者验证逻辑一致性与可答性的多项选择题。对四个大语言模型(包括两个多语言模型和两个韩语专用模型)的评估表明,即使在韩语推理任务中,多语言模型的表现也优于专注于韩语的模型,这揭示了推理能力的跨语言泛化特性。精心设计的提示策略——结合了少样本示例、推理轨迹和任务特定提示——进一步提升了准确率,使其接近人类水平。Ko-MuSR通过支持对长上下文推理及提示策略的系统性评估,为推进韩语自然语言处理研究奠定了坚实基础。