The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate evaluation bias, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work aims to uncover the current limitations of LLMs in handling Chinese tasks, pushing towards the development of more culturally informed and linguistically diverse models with the released data and benchmark (https://yizhilll.github.io/CIF-Bench/).
翻译:大语言模型(LLM)的进步通过指令遵循增强了其在广泛未见过的自然语言处理(NLP)任务上的泛化能力。然而,这些模型在中文等低资源语言中的有效性常常下降,数据泄露导致的评估偏差进一步加剧了这一问题,从而对其在新语言领域中的真实泛化能力提出了质疑。为此,我们提出了中文指令遵循基准(CIF-Bench),旨在评估大语言模型在中文上的零样本泛化能力。CIF-Bench包含150个任务和15000个输入-输出对,由母语者开发,测试20个类别中的复杂推理和中文文化细微差别。为了减轻评估偏差,我们仅公开一半数据集,其余部分保持私有,同时引入多样化指令以减小分数方差,共计45000个数据实例。我们对28个精选大语言模型的评估显示存在显著性能差距,最佳模型得分仅为52.9%,凸显了LLM在较少熟悉语言和任务环境中的局限性。本研究旨在揭示当前LLM处理中文任务时的局限性,通过发布数据和基准(https://yizhilll.github.io/CIF-Bench/),推动开发更具文化认知和语言多样性的模型。