The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.
翻译:大语言模型(LLMs)的发展,通过指令遵循能力,提升了其在大量未见自然语言处理(NLP)任务上的泛化能力。然而,在中文等低资源语言中,其有效性往往下降,加之数据泄露导致的评估偏差,使得人们对其在新语言领域中的真实泛化能力产生质疑。为此,我们提出了中文指令遵循基准(CIF-Bench),旨在评估LLMs对中文的零样本泛化能力。CIF-Bench包含150个任务和15,000个输入-输出对,由母语者开发,旨在测试涵盖20个类别的复杂推理和中文文化细微差别。为减轻数据污染,我们仅公开一半数据集,其余部分保持私有,并引入了多样化的指令以最小化分数方差,数据实例总计达45,000个。我们对28个选定LLMs的评估揭示了一个明显的性能差距,最佳模型得分仅为52.9%,突显了LLMs在较不熟悉的语言和任务背景下的局限性。这项工作不仅揭示了LLMs在处理中文任务方面的当前局限,也为未来LLM泛化能力研究设立了新标准,推动开发更具适应性、文化感知力和语言多样性的模型。