CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Yizhi LI,Ge Zhang,Xingwei Qu,Jiali Li,Zhaoqun Li,Zekun Wang,Hao Li,Ruibin Yuan,Yinghao Ma,Kai Zhang,Wangchunshu Zhou,Yiming Liang,Lei Zhang,Lei Ma,Jiajun Zhang,Zuowen Li,Stephen W. Huang,Chenghua Lin,Jie Fu

from arxiv, Camera-ready version for ACL 2024. Project page at https://yizhilll.github.io/CIF-Bench/

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.

翻译：大语言模型（LLMs）的发展，通过指令遵循能力，提升了其在大量未见自然语言处理（NLP）任务上的泛化能力。然而，在中文等低资源语言中，其有效性往往下降，加之数据泄露导致的评估偏差，使得人们对其在新语言领域中的真实泛化能力产生质疑。为此，我们提出了中文指令遵循基准（CIF-Bench），旨在评估LLMs对中文的零样本泛化能力。CIF-Bench包含150个任务和15,000个输入-输出对，由母语者开发，旨在测试涵盖20个类别的复杂推理和中文文化细微差别。为减轻数据污染，我们仅公开一半数据集，其余部分保持私有，并引入了多样化的指令以最小化分数方差，数据实例总计达45,000个。我们对28个选定LLMs的评估揭示了一个明显的性能差距，最佳模型得分仅为52.9%，突显了LLMs在较不熟悉的语言和任务背景下的局限性。这项工作不仅揭示了LLMs在处理中文任务方面的当前局限，也为未来LLM泛化能力研究设立了新标准，推动开发更具适应性、文化感知力和语言多样性的模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日