Rubrik's Cube：在CUBE数据集上测试一种新的解释评估准则 (Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset)

The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.

翻译：大型语言模型（LLM）的性能和可用性正推动其在解释生成任务中的应用。然而，尽管LLM被广泛采用，其生成的解释已被证实不可靠，导致用户难以区分解释的优劣。为解决这一问题，我们提出了Rubrik's CUBE——一个受教育领域启发的评估准则，以及一个包含2.6万条解释的数据集。这些解释由人类及六个开源与闭源LLM生成，并随后依据该准则进行了质量标注。CUBE数据集聚焦于两项推理任务和两项语言任务，为我们有效测试所提出的评估准则提供了必要的多样性。通过使用Rubrik进行评估，我们发现解释质量同时受任务类型和感知难度的影响。低质量解释主要源于LLM生成内容缺乏简洁性，而非连贯性或措辞问题。完整数据集、评估准则及代码将在论文录用后公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日