The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.
翻译:大型语言模型(LLM)的性能和可用性正推动其在解释生成任务中的应用。然而,尽管LLM被广泛采用,其生成的解释已被证实不可靠,导致用户难以区分解释的优劣。为解决这一问题,我们提出了Rubrik's CUBE——一个受教育领域启发的评估准则,以及一个包含2.6万条解释的数据集。这些解释由人类及六个开源与闭源LLM生成,并随后依据该准则进行了质量标注。CUBE数据集聚焦于两项推理任务和两项语言任务,为我们有效测试所提出的评估准则提供了必要的多样性。通过使用Rubrik进行评估,我们发现解释质量同时受任务类型和感知难度的影响。低质量解释主要源于LLM生成内容缺乏简洁性,而非连贯性或措辞问题。完整数据集、评估准则及代码将在论文录用后公开。