Text editing is a crucial task of modifying text to better align with user intents. However, existing text editing benchmark datasets contain only coarse-grained instructions and lack explainability, thus resulting in outputs that deviate from the intended changes outlined in the gold reference. To comprehensively investigate the text editing capabilities of large language models (LLMs), this paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing. XATU considers finer-grained text editing tasks of varying difficulty (simplification, grammar check, fact-check, etc.), incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects. To enhance interpretability, we combine LLM-based annotation and human annotation, resulting in a benchmark that includes fine-grained instructions and gold-standard edit explanations. By evaluating existing LLMs against our benchmark, we demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks. Furthermore, extensive experimentation reveals the significant role of explanations in fine-tuning language models for text editing tasks. The benchmark will be open-sourced to support reproduction and facilitate future research at~\url{https://github.com/megagonlabs/xatu}.
翻译:文本编辑是一项关键任务,旨在修改文本以更好地符合用户意图。然而,现有的文本编辑基准数据集仅包含粗粒度指令且缺乏可解释性,导致输出结果偏离黄金参考中所标注的预期修改。为全面探究大型语言模型的文本编辑能力,本文提出XATU,这是首个专门针对基于细粒度指令的可解释文本编辑设计的基准。XATU考虑了不同难度级别的细粒度文本编辑任务(如简化、语法检查、事实核查等),涵盖了词汇、句法、语义及知识密集型编辑维度。为增强可解释性,我们结合基于大型语言模型的标注与人工标注,构建了一个包含细粒度指令和黄金标准编辑解释的基准。通过评估现有大型语言模型在该基准上的表现,我们验证了指令微调的有效性以及不同底层架构对各种编辑任务的影响。此外,大量实验揭示了解释在语言模型文本编辑微调中的重要作用。该基准将开源以支持复现并促进未来研究,访问地址为:\url{https://github.com/megagonlabs/xatu}。