In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts as well as human quality ratings, demonstrating our metric's superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.
翻译:本文提出了一种新颖的、基于树的自动评估指标,用于评估大语言模型生成的逐步组装指令。相较于BLEU和BERT相似度分数等传统指标,该指标能更准确地反映构建过程的时空特性。我们将所提出的指标应用于缝纫指令领域,结果表明,该指标与人工标注的错误数量以及人工质量评分具有更好的相关性,从而证明了其在评估缝纫指令时空合理性方面的优越性。进一步的实验表明,针对那些专门为混淆依赖文本相似性的指标而人为构建的反事实示例,我们的方法比传统方法更具鲁棒性。