Evaluating Text Style Transfer (TST) is a complex task due to its multifaceted nature. The quality of the generated text is measured based on challenging factors, such as style transfer accuracy, content preservation, and overall fluency. While human evaluation is considered to be the gold standard in TST assessment, it is costly and often hard to reproduce. Therefore, automated metrics are prevalent in these domains. Nevertheless, it remains unclear whether these automated metrics correlate with human evaluations. Recent strides in Large Language Models (LLMs) have showcased their capacity to match and even exceed average human performance across diverse, unseen tasks. This suggests that LLMs could be a feasible alternative to human evaluation and other automated metrics in TST evaluation. We compare the results of different LLMs in TST using multiple input prompts. Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics. Furthermore, we introduce the concept of prompt ensembling, demonstrating its ability to enhance the robustness of TST evaluation. This research contributes to the ongoing evaluation of LLMs in diverse tasks, offering insights into successful outcomes and areas of limitation.
翻译:文本风格迁移(TST)的评估因其多维度性质而成为一项复杂任务。生成文本的质量需根据风格迁移准确性、内容保留度及整体流畅性等具有挑战性的因素进行衡量。尽管人工评估被视为TST评价的黄金标准,但其成本高昂且难以复现。因此,自动化评估指标在这一领域普遍存在。然而,这些自动化指标与人工评估的相关性仍不明确。近年来,大型语言模型(LLMs)的进步展示了其在各类未见任务中达到甚至超越人类平均水平的能力。这表明LLMs可能成为TST评估中人工评价及其他自动化指标的可行替代方案。我们通过多种输入提示比较了不同LLMs在TST中的结果。研究发现(即使是零样本)提示与人工评估之间存在强相关性,表明LLMs通常优于传统自动化评估指标。此外,我们引入了提示集成概念,证明其能够增强TST评估的鲁棒性。本研究为LLMs在多样化任务中的持续评估做出了贡献,揭示了成功经验与局限性。