Text Style Transfer Evaluation Using Large Language Models

Evaluating Text Style Transfer (TST) is a complex task due to its multifaceted nature. The quality of the generated text is measured based on challenging factors, such as style transfer accuracy, content preservation, and overall fluency. While human evaluation is considered to be the gold standard in TST assessment, it is costly and often hard to reproduce. Therefore, automated metrics are prevalent in these domains. Nevertheless, it remains unclear whether these automated metrics correlate with human evaluations. Recent strides in Large Language Models (LLMs) have showcased their capacity to match and even exceed average human performance across diverse, unseen tasks. This suggests that LLMs could be a feasible alternative to human evaluation and other automated metrics in TST evaluation. We compare the results of different LLMs in TST using multiple input prompts. Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics. Furthermore, we introduce the concept of prompt ensembling, demonstrating its ability to enhance the robustness of TST evaluation. This research contributes to the ongoing evaluation of LLMs in diverse tasks, offering insights into successful outcomes and areas of limitation.

翻译：文本风格迁移（TST）的评估因其多维度性质而成为一项复杂任务。生成文本的质量需根据风格迁移准确性、内容保留度及整体流畅性等具有挑战性的因素进行衡量。尽管人工评估被视为TST评价的黄金标准，但其成本高昂且难以复现。因此，自动化评估指标在这一领域普遍存在。然而，这些自动化指标与人工评估的相关性仍不明确。近年来，大型语言模型（LLMs）的进步展示了其在各类未见任务中达到甚至超越人类平均水平的能力。这表明LLMs可能成为TST评估中人工评价及其他自动化指标的可行替代方案。我们通过多种输入提示比较了不同LLMs在TST中的结果。研究发现（即使是零样本）提示与人工评估之间存在强相关性，表明LLMs通常优于传统自动化评估指标。此外，我们引入了提示集成概念，证明其能够增强TST评估的鲁棒性。本研究为LLMs在多样化任务中的持续评估做出了贡献，揭示了成功经验与局限性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日