Low-Resource Authorship Style Transfer: Can Non-Famous Authors Be Imitated?

Authorship style transfer involves altering text to match the style of a target author whilst preserving the original meaning. Existing unsupervised approaches like STRAP have largely focused on style transfer to target authors with many examples of their writing style in books, speeches, or other published works. This high-resource training data requirement (often greater than 100,000 words) makes these approaches primarily useful for style transfer to published authors, politicians, or other well-known figures and authorship styles, while style transfer to non-famous authors has not been well-studied. We introduce the \textit{low-resource authorship style transfer} task, a more challenging class of authorship style transfer where only a limited amount of text in the target author's style may exist. In our experiments, we specifically choose source and target authors from Reddit and style transfer their Reddit posts, limiting ourselves to just 16 posts (on average ~500 words) of the target author's style. Style transfer accuracy is typically measured by how often a classifier or human judge will classify an output as written by the target author. Recent authorship representations models excel at authorship identification even with just a few writing samples, making automatic evaluation of this task possible for the first time through evaluation metrics we propose. Our results establish an in-context learning technique we develop as the strongest baseline, though we find current approaches do not yet achieve mastery of this challenging task. We release our data and implementations to encourage further investigation.

翻译：作者风格迁移旨在改写文本，使其匹配目标作者的风格，同时保留原始含义。现有的无监督方法（如STRAP）主要专注于将风格迁移至具有大量写作样例（如书籍、演讲或其他已发表作品）的目标作者。这种高资源训练数据需求（通常超过10万字）使得这些方法主要适用于迁移至已出版作者、政治家或其他知名人物及其写作风格，而对于非知名作者的风格迁移研究尚不充分。我们提出了"低资源作者风格迁移"任务，这是一类更具挑战性的作者风格迁移，其中目标作者风格文本数量可能极为有限。在实验中，我们特意从Reddit选取源作者和目标作者，对其帖子进行风格迁移，仅限制使用目标作者的16篇帖子（平均约500词）作为风格样例。风格迁移的准确性通常通过分类器或人工判断输出文本是否被归类为目标作者所写来衡量。近期作者表征模型即使仅凭少数写作样本也能出色完成作者身份识别，这使得我们首次能够通过提出的评估指标对该任务进行自动评估。我们的结果表明，我们开发的一种上下文学习技术成为最强基准方法，但当前方法尚未完全掌握这一挑战性任务。为促进进一步研究，我们公开了数据与实现代码。