In this paper, we introduce an authorship attribution method called Authorial Language Models (ALMs) that involves identifying the most likely author of a questioned document based on the perplexity of the questioned document calculated for a set of causal language models fine-tuned on the writings of a set of candidate author. We benchmarked ALMs against state-of-art-systems using the CCAT50 dataset and the Blogs50 datasets. We find that ALMs achieves a macro-average accuracy score of 83.6% on Blogs50, outperforming all other methods, and 74.9% on CCAT50, matching the performance of the best method. To assess the performance of ALMs on shorter texts, we also conducted text ablation testing. We found that to reach a macro-average accuracy of 70%, ALMs needs 40 tokens on Blogs50 and 400 tokens on CCAT50, while to reach 60% ALMs requires 20 tokens on Blogs50 and 70 tokens on CCAT50.
翻译:在本文中,我们提出了一种名为作者语言模型(ALMs)的作者归属方法,该方法通过计算待鉴定文档在一组基于候选作者作品微调的因果语言模型上的困惑度,来识别该文档最可能的作者。我们使用CCAT50数据集和Blogs50数据集,将ALMs与最先进的系统进行了基准测试。实验发现,ALMs在Blogs50上取得了83.6%的宏平均准确率,优于所有其他方法;在CCAT50上取得了74.9%的宏平均准确率,与最佳方法性能持平。为评估ALMs在较短文本上的表现,我们还进行了文本消融测试。结果表明,要达到70%的宏平均准确率,ALMs在Blogs50上需要40个标记,在CCAT50上需要400个标记;而达到60%的宏平均准确率,在Blogs50上需要20个标记,在CCAT50上需要70个标记。