In this paper, we introduce an authorship attribution method called Authorial Language Models (ALMs) that involves identifying the most likely author of a questioned document based on the perplexity of the questioned document calculated for a set of causal language models fine-tuned on the writings of a set of candidate author. We benchmarked ALMs against state-of-art-systems using the CCAT50 dataset and the Blogs50 datasets. We find that ALMs achieves a macro-average accuracy score of 83.6% on Blogs50, outperforming all other methods, and 74.9% on CCAT50, matching the performance of the best method. To assess the performance of ALMs on shorter texts, we also conducted text ablation testing. We found that to reach a macro-average accuracy of 70%, ALMs needs 40 tokens on Blogs50 and 400 tokens on CCAT50, while to reach 60% ALMs requires 20 tokens on Blogs50 and 70 tokens on CCAT50.
翻译:本文提出一种名为作者语言模型(ALMs)的作者归因方法,该方法通过计算待判定文档在一组基于候选作者作品微调的因果语言模型上的困惑度,来识别文档最可能的作者。我们使用CCAT50数据集和Blogs50数据集将ALMs与当前最优系统进行了基准测试。实验发现,ALMs在Blogs50数据集上达到83.6%的宏平均准确率,优于所有其他方法;在CCAT50数据集上达到74.9%的宏平均准确率,与最佳方法性能持平。为评估ALMs在短文本上的表现,我们还进行了文本消融测试。结果表明,要达到70%的宏平均准确率,ALMs在Blogs50上需要40个token,在CCAT50上需要400个token;而达到60%时,ALMs在Blogs50上仅需20个token,在CCAT50上需要70个token。