This paper explores new methods for locating the sources used to write a text, by fine-tuning a variety of language models to rerank candidate sources. After retrieving candidates sources using a baseline BM25 retrieval model, a variety of reranking methods are tested to see how effective they are at the task of source attribution. We conduct experiments on two datasets, English Wikipedia and medieval Arabic historical writing, and employ a variety of retrieval and generation based reranking models. In particular, we seek to understand how the degree of supervision required affects the performance of various reranking models. We find that semisupervised methods can be nearly as effective as fully supervised methods while avoiding potentially costly span-level annotation of the target and source documents.
翻译:本文探索了通过微调多种语言模型对候选来源进行重排序,以定位文本撰写所用源文献的新方法。在利用基线BM25检索模型获取候选来源后,我们测试了多种重排序方法在来源归属任务中的有效性。我们在两个数据集上开展实验——英语维基百科与中世纪阿拉伯语历史文献,并采用了基于检索与生成的多类重排序模型。特别地,我们试图探究监督程度如何影响不同重排序模型的性能。研究发现,半监督方法在避免目标文档与源文档中潜在高成本的跨度级标注的同时,其效果几乎可与全监督方法相媲美。