Technology Assisted Review (TAR), which aims to reduce the effort required to screen collections of documents for relevance, is used to develop systematic reviews of medical evidence and identify documents that must be disclosed in response to legal proceedings. Stopping methods are algorithms which determine when to stop screening documents during the TAR process, helping to ensure that workload is minimised while still achieving a high level of recall. This paper proposes a novel stopping method based on point processes, which are statistical models that can be used to represent the occurrence of random events. The approach uses rate functions to model the occurrence of relevant documents in the ranking and compares four candidates, including one that has not previously been used for this purpose (hyperbolic). Evaluation is carried out using standard datasets (CLEF e-Health, TREC Total Recall, TREC Legal), and this work is the first to explore stopping method robustness by reporting performance on a range of rankings of varying effectiveness. Results show that the proposed method achieves the desired level of recall without requiring an excessive number of documents to be examined in the majority of cases and also compares well against multiple alternative approaches.
翻译:技术辅助评审(TAR)旨在降低对文档集进行相关性筛选所需的工作量,已被用于开发医学证据的系统评价,并识别因法律程序而必须披露的文档。停止方法是用于确定在TAR过程中何时停止筛选文档的算法,有助于确保在最小化工作量的同时仍实现高查全率。本文提出了一种基于点过程的新型停止方法,点过程是可用于表示随机事件发生的统计模型。该方法使用速率函数对排序中相关文档的分布情况进行建模,并比较了四种候选函数,其中包含一种此前未用于此目的的候选函数(双曲线)。评估采用标准数据集(CLEF e-Health、TREC Total Recall、TREC Legal)进行,且本研究首次通过报告在不同有效性的排序上的性能来探索停止方法的鲁棒性。结果表明,所提出的方法在大多数情况下无需检查过多文档即可达到所需的查全率,并且与多种替代方法相比表现良好。