We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However, none of this is a must for reproducibility and scalability, as TIRA can run any dockerized software locally or remotely in a cloud-native execution environment. Version control and caching ensure efficient (re)execution. TIRA allows for blind evaluation when an experiment runs on a remote server or cloud not under the control of the experimenter. The test data and ground truth are then hidden from public access, and the retrieval software has to process them in a sandbox that prevents data leaks. We currently host an instance of TIREx with 15 corpora (1.9 billion documents) on which 32 shared retrieval tasks are based. Using Docker images of 50 standard retrieval approaches, we automatically evaluated all approaches on all tasks (50 $\cdot$ 32 = 1,600~runs) in less than a week on a midsize cluster (1,620 CPU cores and 24 GPUs). This instance of TIREx is open for submissions and will be integrated with the IR Anthology, as well as released open source.
翻译:我们通过将ir_datasets、ir_measures和PyTerrier与TIRA集成到信息检索实验平台(TIREx)中,旨在推动更标准化、可复现、可扩展甚至盲测的检索实验。标准化实现的关键在于:检索方法需实现PyTerrier接口,同时实验的输入输出需与ir_datasets和ir_measures兼容。然而,对于可复现性和可扩展性而言,这些条件并非必需——TIRA可在本地或云端原生执行环境中运行任何Docker化的软件。版本控制与缓存机制确保了高效(重)执行。当实验运行在不受实验者控制的远程服务器或云端时,TIRA支持盲评估:测试数据与真实标注对公众隐藏,检索软件需在防止数据泄露的沙箱中处理这些数据。我们当前托管了包含15个语料库(19亿文档)的TIREx实例,这些语料库支撑着32个共享检索任务。通过50种标准检索方法的Docker镜像,我们在中型集群(1620个CPU核心与24个GPU)上对全部任务自动运行所有方法(50 $\cdot$ 32 = 1,600次运行),耗时不足一周。此TIREx实例已开放投稿,并将与IR Anthology集成,同时以开源形式发布。