Deep learning algorithms often are trained and deployed on different datasets. Any systematic difference between the training and a test dataset may degrade the algorithm performance--what is known as the domain shift problem. This issue is prevalent in many scientific domains where algorithms are trained on simulated data but applied to real-world datasets. Typically, the domain shift problem is solved through various domain adaptation methods. However, these methods are often tailored for a specific downstream task and may not easily generalize to different tasks. This work explores the feasibility of using an alternative way to solve the domain shift problem that is not specific to any downstream algorithm. The proposed approach relies on modern Unpaired Image-to-Image translation techniques, designed to find translations between different image domains in a fully unsupervised fashion. In this study, the approach is applied to a domain shift problem commonly encountered in Liquid Argon Time Projection Chamber (LArTPC) detector research when seeking a way to translate samples between two differently distributed detector datasets deterministically. This translation allows for mapping real-world data into the simulated data domain where the downstream algorithms can be run with much less domain-shift-related degradation. Conversely, using the translation from the simulated data in a real-world domain can increase the realism of the simulated dataset and reduce the magnitude of any systematic uncertainties. We adapted several UI2I translation algorithms to work on scientific data and demonstrated the viability of these techniques for solving the domain shift problem with LArTPC detector data. To facilitate further development of domain adaptation techniques for scientific datasets, the "Simple Liquid-Argon Track Samples" dataset used in this study also is published.
翻译:深度学习算法通常在训练和部署时面对不同的数据集。训练集与测试集之间的任何系统性差异都可能导致算法性能下降,这被称为域偏移问题。这一问题在诸多科学领域普遍存在,特别是在算法基于模拟数据训练却应用于真实世界数据集的场景中。通常,域偏移问题通过多种域适应方法得以解决。然而,这些方法往往针对特定下游任务设计,难以普适到不同任务。本研究探索了一种不依赖于任何下游算法的替代方案来解决域偏移问题的可行性。所提出的方法依托于现代无配对图像到图像翻译技术,该技术旨在以完全无监督的方式寻找不同图像域之间的翻译映射。在本研究中,该方法被应用于液氩时间投影室探测器研究中常见的一种域偏移问题,即寻求在两种分布不同的探测器数据集之间确定性地翻译样本。这一翻译过程能够将真实世界数据映射到模拟数据域,从而使下游算法在运行时所受域偏移相关性能退化显著降低。反之,利用从模拟数据到真实世界域的翻译,可以增加模拟数据集的真实性,并减小系统不确定性的幅度。我们改编了多种无配对图像到图像翻译算法以适配科学数据,并验证了这些技术用于解决LArTPC探测器数据域偏移问题的可行性。为促进面向科学数据集的域适应技术进一步发展,本研究过程中使用的“简单液氩径迹样本”数据集也已公开发布。