With the rapid growth of artificial intelligence (AI) in healthcare, there has been a significant increase in the generation and storage of sensitive medical data. This abundance of data, in turn, has propelled the advancement of medical AI technologies. However, concerns about unauthorized data exploitation, such as training commercial AI models, often deter researchers from making their invaluable datasets publicly available. In response to the need to protect this hard-to-collect data while still encouraging medical institutions to share it, one promising solution is to introduce imperceptible noise into the data. This method aims to safeguard the data against unauthorized training by inducing degradation in model generalization. Although existing methods have shown commendable data protection capabilities in general domains, they tend to fall short when applied to biomedical data, mainly due to their failure to account for the sparse nature of medical images. To address this problem, we propose the Sparsity-Aware Local Masking (SALM) method, a novel approach that selectively perturbs significant pixel regions rather than the entire image as previous strategies have done. This simple-yet-effective approach significantly reduces the perturbation search space by concentrating on local regions, thereby improving both the efficiency and effectiveness of data protection for biomedical datasets characterized by sparse features. Besides, we have demonstrated that SALM maintains the essential characteristics of the data, ensuring its clinical utility remains uncompromised. Our extensive experiments across various datasets and model architectures demonstrate that SALM effectively prevents unauthorized training of deep-learning models and outperforms previous state-of-the-art data protection methods.
翻译:随着人工智能在医疗领域的快速发展,敏感医疗数据的生成与存储量显著增长。这些海量数据反过来推动了医疗人工智能技术的进步。然而,对数据未授权利用(如训练商业AI模型)的担忧,常使研究人员不愿公开共享其宝贵数据集。为兼顾保护这类难以采集的数据并鼓励医疗机构共享,一种有前景的解决方案是向数据中注入不可感知噪声。该方法旨在通过降低模型泛化能力来保护数据免受未授权训练。尽管现有方法在通用领域展现出值得称道的数据保护能力,但应用于生物医学数据时往往表现不佳,主要原因是未能考虑医学图像的稀疏特性。为解决该问题,我们提出稀疏感知局部掩码(SALM)方法——一种创新性方法,有别于以往策略对整个图像进行扰动,而是选择性扰动显著像素区域。这种简单而有效的方法通过聚焦局部区域显著缩小了扰动搜索空间,从而提升了具有稀疏特征的生物医学数据集的数据保护效率与有效性。此外,我们已证明SALM能保持数据核心特征,确保其临床效用不受影响。在多种数据集和模型架构上的广泛实验表明,SALM能有效阻止深度学习模型的未授权训练,并超越此前最先进的数据保护方法。