Differentially Private Active Learning: Balancing Effective Data Selection and Privacy

Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL's applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.

翻译：主动学习（Active Learning, AL）是一种广泛应用于机器学习中优化数据标注的技术，通过迭代地选择、标注并训练最具信息量的数据。然而，其与形式化隐私保护方法（特别是差分隐私，Differential Privacy, DP）的结合在很大程度上仍未得到充分探索。尽管已有工作针对在线学习等特定场景探索了差分隐私主动学习，但在标准学习设置中将AL与DP结合的基本挑战仍未得到解决，这严重限制了AL在隐私敏感领域的应用。本研究通过为标准学习设置引入差分隐私主动学习（DP-AL）来填补这一空白。我们证明，将DP-SGD训练简单集成到AL中会在隐私预算分配和数据利用方面带来重大挑战。为克服这些挑战，我们提出了步骤放大（step amplification）方法，该方法利用批量创建中的个体采样概率来最大化数据点在训练步骤中的参与度，从而优化数据利用。此外，我们研究了在隐私约束下各种获取函数（acquisition functions）用于数据选择的有效性，发现许多常用函数变得不切实际。我们在视觉和自然语言处理任务上的实验表明，DP-AL能够针对特定数据集和模型架构提升性能。然而，我们的研究结果也凸显了AL在隐私受限环境中的局限性，强调了隐私、模型准确性和数据选择准确性之间的权衡关系。