In today's data-driven world, the proliferation of publicly available information raises security concerns due to the information leakage (IL) problem. IL involves unintentionally exposing sensitive information to unauthorized parties via observable system information. Conventional statistical approaches rely on estimating mutual information (MI) between observable and secret information for detecting ILs, face challenges of the curse of dimensionality, convergence, computational complexity, and MI misestimation. Though effective, emerging supervised machine learning based approaches to detect ILs are limited to binary system sensitive information and lack a comprehensive framework. To address these limitations, we establish a theoretical framework using statistical learning theory and information theory to quantify and detect IL accurately. Using automated machine learning, we demonstrate that MI can be accurately estimated by approximating the typically unknown Bayes predictor's log-loss and accuracy. Based on this, we show how MI can effectively be estimated to detect ILs. Our method performs superior to state-of-the-art baselines in an empirical study considering synthetic and real-world OpenSSL TLS server datasets.
翻译:在当今数据驱动的世界中,公开可用信息的激增因信息泄露问题引发了安全担忧。信息泄露指通过可观测系统信息无意中将敏感信息暴露给未授权方。传统的统计方法依赖于估计可观测信息与秘密信息之间的互信息来检测信息泄露,面临维度灾难、收敛性、计算复杂度以及互信息估计偏差等挑战。尽管新兴的基于监督机器学习的信息泄露检测方法有效,但其仅限于二元系统敏感信息,且缺乏全面框架。为克服这些局限,我们利用统计学习理论与信息理论建立了一个量化并准确检测信息泄露的理论框架。通过自动化机器学习,我们证明可通过近似通常未知的贝叶斯预测器的对数损失与准确性来精确估计互信息。基于此,我们展示了如何有效估计互信息以检测信息泄露。在基于合成数据集与真实OpenSSL TLS服务器数据集的实证研究中,我们的方法性能优于当前最先进的基准方法。