MIST: Mutual Information Estimation Via Supervised Training

We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI's invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.

翻译：我们提出了一种完全数据驱动的互信息估计器设计方法。由于任何互信息估计器都是两个随机变量观测样本的函数，我们使用神经网络对该函数进行参数化，并通过端到端训练使其预测互信息值。训练在一个包含625,000个已知真实互信息值的合成联合分布元数据集上进行。为处理可变样本量与维度，我们采用二维注意力机制确保输入样本的置换不变性。为量化不确定性，我们优化分位数回归损失函数，使估计器能够逼近互信息的抽样分布而非返回单点估计。本研究方案与先前工作不同，采取完全经验化的路径，以通用理论保证换取灵活性与效率。实证表明，学习得到的估计器在不同样本量与维度下（包括训练中未出现的联合分布）均显著优于经典基线方法。所得基于分位数的区间估计具有良好的校准性，且比基于自助法的置信区间更可靠，同时推理速度比现有神经基线方法快数个数量级。除直接实证优势外，该框架可生成可训练、完全可微的估计器，能够嵌入更大规模的学习流程。此外，利用互信息对可逆变换的不变性，可通过标准化流将元数据集适配至任意数据模态，从而为多样化目标元分布实现灵活训练。