Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest two bounds on the mutual information as performance metrics for simulation-based inference methods, without the need for posterior samples, and provide experimental results. This version corrects a minor implementation error in $\gamma$, improving results.
翻译:似然-证据比估计通常被构建为二元分类(NRE-A)或多类分类(NRE-B)任务。与二元分类框架相比,当前多类版本的公式存在一个内在且未知的偏置项,这使得原本具有信息性的诊断变得不可靠。我们提出了一种多类框架,该框架在最优情况下摆脱了NRE-B固有的偏置,使我们能够运行从业者所依赖的诊断。它还在一个极端情况下恢复NRE-A,在极限情况下恢复NRE-B。为了公平比较,我们在熟悉的和新颖的训练机制中对所有算法的行为进行了基准测试:当联合抽取的数据无限时,当数据固定但先验抽取无限时,以及在常见的固定数据和参数设置中。我们的研究表明,性能最高的模型在超参数空间中远离竞争对手(NRE-A、NRE-B)。我们提出了一个不同于先前模型的超参数建议。我们提出了两个互信息边界作为基于模拟的推断方法的性能指标,无需后验样本,并提供了实验结果。此版本修正了$\gamma$中的一个微小实现错误,从而改进了结果。