We consider the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task where the goal is to identify the true state of the world from a finite set of hypotheses, based on finite observation samples from the sources. In order to characterize the learning performance, we propose a misclassification penalty framework, which enables nonuniform treatment of different misclassification errors. In a centralized Bayesian learning setting, we study two variants of the subset selection problem: (i) selecting a minimum cost information set to ensure that the maximum penalty of misclassifying the true hypothesis is below a desired bound and (ii) selecting an optimal information set under a limited budget to minimize the maximum penalty of misclassifying the true hypothesis. Under certain assumptions, we prove that the objective (or constraints) of these combinatorial optimization problems are weak (or approximate) submodular, and establish high-probability performance guarantees for greedy algorithms. Further, we propose an alternate metric for information set selection which is based on the total penalty of misclassification. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems. Finally, we present numerical simulations to validate our theoretical results over several randomly generated instances.
翻译:我们研究了在假设检验/分类任务中选择最优信息源子集的问题,其目标是在基于来自各信息源的有限观测样本的情况下,从有限假设集中识别世界的真实状态。为了刻画学习性能,我们提出了一种误分类惩罚框架,该框架能够对不同的误分类错误进行非均匀处理。在集中式贝叶斯学习设置中,我们研究了子集选择问题的两种变体:(i) 选择最小成本信息集,以确保将真实假设误分类的最大惩罚低于期望界限;(ii) 在有限预算下选择最优信息集,以最小化将真实假设误分类的最大惩罚。在某些假设下,我们证明了这些组合优化问题的目标(或约束)是弱(或近似)子模的,并为贪心算法建立了高概率性能保证。此外,我们提出了一种基于误分类总惩罚的信息集选择替代度量。我们证明了该度量是子模的,并为两种信息集选择问题的贪心算法建立了接近最优的保证。最后,我们通过数值模拟在多个随机生成的实例上验证了我们的理论结果。