We consider the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task where the goal is to identify the true state of the world from a finite set of hypotheses, based on finite observation samples from the sources. In order to characterize the learning performance, we propose a misclassification penalty framework, which enables non-uniform treatment of different misclassification errors. In a centralized Bayesian learning setting, we study two variants of the subset selection problem: (i) selecting a minimum cost information set to ensure that the maximum penalty of misclassifying the true hypothesis remains bounded and (ii) selecting an optimal information set under a limited budget to minimize the maximum penalty of misclassifying the true hypothesis. Under certain assumptions, we prove that the objective (or constraints) of these combinatorial optimization problems are weak (or approximate) submodular, and establish high-probability performance guarantees for greedy algorithms. Further, we propose an alternate metric for information set selection which is based on the total penalty of misclassification. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems. Finally, we present numerical simulations to validate our theoretical results over several randomly generated instances.
翻译:我们研究了在假设检验/分类任务中选择最优信息源子集的问题,其目标是从有限假设集中识别世界的真实状态,依据来自信息源的有限观测样本。为了刻画学习性能,我们提出了一个误分类惩罚框架,该框架能够对不同误分类错误进行非均匀处理。在集中式贝叶斯学习设置下,我们研究了子集选择问题的两种变体:(i) 选择最小成本信息集,以确保对真实假设误分类的最大惩罚保持有界;(ii) 在有限预算下选择最优信息集,以最小化对真实假设误分类的最大惩罚。在某些假设下,我们证明了这些组合优化问题的目标(或约束)是弱(或近似)子模的,并为贪心算法建立了高概率性能保证。此外,我们提出了一种基于误分类总惩罚的信息集选择替代度量。我们证明了该度量是子模的,并为两种信息集选择问题的贪心算法建立了近似最优保证。最后,我们通过数值模拟在多个随机生成的实例上验证了理论结果。