This paper proposes an information-theoretic framework for analyzing the theoretical limits of pool-based active learning (AL), in which a subset of instances is selectively labeled. The proposed framework reformulates pool-based AL as a noisy lossy compression problem by mapping pool observations to noisy symbol observations, data selection to compression, and learning to decoding. This correspondence enables a unified information-theoretic analysis of data selection and learning in pool-based AL. Applying finite blocklength analysis of noisy lossy compression, we derive information-theoretic lower bounds on label complexity and generalization error that serve as theoretical limits for a given learning algorithm under its associated optimal data selection strategy. Specifically, our bounds include terms that reflect overfitting induced by the learning algorithm and the discrepancy between its inductive bias and the target task, and are closely related to established information-theoretic bounds and stability theory, which have not been previously applied to the analysis of pool-based AL. These properties yield a new theoretical perspective on pool-based AL.
翻译:本文提出了一种信息论框架,用于分析基于池的主动学习的理论极限,其中实例的子集被选择性地标注。所提出的框架通过将池观测映射为有噪符号观测、将数据选择映射为压缩、将学习映射为解码,将基于池的主动学习重新表述为一个有噪有损压缩问题。这种对应关系使得能够对基于池的主动学习中的数据选择和学习进行统一的信息论分析。应用有噪有损压缩的有限块长分析,我们推导出了标签复杂度和泛化误差的信息论下界,这些下界作为给定学习算法在其关联的最优数据选择策略下的理论极限。具体而言,我们的界包含了反映由学习算法引起的过拟合以及其归纳偏置与目标任务之间差异的项,并且与已建立的信息论界和稳定性理论密切相关,这些理论先前尚未应用于基于池的主动学习的分析。这些性质为基于池的主动学习提供了一个新的理论视角。