A machine learning tasks from observations must encounter and process uncertainty and novelty, especially when it is expected to maintain performance when observing new information and to choose the best fitting hypothesis to the currently observed information. In this context, some key questions arise: what is information, how much information did the observations provide, how much information is required to identify the data-generating process, how many observations remain to get that information, and how does a predictor determine that it has observed novel information? This paper strengthens existing answers to these questions by formalizing the notion of "identifiable information" that arises from the language used to express the relationship between distinct states. Model identifiability and sample complexity are defined via computation of an indicator function over a set of hypotheses. Their properties and asymptotic statistics are described for data-generating processes ranging from deterministic processes to ergodic stationary stochastic processes. This connects the notion of identifying information in finite steps with asymptotic statistics and PAC-learning. The indicator function's computation naturally formalizes novel information and its identification from observations with respect to a hypothesis set. We also proved that computable PAC-Bayes learners' sample complexity distribution is determined by its moments in terms of the the prior probability distribution over a fixed finite hypothesis set.
翻译:机器学习任务在观测过程中必须处理不确定性和新颖性,尤其当期望其在观测新信息时保持性能,并为当前观测信息选择最佳拟合假设时。在此背景下,一些关键问题随之产生:什么是信息?观测提供了多少信息?识别数据生成过程需要多少信息?还需多少观测才能获得该信息?以及预测器如何判定其已观测到新颖信息?本文通过形式化"可识别信息"的概念来强化对这些问题的现有解答,该概念源于用于表达不同状态间关系的语言。模型可识别性与样本复杂度通过在一组假设上计算指示函数来定义。本文描述了从确定性过程到遍历平稳随机过程等各类数据生成过程的性质及其渐近统计特性,从而将有限步骤内的信息识别概念与渐近统计学及PAC学习联系起来。指示函数的计算自然地形式化了新颖信息及其相对于假设集从观测中的识别过程。我们还证明了可计算PAC-Bayes学习器的样本复杂度分布由其矩决定,这些矩基于固定有限假设集上的先验概率分布。