When mining large datasets in order to predict new data, limitations of the principles behind statistical machine learning pose a serious challenge not only to the Big Data deluge, but also to the traditional assumptions that data generating processes are biased toward low algorithmic complexity. Even when one assumes an underlying algorithmic-informational bias toward simplicity in finite dataset generators, we show that current approaches to machine learning (including deep learning, or any formal-theoretic hybrid mix of top-down AI and statistical machine learning approaches), can always be deceived, naturally or artificially, by sufficiently large datasets. In particular, we demonstrate that, for every learning algorithm (with or without access to a formal theory), there is a sufficiently large dataset size above which the algorithmic probability of an unpredictable deceiver is an upper bound (up to a multiplicative constant that only depends on the learning algorithm) for the algorithmic probability of any other larger dataset. In other words, very large and complex datasets can deceive learning algorithms into a ``simplicity bubble'' as likely as any other particular non-deceiving dataset. These deceiving datasets guarantee that any prediction effected by the learning algorithm will unpredictably diverge from the high-algorithmic-complexity globally optimal solution while converging toward the low-algorithmic-complexity locally optimal solution, although the latter is deemed a global one by the learning algorithm. We discuss the framework and additional empirical conditions to be met in order to circumvent this deceptive phenomenon, moving away from statistical machine learning towards a stronger type of machine learning based on, and motivated by, the intrinsic power of algorithmic information theory and computability theory.
翻译:为预测新数据而挖掘大规模数据集时,统计机器学习原理的局限性不仅对大数据洪流构成严峻挑战,也动摇了"数据生成过程偏向低算法复杂度"的传统假设。即使假定有限数据集生成器存在对简单性的内在算法信息偏差,我们证明当前机器学习方法(包括深度学习,或自上而下人工智能与统计机器学习的任何形式化理论混合体)始终可能被足够大的数据集自然或人为地欺骗。具体而言,我们证明:对每个学习算法(无论是否借助形式化理论),存在一个足够大的数据集规模阈值,超过该阈值后,不可预测欺骗者的算法概率(乘以仅依赖于该学习算法的乘法常数)将成为任何其他更大数据集的算法概率的上界。换言之,超大且复杂的数据集能以与任何其他非欺骗性数据集相同的概率,将学习算法诱入"简单性泡沫"。这些欺骗性数据集确保:学习算法所做的任何预测将不可预测地偏离高算法复杂度的全局最优解,同时收敛至低算法复杂度的局部最优解——尽管后者被学习算法误判为全局最优解。我们讨论了规避该欺骗现象所需满足的框架和附加经验条件,推动统计机器学习向基于算法信息论与可计算性理论内在力量的更强类型机器学习转型。