Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such `structured missingness' raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here, we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.
翻译:缺失数据是许多机器学习任务中不可避免的复杂问题。当数据“完全随机缺失”时,已有多种工具和技术可以应对这一情况。然而,随着机器学习研究日趋宏大,并试图从日益庞大的异构数据中学习,一种愈发常见的问题随之出现:缺失值呈现出显式或隐式的关联性或结构性。这种“结构化缺失”引发了一系列尚未得到系统解决的挑战,并对大规模机器学习构成了根本性阻碍。本文概述了当前文献现状,并提出了在含有结构化缺失的数据中学习所面临的一系列重大挑战。