We propose polar encoding, a representation of categorical and numerical $[0,1]$-valued attributes with missing values to be used in a classification context. We argue that this is a good baseline approach, because it can be used with any classification algorithm, preserves missingness information, is very simple to apply and offers good performance. In particular, unlike the existing missing-indicator approach, it does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values, thereby providing a practical realisation of the "missingness incorporated in attributes" (MIA) proposal. Furthermore, we show that categorical and $[0,1]$-valued attributes can be viewed as special cases of a single attribute type, corresponding to the classical concept of barycentric coordinates, and that this offers a natural interpretation of polar encoding as a fuzzified form of one-hot encoding. With an experiment based on twenty real-life datasets with missing values, we show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies "multiple imputation by chained equations" (MICE) and "multiple imputation with denoising autoencoders" (MIDAS) and -- depending on the classifier -- about as well or better than mean/mode imputation with missing-indicators.
翻译:我们提出极性编码,一种用于分类场景下表示带有缺失值的类别型与数值型[0,1]属性值的方法。我们认为该方法是一种优秀的基线方法,原因在于:它可与任何分类算法兼容、保留缺失值信息、实施极其简便且性能优异。特别是,与现有的缺失指示变量法不同,该方法无需插补,确保缺失值与非缺失值等距分布,并允许决策树算法自主选择缺失值的分裂方式,从而实现了"属性内嵌缺失值"(MIA)方案的实用化。此外,我们证明类别型与[0,1]型属性可视为单一属性类型的特例,对应经典的重心坐标概念,这为极性编码作为独热编码的模糊化形式提供了自然解释。基于20个含缺失值的真实数据集实验表明,在分类性能方面,极性编码优于当前最先进的"链式方程多重插补"(MICE)和"降噪自动编码器多重插补"(MIDAS)策略,且根据分类器选择不同,其表现与均值/众数插补结合缺失指示变量法相当或更优。