We propose polar encoding, a representation of categorical and numerical $[0,1]$-valued attributes with missing values to be used in a classification context. We argue that this is a good baseline approach, because it can be used with any classification algorithm, preserves missingness information, is very simple to apply and offers good performance. In particular, unlike the existing missing-indicator approach, it does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values, thereby providing a practical realisation of the "missingness incorporated in attributes" (MIA) proposal. Furthermore, we show that categorical and $[0,1]$-valued attributes can be viewed as special cases of a single attribute type, corresponding to the classical concept of barycentric coordinates, and that this offers a natural interpretation of polar encoding as a fuzzified form of one-hot encoding. With an experiment based on twenty real-life datasets with missing values, we show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies \e{multiple imputation by chained equations} (MICE) and \e{multiple imputation with denoising autoencoders} (MIDAS) and -- depending on the classifier -- about as well or better than mean/mode imputation with missing-indicators.
翻译:我们提出极坐标编码(polar encoding),这是一种用于分类场景中表示包含缺失值的类别属性和数值型$[0,1]$值属性的方法。我们认为该方法是一种优秀的基线方案,原因在于:它可以与任何分类算法兼容,保留缺失值信息,易于实现且性能良好。特别地,与现有的缺失指示器(missing-indicator)方法不同,它无需进行插补,确保缺失值与非缺失值保持等距,并允许决策树算法自主决定缺失值的分割方式,从而实现了“属性内嵌缺失信息”(MIA)方案的实际应用。此外,我们证明了类别属性和$[0,1]$值属性可被视为单一属性类型的特例,对应经典的重心坐标(barycentric coordinates)概念,这为极坐标编码提供了一种自然解释:将其视为独热编码(one-hot encoding)的模糊化形式。基于20个真实世界含缺失值数据集的实验表明,在分类性能方面,极坐标编码优于现有最优策略——链式方程多重插补(MICE)和去噪自编码器多重插补(MIDAS),且根据分类器的不同,其表现与均值/众数插补结合缺失指示器的方法相当或更优。