Categorical data are present in key areas such as health or supply chain, and this data require specific treatment. In order to apply recent machine learning models on such data, encoding is needed. In order to build interpretable models, one-hot encoding is still a very good solution, but such encoding creates sparse data. Gradient estimators are not suited for sparse data: the gradient is mainly considered as zero while it simply does not always exists, thus a novel gradient estimator is introduced. We show what this estimator minimizes in theory and show its efficiency on different datasets with multiple model architectures. This new estimator performs better than common estimators under similar settings. A real world retail dataset is also released after anonymization. Overall, the aim of this paper is to thoroughly consider categorical data and adapt models and optimizers to these key features.
翻译:类别数据广泛存在于医疗健康、供应链等关键领域,这类数据需要特殊处理。为了将现代机器学习模型应用于此类数据,需要进行编码处理。构建可解释模型时,独热编码仍是较为理想的方案,但该编码方式会产生稀疏数据。传统梯度估计器不适用于稀疏数据:梯度通常被视为零,但事实上梯度并非总是存在,因此本文提出了一种新型梯度估计器。我们从理论上证明了该估计器最小化的目标函数,并在多种模型架构的不同数据集上验证了其有效性。在相似条件下,该新估计器的表现优于常见估计器。此外,本文还发布了经过匿名处理的真实零售数据集。总体而言,本文旨在深入探讨类别数据特性,并针对这些关键特征改进模型与优化器设计。