Examining the effect of different encoding techniques on entity and context embeddings, the goal of this work is to challenge commonly used Ordinal encoding for tabular learning. Applying different preprocessing methods and network architectures over several datasets resulted in a benchmark on how the encoders influence the learning outcome of the networks. By keeping the test, validation and training data consistent, results have shown that ordinal encoding is not the most suited encoder for categorical data in terms of preprocessing the data and thereafter, classifying the target variable correctly. A better outcome was achieved, encoding the features based on string similarities by computing a similarity matrix as input for the network. This is the case for both, entity and context embeddings, where the transformer architecture showed improved performance for Ordinal and Similarity encoding with regard to multi-label classification tasks.
翻译:本研究探讨不同编码技术对实体嵌入与上下文嵌入的影响,旨在挑战表格学习中常用的序数编码方法。通过在多个数据集上应用不同的预处理方法与网络架构,我们建立了编码器对网络学习效果影响的基准测试。在保持测试集、验证集和训练集一致的前提下,实验结果表明:序数编码并非分类变量预处理及后续目标变量分类的最优编码方式。采用基于字符串相似性的特征编码方法(通过计算相似度矩阵作为网络输入)获得了更优的效果。这种优势在实体嵌入和上下文嵌入中均有体现,其中Transformer架构在序数编码与相似性编码的多标签分类任务中展现出更优性能。