In recent years, exploiting the domain-specific underlying structure of data and its generative factors for representation learning has shown success in various use-case agnostic applications. However, the diversity and complexity of tabular data have made it challenging to represent these structures in a latent space through multi-dimensional vectors. We design an autoencoder-based framework for building general purpose embeddings, we assess the performance of different autoencoder architectures, and show simpler models outperform complex ones in embedding highly complex tabular data. We apply our framework to produce plug-and-play, rich, and anonymized embeddings representing AWS customers for usage in any model, saving up to 45% of development time, and observe significant improvements in downstream models. Moreover, we propose a significant improvement to the calculation of reconstruction loss for multi-layer contractive autoencoders (CAE) by calculating the Jacobian of the entire encoder leading to a 15% improvement in reconstruction quality when compared to a stacked CAE.
翻译:近年来,利用数据中特定领域的底层结构及其生成因子进行表示学习,已在多种与用例无关的应用中展现出成功。然而,表格数据的多样性和复杂性使得通过多维向量在潜在空间中表示这些结构变得具有挑战性。我们设计了一种基于自编码器的框架,用于构建通用嵌入,评估了不同自编码器架构的性能,并展示了在面对高度复杂的表格数据时,简单模型优于复杂模型。我们将该框架应用于生成代表AWS客户的即插即用、丰富且匿名的嵌入,这些嵌入可用于任何模型,节省了高达45%的开发时间,并在下游模型中观察到了显著改进。此外,我们提出了一种对多层收缩自编码器(CAE)重构损失计算的重要改进方法,通过计算整个编码器的雅可比矩阵,与堆叠CAE相比,重构质量提升了15%。