Data encoding is a common and central operation in most data analysis tasks. The performance of other models, downstream in the computational process, highly depends on the quality of data encoding. One of the most powerful ways to encode data is using the neural network AutoEncoder (AE) architecture. However, the developers of AE are not able to easily influence the produced embedding space, as it is usually treated as a \textit{black box} technique, which makes it uncontrollable and not necessarily has desired properties for downstream tasks. In this paper, we introduce a novel approach for developing AE models that can integrate external knowledge sources into the learning process, possibly leading to more accurate results. The proposed \methodNamefull{} (\methodName{}) model is able to leverage domain-specific information to make sure the desired distance and neighborhood properties between samples are preservative in the embedding space. The proposed model is evaluated on three large-scale datasets from three different scientific fields and is compared to nine existing encoding models. The results demonstrate that the \methodName{} model effectively captures the underlying structures and relationships between the input data and external knowledge, meaning it generates a more useful representation. This leads to outperforming the rest of the models in terms of reconstruction accuracy.
翻译:数据编码是大多数数据分析任务中常见且核心的操作。计算流程中下游模型的性能高度依赖于数据编码的质量。最强大的数据编码方式之一是利用神经网络自编码器(AutoEncoder, AE)架构。然而,自编码器开发者难以轻易影响生成的嵌入空间,因为该方法通常被视为一种“黑箱”技术,导致其不可控且未必具有下游任务所需的理想特性。本文提出一种开发自编码器模型的新方法,能够将外部知识源集成到学习过程中,从而可能获得更精确的结果。所提出的\methodNamefull{}(\methodName{})模型能够利用领域特定信息,确保嵌入空间中样本间所需的距离与邻域属性得以保持。该模型在来自三个不同科学领域的大规模数据集上进行了评估,并与九种现有编码模型进行了比较。结果表明,\methodName{}模型有效捕捉了输入数据与外部知识之间的潜在结构与关系,即生成了更具实用性的表示,从而在重建精度方面优于其余模型。