Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, e.g., translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g., via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of $\beta$VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.
翻译:词嵌入(word embedding)是机器学习模型生成的词语高维(HD)数值表示,已被用于多种自然语言处理任务(例如双语翻译)。近年来,将高维嵌入转换至潜在空间(如通过自编码器)以利用其潜在表示优势的做法日益普遍。为保持嵌入质量,这类工作通常将嵌入映射至更高维的潜在空间,这不仅使原本复杂的嵌入更难以解译,还增加了存储空间消耗。本研究借鉴βVAE思想对高维潜在空间进行正则化,通过隐式信息压缩将高维潜在空间中的信息浓缩至更低维空间,从而实现嵌入压缩。实验表明,正则化后潜在空间的每个维度具有更显著的语义显著性,并通过交互式探测用户在维度中提出的语义编码水平验证了这一论断。最后,我们设计了一个可视化分析系统,用于监控正则化过程、探索高维潜在空间并解译潜在维度的语义。通过定量与定性评估,验证了本嵌入正则化与解译方法的有效性。