We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.
翻译:我们提出了首个无需配对数据、编码器或预定义匹配集即可实现文本嵌入在不同向量空间之间转换的方法。我们的无监督方法能够将任意嵌入转换至或转换自通用潜在表示(即柏拉图表示假说所推测的通用语义结构)。该方法在不同架构、参数量及训练数据集的模型对之间均能实现较高的余弦相似度。这种在保持几何结构不变的前提下将未知嵌入转换至不同空间的能力,对向量数据库的安全性具有重大影响:仅能访问嵌入向量的攻击者可提取底层文档的敏感信息,足以实现分类与属性推断。