We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method's stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.
翻译:我们在无需平行数据的文本嵌入空间对齐方法vec2vec的基础上展开研究。vec2vec能够实现近乎完美的对齐,但其计算成本高昂且稳定性不足。本文提出mini-vec2vec,这是一种简单高效的替代方案,其计算需求显著降低且具有高度鲁棒性。此外,所学得的映射关系为线性变换。我们的方法包含三个主要阶段:伪平行嵌入向量的试探性匹配、变换拟合以及迭代优化。该线性替代方案在效率上较原始vec2vec实现提升了数个数量级,同时达到或超越了其对齐效果。本方法的稳定性与可解释的算法步骤有利于实现规模化,并为在新领域和学科中的应用创造了新的机遇。