The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and useful world knowledge in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought -- less than 80MB and minutes of CPU time is required to achieve over 50\% accuracy for English to Finnish, Hungarian, and Chinese translations when trained on similar data; even under domain mismatch, we show coocmap still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors, and suggest that similarly processed co-occurrences can outperform dense vectors on other tasks too.
翻译:无监督词语翻译的显著能力已通过词向量/预训练得到验证;然而,这些方法需要大量数据,且当数据来自不同领域时通常失效。我们提出coocmap方法,该方法既可运用高维共现计数,也可使用其低维近似。通过突破低维度的限制,我们证明依赖低维向量及其附带特性会错失高维空间中更优的去噪方法与有用的世界知识,从而制约数据潜力。实验结果表明,无监督翻译的实现比先前认知更为简单和稳健——在相似数据上训练时,仅需不足80MB存储空间和数分钟CPU计算时间,即可在英语到芬兰语、匈牙利语和中文翻译任务中实现超过50%的准确率;即使在领域不匹配情况下,我们展示coocmap仍可在英语NewsCrawl到中文维基百科、英语Europarl到西班牙语维基百科等任务中实现完全无监督工作。这些结果挑战了关于低维向量必要性与优越性的主流假设,并表明经过类似处理的共现矩阵在其他任务中亦可超越稠密向量的表现。