We present a variety of methods for training complex-valued word embeddings, based on the classical Skip-gram model, with a straightforward adaptation simply replacing the real-valued vectors with arbitrary vectors of complex numbers. In a more "physically-inspired" approach, the vectors are produced by parameterised quantum circuits (PQCs), which are unitary transformations resulting in normalised vectors which have a probabilistic interpretation. We develop a complex-valued version of the highly optimised C code version of Skip-gram, which allows us to easily produce complex embeddings trained on a 3.8B-word corpus for a vocabulary size of over 400k, for which we are then able to train a separate PQC for each word. We evaluate the complex embeddings on a set of standard similarity and relatedness datasets, for some models obtaining results competitive with the classical baseline. We find that, while training the PQCs directly tends to harm performance, the quantum word embeddings from the two-stage process perform as well as the classical Skip-gram embeddings with comparable numbers of parameters. This enables a highly scalable route to learning embeddings in complex spaces which scales with the size of the vocabulary rather than the size of the training corpus. In summary, we demonstrate how to produce a large set of high-quality word embeddings for use in complex-valued and quantum-inspired NLP models, and for exploring potential advantage in quantum NLP models.
翻译:我们提出了多种基于经典Skip-gram模型训练复数值词嵌入的方法,其中一种直接适配方案是将实值向量替换为任意复数向量。在更具"物理启发性"的方法中,向量由参数化量子电路(PQCs)生成,这些幺正变换产生具有概率解释的归一化向量。我们开发了高度优化的Skip-gram C代码版本的复数值变体,使其能够在包含38亿词汇的语料库上为超过40万词汇量的词典训练复数值嵌入,进而可为每个词单独训练PQC。我们在标准相似性与关联性数据集上评估了这些复数值嵌入,部分模型获得了与经典基线相竞争的结果。研究发现,虽然直接训练PQC往往会损害性能,但通过两阶段过程获得的量子词嵌入在参数量相当的情况下,其表现与经典Skip-gram嵌入相当。这为实现复杂空间中的嵌入学习提供了一条高度可扩展的路径——其计算复杂度取决于词汇表规模而非训练语料库的大小。总之,我们展示了如何生成大规模高质量词嵌入,以应用于复数值及量子启发的自然语言处理模型,并为探索量子NLP模型的潜在优势提供可能。