Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.
翻译:嵌入是许多机器学习模型(特别是自然语言处理)中基本的初始特征提取步骤。嵌入试图将数据标记映射到低维空间,使得相似标记在嵌入空间中的映射向量彼此接近(通过某种度量)。一个基本问题是:这种嵌入能被学习到何种程度?为研究该问题,我们针对离散数据考虑一个简单概率模型,其中存在某种“真实”但未知的嵌入,且随机变量的相关性与该嵌入的相似性相关。在该模型下,证明嵌入可通过低秩近似消息传递(AMP)方法的变体进行学习。AMP方法能够在特定高维极限下精确预测估计的准确度。特别是,该方法揭示了关键参数(如每个值的样本数、词项频率以及嵌入相关性对概率分布的影响强度)之间的关联。我们的理论发现通过合成数据与真实文本数据的仿真得到了验证。