Over the past decade, analogies, in the form of word-level analogies, have played a significant role as an intrinsic measure of evaluating the quality of word embedding methods such as word2vec. Modern large language models (LLMs), however, are primarily evaluated on extrinsic measures based on benchmarks such as GLUE and SuperGLUE, and there are only a few investigations on whether LLMs can draw analogies between long texts. In this paper, we present ANALOGICAL, a new benchmark to intrinsically evaluate LLMs across a taxonomy of analogies of long text with six levels of complexity -- (i) word, (ii) word vs. sentence, (iii) syntactic, (iv) negation, (v) entailment, and (vi) metaphor. Using thirteen datasets and three different distance measures, we evaluate the abilities of eight LLMs in identifying analogical pairs in the semantic vector space. Our evaluation finds that it is increasingly challenging for LLMs to identify analogies when going up the analogy taxonomy.
翻译:过去十年间,词级类比作为词嵌入方法(如word2vec)质量的内在评估指标发挥了重要作用。然而,现代大语言模型主要基于GLUE和SuperGLUE等基准的外在指标进行评测,鲜有研究探究其能否在长文本间建立类比关系。本文提出ANALOGICAL——一个全新基准,通过六个复杂度层级——(i) 词汇、(ii) 词汇vs句子、(iii) 句法、(iv) 否定、(v) 蕴含、(vi) 隐喻——对长文本类比进行系统分类,实现LLM的内在评估。我们采用十三个数据集与三种不同距离度量,于语义向量空间中评测了八种LLM识别类比对的能力。研究发现,随着类比层级复杂度递增,LLM识别类比的难度呈显著上升趋势。