Over the past decade, analogies, in the form of word-level analogies, have played a significant role as an intrinsic measure of evaluating the quality of word embedding methods such as word2vec. Modern large language models (LLMs), however, are primarily evaluated on extrinsic measures based on benchmarks such as GLUE and SuperGLUE, and there are only a few investigations on whether LLMs can draw analogies between long texts. In this paper, we present ANALOGICAL, a new benchmark to intrinsically evaluate LLMs across a taxonomy of analogies of long text with six levels of complexity -- (i) word, (ii) word vs. sentence, (iii) syntactic, (iv) negation, (v) entailment, and (vi) metaphor. Using thirteen datasets and three different distance measures, we evaluate the abilities of eight LLMs in identifying analogical pairs in the semantic vector space (e.g., "I can speak two languages" should be closer to "I am bilingual" while "I like chocolate" and "I do not like chocolate" should be orthogonal). Our evaluation finds that it is increasingly challenging for LLMs to identify analogies when going up the analogy taxonomy.
翻译:过去十年中,以词语层面类比推理形式呈现的类比任务,作为词嵌入方法(如word2vec)质量的内源性评估指标发挥了重要作用。然而,现代大型语言模型(LLM)主要通过基于GLUE和SuperGLUE等基准测试的外源性指标进行评估,目前仅少量研究关注LLM能否对长文本进行类比推理。本文提出ANALOGICAL——一个用于内源性评估LLM在长文本类比分类体系中六个复杂度层级表现的新基准:(i)词语级、(ii)词语-句子级、(iii)句法级、(iv)否定级、(v)蕴含级、(vi)隐喻级。我们基于十三个数据集和三种不同距离度量,评估了八个LLM在语义向量空间中识别类比对的能力(例如,“I can speak two languages”应与“I am bilingual”更接近,而“I like chocolate”与“I do not like chocolate”应呈正交关系)。评估结果表明,随着类比分类体系层级的提升,LLM识别类比的难度显著增加。