Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.
翻译:大型语言模型(LLMs)已被证明能有效回应用户输入查询的定义型答案。然而,对人类而言,提供多种类型的答案(如示例和释义)是轻而易举的任务,而LLMs在应对非定义型查询时却难以给出正确答案。本研究通过TrackList——一个细粒度的语言与统计分析流程——评估了这种性能下降现象,以探究预训练数据对LLMs处理多样化语言查询的影响。我们还引入了RefoMed-EN,这是一个包含6170个人工标注医学术语及其对应定义、命名、示例化、解释或释义的英文数据集。我们研究了概念的高频出现(头部)或低频出现(尾部)是否影响语言模型的性能。通过句法与语义相似度度量、统计相关性及嵌入表示等方法,我们评估了LLM输出的质量。结果表明,LLM在处理定义型问题时的任务性能最高,而在示例化类型中性能最低。此外,我们发现对于定义型问题,大型语言模型倾向于对流行和常见知识进行更多释义,而对尾部及技术性知识(尤其在专业文本中)的释义则显著减少。