Since large language models have approached human-level performance on many tasks, it has become increasingly harder for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution - data that an oracle language model could assign a probability on the lower end of its distribution. Current methodology such as prompt engineering or crowdsourcing are insufficient for creating long-tail examples because humans are constrained by cognitive bias. We propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic rule, we search for long-tail values for each variable of the rule by first prompting a LLM, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. With this framework we construct a dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and 50K knowledge statements spanning across four domains. Human annotations find that 84% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 56% and 78% of their statements correct. Moreover, their "long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. LINT can be useful for systematically evaluating LLMs' capabilities in the long-tail distribution. We challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in the long-tail distribution compared to head distribution.
翻译:自大语言模型在许多任务上逼近人类水平以来,研究人员越来越难以找到仍对模型具有挑战性的任务。失败案例通常来自长尾分布——即标准语言模型在其概率分布的低端可能赋予较低概率的数据。当前的方法如提示工程或众包不足以创建长尾示例,因为人类受认知偏差的限制。我们提出了一种逻辑诱导知识搜索(LINK)框架,用于系统性生成长尾知识陈述。基于符号规则,我们首先通过提示大语言模型为每个规则变量搜索长尾值,然后通过评审验证值的正确性,最后通过重排序器推动长尾分布。利用该框架,我们构建了一个数据集——逻辑诱导长尾(LINT),包含200条符号规则和跨越四个领域的50K条知识陈述。人工标注发现,LINT中84%的陈述在事实上正确。相比之下,ChatGPT和GPT4在逻辑规则指导下直接生成长尾陈述时表现困难,正确率分别仅为56%和78%。此外,它们生成的“长尾”内容实际上落在较高似然范围内,因此并非真正的长尾。我们的发现表明,LINK在生成长尾分布数据的同时能保证质量。LINT可用于系统性评估大语言模型在长尾分布上的能力。我们使用LINT中的样本通过简单的蕴涵分类任务对模型提出挑战。我们发现,与头部分布相比,ChatGPT和GPT4在长尾分布中识别错误知识的能力下降了约3%。