Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

Taxonomies represent hierarchical relations between entities, frequently applied in various software modeling and natural language processing (NLP) activities. They are typically subject to a set of structural constraints restricting their content. However, manual taxonomy construction can be time-consuming, incomplete, and costly to maintain. Recent studies of large language models (LLMs) have demonstrated that appropriate user inputs (called prompting) can effectively guide LLMs, such as GPT-3, in diverse NLP tasks without explicit (re-)training. However, existing approaches for automated taxonomy construction typically involve fine-tuning a language model by adjusting model parameters. In this paper, we present a general framework for taxonomy construction that takes into account structural constraints. We subsequently conduct a systematic comparison between the prompting and fine-tuning approaches performed on a hypernym taxonomy and a novel computer science taxonomy dataset. Our result reveals the following: (1) Even without explicit training on the dataset, the prompting approach outperforms fine-tuning-based approaches. Moreover, the performance gap between prompting and fine-tuning widens when the training dataset is small. However, (2) taxonomies generated by the fine-tuning approach can be easily post-processed to satisfy all the constraints, whereas handling violations of the taxonomies produced by the prompting approach can be challenging. These evaluation findings provide guidance on selecting the appropriate method for taxonomy construction and highlight potential enhancements for both approaches.

翻译：分类体系表示了实体之间的层次关系，广泛应用于各类软件建模和自然语言处理（NLP）任务中。它们通常需要满足一组结构约束来限制其内容。然而，人工构建分类体系耗时、易遗漏且维护成本高昂。近期对大语言模型（LLM）的研究表明，适当的用户输入（称为提示学习）能够有效引导GPT-3等大语言模型完成多种NLP任务，而无需显式地进行（重新）训练。然而，现有的自动分类体系构建方法通常需要通过调整模型参数对语言模型进行微调。本文提出了一个考虑结构约束的分类体系构建通用框架。我们随后在包含一个上下义分类体系和一个新型计算机科学分类体系数据集上，系统比较了提示学习与微调两种方法的效果。研究结果揭示：（1）即使在未对数据集进行显式训练的情况下，提示学习方法的表现仍优于基于微调的方法。此外，当训练数据集规模较小时，提示学习与微调之间的性能差距会进一步扩大。但是（2）微调方法生成的分类体系可以通过后处理轻松满足所有约束，而处理提示学习方法产生的分类体系违规问题则较为困难。这些评估结果为选择合适的分类体系构建方法提供了指导，并指出了两种方法的潜在改进方向。