Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.
翻译:大语言模型(LLMs)在文本分析任务(如命名实体识别或错误检测)中的应用日益广泛。然而,与基于编码器的模型不同,生成式架构缺乏显式机制来指代输入文本的特定部分。这导致了多种临时性的跨度标注提示策略,其结果往往不一致。本文将这些策略归纳为三类:输入文本标注法、跨度数值位置索引法以及跨度内容匹配法。为克服内容匹配法的局限性,我们提出了LogitMatch——一种新的约束解码方法,该方法强制模型的输出与有效的输入跨度对齐。我们在四项多样化任务中对所有方法进行了评估。研究发现,尽管文本标注法仍是稳健的基线方法,但LogitMatch通过消除跨度匹配问题,在基于匹配的竞争性方法上实现了改进,并在某些实验设置中优于其他策略。