Pre-trained language models based on masked language modeling (MLM) objective excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, a recent trend of scaling decoder models to multiple billion parameters resulted in large language models (LLMs), making them competitive with MLM-based encoders. Although scale amplifies their prowess in NLU tasks, LLMs fall short of SOTA results in information extraction (IE) tasks, many framed as sequence labeling (SL). However, whether this is an intrinsic limitation of LLMs or whether their SL performance can be improved remains unclear. To address this, we explore strategies to enhance the SL performance of "open" LLMs (Llama2 and Mistral) on IE tasks. We investigate bidirectional information flow within groups of decoder blocks, applying layer-wise removal or enforcement of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with SOTA SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, proving that "open" LLMs with layer-dependent CM removal outperform strong MLM-based encoders and instruction-tuned LLMs. However, we observe no effect from CM removal on a small scale when maintaining an equivalent model size, pre-training steps, and pre-training and fine-tuning data.
翻译:基于掩码语言建模(MLM)目标的预训练语言模型在自然语言理解(NLU)任务中表现出色。虽然经过微调的MLM编码器在同等规模下始终优于因果语言建模解码器,但近期将解码器模型扩展至数十亿参数的趋势催生了大型语言模型(LLM),使其能够与MLM编码器相竞争。尽管规模放大了它们在NLU任务中的能力,但LLM在信息抽取(IE)任务(多数被建模为序列标注任务)中仍落后于最先进水平。然而,这是LLM固有的局限性,还是其序列标注性能可以提升仍不清楚。为此,我们探索了在IE任务中增强"开源"LLM(Llama2和Mistral)序列标注性能的策略。我们研究了解码器模块组内的双向信息流,在LLM微调过程中逐层移除或强制施加因果掩码。该方法取得了与最先进序列标注模型相竞争的性能提升,其效果达到或超过了移除所有模块因果掩码的结果。该发现在多种序列标注任务中均成立,证明了采用逐层因果掩码移除的"开源"LLM优于强大的MLM编码器和指令微调LLM。然而,在保持等效模型规模、预训练步数以及预训练与微调数据条件下,我们观察到小规模模型上的因果掩码移除并未产生效果。