We introduce "pointer-guided segment ordering" (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.
翻译:我们提出"指针引导的段落排序"这一新颖的预训练技术,旨在增强大语言模型对段落级文本表征的上下文理解能力。该方法利用自注意力驱动的指针网络来恢复被打乱文本段落的原始顺序,从而解决文档内部结构连贯性与上下文依赖关系的捕获难题。该预训练方法辅以融合动态采样的微调策略,通过增强训练实例的多样性来提升各类下游应用的样本效率。我们在多个数据集上评估了该方法,证明了其在科学文献和财务报告领域需要序列文本分类的任务中的有效性。实验结果表明,指针引导预训练显著提升了模型理解复杂文档结构的能力,在下游分类任务中取得了最先进的性能表现。