Pre-trained models have achieved success in Chinese Short Text Matching (STM) tasks, but they often rely on superficial clues, leading to a lack of robust predictions. To address this issue, it is crucial to analyze and mitigate the influence of superficial clues on STM models. Our study aims to investigate their over-reliance on the edit distance feature, commonly used to measure the semantic similarity of Chinese text pairs, which can be considered a superficial clue. To mitigate STM models' over-reliance on superficial clues, we propose a novel resampling training strategy called Gradually Learn Samples Containing Superficial Clue (GLS-CSC). Through comprehensive evaluations of In-Domain (I.D.), Robustness (Rob.), and Out-Of-Domain (O.O.D.) test sets, we demonstrate that GLS-CSC outperforms existing methods in terms of enhancing the robustness and generalization of Chinese STM models. Moreover, we conduct a detailed analysis of existing methods and reveal their commonality.
翻译:预训练模型在中文短文本匹配任务中取得了成功,但往往依赖表层线索,导致缺乏鲁棒的预测能力。为应对此问题,分析并缓解表层线索对中文短文本匹配模型的影响至关重要。本研究旨在探究模型对常被用于衡量中文文本对语义相似度的编辑距离特征(可视为一种表层线索)的过度依赖现象。为缓解中文短文本匹配模型对表层线索的过度依赖,我们提出一种名为“逐步学习含表层线索样本”的新型重采样训练策略。通过在领域内、鲁棒性及跨领域测试集上的综合评估,我们证明该策略在提升中文短文本匹配模型的鲁棒性与泛化能力方面优于现有方法。此外,我们详细分析了现有方法并揭示了其共性。