We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
翻译:我们提出了一种结合二值词元级分类、语言特征集成与数据增强的多词表达识别综合方法。我们的DeBERTa-v3-large模型在CoAM数据集上取得了69.8%的F1分数,较该数据集上的最佳结果(Qwen-72B,57.8% F1)提升了12个百分点,同时参数量减少了165倍。我们通过以下方式实现这一性能:(1)将检测任务重构为二值词元级的START/END/INSIDE分类而非基于跨度的预测;(2)融入有助于识别不连续及名词类多词表达的NP组块分析与依存特征;(3)采用过采样技术以缓解训练数据中严重的类别不平衡问题。我们在STREUSLE数据集上验证了方法的泛化能力,取得了78.9%的F1分数。这些结果表明,经过精心设计的轻量模型能够在结构化自然语言处理任务上显著超越大语言模型,这对资源受限的实际部署具有重要启示。