Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.
翻译:语言模型在处理习语表达时面临组合性问题,因为其含义往往无法直接从字面成分推导得出。尽管微调及其他优化策略可用于改善习语表达的表示,但这依赖于相关数据的可用性。我们提出基于书籍的名词复合词同义词替换数据集(Noun Compound Synonym Substitution in Books, NCSSB),该数据集通过替换公有领域图书文本中潜在习语性英语名词复合词的同义词而构建。我们探究了训练模型进行习语性检测时数据数量与质量之间的权衡,并结合局部(来自上下文句子)或外部(通过语言资源)获取的语境信息。习语性检测任务的实验结果表明,对于融合语境的模型而言,数据质量是更强的制约因素;但在未采用语境包含策略的模型中,数据数量同样发挥作用。