Self-training has proven to be an effective approach for cross-domain tasks, and in this study, we explore its application to cross-domain constituency parsing. Traditional self-training methods rely on limited and potentially low-quality raw corpora. To overcome this limitation, we propose enhancing self-training with the large language model (LLM) to generate domain-specific raw corpora iteratively. For the constituency parsing, we introduce grammar rules that guide the LLM in generating raw corpora and establish criteria for selecting pseudo instances. Our experimental results demonstrate that self-training for constituency parsing, equipped with an LLM, outperforms traditional methods regardless of the LLM's performance. Moreover, the combination of grammar rules and confidence criteria for pseudo-data selection yields the highest performance in the cross-domain constituency parsing.
翻译:自训练方法已被证明在跨域任务中具有有效性,本研究探讨其在跨域成分句法分析中的应用。传统自训练方法受限于原始语料库规模有限且质量可能较低。为克服这一局限,我们提出利用大语言模型(LLM)增强自训练过程,通过迭代生成领域特定原始语料库。针对成分句法分析,我们引入语法规则指导LLM生成原始语料,并建立伪实例筛选标准。实验结果表明,无论LLM性能如何,配备LLM的成分句法分析自训练方法均优于传统方法。此外,语法规则与伪数据置信度筛选标准的结合在跨域成分句法分析中取得了最高性能。