Detecting plagiarism involves finding similar items in two different sources. In this article, we propose a novel method for detecting plagiarism that is based on attention mechanism-based long short-term memory (LSTM) and bidirectional encoder representations from transformers (BERT) word embedding, enhanced with optimized differential evolution (DE) method for pre-training and a focal loss function for training. BERT could be included in a downstream task and fine-tuned as a task-specific BERT can be included in a downstream task and fine-tuned as a task-specific structure, while the trained BERT model is capable of detecting various linguistic characteristics. Unbalanced classification is one of the primary issues with plagiarism detection. We suggest a focal loss-based training technique that carefully learns minority class instances to solve this. Another issue that we tackle is the training phase itself, which typically employs gradient-based methods like back-propagation for the learning process and thus suffers from some drawbacks, including sensitivity to initialization. To initiate the BP process, we suggest a novel DE algorithm that makes use of a clustering-based mutation operator. Here, a winning cluster is identified for the current DE population, and a fresh updating method is used to produce potential answers. We evaluate our proposed approach on three benchmark datasets ( MSRP, SNLI, and SemEval2014) and demonstrate that it performs well when compared to both conventional and population-based methods.
翻译:剽窃检测涉及在两个不同来源中寻找相似内容。本文提出了一种新颖的剽窃检测方法,该方法基于注意力机制的长短时记忆网络(LSTM)和双向编码器表示从变换器(BERT)词嵌入,并通过优化的差分进化(DE)方法进行预训练,以及使用焦点损失函数进行训练来增强性能。BERT可被纳入下游任务,并针对特定任务结构进行微调,而经过训练的BERT模型能够检测各类语言特征。类别不平衡是剽窃检测中的主要问题之一。我们提出了一种基于焦点损失的训练技术,通过谨慎学习少数类样本来解决这一问题。我们解决的另一个问题是训练阶段本身,该阶段通常采用基于梯度的方法(如反向传播)进行学习,因此存在一些缺陷,包括对初始化的敏感性。为了启动反向传播过程,我们提出了一种新颖的差分进化算法,该算法利用基于聚类的变异算子。在此,我们为当前DE种群确定一个优胜聚类,并采用一种新的更新方法来生成候选解。我们在三个基准数据集(MSRP、SNLI和SemEval2014)上评估了所提方法,结果表明,与传统的基于种群的方法相比,该方法表现优异。