Detecting out of policy speech (OOPS) content is important but difficult. While machine learning is a powerful tool to tackle this challenging task, it is hard to break the performance ceiling due to factors like quantity and quality limitations on training data and inconsistencies in OOPS definition and data labeling. To realize the full potential of available limited resources, we propose a meta learning technique (MLT) that combines individual models built with different text representations. We analytically show that the resulting technique is numerically stable and produces reasonable combining weights. We combine the MLT with a threshold-moving (TM) technique to further improve the performance of the combined predictor on highly-imbalanced in-distribution and out-of-distribution datasets. We also provide computational results to show the statistically significant advantages of the proposed MLT approach. All authors contributed equally to this work.
翻译:检测违规言论内容重要但困难。尽管机器学习是应对这一挑战性任务的有力工具,但由于训练数据的数量与质量限制、违规言论定义与数据标注的不一致性等因素,性能提升难以突破天花板。为充分挖掘有限资源的潜力,我们提出一种结合不同文本表征构建的个体模型的元学习技术(MLT)。我们通过分析表明,该技术数值稳定且能产生合理的组合权重。我们将MLT与阈值移动(TM)技术相结合,以进一步提升组合预测器在高度不平衡的分布内与分布外数据集上的性能。我们还提供了计算结果,证明所提出的MLT方法具有统计显著优势。所有作者对本研究贡献均等。