Detecting out of policy speech (OOPS) content is important but difficult. While machine learning is a powerful tool to tackle this challenging task, it is hard to break the performance ceiling due to factors like quantity and quality limitations on training data and inconsistencies in OOPS definition and data labeling. To realize the full potential of available limited resources, we propose a meta learning technique (MLT) that combines individual models built with different text representations. We analytically show that the resulting technique is numerically stable and produces reasonable combining weights. We combine the MLT with a threshold-moving (TM) technique to further improve the performance of the combined predictor on highly-imbalanced in-distribution and out-of-distribution datasets. We also provide computational results to show the statistically significant advantages of the proposed MLT approach. All authors contributed equally to this work.
翻译:检测违规言论内容至关重要但困难重重。尽管机器学习是应对这一挑战性任务的有力工具,但由于训练数据数量与质量限制、违规言论定义与数据标注不一致等因素,性能提升难以突破瓶颈。为充分利用有限的现有资源,我们提出一种融合不同文本表征构建的独立模型的元学习技术。理论分析表明,该技术具有数值稳定性,能产生合理的组合权重。我们将该技术与阈值移动技术结合,进一步提升了高度不平衡的分布内与分布外数据集上组合预测器的性能。计算结果表明,所提出的元学习技术具有统计显著性优势。所有作者对本研究贡献均等。