Social media data is plagued by the redundancy problem caused by its noisy nature, leading to increased training time and model bias. To address this issue, we propose a novel approach called generative duplication. It aims to remove duplicate text from noisy social media data and mitigate model bias. By doing so, it can improve social media language understanding performance and save training time. Extensive experiments demonstrate that the proposed generative deduplication can effectively reduce training samples while improving performance. This evidence suggests the effectiveness of generative deduplication and its importance in social media language understanding.
翻译:社交媒体数据因其嘈杂特性而饱受冗余问题困扰,导致训练时间增加和模型偏差。为解决这一问题,我们提出了一种名为生成式去重的新方法。该方法旨在从嘈杂的社交媒体数据中去除重复文本并缓解模型偏差。通过这一过程,能够提升社交媒体语言理解的性能并节省训练时间。大量实验表明,所提出的生成式去重方法能够在有效减少训练样本的同时提升性能。这一证据证明了生成式去重的有效性及其在社交媒体语言理解中的重要性。