With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach outperforms the previous model and recent baselines, achieving state-of-the-art results in AI-generated music detection.
翻译:随着生成式AI技术的兴起,任何人都能轻松创作并发布AI生成的音乐,这加剧了对解决版权与所有权问题的技术方案的需求。现有研究主要集中于短音频检测,而需要对长期结构与上下文建模的全音频检测任务仍未得到充分探索。为此,我们提出一种改进版的分段Transformer,称为融合分段Transformer。如我们先前工作所示,我们使用多样化特征提取器从短音乐片段中提取内容嵌入。此外,我们通过引入门控融合层来增强全音频AI生成音乐检测的架构,该层能有效整合内容与结构信息,从而实现对长期上下文的捕捉。在SONICS与AIME数据集上的实验表明,我们的方法优于先前模型及近期基线,在AI生成音乐检测任务中取得了最先进的性能。