Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a deep learning method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. The proposed model was trained on a set of important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English). These datasets have been made available, in raw form, through GitHub for the use of the research community at https://github.com/SaSs7/Dataset. For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy.
翻译:音频是人类交流中最常用的方式之一,但同时也容易被滥用以欺骗他人。随着人工智能的革命,相关技术如今几乎人人可及,使得犯罪分子更容易实施欺诈和伪造行为。本文提出一种深度学习方法,开发能够盲分类输入音频为真实或模仿的分类器;其中“盲”指无需参考或真实来源即可检测模仿音频的能力。所提出的模型基于从大型音频数据集中提取的一组重要特征进行训练,得到分类器后,再使用不同音频的相同特征集进行测试。数据来源于两个专为此项工作构建的原始数据集:一个纯英语数据集和一个混合数据集(阿拉伯语加英语)。这些数据集已以原始形式通过GitHub公开,供研究社区使用(访问地址:https://github.com/SaSs7/Dataset)。为进行比较,音频还通过以母语者为主体的监听者进行了人工分类。所获结果令人关注,并展现了极高的准确率。