Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. The proposed model was trained on a set of important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English). These datasets have been made available, in raw form, through GitHub for the use of the research community at https://github.com/SaSs7/Dataset. For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy.
翻译:音频是人类交流最常用的方式之一,但同时也容易被滥用以欺骗他人。随着人工智能的革命性发展,相关技术如今已几乎普及,使得犯罪分子能够轻易实施犯罪和伪造行为。本研究提出了一种神经网络方法,用于开发一个分类器,该分类器能够"盲目"地将输入音频分类为真实或模仿——"盲目"一词指无需参考或真实来源即可检测模仿音频的能力。所提模型基于从大规模音频数据集中提取的关键特征集合进行训练,从而获得一个分类器,并在不同音频的相同特征集上进行测试。数据取自两个原始数据集(专门为本研究构建):纯英语数据集和混合数据集(阿拉伯语加英语)。这些数据集已以原始形式通过GitHub向研究社区开放(https://github.com/SaSs7/Dataset)。为便于比较,还通过以母语者作为被试的人工检查对音频进行分类。最终结果引人注目,展现出极高的准确率。