Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone, thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. The proposed model was trained on a set of 26 important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English) (The dataset can be provided, in raw form, by writing an email to the first author). For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy, as we were able to get at least 94% correct classification of the test cases, as against the 85% accuracy in the case of human observers.
翻译:音频是人类最常用的交流方式之一,但同时也容易被滥用以欺骗他人。随着人工智能的革命性发展,相关技术如今几乎人人可用,这使得犯罪分子更容易实施欺诈和伪造行为。在本研究中,我们提出一种神经网络方法,开发出一个能够盲目将输入音频分类为真实或模仿的分类器;"盲目"一词指无需参考或真实来源即可检测模仿音频的能力。我们提出一种遵循序列模型的深度神经网络,该模型包含三个隐藏层,交替使用密集层与丢弃层。所提模型基于从大规模音频数据集中提取的26个重要特征进行训练,以得到分类器,并利用来自不同音频的相同特征集进行测试。数据源自两个原始数据集,这些数据集专为本研究构建:一个全英语数据集和一个混合数据集(阿拉伯语加英语)(原始形式的数据集可通过向第一作者发送邮件获取)。为进行比较,还通过人类本族语者进行主观听觉分类。最终结果令人瞩目,展现了极强的准确性:我们能够在测试用例中实现至少94%的正确分类率,而人类观察者的准确率仅为85%。