Faked Speech Detection with Zero Prior Knowledge

Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone, thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. The proposed model was trained on a set of 26 important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English) (The dataset can be provided, in raw form, by writing an email to the first author). For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy, as we were able to get at least 94% correct classification of the test cases, as against the 85% accuracy in the case of human observers.

翻译：音频是人类最常用的交流方式之一，但同时也容易被滥用以欺骗他人。随着人工智能的革命性发展，相关技术如今几乎人人可用，这使得犯罪分子更容易实施欺诈和伪造行为。在本研究中，我们提出一种神经网络方法，开发出一个能够盲目将输入音频分类为真实或模仿的分类器；"盲目"一词指无需参考或真实来源即可检测模仿音频的能力。我们提出一种遵循序列模型的深度神经网络，该模型包含三个隐藏层，交替使用密集层与丢弃层。所提模型基于从大规模音频数据集中提取的26个重要特征进行训练，以得到分类器，并利用来自不同音频的相同特征集进行测试。数据源自两个原始数据集，这些数据集专为本研究构建：一个全英语数据集和一个混合数据集（阿拉伯语加英语）（原始形式的数据集可通过向第一作者发送邮件获取）。为进行比较，还通过人类本族语者进行主观听觉分类。最终结果令人瞩目，展现了极强的准确性：我们能够在测试用例中实现至少94%的正确分类率，而人类观察者的准确率仅为85%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日