EchoVest: Real-Time Sound Classification and Depth Perception Expressed through Transcutaneous Electrical Nerve Stimulation

Over 1.5 billion people worldwide live with hearing impairment. Despite various technologies that have been created for individuals with such disabilities, most of these technologies are either extremely expensive or inaccessible for everyday use in low-medium income countries. In order to combat this issue, we have developed a new assistive device, EchoVest, for blind/deaf people to intuitively become more aware of their environment. EchoVest transmits vibrations to the user's body by utilizing transcutaneous electric nerve stimulation (TENS) based on the source of the sounds. EchoVest also provides various features, including sound localization, sound classification, noise reduction, and depth perception. We aimed to outperform CNN-based machine-learning models, the most commonly used machine learning model for classification tasks, in accuracy and computational costs. To do so, we developed and employed a novel audio pipeline that adapts the Audio Spectrogram Transformer (AST) model, an attention-based model, for our sound classification purposes, and Fast Fourier Transforms for noise reduction. The application of Otsu's Method helped us find the optimal thresholds for background noise sound filtering and gave us much greater accuracy. In order to calculate direction and depth accurately, we applied Complex Time Difference of Arrival algorithms and SOTA localization. Our last improvement was to use blind source separation to make our algorithms applicable to multiple microphone inputs. The final algorithm achieved state-of-the-art results on numerous checkpoints, including a 95.7\% accuracy on the ESC-50 dataset for environmental sound classification.

翻译：全球超过15亿人患有听力障碍。尽管已有多种为这一残障群体开发的技术，但大多数技术在中低收入国家要么极其昂贵，要么难以日常使用。为解决这一问题，我们开发了一款新型辅助设备EchoVest，帮助盲聋人群直观地感知周围环境。EchoVest通过经皮电神经刺激（TENS）技术，根据声源类型向用户身体传递振动。该设备还提供声音定位、声音分类、降噪和深度感知等多种功能。我们旨在超越基于CNN的机器学习模型（分类任务中最常用的机器学习模型）在准确性和计算成本上的表现。为此，我们开发并采用了一种新颖的音频处理流程：将基于注意力机制的音频频谱图Transformer（AST）模型应用于声音分类，并利用快速傅里叶变换进行降噪。通过应用大津法，我们找到了背景噪声声音滤波的最佳阈值，显著提升了准确性。为精确计算方向和深度，我们采用了复值到达时间差算法和SOTA定位技术。最后一项改进是利用盲源分离技术，使算法适用于多麦克风输入。最终算法在多个基准测试中取得了先进成果，包括在ESC-50环境声音分类数据集上达到95.7%的准确率。