The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas, e.g., computer vision or natural language processing. A major limitation with audio is the available data; with audio being a time-dependent modality, high-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset. In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available. Namely, three vocal datasets, Hume-Prosody, Hume-VocalBurst, an acted emotional speech dataset Modulate-Sonata, and an in-game streamer dataset Modulate-Stream. We outline the current baselines on these datasets but encourage researchers from across audio to utilize them outside of the initial baseline tasks.
翻译:NeurIPS 2023音频机器学习研讨会汇集了来自不同音频领域的机器学习专家。从语音情感识别到音频事件检测,诸多有价值的音频驱动机器学习任务亟待探索,但相较于计算机视觉或自然语言处理等其他机器学习领域,该领域的研究社区规模较为有限。音频数据获取是主要瓶颈——作为时间依赖性模态,高质量数据采集既耗时又昂贵,这使得学术研究团队难以将前沿策略应用于更广泛、更具泛化能力的数据集。在本短篇白皮书中,为鼓励难以获取大规模数据集的科研人员,组织方首先梳理了多种开源数据集,并在研讨会期间提供若干专有数据集。具体包括:三个语音数据集(Hume-Prosody、Hume-VocalBurst)、一个表演性情感语音数据集Modulate-Sonata,以及一个游戏主播数据集Modulate-Stream。我们概述了这些数据集的现有基线性能,同时鼓励跨领域音频研究者突破初始基线任务范畴,对上述数据集进行更广泛的应用探索。