The research in Deep Learning applications in sound and music computing have gathered an interest in the recent years; however, there is still a missing link between these new technologies and on how they can be incorporated into real-world artistic practices. In this work, we explore a well-known Deep Learning architecture called Variational Autoencoders (VAEs). These architectures have been used in many areas for generating latent spaces where data points are organized so that similar data points locate closer to each other. Previously, VAEs have been used for generating latent timbre spaces or latent spaces of symbolic music excepts. Applying VAE to audio features of timbre requires a vocoder to transform the timbre generated by the network to an audio signal, which is computationally expensive. In this work, we apply VAEs to raw audio data directly while bypassing audio feature extraction. This approach allows the practitioners to use any audio recording while giving flexibility and control over the aesthetics through dataset curation. The lower computation time in audio signal generation allows the raw audio approach to be incorporated into real-time applications. In this work, we propose three strategies to explore latent spaces of audio and timbre for sound design applications. By doing so, our aim is to initiate a conversation on artistic approaches and strategies to utilize latent audio spaces in sound and music practices.
翻译:近年来,深度学习在声音与音乐计算领域的研究引起了广泛关注;然而,这些新技术如何融入真实世界的艺术实践中仍存在空白。本研究探索了一种著名的深度学习架构——变分自编码器(VAEs)。这类架构已被广泛应用于生成潜在空间,其中数据点按照相似性进行组织,使得相似数据点的位置更为接近。此前,VAEs已被用于生成音色潜在空间或符号音乐片段的潜在空间。将VAE应用于音色音频特征时,需借助声码器将网络生成的音色转换为音频信号,这一过程计算成本高昂。本研究中,我们直接对原始音频数据应用VAE,跳过了音频特征提取环节。该方法允许使用者采用任意音频录音,并通过数据集策展灵活掌控美学表现。由于音频信号生成的计算时间更短,原始音频方法可集成至实时应用中。本研究提出了三种探索音频与音色潜在空间的策略,旨在为声音设计应用提供方法。通过此举,我们期望开启关于利用潜在音频空间进行声音与音乐实践的艺术方法与策略的对话。