This paper describes an end-to-end (E2E) neural architecture for the audio rendering of small portions of display content on low resource personal computing devices. It is intended to address the problem of accessibility for vision-impaired or vision-distracted users at the hardware level. Neural image-to-text (ITT) and text-to-speech (TTS) approaches are reviewed and a new technique is introduced to efficiently integrate them in a way that is both efficient and back-propagate-able, leading to a non-autoregressive E2E image-to-speech (ITS) neural network that is efficient and trainable. Experimental results are presented showing that, compared with the non-E2E approach, the proposed E2E system is 29% faster and uses 19% fewer parameters with a 2% reduction in phone accuracy. A future direction to address accuracy is presented.
翻译:本文描述了一种用于在低资源个人计算设备上对显示内容的小部分进行音频渲染的端到端(E2E)神经架构。其旨在从硬件层面解决视障或视觉分散用户的无障碍访问问题。本文回顾了神经图像到文本(ITT)和文本到语音(TTS)方法,并引入了一种新技术,以高效且可反向传播的方式将它们有效集成,从而构建出一个高效且可训练的非自回归端到端图像到语音(ITS)神经网络。实验结果表明,与非端到端方法相比,所提出的端到端系统速度提高了29%,参数减少了19%,而音素准确率仅下降了2%。本文还提出了未来改进准确率的方向。