We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.
翻译:摘要:我们提出Echo,一种基于单个体量为25M参数的ViT编码器的概念验证音频系统。该编码器通过JEPA目标进行预训练,并分阶段专用于在同一512维潜在空间中承载说话人身份、语音内容及动态源路由,部署时无需针对单一任务进行微调。轻量级头部分别执行说话人日志(ArcFace+VBx)与动态源分离(空目标K集预测)。在未知K值的合成VoxCeleb2混合语音上,标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率(潜在SI-SDR提升9.52 dB),且留出k-NN探针上说话人/内容因子化差距达+53.50分。Echo的意义不在于任何单一任务上创造新的SOTA,而在于三个任务在此规模编码器上的联合共存。我们分阶段记录架构设计过程,报告失败路径,并识别出通过VQ瓶颈实现端到端ASR的结构性壁垒——这一瓶颈仍制约着概念验证系统的性能上限。