Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a modular system architecture with configuration-driven dataset composition and unified Python-based workflows. ESPnet3 introduces a DataOrganizer abstraction for flexible dataset integration and dataset sharding for memory-efficient large-scale training, while allowing recipe-specific logic through lightweight stage overrides. In OWSM pre-training experiments, ESPnet3 reduces per-epoch training time by \emph{21.1 minutes} compared to ESPnet2 and achieves \emph{>80\% GPU utilization} in multi-node training. Fine-tuning experiments show that new models and datasets can be integrated with around \emph{46 lines of additional code}. ESPnet3 will be publicly released with model checkpoints and training logs.
翻译:暂无翻译