Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
翻译:现有大多数文本转语音(TTS)系统要么逐句合成语音后拼接结果,要么仅依赖纯文本对话进行驱动。这两种方法都导致模型对全局上下文或副语言线索理解不足,难以捕捉多说话人交互(打断、重叠语音)、情绪演变轨迹以及多样化声学环境等真实场景特征。我们提出无边界长语音合成框架,用于实现以智能体为中心、无边界的长音频合成。该系统并非针对单一狭窄任务设计,而是作为涵盖VoiceDesigner、多说话人合成、指令TTS及长文本合成的统一能力集。在数据层面,我们提出"标注优于过滤/清洗"策略,并设计名为"全局-句子-词元"的自顶向下多层级标注方案。在模型层面,我们采用连续分词器骨干网络,并引入思维链推理与维度丢弃技术,两者显著提升了复杂条件下的指令遵循能力。进一步研究表明,该系统具有原生智能体特性:层级化标注同时充当大语言模型智能体与语音合成引擎之间的结构化语义接口,形成从场景语义到音系细节的分层控制协议栈。至此,文本演变为信息完备的宽带控制通道,使前端大语言模型能将任意模态输入转换为结构化生成指令,将范式从文本转语音拓展至无边界长语音合成。