Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a "Micro-Scatter-Macro" multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate
翻译:从无监督文本合成高质量指令数据是训练大语言模型(LLM)的一种前景广阔的方法,然而该任务的自动化方法在合成指令的多样性和难度方面仍存在显著局限。为应对这些挑战,我们提出了Self-Foveate,一种基于LLM驱动的指令合成方法。该方法受人类分层视觉感知机制启发,引入了一种"微观-散射-宏观"多层次中央凹注视方法,引导模型在三个互补的粒度上提取文本信息——从细粒度细节、跨区域关联到整体模式,从而同步提升合成指令的多样性与难度。此外,我们引入了一个再合成模块,以增强指令对源文本的忠实度及其整体质量。在多个无监督语料库和不同模型架构上的综合实验表明,Self-Foveate始终优于现有方法。我们的代码已在https://github.com/Mubuky/Self-Foveate公开。