This work introduces a self-optimizing virtual processor (VP) for numerical array programs that shifts parallelization from a manual developer task to a cooperative, agent-like runtime mechanism. Instead of relying on centralized task-graph scheduling, static compiler optimization, or explicitly annotated parallel constructs, the VP uses a decentralized network of cooperative execution segments, derived from the stream of numerical instructions and their data dependencies at runtime. Each segment makes only local decisions about when, where, and how to prepare and execute its computation, including task placement, kernel preparation, and data movement. No central scheduler or mapper instance determines the execution globally; instead, scheduling itself is parallelized and distributed over time - asynchronously and strictly dependency driven. The overall execution strategy emerges from concurrently executing local segments, continuously responding to data availability, cost estimates, system state, hardware capabilities, and problem size. While preserving the sequential semantics of the program our VP automatically exploits parallelism across large program regions rather than being limited to individual loop bodies, modules, or explicitly marked parallel sections; developers are not required to design or encode a parallelization strategy. The current VP primarily targets low-latency strong scaling on local heterogeneous hardware, covering workloads from small, latency-sensitive array operations to large data-parallel computations. The current implementation targets the predefined array instruction set of the ILNumerics.ONAL domain-specific language, while the underlying concept is applicable to general array-based numerical programming models such as MATLAB and NumPy.
翻译:本文提出了一种面向数值数组程序的自优化虚拟处理器(VP),将并行化从人工开发任务转变为协作式的类智能体运行时机制。该虚拟处理器不依赖集中式任务图调度、静态编译器优化或显式标注的并行结构,而是通过运行时数值指令流及其数据依赖关系,构建出一个由协作执行片段组成的去中心化网络。每个片段仅就自身计算任务的时机、位置与执行方式(包括任务部署、内核准备与数据移动)做出局部决策。全局执行不存在任何中央调度器或映射器实例;相反地,调度过程本身被并行化并随时间分布执行——以异步且严格依赖驱动的方式展开。整体执行策略由多个并发执行的局部片段共同涌现:这些片段持续响应数据可用性、成本估算、系统状态、硬件能力及问题规模。在保持程序顺序语义不变的前提下,我们的虚拟处理器能自动挖掘跨大型程序区域的并行性,而不仅限于单个循环体、模块或显式标记的并行段;开发者无需设计或编码任何并行化策略。当前虚拟处理器主要面向本地异构硬件上的低延迟强扩展场景,覆盖从对延迟敏感的小型数组运算到大规模数据并行计算的全谱系负载。现有实现针对ILNumerics.ONAL领域特定语言的预定义数组指令集,但其底层概念可推广至MATLAB和NumPy等通用基于数组的数值编程模型。