This work introduces a self-optimizing virtual processor (VP) for numerical array programs that shifts parallelization from a manual developer task to a cooperative, agent-like runtime mechanism. Instead of relying on centralized task-graph scheduling, static compiler optimization, or explicitly annotated parallel constructs, the VP uses a decentralized network of cooperative execution segments, derived from the stream of numerical instructions and their data dependencies at runtime. Each segment makes only local decisions about when, where, and how to prepare and execute its computation, including task placement, kernel preparation, and data movement. No central scheduler or mapper instance determines the execution globally; instead, scheduling itself is parallelized and distributed over time - asynchronously and strictly dependency driven. The overall execution strategy emerges from concurrently executing local segments, continuously responding to data availability, cost estimates, system state, hardware capabilities, and problem size. While preserving the sequential semantics of the program our VP automatically exploits parallelism across large program regions rather than being limited to individual loop bodies, modules, or explicitly marked parallel sections; developers are not required to design or encode a parallelization strategy. The current VP primarily targets low-latency strong scaling on local heterogeneous hardware, covering workloads from small, latency-sensitive array operations to large data-parallel computations. The current implementation targets the predefined array instruction set of the ILNumerics ONAL domain-specific language, accessible https://github.com/ILNumerics/ILNumerics.ONAL , while the underlying concept is applicable to general array-based numerical programming models such as MATLAB and NumPy.
翻译:本文提出了一种针对数值数组程序的自优化虚拟处理器(VP),将并行化从手动开发任务转变为协作式、类似代理的运行时机制。该VP不依赖集中式任务图调度、静态编译器优化或显式标注的并行结构,而是利用一个由数值指令流及其运行时数据依赖关系生成的去中心化协作执行片段网络。每个片段仅就何时、何地以及如何准备和执行其计算(包括任务放置、内核准备和数据移动)做出局部决策。全局执行并非由任何中央调度器或映射器实例决定;相反,调度本身被并行化并在时间上分布——异步且严格由依赖关系驱动。整体执行策略源于并发执行的局部片段,这些片段持续响应数据可用性、成本估算、系统状态、硬件能力和问题规模。在保持程序顺序语义的同时,我们的VP自动利用跨大型程序区域的并行性,而非局限于单个循环体、模块或显式标注的并行节;开发人员无需设计或编码并行化策略。当前的VP主要针对本地异构硬件上的低延迟强扩展,覆盖从小型、延迟敏感的数组操作到大规模数据并行计算的工作负载。当前实现针对ILNumerics ONAL领域特定语言(访问https://github.com/ILNumerics/ILNumerics.ONAL)预定义的数组指令集,而其底层概念适用于基于数组的通用数值编程模型,如MATLAB和NumPy。