A Lock-Free Work-Stealing Algorithm for Bulk Operations

Work-stealing is a widely used technique for balancing irregular parallel workloads, and most modern runtime systems adopt lock-free work-stealing deques to reduce contention and improve scalability. However, existing algorithms are designed for general-purpose parallel runtimes and often incur overheads that are unnecessary in specialized settings. In this paper, we present a new lock-free work-stealing queue tailored for a master-worker framework used in the parallelization of a mixed-integer programming optimization solver based on decision diagrams. Our design supports native bulk operations, grows without bounds, and assumes at most one owner and one concurrent stealer, thereby eliminating the need for heavy synchronization. We provide an informal sketch that our queue is linearizable and lock-free under this restricted concurrency model. Benchmarks demonstrate that our implementation achieves constant-latency push performance, remaining stable even as batch size increases, in contrast to existing queues from C++ Taskflow whose latencies grow sharply with batch size. Pop operations perform comparably across all implementations, while our steal operation maintains nearly flat latency across different steal proportions. We also explore an optimized steal variant that reduces latency by up to 3x in practice. Finally, a pseudo workload based on large-graph exploration confirms that all implementations scale linearly. However, we argue that solver workloads with irregular node processing times would further amplify the advantages of our algorithm.

翻译：工作窃取是一种广泛用于平衡不规则并行工作负载的技术，大多数现代运行时系统采用无锁工作窃取双端队列以减少争用并提高可扩展性。然而，现有算法是为通用并行运行时设计的，在专用场景中常产生不必要的开销。本文提出了一种新的无锁工作窃取队列，专为基于决策图的混合整数规划优化求解器并行化中所采用的主-工作者框架而设计。我们的设计支持原生批量操作，可无界增长，并假设最多仅有一个所有者和一个并发窃取者，从而消除了对重量级同步的需求。我们提供了一个非正式的概要证明，在此受限并发模型下，我们的队列是线性化且无锁的。基准测试表明，我们的实现实现了恒定延迟的推送性能，即使批量大小增加也能保持稳定，这与C++ Taskflow中现有队列的延迟随批量大小急剧增长形成对比。弹出操作在所有实现中表现相当，而我们的窃取操作在不同窃取比例下保持近乎平坦的延迟。我们还探索了一种优化的窃取变体，在实践中可将延迟降低多达3倍。最后，基于大图探索的伪工作负载证实所有实现均能线性扩展。但我们认为，具有不规则节点处理时间的求解器工作负载将进一步放大我们算法的优势。