We discuss three SYCL realisations of a simple Finite Volume scheme over multiple Cartesian patches. The realisation flavours differ in the way how they map the compute steps onto loops and tasks: We compare an implementation which is exclusively using a cascade of for-loops to a version which uses nested parallelism, and finally benchmark these against a version which models the calculations as task graph. Our work proposes realisation idioms to realise these flavours within SYCL. The idioms translate to some degree to other GPGPU programming techniques, too. Our preliminary results suggest that SYCL's capability to model calculations via tasks or nested parallelism does not yet allow such realisations to outperform their counterparts using exclusively data parallelism.
翻译:我们讨论了在多个笛卡尔块上实现简单有限体积格式的三种SYCL方案。这些实现方案的区别在于它们将计算步骤映射到循环和任务的方式:我们将完全使用循环级联的实现与使用嵌套并行性的版本进行比较,最后将这些方案与将计算建模为任务图的版本进行基准测试。我们的工作提出了在SYCL中实现这些方案的实现惯用法。这些惯用法在一定程度上也适用于其他通用GPU编程技术。初步结果表明,SYCL通过任务或嵌套并行性进行建模的能力目前尚无法使此类实现优于仅使用数据并行性的对应方案。