We study the impact of merging routines in merge-based sorting algorithms. More precisely, we focus on the galloping routine that TimSort uses to merge monotonic sub-arrays, hereafter called runs, and on the impact on the number of element comparisons performed if one uses this routine instead of a na\"ive merging routine. This routine was introduced in order to make TimSort more efficient on arrays with few distinct values. Alas, we prove that, although it makes TimSort sort array with two values in linear time, it does not prevent TimSort from requiring up to $\Theta(n \log(n))$ element comparisons to sort arrays of length~$n$ with three distinct values. However, we also prove that slightly modifying TimSort's galloping routine results in requiring only $\mathcal{O}(n + n \log(\sigma))$ element comparisons in the worst case, when sorting arrays of length $n$ with $\sigma$ distinct values. We do so by focusing on the notion of dual runs, which was introduced in the 1990s, and on the associated dual run-length entropy. This notion is both related to the number of distinct values and to the number of runs in an array, which came with its own run-length entropy that was used to explain TimSort's otherwise "supernatural" efficiency. We also introduce new notions of fast- and middle-growth for natural merge sorts (i.e., algorithms based on merging runs), which are found in several merge sorting algorithms similar to TimSort. We prove that algorithms with the fast- or middle-growth property, provided that they use our variant of TimSort's galloping routine for merging runs, are as efficient as possible at sorting arrays with low run-induced or dual-run-induced complexities.
翻译:我们研究了归并排序算法中归并例程的影响。更精确地说,我们聚焦于TimSort用于归并单调子数组(以下称为“段”)的跳跃例程,以及使用该例程而非朴素归并例程对元素比较次数的影响。该例程的引入旨在使TimSort在处理具有少量不同值的数组时更高效。然而,我们证明:尽管它使TimSort能在线性时间内排序仅含两个不同值的数组,但无法阻止TimSort在排序长度为~$n$、含三个不同值的数组时,仍需多达$\Theta(n \log(n))$次元素比较。不过,我们也证明:若对TimSort的跳跃例程进行轻微修改,则在排序长度为$n$、含$\sigma$个不同值的数组时,最坏情况下仅需$\mathcal{O}(n + n \log(\sigma))$次元素比较。这一结论基于对1990年代提出的“双段”概念及其相关双段长度熵的研究。该概念既与数组中不同值的数量相关,也与段的数量相关——后者伴随自身的段长度熵,曾被用于解释TimSort原本“超自然”的高效性。我们还引入了天然归并排序(即基于段归并的算法)的“快速生长”与“中速生长”新概念,这些特性存在于多种类似TimSort的归并排序算法中。我们证明:若采用我们改进的TimSort跳跃例程进行段归并,具备快速或中速生长特性的算法在排序具有低段复杂度或低双段复杂度的数组时,可达到极限高效。