High Register-Count HotSort Kernels
Last week I returned to working on HotSort in order to add a few new features. One of the “free” features on my list was to implement high register-count merge kernels on GK110 and GT200 architectures.
The merge kernels in HotSort minimize global loads and stores by maximizing the number of element comparisons performed per thread.
Up until now, the same merging algorithm and register configurations were being used across all CUDA architectures and the resulting merge kernels were approaching the Fermi and GK104 63 register-per-thread limit.
Since GK110 and GT200 devices support high register-count kernels, two new merge kernels have been implemented to exploit this capability.
These new kernels further reduce the total number of global memory transactions resulting in an ~8-12% performance increase in sorting large arrays of 32-bit and 64-bit elements.
Not a bad result for simply adjusting a few configuration files and rebuilding!
You can see a comparison between the old and new Tesla K20c kernels here.
The updated HotSort Benchmarks doc for all architectures including GT200 is here.
Update:
One last note on performance, I can actually achieve an extra ~2% improvement on large arrays as well as on large numbers of small arrays if the new high register-count merge kernels are used as early as possible in the sorting process. However, this results in the very small single array benchmarks being ~1% below their peak. Right now I’m mostly interested in small array performance, so I chose not to disturb the small array sorting kernel launch logic. The assumption is that this is entirely due to SMX under-utilization. The fix is straightforward: launch the smaller merging kernels when performing small sorts. I’ll save that work for later.