We use the CUDPP prefix-sum library of CTA level to implement the reduction.
Were call a CUDPP parallel prefix-sum to calculate the side array, then moving all data of the segment into a buffer and performing a partition function.
Harris, "CUDPP: CUDA Data-Parallel Primitives Library 1.1.1," NIVIDA, UCDAVIS, April 2010, http://code.google.com/ p/cudpp/.