NVIDIA Developer Zone, "CuBLAS
user guide," January, 2015.
Before we call cuBLAS
to implement the matrix multiplication, we need to reorganize the transformed filter and transformed image data.
For a value of M = 64, the speedup of the CUDA implementation against the cuBLAS based 2D-NUDFT has been around 100.
As for the NED case, a GPU implementation of the 2D NER-NUDFT has been worked out based on cuBLAS routines for reference.
Therefore, both square and nonsquare matrices MVP are tested, from which a significant speedup ratio is observed comparing to the famous CUBLAS
, as shown in Tables 4 and 5.
Therefore, we do not implement our own GPU code of matrix-matrix multiplication, but just use the optimized implementation in CUBLAS
. In this way, to compute the matrix D, we just calculate the modulus of reference features and add it to proper position in C.
The FFT/IFFT modules have approximately the same performance because similar to , both are implemented by the CUBLAS
It should be noticed that, the comparison regards the cases N = [2.sup.i], i = 4, ..., 18, being the only possible ones for the cuBLAS
routine to be successfully run due to memory limitations of the employed GPU.
For the matrix multiplications, we use cuBLAS
These products can be performed on a GPU with a function cublasDgemm from the CUBLAS