CUBLASCompute Unified Basic Linear Algebra Subprograms (NVidia)
References in periodicals archive ?
Before we call cuBLAS to implement the matrix multiplication, we need to reorganize the transformed filter and transformed image data.
For a value of M = 64, the speedup of the CUDA implementation against the cuBLAS based 2D-NUDFT has been around 100.
As for the NED case, a GPU implementation of the 2D NER-NUDFT has been worked out based on cuBLAS routines for reference.
Therefore, both square and nonsquare matrices MVP are tested, from which a significant speedup ratio is observed comparing to the famous CUBLAS, as shown in Tables 4 and 5.
Therefore, we do not implement our own GPU code of matrix-matrix multiplication, but just use the optimized implementation in CUBLAS. In this way, to compute the matrix D, we just calculate the modulus of reference features and add it to proper position in C.
The FFT/IFFT modules have approximately the same performance because similar to [7], both are implemented by the CUBLAS function library.
It should be noticed that, the comparison regards the cases N = [2.sup.i], i = 4, ..., 18, being the only possible ones for the cuBLAS routine to be successfully run due to memory limitations of the employed GPU.
For the matrix multiplications, we use cuBLAS routine cublasCgemm.
These products can be performed on a GPU with a function cublasDgemm from the CUBLAS library [28].