Forum > GPU crunching
[Split] PowerSpectrum Unit Test
Jason G:
OK,
While playing, it is definitely numdatapoints/fftlen transfers (of size float3) that is killing performance here. I've tried variations on partial reductions, to be completed on host, even with mapped/pinned memory. I find It's going to be more efficient to transmit the thresholds into the kernel, transferring flag/results only if necessary, solely because we have an upper limit of 30 signals... will keep playing.
Jason
Raistmer:
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
Raistmer:
--- Quote from: Jason G on 27 Nov 2010, 09:38:00 am --- solely because we have an upper limit of 30 signals...
--- End quote ---
Per signle kernel call it doesn't matter. You can always download whole array back if needed, and this is very rare case. In common case only uint flag of 4 bytes can be transferred if threshold /best comparison inside kernel.
But again, it works good only while FFT sizes are small (many short arrays).
Jason G:
--- Quote from: Raistmer on 27 Nov 2010, 10:25:27 am ---Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
--- End quote ---
Yep, that gels with what I'm seeing in cuda calls. I'll avoid looking too hard at OpenCL implementation in respect of strength in diversity, but the principles match up so far.
Jason G:
--- Quote from: Raistmer on 27 Nov 2010, 10:29:38 am ---But again, it works good only while FFT sizes are small (many short arrays).
--- End quote ---
Yep, same again. The shorter FFT lengths match with the longer pulsePoTs, so I want those as short as possible. I'll feed through flags to the kernels at least for the shorter FFT lengths.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version