[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (20/62) > >>

Jason G:
Yep, I mean exactly what you're getting at:

--- Quote from: Jason G on 25 Nov 2010, 04:22:15 pm ---Stock:
...
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
...
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
...
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
--- End quote ---

I think my old Pentium 3 will calculate average & peak for a 1MiB point dataset in similar GFlops speeds, and need much less power to do so. (compared to GTX 480 overclocked)

Part of the waste is definitely the memory copies back to host for result reduction (OK) but not that much. I'll continue playing around and see if I can determine whether something like this should really be done as is, with improved GPU code, partially on CPU, or fully on CPU.

Jason

Raistmer:
Full CPU transfer will be slower, I started with that in early stage of OpenCL MB.
4M memory transfer of power matrix costs too much.
For low FFT sizes I use flag transfer to see if mean/max/pos data need to be downloaded from GPU or not, but for big FFT sizes (low number mean/max/pos elements) I found it's easier to transfer them than flag.
Memory transaction (for ATI at least) has some threshold size (about 16kb) after that time of transfer almost doesn't change. That is, no matter single byte transferred or 16kb - overhead will be the same. So, no sence to download flag instead of 16kb of origianl mean/max/pos data.

Jason G:
OK, then I have a middle ground in mind that might restore some throughput & hopefully be friendly to the preceeding powerspectrum threadblock layout. Will give it a go.

[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.

Raistmer:

--- Quote from: Jason G on 25 Nov 2010, 06:20:43 pm ---[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.

--- End quote ---
3 float numbers (12 bytes) per each array. And when FFT size is big enough (most time it ~8-16k ) it's not too much for transfer (at FFT size of 8K we have 1k/8=128 arrays => 128*12=1,5kB for memory transfer per kernel call. It's well in threshold value of constant time memory transfer for ATI. That is, no sence to reduce size of transfer (and I see no ways to eliminate it completely w/o doing full reduction and spike bookkeeping completely on GPU).
Will look onto NV profiler data to see if it has another time of transfer vs size of transfer dependance...

Raistmer:
OK,it's data for OpenCL build but you will gt ~ same with CUDA perhaps:

Transfer of flag (in my case flag in uint4 so 16 bytes):
4us GPU time and ~120us CPU time (as NV profiler shows)
transfer of full results array in case of big FFT sizes, for example, transfer of 8k data: GPU: 14us, CPU:128us;
4k of data: GPU: 9us, CPU:117us

looks like for NV GPU same rule applies, maybe with slightly lower threshold value: if transfer size less than some threshold, transfer time no longer depends from transfer size.
And it should be so, because of rather big quant of data in bus transfer.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version