Author Topic: [Split] PowerSpectrum Unit Test (Read 186759 times)

PatrickV2 · « **Reply #90 on:** 25 Nov 2010, 03:05:46 am »

Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?

Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results.

Regards, Patrick.

Jason G · « **Reply #91 on:** 25 Nov 2010, 04:59:06 am »

Hi Patrick,
I'm currently working on adding the next part of stock code to the reference set. It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers). Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).

Until then you can just multiply the Memory throughput figures by ten in your head

.

As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3. All going well, I should have something more worth testing in a day or so.

Jason

[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it

(Powerspectrum performance is constant)

Quote

Stock:
PowerSpectrum< 64 thrd/blk> 29.0 GFlops 115.9 GB/s 0.0ulps
SumMax ( 8 ) 4.3 GFlops 19.0 GB/s
SumMax ( 16) 3.8 GFlops 16.5 GB/s
SumMax ( 32) 1.8 GFlops 8.0 GB/s
SumMax ( 64) 3.1 GFlops 13.5 GB/s
SumMax ( 128) 4.7 GFlops 20.5 GB/s
SumMax ( 256) 6.3 GFlops 27.6 GB/s
SumMax ( 512) 11.2 GFlops 48.9 GB/s
SumMax ( 1024) 17.2 GFlops 75.1 GB/s
SumMax ( 2048) 20.3 GFlops 88.9 GB/s
SumMax ( 4096) 24.3 GFlops 106.3 GB/s
SumMax ( 8192) 25.2 GFlops 110.2 GB/s
SumMax ( 16384) 24.8 GFlops 108.7 GB/s
SumMax ( 32768) 28.3 GFlops 123.8 GB/s
SumMax ( 65536) 18.4 GFlops 80.4 GB/s
SumMax (131072) 10.1 GFlops 44.3 GB/s

Powerspectrum + SumMax ( 8 ) 12.0 GFlops 49.1 GB/s
Powerspectrum + SumMax ( 16) 10.8 GFlops 44.4 GB/s
Powerspectrum + SumMax ( 32) 6.2 GFlops 25.2 GB/s
Powerspectrum + SumMax ( 64) 9.3 GFlops 38.3 GB/s
Powerspectrum + SumMax ( 128) 12.6 GFlops 51.7 GB/s
Powerspectrum + SumMax ( 256) 15.3 GFlops 62.5 GB/s
Powerspectrum + SumMax ( 512) 20.8 GFlops 85.1 GB/s
Powerspectrum + SumMax ( 1024) 24.8 GFlops 101.5 GB/s
Powerspectrum + SumMax ( 2048) 26.3 GFlops 107.5 GB/s
Powerspectrum + SumMax ( 4096) 27.7 GFlops 113.5 GB/s
Powerspectrum + SumMax ( 8192) 28.0 GFlops 114.6 GB/s
Powerspectrum + SumMax ( 16384) 27.8 GFlops 113.8 GB/s
Powerspectrum + SumMax ( 32768) 28.8 GFlops 117.9 GB/s
Powerspectrum + SumMax ( 65536) 25.4 GFlops 104.0 GB/s
Powerspectrum + SumMax (131072) 19.8 GFlops 81.1 GB/s

Raistmer · « **Reply #92 on:** 25 Nov 2010, 10:39:19 am »

yes, i should be so.
Different sizes mean different block numbers - different memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always even if X varies.

Jason G · « **Reply #93 on:** 25 Nov 2010, 04:22:15 pm »

@Raistmer: Now I restore the stock Memory transfers, and find this response from stock code:

Quote

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0]( 8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0]( 16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0]( 32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0]( 64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0]( 128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0]( 256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0]( 512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0]( 1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0]( 2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0]( 4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0]( 8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
PowerSpectrum+summax Unit test
Stock:
PowerSpectrum< 64 thrd/blk> 29.1 GFlops 116.3 GB/s 0.0ulps
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
SumMax ( 16) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
SumMax ( 32) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
SumMax ( 64) 1.8 GFlops 7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
SumMax ( 128) 2.6 GFlops 11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
SumMax ( 256) 3.5 GFlops 15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
SumMax ( 512) 5.0 GFlops 21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
SumMax ( 1024) 6.1 GFlops 26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
SumMax ( 2048) 6.7 GFlops 29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
SumMax ( 4096) 7.2 GFlops 31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
SumMax ( 16384) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
SumMax ( 32768) 7.3 GFlops 32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
SumMax ( 65536) 6.2 GFlops 27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ? or do you think my test is broken ?

Raistmer · « **Reply #94 on:** 25 Nov 2010, 05:47:47 pm »

Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now

}

Jason G · « **Reply #95 on:** 25 Nov 2010, 06:07:03 pm »

Yep, I mean exactly what you're getting at:

Quote from: Jason G on 25 Nov 2010, 04:22:15 pm

Stock:
...
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
...
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
...
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

I think my old Pentium 3 will calculate average & peak for a 1MiB point dataset in similar GFlops speeds, and need much less power to do so. (compared to GTX 480 overclocked)

Part of the waste is definitely the memory copies back to host for result reduction (OK) but not that much. I'll continue playing around and see if I can determine whether something like this should really be done as is, with improved GPU code, partially on CPU, or fully on CPU.

Jason

Raistmer · « **Reply #96 on:** 25 Nov 2010, 06:13:55 pm »

Full CPU transfer will be slower, I started with that in early stage of OpenCL MB.
4M memory transfer of power matrix costs too much.
For low FFT sizes I use flag transfer to see if mean/max/pos data need to be downloaded from GPU or not, but for big FFT sizes (low number mean/max/pos elements) I found it's easier to transfer them than flag.
Memory transaction (for ATI at least) has some threshold size (about 16kb) after that time of transfer almost doesn't change. That is, no matter single byte transferred or 16kb - overhead will be the same. So, no sence to download flag instead of 16kb of origianl mean/max/pos data.

Jason G · « **Reply #97 on:** 25 Nov 2010, 06:20:43 pm »

OK, then I have a middle ground in mind that might restore some throughput & hopefully be friendly to the preceeding powerspectrum threadblock layout. Will give it a go.

[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.

Raistmer · « **Reply #98 on:** 27 Nov 2010, 07:20:53 am »

Quote from: Jason G on 25 Nov 2010, 06:20:43 pm

[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.

3 float numbers (12 bytes) per each array. And when FFT size is big enough (most time it ~8-16k ) it's not too much for transfer (at FFT size of 8K we have 1k/8=128 arrays => 128*12=1,5kB for memory transfer per kernel call. It's well in threshold value of constant time memory transfer for ATI. That is, no sence to reduce size of transfer (and I see no ways to eliminate it completely w/o doing full reduction and spike bookkeeping completely on GPU).
Will look onto NV profiler data to see if it has another time of transfer vs size of transfer dependance...

Raistmer · « **Reply #99 on:** 27 Nov 2010, 08:25:36 am »

OK,it's data for OpenCL build but you will gt ~ same with CUDA perhaps:

Transfer of flag (in my case flag in uint4 so 16 bytes):
4us GPU time and ~120us CPU time (as NV profiler shows)
transfer of full results array in case of big FFT sizes, for example, transfer of 8k data: GPU: 14us, CPU:128us;
4k of data: GPU: 9us, CPU:117us

looks like for NV GPU same rule applies, maybe with slightly lower threshold value: if transfer size less than some threshold, transfer time no longer depends from transfer size.
And it should be so, because of rather big quant of data in bus transfer.

Jason G · « **Reply #100 on:** 27 Nov 2010, 09:38:00 am »

OK,
While playing, it is definitely numdatapoints/fftlen transfers (of size float3) that is killing performance here. I've tried variations on partial reductions, to be completed on host, even with mapped/pinned memory. I find It's going to be more efficient to transmit the thresholds into the kernel, transferring flag/results only if necessary, solely because we have an upper limit of 30 signals... will keep playing.

Jason

Raistmer · « **Reply #101 on:** 27 Nov 2010, 10:25:27 am »

Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.

Raistmer · « **Reply #102 on:** 27 Nov 2010, 10:29:38 am »

Quote from: Jason G on 27 Nov 2010, 09:38:00 am

solely because we have an upper limit of 30 signals...

Per signle kernel call it doesn't matter. You can always download whole array back if needed, and this is very rare case. In common case only uint flag of 4 bytes can be transferred if threshold /best comparison inside kernel.

But again, it works good only while FFT sizes are small (many short arrays).

Jason G · « **Reply #103 on:** 27 Nov 2010, 10:48:35 am »

Quote from: Raistmer on 27 Nov 2010, 10:25:27 am

Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.

Yep, that gels with what I'm seeing in cuda calls. I'll avoid looking too hard at OpenCL implementation in respect of strength in diversity, but the principles match up so far.

Jason G · « **Reply #104 on:** 27 Nov 2010, 10:50:53 am »

Quote from: Raistmer on 27 Nov 2010, 10:29:38 am

But again, it works good only while FFT sizes are small (many short arrays).

Yep, same again. The shorter FFT lengths match with the longer pulsePoTs, so I want those as short as possible. I'll feed through flags to the kernels at least for the shorter FFT lengths.

Author Topic: [Split] PowerSpectrum Unit Test (Read 186759 times)

PatrickV2

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Raistmer

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test