Forum > GPU crunching
[Split] PowerSpectrum Unit Test
PatrickV2:
Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?
Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results. ;)
Regards, Patrick.
Jason G:
Hi Patrick,
I'm currently working on adding the next part of stock code to the reference set. It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers). Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).
Until then you can just multiply the Memory throughput figures by ten in your head ;).
As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3. All going well, I should have something more worth testing in a day or so.
Jason
[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it ::) (Powerspectrum performance is constant)
--- Quote ---Stock:
PowerSpectrum< 64 thrd/blk> 29.0 GFlops 115.9 GB/s 0.0ulps
SumMax ( 8 ) 4.3 GFlops 19.0 GB/s
SumMax ( 16) 3.8 GFlops 16.5 GB/s
SumMax ( 32) 1.8 GFlops 8.0 GB/s
SumMax ( 64) 3.1 GFlops 13.5 GB/s
SumMax ( 128) 4.7 GFlops 20.5 GB/s
SumMax ( 256) 6.3 GFlops 27.6 GB/s
SumMax ( 512) 11.2 GFlops 48.9 GB/s
SumMax ( 1024) 17.2 GFlops 75.1 GB/s
SumMax ( 2048) 20.3 GFlops 88.9 GB/s
SumMax ( 4096) 24.3 GFlops 106.3 GB/s
SumMax ( 8192) 25.2 GFlops 110.2 GB/s
SumMax ( 16384) 24.8 GFlops 108.7 GB/s
SumMax ( 32768) 28.3 GFlops 123.8 GB/s
SumMax ( 65536) 18.4 GFlops 80.4 GB/s
SumMax (131072) 10.1 GFlops 44.3 GB/s
Powerspectrum + SumMax ( 8 ) 12.0 GFlops 49.1 GB/s
Powerspectrum + SumMax ( 16) 10.8 GFlops 44.4 GB/s
Powerspectrum + SumMax ( 32) 6.2 GFlops 25.2 GB/s
Powerspectrum + SumMax ( 64) 9.3 GFlops 38.3 GB/s
Powerspectrum + SumMax ( 128) 12.6 GFlops 51.7 GB/s
Powerspectrum + SumMax ( 256) 15.3 GFlops 62.5 GB/s
Powerspectrum + SumMax ( 512) 20.8 GFlops 85.1 GB/s
Powerspectrum + SumMax ( 1024) 24.8 GFlops 101.5 GB/s
Powerspectrum + SumMax ( 2048) 26.3 GFlops 107.5 GB/s
Powerspectrum + SumMax ( 4096) 27.7 GFlops 113.5 GB/s
Powerspectrum + SumMax ( 8192) 28.0 GFlops 114.6 GB/s
Powerspectrum + SumMax ( 16384) 27.8 GFlops 113.8 GB/s
Powerspectrum + SumMax ( 32768) 28.8 GFlops 117.9 GB/s
Powerspectrum + SumMax ( 65536) 25.4 GFlops 104.0 GB/s
Powerspectrum + SumMax (131072) 19.8 GFlops 81.1 GB/s
--- End quote ---
Raistmer:
yes, i should be so.
Different sizes mean different block numbers - different memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always even if X varies.
Jason G:
@Raistmer: Now I restore the stock Memory transfers, and find this response from stock code:
--- Quote ---Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0]( 8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0]( 16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0]( 32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0]( 64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0]( 128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0]( 256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0]( 512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0]( 1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0]( 2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0]( 4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0]( 8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
PowerSpectrum+summax Unit test
Stock:
PowerSpectrum< 64 thrd/blk> 29.1 GFlops 116.3 GB/s 0.0ulps
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
SumMax ( 16) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
SumMax ( 32) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
SumMax ( 64) 1.8 GFlops 7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
SumMax ( 128) 2.6 GFlops 11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
SumMax ( 256) 3.5 GFlops 15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
SumMax ( 512) 5.0 GFlops 21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
SumMax ( 1024) 6.1 GFlops 26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
SumMax ( 2048) 6.7 GFlops 29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
SumMax ( 4096) 7.2 GFlops 31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
SumMax ( 16384) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
SumMax ( 32768) 7.3 GFlops 32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
SumMax ( 65536) 6.2 GFlops 27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
--- End quote ---
Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ? or do you think my test is broken ?
Raistmer:
Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now ;D }
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version