Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (19/62) > >>

PatrickV2:
Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?

Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results. ;)

Regards, Patrick.

Jason G:
Hi Patrick,
   I'm currently working on adding the next part of stock code to the reference set.  It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers).  Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).

Until then you can just multiply the Memory throughput figures by ten in your head  ;).

As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3.  All going well, I should have something more worth testing in a day or so.

Jason

[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it  ::) (Powerspectrum performance is constant)

--- Quote ---Stock:
   PowerSpectrum<  64 thrd/blk>   29.0 GFlops  115.9 GB/s   0.0ulps
   SumMax (     8 )    4.3 GFlops   19.0 GB/s
   SumMax (    16)    3.8 GFlops   16.5 GB/s
   SumMax (    32)    1.8 GFlops    8.0 GB/s
   SumMax (    64)    3.1 GFlops   13.5 GB/s
   SumMax (   128)    4.7 GFlops   20.5 GB/s
   SumMax (   256)    6.3 GFlops   27.6 GB/s
   SumMax (   512)   11.2 GFlops   48.9 GB/s
   SumMax (  1024)   17.2 GFlops   75.1 GB/s
   SumMax (  2048)   20.3 GFlops   88.9 GB/s
   SumMax (  4096)   24.3 GFlops  106.3 GB/s
   SumMax (  8192)   25.2 GFlops  110.2 GB/s
   SumMax ( 16384)   24.8 GFlops  108.7 GB/s
   SumMax ( 32768)   28.3 GFlops  123.8 GB/s
   SumMax ( 65536)   18.4 GFlops   80.4 GB/s
   SumMax (131072)   10.1 GFlops   44.3 GB/s

   Powerspectrum + SumMax (     8 )   12.0 GFlops   49.1 GB/s
   Powerspectrum + SumMax (    16)   10.8 GFlops   44.4 GB/s
   Powerspectrum + SumMax (    32)    6.2 GFlops   25.2 GB/s
   Powerspectrum + SumMax (    64)    9.3 GFlops   38.3 GB/s
   Powerspectrum + SumMax (   128)   12.6 GFlops   51.7 GB/s
   Powerspectrum + SumMax (   256)   15.3 GFlops   62.5 GB/s
   Powerspectrum + SumMax (   512)   20.8 GFlops   85.1 GB/s
   Powerspectrum + SumMax (  1024)   24.8 GFlops  101.5 GB/s
   Powerspectrum + SumMax (  2048)   26.3 GFlops  107.5 GB/s
   Powerspectrum + SumMax (  4096)   27.7 GFlops  113.5 GB/s
   Powerspectrum + SumMax (  8192)   28.0 GFlops  114.6 GB/s
   Powerspectrum + SumMax ( 16384)   27.8 GFlops  113.8 GB/s
   Powerspectrum + SumMax ( 32768)   28.8 GFlops  117.9 GB/s
   Powerspectrum + SumMax ( 65536)   25.4 GFlops  104.0 GB/s
   Powerspectrum + SumMax (131072)   19.8 GFlops   81.1 GB/s

--- End quote ---

Raistmer:
yes, i should be so.
Different sizes mean different block numbers - different  memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always  even if X varies.

Jason G:
@Raistmer:  Now I restore the stock Memory transfers, and find this response from stock code:


--- Quote ---Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0](     8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0](    16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0](    32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0](    64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0](   128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0](   256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0](   512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0](  1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0](  2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0](  4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0](  8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
                PowerSpectrum+summax Unit test
Stock:
   PowerSpectrum<  64 thrd/blk>   29.1 GFlops  116.3 GB/s   0.0ulps
   SumMax (     8)    0.8 GFlops    3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
   SumMax (    16)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
   SumMax (    32)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
   SumMax (    64)    1.8 GFlops    7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
   SumMax (   128)    2.6 GFlops   11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
   SumMax (   256)    3.5 GFlops   15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
   SumMax (   512)    5.0 GFlops   21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
   SumMax (  1024)    6.1 GFlops   26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
   SumMax (  2048)    6.7 GFlops   29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
   SumMax (  4096)    7.2 GFlops   31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
   SumMax (  8192)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
   SumMax ( 16384)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
   SumMax ( 32768)    7.3 GFlops   32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
   SumMax ( 65536)    6.2 GFlops   27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
   SumMax (131072)    5.1 GFlops   22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
--- End quote ---

Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ?  or do you think my test is broken ?

Raistmer:
Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now ;D }

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version