Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (25/62) > >>

glennaxl:
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   26.5 GFlops  105.8 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.6 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.7 GFlops   26.9 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.7 GFlops  106.9 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   37.0 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.8 GFlops   43.8 GB/s 121.7ulps


-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.2 GFlops  100.7 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.5 GFlops   26.3 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.3 GFlops  105.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   36.9 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.3 GFlops  101.2 GB/s 1183.3ulps

 SumMax (    64)    2.0 GFlops    8.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.4 GFlops   25.7 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       25.9 GFlops  103.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         8.8 GFlops   35.8 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps

Jason G:
Thanks!  compute cap 1.3, so completes the basic heuristic functionality test  :)

GTX 295 (taking lower & upper limits on each GPU as combined range)
    Average, peak calcs, thread-count hueristic: OK (both)
    worst case speedup: ~35%,40%
    best case speedup: ~61%-60%

GTX 260
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~37%
    best case speedup: ~62%

Still some legroom in those 2xx series yet  :)  With the 295's still pulling those kindof relative performance numbers,  They'll still challenge the 480's for a while yet IMO.  Running several tasks on the same 480 GPU makes the picture less clear, so as some of the small refinements creep into future releases it'll be something fun to watch at least.

arkayn:
Here is the results from my 460

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   12.9 GFlops   51.6 GB/s   0.0ulps

 SumMax (    64)    1.1 GFlops    4.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.4 GFlops   13.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       19.4 GFlops   77.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         5.5 GFlops   22.1 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         6.9 GFlops   28.1 GB/s 121.7ulps

Jason G:
Thanks!, cooperating with cc 2.1 as well (after that rocky start  ;) )

GTX 460
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~61%  :o
    best case speedup: ~103%

looking good.  I haven't worked out the worse case speedup for this kernel on my 480 yet, should be similarish, doing...

Stock  PS+SuMx(    64)    5.9 GFlops   24.0 GB/s
...

Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         8.1 GFlops   32.7 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.0 GB/s 121.7ulps

So
GTX480
worst:   (8.1-5.9)/5.9  ~= 37%
best:  (16.1-5.9)/5.9 ~= 173%

I guess I can live with the smaller improvement in the worst case, if I can manage to get a piiece of the best case improvement in some code down the road.

Jason G:
Thanks All!

From here I'll move to complete at least the 'worst case' operation for all sizes.  That will take some time to make a further test confirming which sizes will work, at least for worst case speedups (simple implementation), and which not.  During that period , I'll also be seeking straightforward integration into the X series builds, It would only amount to a very very small speedup over the whole processing, but will confirm certain techniques (as already mentioned). 

The 'best case' optimisation will require extensive work to extract a reasonable portion of, which would be a further small speedup overall that looks like it'll help most GPUs, but Fermi most.  Again those techniques would reflect on other more critical code areas in the long run, so your help here has been appreciated most highly.

I can start to apply some of the methods determined here toward more important areas with a lot more confidence.

Cheers, Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version