[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (25/62) > >>

glennaxl:
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 26.5 GFlops 105.8 GB/s 1183.3ulps

SumMax ( 64) 2.1 GFlops 8.6 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 6.7 GFlops 26.9 GB/s

GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.7 GFlops 106.9 GB/s 121.7ulps

Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
9.1 GFlops 37.0 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.8 GFlops 43.8 GB/s 121.7ulps

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 25.2 GFlops 100.7 GB/s 1183.3ulps

SumMax ( 64) 2.1 GFlops 8.7 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 6.5 GFlops 26.3 GB/s

GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.3 GFlops 105.1 GB/s 121.7ulps

Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
9.1 GFlops 36.9 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.4 GFlops 42.1 GB/s 121.7ulps

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 25.3 GFlops 101.2 GB/s 1183.3ulps

SumMax ( 64) 2.0 GFlops 8.4 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 6.4 GFlops 25.7 GB/s

GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 25.9 GFlops 103.7 GB/s 121.7ulps

Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
8.8 GFlops 35.8 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.4 GFlops 42.1 GB/s 121.7ulps

Jason G:
Thanks! compute cap 1.3, so completes the basic heuristic functionality test :)

GTX 295 (taking lower & upper limits on each GPU as combined range)
Average, peak calcs, thread-count hueristic: OK (both)
worst case speedup: ~35%,40%
best case speedup: ~61%-60%

GTX 260
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~37%
best case speedup: ~62%

Still some legroom in those 2xx series yet :) With the 295's still pulling those kindof relative performance numbers, They'll still challenge the 480's for a while yet IMO. Running several tasks on the same 480 GPU makes the picture less clear, so as some of the small refinements creep into future releases it'll be something fun to watch at least.

arkayn:
Here is the results from my 460

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 12.9 GFlops 51.6 GB/s 0.0ulps

SumMax ( 64) 1.1 GFlops 4.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 3.4 GFlops 13.8 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 19.4 GFlops 77.4 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
5.5 GFlops 22.1 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
6.9 GFlops 28.1 GB/s 121.7ulps

Jason G:
Thanks!, cooperating with cc 2.1 as well (after that rocky start ;) )

GTX 460
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~61% :o
best case speedup: ~103%

looking good. I haven't worked out the worse case speedup for this kernel on my 480 yet, should be similarish, doing...

Stock PS+SuMx( 64) 5.9 GFlops 24.0 GB/s
...

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.7 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.0 GB/s 121.7ulps

So
GTX480
worst: (8.1-5.9)/5.9 ~= 37%
best: (16.1-5.9)/5.9 ~= 173%

I guess I can live with the smaller improvement in the worst case, if I can manage to get a piiece of the best case improvement in some code down the road.

Jason G:
Thanks All!

From here I'll move to complete at least the 'worst case' operation for all sizes. That will take some time to make a further test confirming which sizes will work, at least for worst case speedups (simple implementation), and which not. During that period , I'll also be seeking straightforward integration into the X series builds, It would only amount to a very very small speedup over the whole processing, but will confirm certain techniques (as already mentioned).

The 'best case' optimisation will require extensive work to extract a reasonable portion of, which would be a further small speedup overall that looks like it'll help most GPUs, but Fermi most. Again those techniques would reflect on other more critical code areas in the long run, so your help here has been appreciated most highly.

I can start to apply some of the methods determined here toward more important areas with a lot more confidence.

Cheers, Jason

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version