[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (40/62) > >>

perryjay:
I've changed over to win-7 64 bit just before we came back up so I decided to run test 6 again. Not sure how much of a difference it will make.

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrum4.exe > results.txt
'powerspectrum4.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrum6.exe
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1400 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.4 GB/s 1183.3ulps

SumMax ( 64) 0.3 GFlops 1.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.0 GFlops 4.1 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.5 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.6 GFlops 6.6 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.8 GFlops 7.3 GB/s 121.7ulps

Leave it to me to mess up, EVGA precision wasn't holding the o/c. I looked all over the place but couldn't find the little button to make it apply at startup until just now. Here's the corrected test...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.5 GB/s 1183.3ulps

SumMax ( 64) 0.4 GFlops 1.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.2 GFlops 4.7 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.6 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
0.7 GFlops 3.0 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
2.1 GFlops 8.3 GB/s 121.7ulps

C:\test>

Jason G:
Updated first post:

--- Quote ---Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)
--- End quote ---

Please test on all cuda capable cards.
example output:
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.9 GFlops 12.9 GB/s
PS+SuMx( 16) [OK] 3.9 GFlops 16.2 GB/s
PS+SuMx( 32) [OK] 3.9 GFlops 15.8 GB/s
PS+SuMx( 64) [OK] 6.0 GFlops 24.2 GB/s

Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 4.3 18.6 121.7 [OK] 22.8 99.7 121.7
PS+SuMx( 16) 6.7 28.1 121.7 [OK] 21.4 89.7 121.7
PS+SuMx( 32) 9.4 38.6 121.7 [OK] 20.8 85.2 121.7
PS+SuMx( 64) 11.7 47.4 121.7 [OK] 20.4 82.6 121.7

Claggy:
My 9800GTX+ on Win 7 x64:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.0 GFlops 8.8 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 10.7 GB/s
PS+SuMx( 32) [OK] 2.8 GFlops 11.5 GB/s
PS+SuMx( 64) [OK] 4.5 GFlops 18.1 GB/s

Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 2.7 11.8 121.7 [OK] 7.1 31.0 121.7
PS+SuMx( 16) 4.0 16.5 121.7 [OK] 7.7 32.1 121.7
PS+SuMx( 32) 4.9 19.9 121.7 [OK] 7.3 29.7 121.7
PS+SuMx( 64) 6.6 26.7 121.7 [OK] 8.9 35.9 121.7
and on my 128Mb 8400M GS on Vista 32bit:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.3 GFlops 1.3 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.2 GB/s
PS+SuMx( 32) [OK] 0.2 GFlops 0.9 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.5 GB/s

Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.4 1.9 121.7 [OK] 0.5 2.1 121.7
PS+SuMx( 16) 0.4 1.8 121.7 [OK] 0.5 1.9 121.7
PS+SuMx( 32) 0.4 1.7 121.7 [OK] 0.4 1.8 121.7
PS+SuMx( 64) 0.5 2.1 121.7 [OK] 0.5 2.2 121.7
Claggy

Jason G:
LoL, I thought stock code was already G80 optimised, guess I was WRONG.

Miep:
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
   PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.57 +- 0.048 GFlops 2.49 +- 0.24 GB/s
PS+SuMx( 16) [OK] 0.57 +- 0.048 GFlops 2.39 +- 0.19 GB/s
PS+SuMx( 32) [OK] 0.49 +- 0.031 GFlops 2.01 +- 0.11 GB/s
PS+SuMx( 64) [OK] 0.80 +- 0.105 GFlops 3.20 +- 0.41 GB/s

Opt1: 64 thrds/block
   worst case         best case
         GFlps      GB/s    ulps    GFlps GB/s ulps
PS+SuMx( 8) 0.87 +- 0.048 3.92 +- 0.20 121.7 [OK] 1.21 +- 0.03 5.49 +- 0.03 121.7
PS+SuMx( 16) 0.89 +- 0.19 3.70 +- 0.78 121.7 [OK] 1.20 +- 0      5.00 +- 0 121.7
PS+SuMx( 32) 0.97 +-0.048 3.92 +- 0.19 121.7 [OK] 1.10 +- 0     4.60 +- 0 121.7
PS+SuMx( 64) 1.24 +- 0.11 5.02 +- 0.42 121.7 [OK] 1.41 +- 0.03 5.85 +- 0.05 121.7
Average and standard deviation over 10 runs.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version