Author Topic: [Split] PowerSpectrum Unit Test (Read 186520 times)

perryjay · « **Reply #195 on:** 10 Dec 2010, 09:59:00 am »

I've changed over to win-7 64 bit just before we came back up so I decided to run test 6 again. Not sure how much of a difference it will make.

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrum4.exe > results.txt
'powerspectrum4.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrum6.exe
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1400 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.4 GB/s 1183.3ulps

SumMax ( 64) 0.3 GFlops 1.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.0 GFlops 4.1 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.5 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.6 GFlops 6.6 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.8 GFlops 7.3 GB/s 121.7ulps

Leave it to me to mess up, EVGA precision wasn't holding the o/c. I looked all over the place but couldn't find the little button to make it apply at startup until just now. Here's the corrected test...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.5 GB/s 1183.3ulps

SumMax ( 64) 0.4 GFlops 1.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.2 GFlops 4.7 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.6 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
0.7 GFlops 3.0 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
2.1 GFlops 8.3 GB/s 121.7ulps

C:\test>

Jason G · « **Reply #196 on:** 21 Dec 2010, 11:26:12 am »

Updated first post:

Quote

Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Please test on all cuda capable cards.
example output:

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.9 GFlops   12.9 GB/s
 PS+SuMx(    16) [OK]    3.9 GFlops   16.2 GB/s
 PS+SuMx(    32) [OK]    3.9 GFlops   15.8 GB/s
 PS+SuMx(    64) [OK]    6.0 GFlops   24.2 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.3   18.6 121.7 [OK]   22.8   99.7 121.7
 PS+SuMx(    16)    6.7   28.1 121.7 [OK]   21.4   89.7 121.7
 PS+SuMx(    32)    9.4   38.6 121.7 [OK]   20.8   85.2 121.7
 PS+SuMx(    64)   11.7   47.4 121.7 [OK]   20.4   82.6 121.7

Claggy · « **Reply #197 on:** 21 Dec 2010, 11:47:05 am »

My 9800GTX+ on Win 7 x64:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.0 GFlops    8.8 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   10.7 GB/s
 PS+SuMx(    32) [OK]    2.8 GFlops   11.5 GB/s
 PS+SuMx(    64) [OK]    4.5 GFlops   18.1 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    2.7   11.8 121.7 [OK]    7.1   31.0 121.7
 PS+SuMx(    16)    4.0   16.5 121.7 [OK]    7.7   32.1 121.7
 PS+SuMx(    32)    4.9   19.9 121.7 [OK]    7.3   29.7 121.7
 PS+SuMx(    64)    6.6   26.7 121.7 [OK]    8.9   35.9 121.7

and on my 128Mb 8400M GS on Vista 32bit:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.3 GFlops    1.3 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.2 GB/s
 PS+SuMx(    32) [OK]    0.2 GFlops    0.9 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.4    1.9 121.7 [OK]    0.5    2.1 121.7
 PS+SuMx(    16)    0.4    1.8 121.7 [OK]    0.5    1.9 121.7
 PS+SuMx(    32)    0.4    1.7 121.7 [OK]    0.4    1.8 121.7
 PS+SuMx(    64)    0.5    2.1 121.7 [OK]    0.5    2.2 121.7

Claggy

Jason G · « **Reply #198 on:** 21 Dec 2010, 11:57:36 am »

LoL, I thought stock code was already G80 optimised, guess I was WRONG.

Miep · « **Reply #199 on:** 21 Dec 2010, 12:11:58 pm »

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.57 +- 0.048 GFlops    2.49 +- 0.24 GB/s
 PS+SuMx(    16) [OK]    0.57 +- 0.048 GFlops    2.39 +- 0.19 GB/s
 PS+SuMx(    32) [OK]    0.49 +- 0.031 GFlops    2.01 +- 0.11 GB/s
 PS+SuMx(    64) [OK]    0.80 +- 0.105 GFlops    3.20 +- 0.41 GB/s


Opt1: 64 thrds/block
                        worst case                                 best case
                         GFlps          GB/s        ulps            GFlps         GB/s     ulps
 PS+SuMx(     8)    0.87 +- 0.048    3.92 +- 0.20 121.7 [OK]    1.21 +- 0.03  5.49 +- 0.03 121.7
 PS+SuMx(    16)    0.89 +- 0.19     3.70 +- 0.78 121.7 [OK]    1.20 +- 0      5.00  +- 0   121.7
 PS+SuMx(    32)    0.97 +-0.048    3.92 +- 0.19 121.7 [OK]    1.10 +- 0       4.60 +- 0  121.7
 PS+SuMx(    64)    1.24 +- 0.11    5.02 +- 0.42 121.7 [OK]    1.41 +- 0.03   5.85 +- 0.05 121.7

Average and standard deviation over 10 runs.

Jason G · « **Reply #200 on:** 21 Dec 2010, 12:14:18 pm »

How did you do ten runs, while collecting data, on 'that thing' in that timeframe ? magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot

]

Miep · « **Reply #201 on:** 21 Dec 2010, 12:23:27 pm »

Quote from: Jason G on 21 Dec 2010, 12:14:18 pm

How did you do ten runs, while collecting data, on 'that thing' in that timeframe ? magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot ]

A run takes some 20 seconds - makes some 5 minutes with graceful rounding. Typing the data into Excel and the calculated values back into the post took about half an hour.

timer tolerances?

Jason G · « **Reply #202 on:** 21 Dec 2010, 12:27:14 pm »

Quote from: Miep on 21 Dec 2010, 12:23:27 pm

timer tolerances?

Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test

[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit

Miep · « **Reply #203 on:** 21 Dec 2010, 12:44:19 pm »

Quote from: Jason G on 21 Dec 2010, 12:27:14 pm

Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test

Well manual data collection works just as well, only more tedious.

Quote

[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit

I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.

Jason G · « **Reply #204 on:** 21 Dec 2010, 12:46:29 pm »

Quote from: Miep on 21 Dec 2010, 12:44:19 pm

I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.

Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

Miep · « **Reply #205 on:** 21 Dec 2010, 01:04:35 pm »

Quote from: Jason G on 21 Dec 2010, 12:46:29 pm

Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

Now he tells us

So normal data would perform somewhere in between - any info on the distribution between the two endpoints?

Jason G · « **Reply #206 on:** 21 Dec 2010, 01:08:17 pm »

Quote from: Miep on 21 Dec 2010, 01:04:35 pm

So normal data would perform somewhere in between - any info on the distribution between the two endpoints?

Yes. Actual performance will fall somewhere in between best & worst cases ...

... Though initially I'll be using 'worst case' code for rapid code improvements to working prototypes ( Size 64 already in field testing in x33 ), best case code is a glass ceiling to aim for with 'advanced coding'

[Edit:] size 64 (worst case implementation) provides ~3% performance improvement to 'shorties' on GTX 480

[Edit2:] oh, that was 'old' worst case code, nevermind

PatrickV2 · « **Reply #207 on:** 21 Dec 2010, 02:04:14 pm »

I re-ran the tests on my rig (Q6600/8GB/8800GTX) under both Win764 as well as WinXP32.

First WinXP32:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.6 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 11.1 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 4.3 GFlops 17.5 GB/s

Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.6 15.8 121.7 [OK] 6.2 27.2 121.7
PS+SuMx( 16) 4.5 18.8 121.7 [OK] 6.1 25.5 121.7
PS+SuMx( 32) 4.9 20.1 121.7 [OK] 5.8 23.8 121.7
PS+SuMx( 64) 6.6 26.5 121.7 [OK] 7.4 30.0 121.7

Then Win7-64:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.1 GFlops 9.0 GB/s
PS+SuMx( 16) [OK] 2.4 GFlops 10.2 GB/s
PS+SuMx( 32) [OK] 2.4 GFlops 9.8 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.6 GB/s

Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.4 14.9 121.7 [OK] 6.1 26.8 121.7
PS+SuMx( 16) 4.2 17.5 121.7 [OK] 6.0 25.3 121.7
PS+SuMx( 32) 4.6 18.7 121.7 [OK] 5.8 23.7 121.7
PS+SuMx( 64) 5.9 24.0 121.7 [OK] 7.4 29.8 121.7

As always, hope it helps. ;)

Regards, Patrick.

EDIT: Modified to use no smilies due to the 'cool' smilies in the test-results.

Ghost0210 · « **Reply #208 on:** 21 Dec 2010, 02:07:47 pm »

Win7x64 results:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8 ) [OK] 2.4 GFlops 10.7 GB/s
PS+SuMx( 16) [OK] 3.1 GFlops 13.0 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.6 GB/s
PS+SuMx( 64) [OK] 4.0 GFlops 16.1 GB/s

Opt1: 256 thrds/block
   worst case    best case
   GFlps GB/s ulps    GFlps GB/s ulps
PS+SuMx( 8 ) 4.9 21.4 121.7 [OK] 13.1 57.4 121.7
PS+SuMx( 16) 6.5 27.2 121.7 [OK] 12.3 51.4 121.7
PS+SuMx( 32) 7.8 31.8 121.7 [OK] 11.9 48.7 121.7
PS+SuMx( 64) 8.6 34.8 121.7 [OK] 11.6 47.0 121.7

M_M · « **Reply #209 on:** 21 Dec 2010, 02:24:52 pm »

GTX460 1GB OC Core=880MHz Mem=2000MHz Win7-64bit

C:\Test>powerspectrumtest7

Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx(

[OK] 3.4 GFlops 14.7 GB/s
PS+SuMx( 16) [OK] 3.5 GFlops 14.7 GB/s
PS+SuMx( 32) [OK] 2.3 GFlops 9.6 GB/s
PS+SuMx( 64) [OK] 3.5 GFlops 14.3 GB/s

Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx(

6.5 28.4 121.7 [OK] 13.5 59.1 121.7
PS+SuMx( 16) 7.7 32.3 121.7 [OK] 12.6 52.8 121.7
PS+SuMx( 32) 8.5 34.8 121.7 [OK] 12.2 49.8 121.7
PS+SuMx( 64) 9.0 36.3 121.7 [OK] 12.3 49.6 121.7

Author Topic: [Split] PowerSpectrum Unit Test (Read 186520 times)

perryjay

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Claggy

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

PatrickV2

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

M_M

Re: [Split] PowerSpectrum Unit Test