+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 162809 times)

Offline perryjay

  • Knight Templar
  • ****
  • Posts: 427
Re: [Split] PowerSpectrum Unit Test
« Reply #195 on: 10 Dec 2010, 09:59:00 am »
I've changed over to win-7 64 bit just before we came back up so I decided to run test 6 again. Not sure how much of a difference it will make.

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrum4.exe > results.txt
'powerspectrum4.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrum6.exe
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1400 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.9 GFlops   11.4 GB/s 1183.3ulps

 SumMax (    64)    0.3 GFlops    1.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.0 GFlops    4.1 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.9 GFlops   11.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.6 GFlops    6.6 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.8 GFlops    7.3 GB/s 121.7ulps



Leave it to me to mess up, EVGA precision wasn't holding the o/c. I looked all over the place but couldn't find the little button to make it apply at startup until just now. Here's the corrected test...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.9 GFlops   11.5 GB/s 1183.3ulps

 SumMax (    64)    0.4 GFlops    1.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.2 GFlops    4.7 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.9 GFlops   11.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         0.7 GFlops    3.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         2.1 GFlops    8.3 GB/s 121.7ulps



C:\test>
« Last Edit: 10 Dec 2010, 10:13:13 am by perryjay »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #196 on: 21 Dec 2010, 11:26:12 am »
Updated first post:
Quote
Update: PowerSpectrum(+summax reduction) Test #7
 - completed summax reduction sizes 8 through 64
 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test
 - tidied up test result layout
 - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Please test on all cuda capable cards.
example output:

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.9 GFlops   12.9 GB/s
 PS+SuMx(    16) [OK]    3.9 GFlops   16.2 GB/s
 PS+SuMx(    32) [OK]    3.9 GFlops   15.8 GB/s
 PS+SuMx(    64) [OK]    6.0 GFlops   24.2 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.3   18.6 121.7 [OK]   22.8   99.7 121.7
 PS+SuMx(    16)    6.7   28.1 121.7 [OK]   21.4   89.7 121.7
 PS+SuMx(    32)    9.4   38.6 121.7 [OK]   20.8   85.2 121.7
 PS+SuMx(    64)   11.7   47.4 121.7 [OK]   20.4   82.6 121.7


Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: [Split] PowerSpectrum Unit Test
« Reply #197 on: 21 Dec 2010, 11:47:05 am »
My 9800GTX+ on Win 7 x64:


Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.0 GFlops    8.8 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   10.7 GB/s
 PS+SuMx(    32) [OK]    2.8 GFlops   11.5 GB/s
 PS+SuMx(    64) [OK]    4.5 GFlops   18.1 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    2.7   11.8 121.7 [OK]    7.1   31.0 121.7
 PS+SuMx(    16)    4.0   16.5 121.7 [OK]    7.7   32.1 121.7
 PS+SuMx(    32)    4.9   19.9 121.7 [OK]    7.3   29.7 121.7
 PS+SuMx(    64)    6.6   26.7 121.7 [OK]    8.9   35.9 121.7


and on my 128Mb 8400M GS on Vista 32bit:


Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.3 GFlops    1.3 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.2 GB/s
 PS+SuMx(    32) [OK]    0.2 GFlops    0.9 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.4    1.9 121.7 [OK]    0.5    2.1 121.7
 PS+SuMx(    16)    0.4    1.8 121.7 [OK]    0.5    1.9 121.7
 PS+SuMx(    32)    0.4    1.7 121.7 [OK]    0.4    1.8 121.7
 PS+SuMx(    64)    0.5    2.1 121.7 [OK]    0.5    2.2 121.7


Claggy
« Last Edit: 21 Dec 2010, 12:15:54 pm by Claggy »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #198 on: 21 Dec 2010, 11:57:36 am »
LoL, I thought stock code was already G80 optimised, guess I was WRONG.

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #199 on: 21 Dec 2010, 12:11:58 pm »

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.57 +- 0.048 GFlops    2.49 +- 0.24 GB/s
 PS+SuMx(    16) [OK]    0.57 +- 0.048 GFlops    2.39 +- 0.19 GB/s
 PS+SuMx(    32) [OK]    0.49 +- 0.031 GFlops    2.01 +- 0.11 GB/s
 PS+SuMx(    64) [OK]    0.80 +- 0.105 GFlops    3.20 +- 0.41 GB/s


Opt1: 64 thrds/block
                        worst case                                 best case
                         GFlps          GB/s        ulps            GFlps         GB/s     ulps
 PS+SuMx(     8)    0.87 +- 0.048    3.92 +- 0.20 121.7 [OK]    1.21 +- 0.03  5.49 +- 0.03 121.7
 PS+SuMx(    16)    0.89 +- 0.19     3.70 +- 0.78 121.7 [OK]    1.20 +- 0      5.00  +- 0   121.7
 PS+SuMx(    32)    0.97 +-0.048    3.92 +- 0.19 121.7 [OK]    1.10 +- 0       4.60 +- 0  121.7
 PS+SuMx(    64)    1.24 +- 0.11    5.02 +- 0.42 121.7 [OK]    1.41 +- 0.03   5.85 +- 0.05 121.7


Average and standard deviation over 10 runs.
« Last Edit: 21 Dec 2010, 12:13:16 pm by Jason G »
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #200 on: 21 Dec 2010, 12:14:18 pm »
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ?  magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot  ::)]
« Last Edit: 21 Dec 2010, 12:18:58 pm by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #201 on: 21 Dec 2010, 12:23:27 pm »
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ?  magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot ::)]

A run takes some 20 seconds - makes some 5 minutes with graceful rounding. Typing the data into Excel and the calculated values back into the post took about half an hour.  :P

timer tolerances?
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #202 on: 21 Dec 2010, 12:27:14 pm »
timer tolerances?

Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test  ;)

[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit  :)
« Last Edit: 21 Dec 2010, 12:32:41 pm by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #203 on: 21 Dec 2010, 12:44:19 pm »
Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test  ;)

Well manual data collection works just as well, only more tedious.

Quote
[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit  :)

I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #204 on: 21 Dec 2010, 12:46:29 pm »
I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.

Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
« Last Edit: 21 Dec 2010, 12:55:25 pm by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #205 on: 21 Dec 2010, 01:04:35 pm »
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #206 on: 21 Dec 2010, 01:08:17 pm »
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?

Yes.  Actual performance will fall somewhere in between best & worst cases ...  :P ... Though initially I'll be using 'worst case' code for rapid code  improvements to working prototypes ( Size 64 already in field testing in x33 ), best case code is a glass ceiling to aim for with 'advanced coding'

[Edit:] size 64 (worst case implementation) provides ~3% performance improvement to 'shorties' on GTX 480

[Edit2:] oh, that was 'old' worst case code, nevermind  ::)
« Last Edit: 21 Dec 2010, 01:12:17 pm by Jason G »

Offline PatrickV2

  • Knight o' The Round Table
  • ***
  • Posts: 139
Re: [Split] PowerSpectrum Unit Test
« Reply #207 on: 21 Dec 2010, 02:04:14 pm »
I re-ran the tests on my rig (Q6600/8GB/8800GTX) under both Win764 as well as WinXP32.

First WinXP32:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.6 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   11.1 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    4.3 GFlops   17.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.6   15.8 121.7 [OK]    6.2   27.2 121.7
 PS+SuMx(    16)    4.5   18.8 121.7 [OK]    6.1   25.5 121.7
 PS+SuMx(    32)    4.9   20.1 121.7 [OK]    5.8   23.8 121.7
 PS+SuMx(    64)    6.6   26.5 121.7 [OK]    7.4   30.0 121.7


Then Win7-64:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.1 GFlops    9.0 GB/s
 PS+SuMx(    16) [OK]    2.4 GFlops   10.2 GB/s
 PS+SuMx(    32) [OK]    2.4 GFlops    9.8 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.6 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.4   14.9 121.7 [OK]    6.1   26.8 121.7
 PS+SuMx(    16)    4.2   17.5 121.7 [OK]    6.0   25.3 121.7
 PS+SuMx(    32)    4.6   18.7 121.7 [OK]    5.8   23.7 121.7
 PS+SuMx(    64)    5.9   24.0 121.7 [OK]    7.4   29.8 121.7

As always, hope it helps. ;)

Regards, Patrick.

EDIT: Modified to use no smilies due to the 'cool' smilies in the test-results.

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #208 on: 21 Dec 2010, 02:07:47 pm »
Win7x64 results:


Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8 ) [OK]    2.4 GFlops   10.7 GB/s
 PS+SuMx(    16) [OK]    3.1 GFlops   13.0 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.6 GB/s
 PS+SuMx(    64) [OK]    4.0 GFlops   16.1 GB/s


Opt1: 256 thrds/block
                                worst case                    best case
                            GFlps  GB/s ulps             GFlps  GB/s ulps
 PS+SuMx(     8 )    4.9   21.4 121.7 [OK]   13.1   57.4 121.7
 PS+SuMx(    16)    6.5   27.2 121.7 [OK]   12.3   51.4 121.7
 PS+SuMx(    32)    7.8   31.8 121.7 [OK]   11.9   48.7 121.7
 PS+SuMx(    64)    8.6   34.8 121.7 [OK]   11.6   47.0 121.7
« Last Edit: 21 Dec 2010, 02:15:25 pm by Ghost »

Offline M_M

  • Squire
  • *
  • Posts: 32
Re: [Split] PowerSpectrum Unit Test
« Reply #209 on: 21 Dec 2010, 02:24:52 pm »
GTX460 1GB OC Core=880MHz Mem=2000MHz Win7-64bit

C:\Test>powerspectrumtest7

Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    3.4 GFlops   14.7 GB/s
 PS+SuMx(    16) [OK]    3.5 GFlops   14.7 GB/s
 PS+SuMx(    32) [OK]    2.3 GFlops    9.6 GB/s
 PS+SuMx(    64) [OK]    3.5 GFlops   14.3 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.5   28.4 121.7 [OK]   13.5   59.1 121.7
 PS+SuMx(    16)    7.7   32.3 121.7 [OK]   12.6   52.8 121.7
 PS+SuMx(    32)    8.5   34.8 121.7 [OK]   12.2   49.8 121.7
 PS+SuMx(    64)    9.0   36.3 121.7 [OK]   12.3   49.6 121.7

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 25
Total: 25
Powered by EzPortal