Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Jason G on 18 Nov 2010, 07:45:10 am

Title: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 07:45:10 am
.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?

No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine).   The kernels are all 'hard code' so no speed difference should be evident between driver change.

[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]

Jason

[Edit:] Confirmed requires driver 260.89+ , [Mod] Split off driver thread

[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o.  Could be a smidgen faster.

[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64)  1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
 - Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
 - Opt1 best & worse cases likely to occur in real life tested,  worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
 - On Integrated GPUs, use mapped/pinned host memory, so on those  worst case should be ~= best case ( and hopefully some margin better than the stock reduction  :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.1 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    5.9 GFlops   24.1 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.3 GFlops  177.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         8.1 GFlops   32.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.2 GB/s 121.7ulps

Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)

Update: PowerSpectrum(+summax reduction) Test #7
 - completed summax reduction sizes 8 through 64
 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test
 - tidied up test result layout
 - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.

Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only

Temporary download location(s):
fast:  http://www.arkayn.us/seti/PowerSpectrumTest9.7z
slow: ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z


Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9

Update: @ALL, Thanks! I'm closing this test for now.  It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).

FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here.  Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come   :)

[Locking thread, Please stay tuned for further Unit Tests!]
Title: Re: Latest nVIDIA_driver and CUDA_Version
Post by: Frizz on 18 Nov 2010, 08:43:49 am
How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.
Title: Re: Latest nVIDIA_driver and CUDA_Version
Post by: Jason G on 18 Nov 2010, 08:44:56 am
What driver ?
Title: Re: Latest nVIDIA_driver and CUDA_Version
Post by: Miep on 18 Nov 2010, 08:48:29 am
How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

updating from 258.96 to 260.99 solved that hiccup for me
Title: Re: Latest nVIDIA_driver and CUDA_Version
Post by: Richard Haselgrove on 18 Nov 2010, 08:52:28 am
And I've just checked that 260.89 is good enough, too.
Title: Re: PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 09:07:18 am
As a side effect, I'm accumulating a good collection of data that tells me a lot about the different GPU memory subsystems, on different generations (Powerspectrum is a 'memory bound' computation).  Will split this off into its own thread a bit later [Done]

[Edit] In our own thread now, feel free to post results here,  attach, or PM.  I'm getting some very handy information to use this weekend, toward optimisation strategies.

Will try and make some sort of table up once I make some sense of the data.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 18 Nov 2010, 10:29:01 am
Device: GeForce GT 240, 1340 MHz clock, 512 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:        8.6 GFlops    3.4 GB/s 1183.3ulps
     64 threads:       10.1 GFlops    4.1 GB/s 1183.3ulps
    128 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps
    256 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        3.4 GFlops    1.3 GB/s 1183.3ulps
     64 threads:        4.5 GFlops    1.8 GB/s 1183.3ulps
    128 threads:        4.5 GFlops    1.8 GB/s 1183.3ulps
    256 threads:        4.4 GFlops    1.8 GB/s 1183.3ulps


Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 18 Nov 2010, 10:36:49 am
A couple more datapoints, from Windows 7. The 'AMP' in the card model name says it's a factory overclock version.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 10:49:27 am
Thanks both.  Those are the 'stubborn' cards  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 18 Nov 2010, 11:23:24 am
I can't speak for Richard but my card is as stubborn as I am  ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 18 Nov 2010, 11:36:31 am
**********
-device 0
**********
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       18.6 GFlops    7.4 GB/s 1183.3ulps
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps
    128 threads:       26.7 GFlops   10.7 GB/s 1183.3ulps
    256 threads:       26.7 GFlops   10.7 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.3 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.2 GFlops    2.9 GB/s 1183.3ulps
    128 threads:       10.6 GFlops    4.2 GB/s 1183.3ulps
    256 threads:       10.7 GFlops    4.3 GB/s 1183.3ulps


**********
-device 1
**********
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.8 GFlops   10.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       17.9 GFlops    7.2 GB/s 1183.3ulps
     64 threads:       26.0 GFlops   10.4 GB/s 1183.3ulps
    128 threads:       26.1 GFlops   10.4 GB/s 1183.3ulps
    256 threads:       24.6 GFlops    9.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.2 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.1 GFlops    2.8 GB/s 1183.3ulps
    128 threads:       10.3 GFlops    4.1 GB/s 1183.3ulps
    256 threads:       10.6 GFlops    4.2 GB/s 1183.3ulps


**********
-device 2
**********
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.4 GFlops   10.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       18.7 GFlops    7.5 GB/s 1183.3ulps
     64 threads:       25.6 GFlops   10.2 GB/s 1183.3ulps
    128 threads:       25.9 GFlops   10.4 GB/s 1183.3ulps
    256 threads:       25.9 GFlops   10.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.2 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.0 GFlops    2.8 GB/s 1183.3ulps
    128 threads:       10.3 GFlops    4.1 GB/s 1183.3ulps
    256 threads:       10.4 GFlops    4.1 GB/s 1183.3ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 11:39:16 am
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480.  That's something to investigate.  Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       29.1 GFlops   11.6 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:       17.6 GFlops    7.1 GB/s   0.0ulps
     64 threads:       28.9 GFlops   11.6 GB/s   0.0ulps
    128 threads:       40.5 GFlops   16.2 GB/s   0.0ulps
    256 threads:       44.0 GFlops   17.6 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       19.3 GFlops    7.7 GB/s   0.0ulps
     64 threads:       38.0 GFlops   15.2 GB/s   0.0ulps
    128 threads:       61.1 GFlops   24.5 GB/s   0.0ulps
    256 threads:       61.4 GFlops   24.6 GB/s   0.0ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 18 Nov 2010, 11:44:39 am
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480.  That's something to investigate.  Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

My bad. FAH was running in background. Edited my post with new results.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 11:50:00 am
My bad. FAH was running in background. Edited my post with new results.

Ahh, cheers & LoL... I'm wondering why mod2 doesn't appear to work on those.  ([Later:] ah, probably some shared memory bank conflicts or such, will read into that. )
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 18 Nov 2010, 02:07:05 pm
And on my 465 with 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.0 GFlops    6.4 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:        9.8 GFlops    3.9 GB/s   0.0ulps
     64 threads:       15.9 GFlops    6.3 GB/s   0.0ulps
    128 threads:       20.9 GFlops    8.3 GB/s   0.0ulps
    256 threads:       23.1 GFlops    9.2 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       14.4 GFlops    5.8 GB/s   0.0ulps
     64 threads:       28.4 GFlops   11.4 GB/s   0.0ulps
    128 threads:       33.5 GFlops   13.4 GB/s   0.0ulps
    256 threads:       32.8 GFlops   13.1 GB/s   0.0ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 18 Nov 2010, 02:36:34 pm
Not sure if you're looking for this, but below my results on my 8800GTX, 260.99 drivers:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       17.8 GFlops    7.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       14.2 GFlops    5.7 GB/s 1183.3ulps
     64 threads:       17.8 GFlops    7.1 GB/s 1183.3ulps
    128 threads:       17.8 GFlops    7.1 GB/s 1183.3ulps
    256 threads:       17.6 GFlops    7.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        6.8 GFlops    2.7 GB/s 1183.3ulps
     64 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        9.1 GFlops    3.7 GB/s 1183.3ulps
    256 threads:        8.0 GFlops    3.2 GB/s 1183.3ulps

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 18 Nov 2010, 03:37:36 pm
starting PowerSpectrum2
.
-device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       20.6 GFlops    8.2 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:       12.5 GFlops    5.0 GB/s   0.0ulps
     64 threads:       20.5 GFlops    8.2 GB/s   0.0ulps
    128 threads:       27.6 GFlops   11.0 GB/s   0.0ulps
    256 threads:       29.9 GFlops   12.0 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       14.4 GFlops    5.8 GB/s   0.0ulps
     64 threads:       28.3 GFlops   11.3 GB/s   0.0ulps
    128 threads:       42.4 GFlops   16.9 GB/s   0.0ulps
    256 threads:       42.5 GFlops   17.0 GB/s   0.0ulps


-device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       20.6 GFlops    8.3 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:       12.6 GFlops    5.0 GB/s   0.0ulps
     64 threads:       20.5 GFlops    8.2 GB/s   0.0ulps
    128 threads:       27.5 GFlops   11.0 GB/s   0.0ulps
    256 threads:       30.1 GFlops   12.0 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       14.4 GFlops    5.8 GB/s   0.0ulps
     64 threads:       28.4 GFlops   11.4 GB/s   0.0ulps
    128 threads:       42.2 GFlops   16.9 GB/s   0.0ulps
    256 threads:       41.1 GFlops   16.4 GB/s   0.0ulps


.
Done
modify:
@Jason, woundering about you get 20 GFlops more with 256 threads than mine GTX470
have you source for me to compile with 2011XE Compiler ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 18 Nov 2010, 05:45:09 pm
I tried running it on my 460 but the program always crashes on the end of 128/beginning of 256 threads in mod 2.

Never see any results.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 18 Nov 2010, 05:47:34 pm
Here's my 9800GTX+ result, like Richard's 9800GTX+ it's a factory overclocked example, but by XFX:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.1 GFlops    6.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       15.1 GFlops    6.1 GB/s 1183.3ulps
     64 threads:       16.1 GFlops    6.5 GB/s 1183.3ulps
    128 threads:       16.0 GFlops    6.4 GB/s 1183.3ulps
    256 threads:       15.9 GFlops    6.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
     64 threads:        8.2 GFlops    3.3 GB/s 1183.3ulps
    128 threads:        8.3 GFlops    3.3 GB/s 1183.3ulps
    256 threads:        8.1 GFlops    3.2 GB/s 1183.3ulps

Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 18 Nov 2010, 06:11:25 pm
Here's my 128Mb 8400M GS's result, while it's not got enough RAM for Seti, it at least gives you some figures for very slow GPU's:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps
     64 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps
    128 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps
    256 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.6 GFlops    0.2 GB/s 1183.3ulps

Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 18 Nov 2010, 06:27:05 pm
run it twice on the ION
~~~~~~~~~~~~~~~~~
starting PowerSpectrum2
.

Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:        1.3 GFlops    0.5 GB/s 1183.3ulps
     64 threads:        1.9 GFlops    0.7 GB/s 1183.3ulps
    128 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps
    256 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        1.0 GFlops    0.4 GB/s 1183.3ulps
     64 threads:        1.0 GFlops    0.4 GB/s 1183.3ulps
    128 threads:        0.9 GFlops    0.4 GB/s 1183.3ulps
    256 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps



Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:        1.3 GFlops    0.5 GB/s 1183.3ulps
     64 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps
    128 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps
    256 threads:        1.9 GFlops    0.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        1.0 GFlops    0.4 GB/s 1183.3ulps
     64 threads:        1.0 GFlops    0.4 GB/s 1183.3ulps
    128 threads:        0.9 GFlops    0.4 GB/s 1183.3ulps
    256 threads:  &nbqp;     0.8 GFlops    0.3 GB/s 1183.3ulps


.
Done
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 18 Nov 2010, 07:01:14 pm
This is what I got on my 480's with 260.99

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory
Compiled with CUDA 3020
Stock GetPowerSpectrum<> mod 1:
     64 threads:       27.6 GFlops   11.1 GB/s   0.0ulps

GetPowerSpectrum<> mod 1:
     32 threads:       17.5 GFlops   7.0 GB/s    0.0ulps
     64 threads:       27.5 GFlops&nb!`; 11.0 GB/s    0.0ulps
    128 threads:       36.4 GFlops  14.6 GB/s    0.0ulps
    256 threads:       39.6 GFlops  15.8 GB/s    0.0ulps

GetPowerSpectrum<> mod 2:
     32 threads:       20.2 GFlops   8.1 GB/s    0.0ulps
     64 threads:       39.7 GFlops  15.9 GB/s    0.0ulps
    128 threads:       64.1 GFlops  25.6 GB/s    0.0ulps
    256 threads:       64.3 GFlops  25.7 GB/s    0.0ulps

Steve

I edited the data as the first time I was crunching.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 08:15:21 pm
modify:
@Jason, woundering about you get 20 GFlops more with 256 threads than mine GTX470
have you source for me to compile with 2011XE Compiler ?

GTX480 has wider memory bus IIRC.  Also they're GPU Kernels Heinz, so CPU host side won't make any difference here (Unless Intel started messing with Cuda binaries  ;) ) After some work, this will lead to a set of optimisation strategies for other kernels throughout, rather than 1 specific piece of useful code

I'm looking at this (almost pure) memory bound computation (powerspectrum), as a way to see what optimisation strategies work on different cards with that type of operation.  This way I can learn to make kernels that choose the best memory access strategy internally by compute capability.

So far it looks like Mod2 is winning on Fermi (apart from whatever is causing arkayn's problems)  Prior Gen 200 series seem to like Mod1 better, so I suspect there is some memory pattern issue for me to look at in Mod2 with respect to prior gen cards.  Earlier G80-G92 cards could be even more memory subsystem constrained, or need even more special treatment of access patterns, by the looks of things.

@Arkayn, not sure what would cause that, but on my 480 that's where things start to get 'a bit warm'  ... Is there a possibility of temperature issues ? Try cranking the fan perhaps.  [Edit:]  Probably pushing the 2.1 (GTX 460) architecture limits in Mod2.  I'll look into that for mod3.

Steve's WINNING! (Just  ;) ) -

Plenty of data for me to chew on.  Will be thinking about mod3.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 08:18:39 pm
Here's my 128Mb 8400M GS's result, while it's not got enough RAM for Seti, it at least gives you some figures for very slow GPU's:

Nice!  Another stubborn GPU  :D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 08:37:59 pm
Not sure if you're looking for this, but below my results on my 8800GTX, 260.99 drivers:
Exactly what I'm looking for, thanks.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 18 Nov 2010, 08:38:12 pm
Device: GeForce 9800 GT, 1750 MHz clock, 500 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       13.6 GFlops    5.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       12.1 GFlops    4.9 GB/s 1183.3ulps
     64 threads:       13.7 GFlops    5.5 GB/s 1183.3ulps
    128 threads:       13.5 GFlops    5.4 GB/s 1183.3ulps
    256 threads:       13.4 GFlops    5.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.3 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.0 GFlops    2.8 GB/s 1183.3ulps
    128 threads:        7.1 GFlops    2.8 GB/s 1183.3ulps
    256 threads:        6.8 GFlops    2.7 GB/s 1183.3ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 08:48:52 pm
If anyone's wondering what this figure is:

... 1183.3ulps ...

It's a measure of the precision against a CPU double precision reference power spectrum.

Fermi's get 0ulps total deviation (most accurate) because they default to IEEE-754 compliance, whereas earlier gen consistently get 1183.3 because they use a fast single precision implementation by default.

I can either use special intrinsic functions on the older cards to force compliance, at a speed penalty, or allow the Fermi's to use the faster (less accurate) computation.  Will see.  1183.3 'Units of Least Precision' isn't much total deviation from double precision reference over the 1048576 point data set used in multibeam. 

an ulp is defined here as:
Quote
const float ulp =  1.192092896e-07f;
... about  0.00000012 ... and there'd be some of that amount of variation from double precision CPU reference scattered throughout the dataset.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 09:01:58 pm
@Arkayn:  I looked through some results I have, and I have a GTX460 set that ran to completion @ stock speeds (Using driver 263.06).  Might be pushing the memory OC a bit on yours ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 18 Nov 2010, 10:05:35 pm
I think it is at 800/1600 right now, runs Collatz just fine at that speed.

I just took it down to stock speed as well as the lowest setting that Afterburner allowed and it still crashed the program.

This is on a XP-64 pro machine though.

Driver is the 263.06, do I need the toolkit installed as well?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 18 Nov 2010, 10:41:16 pm
Driver is the 263.06, do I need the toolkit installed as well?

Nope, It's definitely something weird.  Bear in mind that those upper kernels are pushing Fermi's memory subsystem harder than any boinc science app has to date that I know of, so I doubt Collatz or any other existing app would be a fair comparison ( except maybe Furmark, which is just a savage thing to do to a graphics card )

If it runs this at stock OK, but not at 800/1600, then it might be Collatz stable, but is unlikely to be future X series stable.  My current feeling is that the memory frequency is the culprit, rather than the core.

(If it doesn't run correctly at stock either, then more guessing to do  ;) )

[Later:]  At this stage I'm assuming some sort of bug in Mod2, so don;t go pulling things to bits just yet  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 06:06:32 am
A source of a subtle stock code precision variation on pre-Fermi cards found, will test patch mod1 & mod3 & leave stock alone,  probably fix mod2 but leave precision un-fixed as a test (fixing mod2 will make it slower anyway)

[A Bit Later:] Updated first post:

Quote
[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

Some variation on #1 and/or #3 may need to end up contributing to a stock update down the road due to the stock code (very tiny) precision mismatch on CPU Vs PreFermi Vs Fermi ).  The issue could be a contributor to the 'dodgy Gaussians', time will tell whether that's the case or not.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 19 Nov 2010, 11:40:16 am
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
     64 threads:        6.3 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    512 threads:       26.6 GFlops   10.7 GB/s 121.7ulps
   1024 threads: N/A

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.1 GFlops   10.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.4 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.1 GFlops   10.4 GB/s 121.7ulps
    128 threads:       26.3 GFlops   10.5 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       25.9 GFlops   10.3 GB/s 121.7ulps
    128 threads:       26.0 GFlops   10.4 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
    512 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
   1024 threads: N/A

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.5 GFlops   10.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
    128 threads:        6.0 GFlops    2.4 GB/s 1183.3ulps
    256 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    512 threads:       25.8 GFlops   10.3 GB/s 121.7ulps
   1024 threads: N/A
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 19 Nov 2010, 11:42:05 am
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       14.7 GFlops    5.9 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in a
     32 threads:        8.1 GFlops    3.2 GB/s 121.7ulps
     64 threads:       14.5 GFlops    5.8 GB/s 121.7ulps
    128 threads:       22.2 GFlops    8.9 GB/s 121.7ulps
    256 threads:       26.2 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        9.4 GFlops    3.8 GB/s   0.0ulps
     64 threads:       12.2 GFlops    4.9 GB/s   0.0ulps
    128 threads:       14.7 GFlops    5.9 GB/s   0.0ulps
    256 threads:       14.3 GFlops    5.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split lo
     32 threads:        7.6 GFlops    3.0 GB/s 121.7ulps
     64 threads:       14.0 GFlops    5.6 GB/s 121.7ulps
    128 threads:       21.5 GFlops    8.6 GB/s 121.7ulps
    256 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    512 threads:       20.0 GFlops    8.0 GB/s 121.7ulps
   1024 threads:       17.5 GFlops    7.0 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 19 Nov 2010, 11:56:46 am
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        4.6 GFlops    1.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    128 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        3.0 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.4 GFlops    1.8 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    512 threads:        3.3 GFlops    1.3 GB/s 121.7ulps
   1024 threads: N/A


Oh, look, I'm faster than an ION...
And look how horrible even mod3 is a whole 5% slower than stock. That's 4 minutes on a 90' task. Which means I'd diminish throughput by one task per 4-5 days. Simply outrageous.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 12:08:27 pm
LoL, don't worry, we'll put a crappy stock codepath in just for you  ;)

[Edit:] I'm leaning toward the simpler Mod1 Kernel for the rest of us.  On the Fermi's at least there is some cache control to play with yet, but then the denser threadcount of Mod3, at little cost, may allow more active kernels to fit on the Fermi GPU concurrently... Hmmm....
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 19 Nov 2010, 01:22:41 pm
My 9800GTX+'s rerun:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.2 GFlops    6.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.8 GFlops    6.3 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.7 GFlops    1.1 GB/s 1183.3ulps
     64 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    128 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    256 threads:        2.5 GFlops    1.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.9 GFlops    6.3 GB/s 121.7ulps
    512 threads:       15.1 GFlops    6.0 GB/s 121.7ulps
   1024 threads: N/A

Claggy

Edit: and my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
     64 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    128 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    256 threads:        0.2 GFlops    0.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    512 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
   1024 threads: N/A
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 19 Nov 2010, 01:30:12 pm
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 19 Nov 2010, 01:36:13 pm
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 01:46:12 pm
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.

The clock rate is just what the driver/library reports, which is some fixed number & doesn't measure any hardware (or mean much other than some general indication of the original core spec).

As far as GTX 260 Vs 9800GTX+ Vs GTX 460 goes, quite right  ;) but not strange at all,  This is a 'memory bound' kernel, almost purely instead of 'compute bound'.  That makes it not overly dependant on the processing speed of the GPU at all, but instead on the specific memory implementation, clocks & quality of the RAM chips used, as well as the kernel playing around I've been trying out.

So for that reason, this should be taken as a comparison of memory bound operations on different cards, and relative memory subsystem performance of the cards with respect to kernel tweaking, not a guide to GPU compute performance .... as there simply is very little to compute in a powerspectrum at all.

The goals at this time involve isolating effective strategies at shovelling data in and out of the GPU, rather than what's going on inside.... That comes later with some more meaty (compute intensive) kernels.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 01:51:11 pm
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
  Yes it's looking not bad.  I can readily embed a couple of codepaths in now,  As the drivers have their own built n dispatch ( YaY ).  To me that means we probably can have our cake & eat it too, but it is just a matter of running around picking up all the crumbs & sticking them together first.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 19 Nov 2010, 01:54:14 pm
And on the 465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.0 GFlops    6.4 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        9.8 GFlops    3.9 GB/s 121.7ulps
     64 threads:       15.8 GFlops    6.3 GB/s 121.7ulps
    128 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    256 threads:       23.1 GFlops    9.2 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:       10.8 GFlops    4.3 GB/s   0.0ulps
     64 threads:       13.2 GFlops    5.3 GB/s   0.0ulps
    128 threads:       13.3 GFlops    5.3 GB/s   0.0ulps
    256 threads:       12.1 GFlops    4.9 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        9.4 GFlops    3.7 GB/s 121.7ulps
     64 threads:       15.3 GFlops    6.1 GB/s 121.7ulps
    128 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    256 threads:       20.6 GFlops    8.3 GB/s 121.7ulps
    512 threads:       20.6 GFlops    8.2 GB/s 121.7ulps
   1024 threads:       18.6 GFlops    7.4 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 02:08:43 pm
Cheers,
   Will have to test the kernel concurrency next ( launch 2 - 16 powerspectrums at the same time ). No idea how much, if any, overall speed improvement might be achievable with that, but  needs testing.  I'll keep stock & all 3 mods in play for that, since one may 'pack' better than the others (smaller thread counts might pass the larger ones in performance if executing multiple on the same multiprocessor).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 19 Nov 2010, 02:09:58 pm
@Ghost: Didn't you post about 50% higher results for GTX465 results yesterday? Why's the difference? Did you change something? Drivers?

Or are there 2 different versons of PowerSpectrum floating around?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 19 Nov 2010, 02:11:28 pm
Or are there 2 different versons of PowerSpectrum floating around?

Check the first post, for the updated build & notes.  The Mod2 kernel was doing suspect things, so I've knobbled it (for now).

[I see you used the newer build yourself, so yes, mod2 numbers will be lower than yesterday ]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 19 Nov 2010, 04:44:41 pm
Device:  GeForce GTX 480, 810 MHz clock,  1503 MB memory
Compute capability 2.0
Compiled with CUDA 3020
Stock GetPowerSpectrum<>:
     63 threads:       27.7 GFlops  11.1 GB/s    0.0ulps

GetPowerSpectrum<> mod 1: <made Fermi & Pre-Fermi match in accuracy.>
     32 threads:       17.4 GFlops   7.0 GB/s    121.7ulps
     64 threads:       27.5 GFlops  11.0 GB/s    121.7ulps
    128 threads:       36.4 GFlops  14.5 GB/s    121.7ulps
    256 threads:       39.6 GFlops  15.8 GB/s    121.7ulps

GetPowerSpectrum<> mod 2 <fixed, but slow>:
     32 threads:       18.9 GFlops   7.6 GB/s      0.0ulps
     64 threads:       23.1 GFlops   9.2 GB/s      0.0ulps
    128 threads:       24.1 GFlops   9.6 GB/s      0.0ulps
    256 threads:       22.7 GFlops   9.1 GB/s      0.0ulps

GetPowerSpectrum<> mod 3: <As with mod1, +threads & split loads>
     32 threads:       16.7 GFlpos   6.7 GB/s    121.7ulps
     64 threads:       26.9 GFlops  10.8 GB/s    121.7ulps
    128 threads:       36.0 GFlops  14.4 GB/s    121.7ulps
    256 threads:       34.9 GFlops  13.9 GB/s    121.7ulps
    512 threads:       34.7 GFlops  13.9 GB/s    121.7ulps
   1024 threads:       33.5 GFlops  13.4 GB/s    121.7ulps


Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 19 Nov 2010, 07:46:31 pm
Got through mod 2 just fine, now it crashes on mod 3 512 threads.

I even set the clocks to 505/1010/1350 just to check.

Also crashes at 800/1600/1800
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:49:52 am
mmm, don't know why, weird.  Will look at mod3's differences to mod2 (not much).  Maybe some sort of driver bug ? It runs on XP32 here, but that's only a 260, not a Fermi.
I'd try a 263.06 driver clean install & see if that helps.

Can anyone else report crashing out on Mod3 ?  Looks like Mod1 (256 thread) will be the useful technique on Fermi cards anyway, but if there is some issue with Mod3 it'd be nice to find & fix for a fair comparison.

[A bit Later:] Might have found something, will try adjust mod3 & update later.    @arkayn:  :o why is your card the only one that tells me when I do something wrong ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 01:56:32 am
Updated first post:
Quote
[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o.  Could be a smidgen faster.

Thanks Arkayn for picking up my bugs.  Still no idea why yours is extra fussy, but it's very handy at the moment.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 20 Nov 2010, 03:15:02 am
Mod3 perforamance improved in latest PS build...

Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       14.7 GFlops    5.9 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        8.2 GFlops    3.3 GB/s 121.7ulps
     64 threads:       14.6 GFlops    5.8 GB/s 121.7ulps
    128 threads:       22.3 GFlops    8.9 GB/s 121.7ulps
    256 threads:       26.2 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        9.4 GFlops    3.8 GB/s   0.0ulps
     64 threads:       12.2 GFlops    4.9 GB/s   0.0ulps
    128 threads:       14.7 GFlops    5.9 GB/s   0.0ulps
    256 threads:       14.3 GFlops    5.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        8.2 GFlops    3.3 GB/s 121.7ulps
     64 threads:       14.7 GFlops    5.9 GB/s 121.7ulps
    128 threads:       22.3 GFlops    8.9 GB/s 121.7ulps
    256 threads:       26.1 GFlops   10.4 GB/s 121.7ulps
    512 threads:       25.7 GFlops   10.3 GB/s 121.7ulps
   1024 threads:       18.3 GFlops    7.3 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 03:21:06 am
hehe thanks. 460 with stock code is starting to look a bit anaemic, around all those 20+ figures
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 20 Nov 2010, 04:27:52 am
C:\ap_j>cd g_fft
Stopping Boinc...
starting PowerSpectrum4.exe
.

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       20.6 GFlops    8.2 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       12.5 GFlops    5.0 GB/s 121.7ulps
     64 threads:       20.5 GFlops    8.2 GB/s 121.7ulps
    128 threads:       27.6 GFlops   11.0 GB/s 121.7ulps
    256 threads:       29.9 GFlops   12.0 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:       13.5 GFlops    5.4 GB/s   0.0ulps
     64 threads:       16.7 GFlops    6.7 GB/s   0.0ulps
    128 threads:       17.2 GFlops    6.9 GB/s   0.0ulps
    256 threads:       15.7 GFlops    6.3 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       12.6 GFlops    5.0 GB/s 121.7ulps
     64 threads:       20.6 GFlops    8.2 GB/s 121.7ulps
    128 threads:       27.5 GFlops   11.0 GB/s 121.7ulps
    256 threads:       30.0 GFlops   12.0 GB/s 121.7ulps
    512 threads:       29.7 GFlops   11.9 GB/s 121.7ulps
   1024 threads:       25.6 GFlops   10.2 GB/s 121.7ulps


.
Done
Restarting Boinc...
Drücken Sie eine beliebige Taste . . .

heinz
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 20 Nov 2010, 04:51:29 am
Mod 4 Results on my 465:


Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       16.0 GFlops    6.4 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        9.8 GFlops    3.9 GB/s 121.7ulps
     64 threads:       15.9 GFlops    6.3 GB/s 121.7ulps
    128 threads:       21.0 GFlops    8.4 GB/s 121.7ulps
    256 threads:       23.1 GFlops    9.2 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:       10.7 GFlops    4.3 GB/s   0.0ulps
     64 threads:       13.1 GFlops    5.2 GB/s   0.0ulps
    128 threads:       13.3 GFlops    5.3 GB/s   0.0ulps
    256 threads:       12.1 GFlops    4.8 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        9.8 GFlops    3.9 GB/s 121.7ulps
     64 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    128 threads:       21.0 GFlops    8.4 GB/s 121.7ulps
    256 threads:       23.1 GFlops    9.2 GB/s 121.7ulps
    512 threads:       22.9 GFlops    9.1 GB/s 121.7ulps
   1024 threads:       19.5 GFlops    7.8 GB/s 121.7ulps

Edit: Corrected figures - was running downclocked in previous test (no tasks) stock 465 speeds now shown
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 20 Nov 2010, 04:54:38 am
Mod1 & Mod3 256 threads seems to suit Fermi the best...
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 20 Nov 2010, 06:19:24 am
Windows XP 32 seems to be faster than Windows 7 64.
I also noticed that for AP. For both Nvidia and AMD.



Windows 7 64

Device: GeForce GTX 460, 1451 MHz clock, 1024 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       12.7 GFlops    5.1 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        7.1 GFlops    2.8 GB/s 121.7ulps
     64 threads:       12.6 GFlops    5.0 GB/s 121.7ulps
    128 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
    256 threads:       22.4 GFlops    9.0 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        8.0 GFlops    3.2 GB/s   0.0ulps
     64 threads:       10.4 GFlops    4.2 GB/s   0.0ulps
    128 threads:       12.5 GFlops    5.0 GB/s   0.0ulps
    256 threads:       12.3 GFlops    4.9 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        7.2 GFlops    2.9 GB/s 121.7ulps
     64 threads:       12.7 GFlops    5.1 GB/s 121.7ulps
    128 threads:       18.8 GFlops    7.5 GB/s 121.7ulps
    256 threads:       22.4 GFlops    9.0 GB/s 121.7ulps
    512 threads:       21.9 GFlops    8.8 GB/s 121.7ulps
   1024 threads:       15.6 GFlops    6.2 GB/s 121.7ulps


================================================

Windows XP 32

Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       13.2 GFlops    5.3 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        7.3 GFlops    2.9 GB/s 121.7ulps
     64 threads:       13.1 GFlops    5.2 GB/s 121.7ulps
    128 threads:       19.8 GFlops    7.9 GB/s 121.7ulps
    256 threads:       23.5 GFlops    9.4 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        8.4 GFlops    3.3 GB/s   0.0ulps
     64 threads:       10.9 GFlops    4.4 GB/s   0.0ulps
    128 threads:       13.0 GFlops    5.2 GB/s   0.0ulps
    256 threads:       12.7 GFlops    5.1 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        7.4 GFlops    3.0 GB/s 121.7ulps
     64 threads:       13.2 GFlops    5.3 GB/s 121.7ulps
    128 threads:       19.9 GFlops    8.0 GB/s 121.7ulps
    256 threads:       23.6 GFlops    9.5 GB/s 121.7ulps
    512 threads:       23.2 GFlops    9.3 GB/s 121.7ulps
   1024 threads:       16.2 GFlops    6.5 GB/s 121.7ulps


Title: Re: [Split] PowerSpectrum Unit Test
Post by: MarkJ on 20 Nov 2010, 06:38:13 am
I ran on all the different cards on the farm:

1st up the GT240 (Win7 x64) has 3 cards, the DDR5 variety. Device 0 is slightly slower than 1 and 2, although they are all the same brand/model. Output is from device 0.

Device: GeForce GT 240, 1340 MHz clock, 475 MB memory.
Compute capability 1.2
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:        9.9 GFlops    4.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        8.5 GFlops    3.4 GB/s 121.7ulps
     64 threads:       10.1 GFlops    4.0 GB/s 121.7ulps
    128 threads:       10.0 GFlops    4.0 GB/s 121.7ulps
    256 threads:       10.0 GFlops    4.0 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.1 GFlops    0.8 GB/s 1183.3ulps
     64 threads:        2.1 GFlops    0.8 GB/s 1183.3ulps
    128 threads:        2.1 GFlops    0.9 GB/s 1183.3ulps
    256 threads:        2.0 GFlops    0.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        8.8 GFlops    3.5 GB/s 121.7ulps
     64 threads:       10.1 GFlops    4.0 GB/s 121.7ulps
    128 threads:       10.0 GFlops    4.0 GB/s 121.7ulps
    256 threads:       10.0 GFlops    4.0 GB/s 121.7ulps
    512 threads:       10.0 GFlops    4.0 GB/s 121.7ulps
   1024 threads: N/A


*******************************************

Next we have a GTX275 (win7 x64):

Device: GeForce GTX 275, 1404 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       27.1 GFlops   10.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       17.1 GFlops    6.8 GB/s 121.7ulps
     64 threads:       27.1 GFlops   10.8 GB/s 121.7ulps
    128 threads:       27.3 GFlops   10.9 GB/s 121.7ulps
    256 threads:       27.3 GFlops   10.9 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
     64 threads:        6.3 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.0 GFlops    2.4 GB/s 1183.3ulps
    256 threads:        6.0 GFlops    2.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       17.1 GFlops    6.9 GB/s 121.7ulps
     64 threads:       27.1 GFlops   10.8 GB/s 121.7ulps
    128 threads:       27.4 GFlops   11.0 GB/s 121.7ulps
    256 threads:       27.2 GFlops   10.9 GB/s 121.7ulps
    512 threads:       27.3 GFlops   10.9 GB/s 121.7ulps
   1024 threads: N/A


*******************************************

Next a GTX295. Yeah, I know various people have run these. Win7 x64 again

Device: GeForce GTX 295, 1242 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       24.2 GFlops    9.7 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       15.6 GFlops    6.3 GB/s 121.7ulps
     64 threads:       24.6 GFlops    9.8 GB/s 121.7ulps
    128 threads:       24.8 GFlops    9.9 GB/s 121.7ulps
    256 threads:       24.7 GFlops    9.9 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        5.6 GFlops    2.2 GB/s 1183.3ulps
     64 threads:        5.7 GFlops    2.3 GB/s 1183.3ulps
    128 threads:        5.5 GFlops    2.2 GB/s 1183.3ulps
    256 threads:        5.4 GFlops    2.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       15.6 GFlops    6.3 GB/s 121.7ulps
     64 threads:       24.6 GFlops    9.8 GB/s 121.7ulps
    128 threads:       24.8 GFlops    9.9 GB/s 121.7ulps
    256 threads:       24.7 GFlops    9.9 GB/s 121.7ulps
    512 threads:       24.7 GFlops    9.9 GB/s 121.7ulps
   1024 threads: N/A


*******************************************

Then a GTX460 (factory OC'ed version from EVGA. Once again under Win7 x64

Device: GeForce GTX 460, 810 MHz clock, 738 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       12.0 GFlops    4.8 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        6.9 GFlops    2.8 GB/s 121.7ulps
     64 threads:       12.0 GFlops    4.8 GB/s 121.7ulps
    128 threads:       17.4 GFlops    6.9 GB/s 121.7ulps
    256 threads:       19.1 GFlops    7.6 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        7.6 GFlops    3.0 GB/s   0.0ulps
     64 threads:       10.0 GFlops    4.0 GB/s   0.0ulps
    128 threads:       11.9 GFlops    4.8 GB/s   0.0ulps
    256 threads:       11.7 GFlops    4.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        7.0 GFlops    2.8 GB/s 121.7ulps
     64 threads:       12.1 GFlops    4.8 GB/s 121.7ulps
    128 threads:       17.4 GFlops    6.9 GB/s 121.7ulps
    256 threads:       19.1 GFlops    7.7 GB/s 121.7ulps
    512 threads:       18.8 GFlops    7.5 GB/s 121.7ulps
   1024 threads:       14.3 GFlops    5.7 GB/s 121.7ulps


*******************************************

And lastly just for comparison the same brand/model factory OC'ed GTX460 but under WinXP

Device: GeForce GTX 460, 1350 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       12.1 GFlops    4.8 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        6.9 GFlops    2.8 GB/s 121.7ulps
     64 threads:       12.0 GFlops    4.8 GB/s 121.7ulps
    128 threads:       17.4 GFlops    7.0 GB/s 121.7ulps
    256 threads:       19.1 GFlops    7.6 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        7.6 GFlops    3.0 GB/s   0.0ulps
     64 threads:       10.0 GFlops    4.0 GB/s   0.0ulps
    128 threads:       11.9 GFlops    4.8 GB/s   0.0ulps
    256 threads:       11.7 GFlops    4.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        7.0 GFlops    2.8 GB/s 121.7ulps
     64 threads:       12.1 GFlops    4.8 GB/s 121.7ulps
    128 threads:       17.4 GFlops    7.0 GB/s 121.7ulps
    256 threads:       19.1 GFlops    7.7 GB/s 121.7ulps
    512 threads:       18.9 GFlops    7.5 GB/s 121.7ulps
   1024 threads:       14.3 GFlops    5.7 GB/s 121.7ulps


Cheers,
MarkJ
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 20 Nov 2010, 07:27:55 am
Busy thread and a lot happening here. My respect. I re-ran the version 4 benchmark again on:

Win7-64/8GB/8800GTX/260.99 drivers:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       17.8 GFlops    7.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       14.0 GFlops    5.6 GB/s 121.7ulps
     64 threads:       17.8 GFlops    7.1 GB/s 121.7ulps
    128 threads:       17.8 GFlops    7.1 GB/s 121.7ulps
    256 threads:       17.6 GFlops    7.0 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.9 GFlops    1.1 GB/s 1183.3ulps
     64 threads:        2.9 GFlops    1.2 GB/s 1183.3ulps
    128 threads:        2.9 GFlops    1.1 GB/s 1183.3ulps
    256 threads:        2.9 GFlops    1.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       14.6 GFlops    5.8 GB/s 121.7ulps
     64 threads:       17.9 GFlops    7.2 GB/s 121.7ulps
    128 threads:       17.7 GFlops    7.1 GB/s 121.7ulps
    256 threads:       17.5 GFlops    7.0 GB/s 121.7ulps
    512 threads:       16.1 GFlops    6.4 GB/s 121.7ulps
   1024 threads: N/A


EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 07:52:23 am
EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?

Yes please.  The difference picked up earlier (Thanks Frizz)  between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) .    I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 08:02:51 am
I ran on all the different cards on the farm:

1st up the GT240 (Win7 x64) has 3 cards, the DDR5 variety. Device 0 is slightly slower than 1 and 2, although they are all the same brand/model. Output is from device 0.

Device: GeForce GT 240, 1340 MHz clock, 475 MB memory....

Nice to be edging out stock on that stubborn card.  With the rest of your results it's starting to paint a picture that might be easy to handle:

by Compute Capability
  2.0 & 2.1: Mod3 256 thread wins (Significant Boost )
  1.3: Mod3 with 128 threads  ( Very small boost )
 1.0-1.2: Mod3 with 64 threads  (edges out stock by a slim margin sometimes, but seems consistent)

That should be fairly straightforward to follow rules like this for other more important kernels, so I'll make sure I fully understand this behaviour & build kernels with that in mind.

Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 20 Nov 2010, 11:17:03 am
Test 4 Win 7 64 260.99

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum<>:
     64 threads:       27.6 GFlops  11.0 GB/s      0.0ulps

GetPowerSpectrum<> mod 1: <made Fermi & Pre-Fermi match in accuracy.>
     32 threads:       17.4 GFlops   7.0 GB/s    121.7ulps
     64 threads:       27.5 GFlops  11.0 GB/s    121.7ulps
    128 threads:       36.4 GFlops  14.5 GB/s    121.7ulps
    256 threads:       39.6 GFlops  15.8 GB/s    121.7ulps

GetPowerSpectrum<> mod 2 <fixed, but slow>:
     32 threads:       18.9 GFlops   7.6 GB/s      0.0ulps
     64 threads:       23.1 GFlops   9.2 GB/s      0.0ulps
    128 threads:       24.1 GFlops   9.6 GB/s      0.0ulps
    256 threads:       22.7 GFlops   9.1 GB/s      0.0ulps

GetPowerSpectrum<> mod 3: <As with mod1, +threads & split loads>
     32 threads:       17.5 GFlops   7.0 GB/s    121.7ulps
     64 threads:       27.6 GFlops  11.0 GB/s    121.7ulps
    128 threads:       36.3 GFlops  14.5 GB/s    121.7ulps
    256 threads:       39.7 GFlops  15.9 GB/s    121.7ulps
    512 threads:       39.2 GFlops  15.7 GB/s    121.7ulps
   1024 threads:       34.7 GFlops  13.9 GB/s    121.7ulps

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 20 Nov 2010, 11:31:52 am
Me and my little 9500GT reporting for duty sir but it's time for a little hand holding.I downloaded the package from the first post. I got a DLL and the executable. Where do I put the DLL before I open the EXE?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 20 Nov 2010, 11:32:41 am
EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?

Yes please.  The difference picked up earlier (Thanks Frizz)  between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) .    I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.

Sure, no problem. The results:

WinXP32-SP3/8GB/8800GTX/260.99 drivers:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       18.3 GFlops    7.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       14.1 GFlops    5.6 GB/s 121.7ulps
     64 threads:       18.2 GFlops    7.3 GB/s 121.7ulps
    128 threads:       18.2 GFlops    7.3 GB/s 121.7ulps
    256 threads:       17.9 GFlops    7.2 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.9 GFlops    1.2 GB/s 1183.3ulps
     64 threads:        2.9 GFlops    1.2 GB/s 1183.3ulps
    128 threads:        2.9 GFlops    1.2 GB/s 1183.3ulps
    256 threads:        2.9 GFlops    1.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       14.7 GFlops    5.9 GB/s 121.7ulps
     64 threads:       18.3 GFlops    7.3 GB/s 121.7ulps
    128 threads:       18.2 GFlops    7.3 GB/s 121.7ulps
    256 threads:       18.0 GFlops    7.2 GB/s 121.7ulps
    512 threads:       16.4 GFlops    6.6 GB/s 121.7ulps
   1024 threads: N/A

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 20 Nov 2010, 11:38:02 am
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 20 Nov 2010, 11:45:11 am
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.

Steve

Whoa, you can copy and paste from a CMD window.  :o Right click in the title bar and you will be enlightened. Saves you a LOT of typing!

Regards,

Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 20 Nov 2010, 11:48:57 am
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.

Steve

Even easier: use a redirect.

PowerSpectrum4 > results.txt

Always avoid rekeying as much as you possibly can. Apart from the time wasted, it's a prolific source of errors.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 20 Nov 2010, 11:51:36 am
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.

Steve

Whoa, you can copy and paste from a CMD window.  :o Right click in the title bar and you will be enlightened. Saves you a LOT of typing!

Regards,

Patrick.

Thank you! I was stumbling myself trying to figure out how to do it. I actually had to toss the cat out because he kept climbing on me while I was trying to type. My DOS is not as good as it once was. It was even trial and error just to get to the right directory.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:02:39 pm
BTW: Steve, I think your card has downclocked or something.

Here's mine, 480 @ 820MHz (Win7x64):
Quote
...GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       17.7 GFlops    7.1 GB/s 121.7ulps
     64 threads:       29.1 GFlops   11.6 GB/s 121.7ulps
    128 threads:       40.3 GFlops   16.1 GB/s 121.7ulps
    256 threads:       44.2 GFlops   17.7 GB/s 121.7ulps
    512 threads:       43.4 GFlops   17.4 GB/s 121.7ulps
   1024 threads:       36.8 GFlops   14.7 GB/s 121.7ulps...

Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 20 Nov 2010, 12:08:24 pm
BTW: Steve, I think your card has downclocked or something.

Here's mine, 480 @ 820MHz (Win7x64):
Quote
...GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       17.7 GFlops    7.1 GB/s 121.7ulps
     64 threads:       29.1 GFlops   11.6 GB/s 121.7ulps
    128 threads:       40.3 GFlops   16.1 GB/s 121.7ulps
    256 threads:       44.2 GFlops   17.7 GB/s 121.7ulps
    512 threads:       43.4 GFlops   17.4 GB/s 121.7ulps
   1024 threads:       36.8 GFlops   14.7 GB/s 121.7ulps...



That's interesting. Normally I am running at 860 MHz, with the voltage at 1.05 VDC.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:14:23 pm
Me and my little 9500GT reporting for duty sir but it's time for a little hand holding.I downloaded the package from the first post. I got a DLL and the executable. Where do I put the DLL before I open the EXE?

9500GT would be a great double check of the theories so far (Mod3 64 thread should be the right choice & extremely close to stock for that one)

Just- 
   - chuck the exe & dll into a new folder somewhere easy to get to, such as C:\TEST
   - Open a command window (Start->Run->CMD.EXE),
   - change directory to that location ( cd \TEST )
   - run the test  ( powerspectrum4.exe > results.txt )
   - wait for it to finish & look at results.txt

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 20 Nov 2010, 12:18:25 pm
(http://i901.photobucket.com/albums/ac211/SciManStev/GPU.jpg)

At least they crunch fast.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:19:50 pm
Ah Huh!  Memory clock

820/1640/2088 1.138V, that's about as hard as I can reasonably push it without going to water.

This particular test kernel is memory bound, so that'll be the difference.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:33:33 pm
Yes please.  The difference picked up earlier (Thanks Frizz)  between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) .    I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.

Sure, no problem. The results:
...
     64 threads:       18.3 GFlops    7.3 GB/s 121.7ulps
  Thanks!, Not enough in it (~2-3%) for me to consider switching back to Xp32  :).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 20 Nov 2010, 12:44:37 pm


Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrum4.exe

Device: GeForce 9500 GT, 1840 MHz clock, 1008 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:        2.8 GFlops    1.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.7 GFlops    1.1 GB/s 121.7ulps
     64 threads:        2.9 GFlops    1.1 GB/s 121.7ulps
    128 threads:        2.9 GFlops    1.1 GB/s 121.7ulps
    256 threads:        2.9 GFlops    1.2 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.5 GFlops    0.2 GB/s 1183.3ulps
     64 threads:        0.5 GFlops    0.2 GB/s 1183.3ulps
    128 threads:        0.5 GFlops    0.2 GB/s 1183.3ulps
    256 threads:        0.5 GFlops    0.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        2.8 GFlops    1.1 GB/s 121.7ulps
     64 threads:        2.9 GFlops    1.1 GB/s 121.7ulps
    128 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
    256 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
    512 threads:        2.9 GFlops    1.1 GB/s 121.7ulps
   1024 threads: N/A



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 20 Nov 2010, 12:57:05 pm
Woohoo, I like the ones that say 1.2GB/s , might have to shift compute cap 1.1 cards into the Mod 3, 128 thread category  ( or add more digits next time,  to find out where within 0-9% that difference is.  9% would be good )
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 20 Nov 2010, 06:49:29 pm
I guess I was doing it wrong before as well, I was just running it straight

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:       12.8 GFlops    5.1 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        7.7 GFlops    3.1 GB/s 121.7ulps
     64 threads:       12.8 GFlops    5.1 GB/s 121.7ulps
    128 threads:       17.6 GFlops    7.0 GB/s 121.7ulps
    256 threads:       19.3 GFlops    7.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        8.7 GFlops    3.5 GB/s   0.0ulps
     64 threads:       11.2 GFlops    4.5 GB/s   0.0ulps
    128 threads:       13.2 GFlops    5.3 GB/s   0.0ulps
    256 threads:       12.8 GFlops    5.1 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        7.8 GFlops    3.1 GB/s 121.7ulps
     64 threads:       12.9 GFlops    5.1 GB/s 121.7ulps
    128 threads:       17.6 GFlops    7.0 GB/s 121.7ulps
    256 threads:       19.3 GFlops    7.7 GB/s 121.7ulps
    512 threads:       19.1 GFlops    7.6 GB/s 121.7ulps
   1024 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 20 Nov 2010, 06:55:14 pm
Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Nov 2010, 05:32:37 am
Interesting.  Cuda visual profiler reports the global memory throughput as ~175GB/s , of Mod 3 with 256 threads.  That means the measurement in the UnitTest is a factor of 10 out  ::)  ( reported by Powerspectrum4 was ~17.7 GB/s)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 21 Nov 2010, 05:42:36 am
Possible reason:
profiler counts all memory transfers, including overhead. Your code probably counts only useful data transfers.
It can be sign of big overhead amount.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Nov 2010, 05:53:24 am
Hmmm, yes I read that.  Whatever reason will pop up as I analyse the crap out of the 256 thread version to see why it's faster on Fermi.  I'm looking for a counter for uncoalesced global loads, but can't find it so far  :-\
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 21 Nov 2010, 05:56:58 am
in memory operations section. For NV it presents.

Regarding workgroup size quite few factors can influence:
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

All these factors can limit number of warps in flight simultaneously on single compute unit. That is, it influence quality of memory latence hiding.
It will add to all other issues with memory access patterns vs workgroup dimensions (at same workgroup size).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Nov 2010, 06:05:20 am
Ahh, found that counter for uncoalesced reads & writes isn't supported on greater that compute capability 1.1... oh well.

Quote
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack

We're all good there with Mod3.  only 6 registers / thread, occupancy is 1, no shared mem usage in this variant, and only a single call.
So it looks like a clean memory bound kernel with no issues.

I did notice the Memcopy kerenls use  192 threads, possibly to fit extra blocks per SM despite being memory bound, so I'm going to try that.

Mod3/256 threads fits 6 blocks/SM, and the max is 8, so it might be worth checking.

[Edit:] much the same:
Quote
    192 threads:       44.1 GFlops   17.6 GB/s 121.7ulps

Not much more to squeeze out of Kernels like this I think,  Will add concurrent kernels next (take my time doing so)

[Later:] Oops:
Quote
float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;
fixing:
Quote
float GB =  ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;
That's better:
Quote
    256 threads:       44.2 GFlops  176.8 GB/s 121.7ulps
Near maximum I think, will have to calculate the theoretical.

Theoretical max of GTX480 @ 2088MHz memclock = 200.448 GB/s  , so 176.8 (effective) is pushing pretty hard.  Onto concurrency....
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 21 Nov 2010, 11:13:01 am
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
     64 threads:        4.6 GFlops    1.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        3.2 GFlops    1.3 GB/s 121.7ulps
     64 threads:        4.6 GFlops    1.8 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.2 GFlops    1.7 GB/s 121.7ulps
    512 threads:        3.5 GFlops    1.4 GB/s 121.7ulps
   1024 threads: N/A


HTH
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Nov 2010, 11:15:39 am
Fits with the theories so far  :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10  ::)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Josef W. Segur on 21 Nov 2010, 02:42:32 pm
Fits with the theories so far  :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10  ::)

And maybe consider whether the kernels might be memory bound on some cards?

Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.

Yeah, 1840 shader gives 117.76 GFLOPS per the nVidia formula with 32 CUDA cores (aka shaders). What I find interesting is that they're trying to discourage use of Furmark and such which actually try to achieve the highest possible performance...
                                                                                Joe
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 21 Nov 2010, 03:09:49 pm
Actually, Nvidia seems to start even more differeintiating gaming GeForce and HTPC Tesla products by putting a limitaton in GTX580 (and probably future high-end gaming GeForce products) to downclock when its usage achieve very high level (as in FurMark or OCCT for example). Reason they are giving is that games will never put such high workload on GPU, and they are probably right. However, some highly optimized real-life CUDA applications could achieve it also - my guess is that Nvidia will respond with "buy a (much more expensive) Tesla if HTPC is what you want"... :(   
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Nov 2010, 11:16:03 pm
And maybe consider whether the kernels might be memory bound on some cards?,,,

This one is  memory bound on all of them, with only 3 compute instructions which are partially fused in each thread.  Getting the right kernel geometry per compute capability does  seem to let us push bandwidth upward from stock though, and it appears stock code for that kernel was compute cap 1.0 optimised (reasonable).  There are ways to switch kernels at run time that are automatic, in the driver,now though.

With a memory bound computation like this then, it does seem logical to increase the compute density, which the first freaky powerspectrum's inclusion of power-spectrum into the FFT output does do, will need to test & refine those implementations for extension to more sizes in the long run.

Next is probably to try to rearrange the FFT & powerspectrum into chunks, in order to better exploit the cache available on Fermi (~768k L2 ), which the FFT-> powerspectrum sequence appears to be thrashing solely due to dataset size. I'm hoping that the concurrent kernels mechanism is intelligent enough to discriminate cache hot data. 

In either case the next test will probably need some extra compute density, which should see the GFlops rise against a hopefully similar bandwidth figure.

I haven't yet explored whether any processing subsequent to the powerspectrum could also be embedded to further raise the compute density, finding spikes immediately, for example, but it's looking like a possibility.  Further on, Dealing with indiviual PulsePoTs for the pulsefinding looks like an option, if the FFT sequence preceding it can be done in suitable blocks.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 22 Nov 2010, 04:19:21 am
For spike finding whole array should be scanned. That is, either long loop inside thread or threads cooperation via shared memory and barriers.
Whereas power spectrum computation inherently parallel and each thread can be mapped to separate matrix point.
I tried to fuse power spectrum computation with normalization - performance decreased because of huge drop in available separate threads (normalization required mean computation, i.e., again, access to whole PoT array ).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Nov 2010, 04:46:21 am
(normalization required mean computation, i.e., again, access to whole PoT array ).
  mmm, there may be a way partially around that by some sync barrier / reduction.  Averaging the large dataset *should* be parallelisable (By swapping local means).  [Edit:] That summax stuff seems to be doing that, but seems to be fairly generalised, with lots of 'TODO' and unnecessary stuff.  will work out how to reduce in powerspectrum kernel later, since it seems pointless rescanning the whole array when we just had it there for powerspectrum.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 22 Nov 2010, 06:47:49 am
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where  N>>1 or you will have too many reduce steps.
]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Nov 2010, 09:14:50 am
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where  N>>1 or you will have too many reduce steps.
]

I agree to some extent, except for that we're already memory bound here, so pinching at least portions (say first stage of the reduction of 256 points in the block) should be almost free via shared memory (compared to memory access time anyway).    If it doesn't work out then it all leads to better understanding of these complicated things anyway  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Nov 2010, 01:35:28 am
Further confirmation of this kernel being memory bound first.  I wound up the memory clock without changing the core clock.

At original OC (not stock ) 2088MHz memory clock:
Quote
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
    256 threads:       44.1 GFlops  176.4 GB/s 121.7ulps

At 2208 MHz:
Quote
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
    256 threads:       46.7 GFlops  186.8 GB/s 121.7ulps

So a ~5.7% increase in throughput for similar increase in memory clock (linear scaling with memory clock)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 25 Nov 2010, 03:05:46 am
Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?

Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results. ;)

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Nov 2010, 04:59:06 am
Hi Patrick,
   I'm currently working on adding the next part of stock code to the reference set.  It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers).  Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).

Until then you can just multiply the Memory throughput figures by ten in your head  ;).

As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3.  All going well, I should have something more worth testing in a day or so.

Jason

[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it  ::) (Powerspectrum performance is constant)
Quote
Stock:
   PowerSpectrum<  64 thrd/blk>   29.0 GFlops  115.9 GB/s   0.0ulps
   SumMax (     8 )    4.3 GFlops   19.0 GB/s
   SumMax (    16)    3.8 GFlops   16.5 GB/s
   SumMax (    32)    1.8 GFlops    8.0 GB/s
   SumMax (    64)    3.1 GFlops   13.5 GB/s
   SumMax (   128)    4.7 GFlops   20.5 GB/s
   SumMax (   256)    6.3 GFlops   27.6 GB/s
   SumMax (   512)   11.2 GFlops   48.9 GB/s
   SumMax (  1024)   17.2 GFlops   75.1 GB/s
   SumMax (  2048)   20.3 GFlops   88.9 GB/s
   SumMax (  4096)   24.3 GFlops  106.3 GB/s
   SumMax (  8192)   25.2 GFlops  110.2 GB/s
   SumMax ( 16384)   24.8 GFlops  108.7 GB/s
   SumMax ( 32768)   28.3 GFlops  123.8 GB/s
   SumMax ( 65536)   18.4 GFlops   80.4 GB/s
   SumMax (131072)   10.1 GFlops   44.3 GB/s

   Powerspectrum + SumMax (     8 )   12.0 GFlops   49.1 GB/s
   Powerspectrum + SumMax (    16)   10.8 GFlops   44.4 GB/s
   Powerspectrum + SumMax (    32)    6.2 GFlops   25.2 GB/s
   Powerspectrum + SumMax (    64)    9.3 GFlops   38.3 GB/s
   Powerspectrum + SumMax (   128)   12.6 GFlops   51.7 GB/s
   Powerspectrum + SumMax (   256)   15.3 GFlops   62.5 GB/s
   Powerspectrum + SumMax (   512)   20.8 GFlops   85.1 GB/s
   Powerspectrum + SumMax (  1024)   24.8 GFlops  101.5 GB/s
   Powerspectrum + SumMax (  2048)   26.3 GFlops  107.5 GB/s
   Powerspectrum + SumMax (  4096)   27.7 GFlops  113.5 GB/s
   Powerspectrum + SumMax (  8192)   28.0 GFlops  114.6 GB/s
   Powerspectrum + SumMax ( 16384)   27.8 GFlops  113.8 GB/s
   Powerspectrum + SumMax ( 32768)   28.8 GFlops  117.9 GB/s
   Powerspectrum + SumMax ( 65536)   25.4 GFlops  104.0 GB/s
   Powerspectrum + SumMax (131072)   19.8 GFlops   81.1 GB/s
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 25 Nov 2010, 10:39:19 am
yes, i should be so.
Different sizes mean different block numbers - different  memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always  even if X varies.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Nov 2010, 04:22:15 pm
@Raistmer:  Now I restore the stock Memory transfers, and find this response from stock code:

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0](     8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0](    16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0](    32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0](    64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0](   128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0](   256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0](   512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0](  1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0](  2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0](  4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0](  8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
                PowerSpectrum+summax Unit test
Stock:
   PowerSpectrum<  64 thrd/blk>   29.1 GFlops  116.3 GB/s   0.0ulps
   SumMax (     8)    0.8 GFlops    3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
   SumMax (    16)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
   SumMax (    32)    1.1 GFlops    4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
   SumMax (    64)    1.8 GFlops    7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
   SumMax (   128)    2.6 GFlops   11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
   SumMax (   256)    3.5 GFlops   15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
   SumMax (   512)    5.0 GFlops   21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
   SumMax (  1024)    6.1 GFlops   26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
   SumMax (  2048)    6.7 GFlops   29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
   SumMax (  4096)    7.2 GFlops   31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
   SumMax (  8192)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
   SumMax ( 16384)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
   SumMax ( 32768)    7.3 GFlops   32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
   SumMax ( 65536)    6.2 GFlops   27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
   SumMax (131072)    5.1 GFlops   22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ?  or do you think my test is broken ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 25 Nov 2010, 05:47:47 pm
Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now ;D }
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Nov 2010, 06:07:03 pm
Yep, I mean exactly what you're getting at:
Stock:
...
   SumMax (     8)    0.8 GFlops    3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
...
   SumMax (  8192)    7.3 GFlops   31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
...
   SumMax (131072)    5.1 GFlops   22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251

I think my old Pentium 3 will calculate average & peak for a 1MiB point dataset in similar GFlops speeds, and need much less power to do so. (compared to GTX 480 overclocked)

Part of the waste is definitely the memory copies back to host for result reduction (OK) but not that much.  I'll continue playing around and see if I can determine whether something like this should really be done as is, with improved GPU code, partially on CPU, or fully on CPU.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 25 Nov 2010, 06:13:55 pm
Full CPU transfer will be slower, I started with that in early stage of OpenCL MB.
4M memory transfer of power matrix costs too much.
For low FFT sizes I use flag transfer to see if mean/max/pos data need to be downloaded from GPU or not, but for big FFT sizes (low number mean/max/pos elements) I found it's easier to transfer them than flag.
Memory transaction (for ATI at least) has some threshold size (about 16kb) after that time of transfer almost doesn't change. That is, no matter single byte transferred or 16kb - overhead will be the same. So, no sence to download flag instead of 16kb of origianl mean/max/pos data.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Nov 2010, 06:20:43 pm
OK, then I have a middle ground in mind that might restore some throughput & hopefully be friendly to the preceeding powerspectrum threadblock layout.  Will give it a go.

[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results.  I'll go onto FindSpikes to see if all that data is really needed.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 27 Nov 2010, 07:20:53 am
[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results.  I'll go onto FindSpikes to see if all that data is really needed.
3 float numbers (12 bytes) per each array. And when FFT size is big enough (most time it ~8-16k ) it's not too much for transfer (at FFT size of 8K we have 1k/8=128 arrays => 128*12=1,5kB for memory transfer per kernel call. It's well in threshold value of constant time memory transfer for ATI. That is, no sence to reduce size of transfer (and I see no ways to eliminate it completely w/o doing full reduction and spike bookkeeping completely on GPU).
Will look onto NV profiler data to see if it has another time of transfer vs size of transfer dependance...
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 27 Nov 2010, 08:25:36 am
OK,it's data for OpenCL build but you will gt ~ same with CUDA perhaps:

Transfer of flag (in my case flag in uint4 so 16 bytes):
4us GPU time and ~120us CPU time (as NV profiler shows)
transfer of full results array in case of big FFT sizes, for example, transfer of 8k data: GPU: 14us, CPU:128us;
4k of data: GPU: 9us, CPU:117us

looks like for NV GPU same rule applies, maybe with slightly lower threshold value: if transfer size less than some threshold, transfer time no longer depends from transfer size.
And it should be so, because of rather big quant of data in bus transfer.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Nov 2010, 09:38:00 am
OK,
    While playing, it is definitely numdatapoints/fftlen transfers (of size float3) that is killing performance here.  I've tried variations on partial reductions, to be completed on host, even with mapped/pinned memory. I find It's going to be more efficient to transmit the thresholds into the kernel, transferring flag/results only if necessary, solely because we have an upper limit of 30 signals... will keep playing.

Jason

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 27 Nov 2010, 10:25:27 am
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 27 Nov 2010, 10:29:38 am
solely because we have an upper limit of 30 signals...
Per signle kernel call it doesn't matter. You can always download whole array back if needed, and this is very rare case. In common case only uint flag of 4 bytes can be transferred if threshold /best comparison inside kernel.

But again, it works good only while FFT sizes are small (many short arrays).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Nov 2010, 10:48:35 am
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.

  Yep, that gels with what I'm seeing in cuda calls.  I'll avoid looking too hard at OpenCL implementation in respect of strength in diversity, but the principles match up so far.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Nov 2010, 10:50:53 am
But again, it works good only while FFT sizes are small (many short arrays).

Yep, same again.  The shorter FFT lengths match with the longer pulsePoTs, so I want those as short as possible.  I'll feed through flags to the kernels at least for the shorter FFT lengths.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 28 Nov 2010, 01:05:24 am
WoW,
    Now am completely brainfried & need to design a thorough test for the next part.  I'll take a good break before creating a new one.

I chose one size for the combined powerspectrum+summax optimisation (fftlen=64), and *think* I've got that working.  I want to be very sure though, so I can use the same techniques through templatisation of the kernel.

It turns out using the shared memory for speeding the reduction is STINKING DIFFICULT  :o....I really hope it gets easier with practice  :D

"Tentatively looking OK" result for some reductions... but the speed looks too fast to be 100% correct right through, hence the need for extreme caution & a break from coding for a little while (Stock = Yellow, Opt1 = Green though suspect speed ):

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.0 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
      fft[0] avg 0.650947 Pk 3.050944 OK
      fft[1] avg 0.624826 Pk 2.995684 OK
      fft[2] avg 0.620340 Pk 2.418427 OK
      fft[3] avg 0.779598 Pk 2.243930 OK

 PS+SuMx(    64)    6.0 GFlops   24.2 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.1 GFlops  176.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax Array Mapped to pinned host memory.
  256 threads, fftlem 64:    33.2 GFlops  134.5 GB/s 121.7ulps
       fft[0] avg 0.650947 Pk 3.050944
       fft[1] avg 0.624826 Pk 2.995684
       fft[2] avg 0.620340 Pk 2.418427
       fft[3] avg 0.779598 Pk 2.243929


I'll post a thorough updated test when I'm a bit more confident of the result, but prior to templating the other sizes.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 10:02:05 am
Managed to slow it down some (by processing properly  ;)), but tests out OK here (so far):

First post updated (particularly looking for which cards show any net gain, and which none, in worst & best cases):
Quote
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64)  1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
 - Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
 - Opt1 best & worse cases likely to occur in real life tested,  worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
 - On Integrated GPUs, use mapped/pinned host memory, so on those  worst case should be ~= best case ( and hopefully some margin better than the stock reduction  :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.1 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    5.9 GFlops   24.1 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.3 GFlops  177.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         8.1 GFlops   32.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.2 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 10:37:07 am
BTW: Please test on unloaded system (keep forgetting to mention that  ;))

[Edit:]  Attached the wrong file  ::)  Fixing...  Nevermind, was correct file after all
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 29 Nov 2010, 12:18:52 pm
Testing on my shrubbery. Each file contains Result4 and Result5 (since I seem to have missed a testing cycle). Other machines will follow. Last one.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 12:20:12 pm
Cheers, analysing first one:...

On that 9800GTX+ on Win7 (compute cap 1.1, I make that ~29% worst case, best case  ~63% speedup.  Looks like I'm getting the pre-Fermi's to budge finally  ::), good (was worried about that).  Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/blk, cc1.1)

analysing second one (9800GT on XP): ...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/block, cc1.1)
worst ~44%, best ~83%. 

analysing third one (GTX 470 on XP):...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 256 thrds/block, cc2.0)
worst ~45%, best ~115%. 

Thanks for the test4 results, they were helpful to doublecheck the threadcount huerisitc was wise enough in all three cases.

This particular code portion has mostly low impact, but Raistmer tells me it has most impact for VHAR.  In any case, it's the compute capability based hueristics,  & optimisation techniques being used that should hopefully help in more significant areas.  Already starting to get much better armed than a week ago.  Thanks  ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 29 Nov 2010, 01:43:41 pm
Here's the results from my 9800GTX+ on Win 7 64bit:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   64.2 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.5 GFlops   18.3 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.3 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.9 GFlops   32.1 GB/s 121.7ulps

and from my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>    1.2 GFlops    4.8 GB/s 1183.3ulps

 SumMax (    64)    0.1 GFlops    0.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        1.2 GFlops    4.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         0.6 GFlops    2.4 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         0.6 GFlops    2.5 GB/s 121.7ulps

Claggy

Edit: Here's the results of my 9800GTX+ on Windows Vista 64bit:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   64.1 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    5.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.8 GFlops   23.4 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.5 GFlops   30.4 GB/s 121.7ulps

Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 29 Nov 2010, 01:44:54 pm
Ran it on my rig (Q6600/8GB/8800GTX/Win7-64), results:


Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.1 GFlops   72.4 GB/s 1183.3ulps

 SumMax (    64)    1.2 GFlops    4.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.9 GFlops   15.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.2 GFlops   72.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.4 GFlops   22.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         6.6 GFlops   26.6 GB/s 121.7ulps



Are you also interested in a run under WinXP?

Regards,

Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 29 Nov 2010, 01:47:57 pm
Win7 x64 - GTX465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   15.9 GFlops   63.8 GB/s   0.0ulps

 SumMax (    64)    1.3 GFlops    5.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.6 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.4 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 01:49:17 pm
....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider  ;).

Analysing your first result....

8800GTX
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38%
    best case speedup: ~69%
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 01:49:59 pm
Win7 x64 - GTX465:

Thanks, analysing your result too....

GTX 465
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~46%
    best case speedup: ~112%
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 01:51:06 pm
...and from my 128Mb 8400M GS:

Analysing both ;)


9800GTX+
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~33%
    best case speedup: ~75%

8400M GS
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~50%  <-- nice
    best case speedup:  ~50%   <-- nice



Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 29 Nov 2010, 02:13:32 pm
....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider  ;).

Analysing your first result....

8800GTX
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38%
    best case speedup: ~69%

As requested (Q6600/8GB/8800GTX/WinXP32):


Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.3 GFlops   73.1 GB/s 1183.3ulps

 SumMax (    64)    1.3 GFlops    5.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.3 GFlops   73.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         6.4 GFlops   25.8 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.9 GFlops   32.2 GB/s 121.7ulps


Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 02:25:06 pm
As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38% --> 5.4 GFlops
    best case speedup: ~69%  -->  6.6Gflops

8800GTX XP32 result
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~48%   --> 6.4 GFlops
    best case speedup: ~83%    --> 7.9 GFlops


Tentative conclusion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected).  Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

(looks like I might end up widening the gap, rather than narrowing it  ::))
Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 29 Nov 2010, 02:34:11 pm
As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38% --> 5.4 GFlops
    best case speedup: ~69%  -->  6.6Gflops

8800GTX XP32 result
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~48%   --> 6.4 GFlops
    best case speedup: ~83%    --> 7.9 GFlops


Tentative conculsion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected).  Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

Jason

Ah, both OSes have the 260.99 driver installed. Aero was active on Win7-64. There was also a VMWare virtual machine idling on the Win7-machine.

Since I suppose you;d like me to re-run the test on the Win7 machine without Aero and without the VM active, I did :):


Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.1 GFlops   72.4 GB/s 1183.3ulps

 SumMax (    64)    1.2 GFlops    4.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.9 GFlops   15.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.2 GFlops   72.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.4 GFlops   21.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         6.6 GFlops   26.6 GB/s 121.7ulps


Hope this provides some insight.

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 02:43:24 pm
Hope this provides some insight.

Thanks it does  :).  Neither aero nor the idling VM appear to have noticeably altered the performance numbers there... so us Win7-adopters  appear to be paying the price for our shiny new WDDM driver model  ;).

The stock code numbers are interesting too.  XP32 @ 4.3GFlops, and Win7x64 @ 3.9-4.1 GFlops ..... looks like the more familiar reported ~10% advantage to XP we've heard about before.

Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64  :D

[Edit:] Here later in the week, I'll look into if the 64 bitness of the OS is a factor now, though it hasn't shown to be significant before.  The WoW64 layer could be slowing  things up there somehow, possibly, but best to know for sure.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 29 Nov 2010, 03:35:06 pm
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   26.5 GFlops  105.8 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.6 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.7 GFlops   26.9 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.7 GFlops  106.9 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   37.0 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.8 GFlops   43.8 GB/s 121.7ulps


-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.2 GFlops  100.7 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.5 GFlops   26.3 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.3 GFlops  105.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   36.9 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.3 GFlops  101.2 GB/s 1183.3ulps

 SumMax (    64)    2.0 GFlops    8.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.4 GFlops   25.7 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       25.9 GFlops  103.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         8.8 GFlops   35.8 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 03:41:52 pm
Thanks!  compute cap 1.3, so completes the basic heuristic functionality test  :)

GTX 295 (taking lower & upper limits on each GPU as combined range)
    Average, peak calcs, thread-count hueristic: OK (both)
    worst case speedup: ~35%,40%
    best case speedup: ~61%-60%

GTX 260
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~37%
    best case speedup: ~62%

Still some legroom in those 2xx series yet  :)  With the 295's still pulling those kindof relative performance numbers,  They'll still challenge the 480's for a while yet IMO.  Running several tasks on the same 480 GPU makes the picture less clear, so as some of the small refinements creep into future releases it'll be something fun to watch at least.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 29 Nov 2010, 05:24:58 pm
Here is the results from my 460

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   12.9 GFlops   51.6 GB/s   0.0ulps

 SumMax (    64)    1.1 GFlops    4.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.4 GFlops   13.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       19.4 GFlops   77.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         5.5 GFlops   22.1 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         6.9 GFlops   28.1 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 05:32:58 pm
Thanks!, cooperating with cc 2.1 as well (after that rocky start  ;) )

GTX 460
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~61%  :o
    best case speedup: ~103%

looking good.  I haven't worked out the worse case speedup for this kernel on my 480 yet, should be similarish, doing...

Stock  PS+SuMx(    64)    5.9 GFlops   24.0 GB/s
...

Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         8.1 GFlops   32.7 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.0 GB/s 121.7ulps

So
GTX480
worst:   (8.1-5.9)/5.9  ~= 37%
best:  (16.1-5.9)/5.9 ~= 173%

I guess I can live with the smaller improvement in the worst case, if I can manage to get a piiece of the best case improvement in some code down the road.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 29 Nov 2010, 06:14:54 pm
Thanks All!

From here I'll move to complete at least the 'worst case' operation for all sizes.  That will take some time to make a further test confirming which sizes will work, at least for worst case speedups (simple implementation), and which not.  During that period , I'll also be seeking straightforward integration into the X series builds, It would only amount to a very very small speedup over the whole processing, but will confirm certain techniques (as already mentioned). 

The 'best case' optimisation will require extensive work to extract a reasonable portion of, which would be a further small speedup overall that looks like it'll help most GPUs, but Fermi most.  Again those techniques would reflect on other more critical code areas in the long run, so your help here has been appreciated most highly.

I can start to apply some of the methods determined here toward more important areas with a lot more confidence.

Cheers, Jason

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 30 Nov 2010, 10:59:44 am
For sake of completeness:

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>    4.5 GFlops   17.8 GB/s 1183.3ulps

 SumMax (    64)    0.2 GFlops    1.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.7 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        4.4 GFlops   17.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         1.3 GFlops    5.3 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.4 GFlops    5.8 GB/s 121.7ulps


Some 10% difference between the two bottom ones.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 30 Nov 2010, 11:03:32 am
Cheers,
  Analysing...
   
Average, peak calcs, thread-count hueristic: OK
    worst case speedup: (1.3-0.4)/0.4   ~225%  (3.25x).. Winner!  ;D
    best case speedup:  (1.4-0.4)/0.4    ~250%  (3.5x)


Double checking those ridiculous numbers: (mistakes always possible  ;) )

1.3GFlops(optimised) / 0.5 GFlops(Stock) definitely = 3.25x  (325% of stock throughput)
The perecentage of optimised throughput that is speedup is then 0.9 GFlops / 1.3 GFlops  ~= 69 percent of Opt throughput is Bonus.  Speedup component is 225% of the stock throughput.

#Stock is doing something that GPU doesn't like  :-\
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 30 Nov 2010, 11:35:27 am
I reran a few times, getting 0.8-0.9 1.3 1.4-1.5 now.
i.e. higher baseline, optimazation values stable. Can do some statistics tomorrow.

edit: that 0.4 seems to have been exceptionally low (and no, I didn't have the GPU crunching by accident :P )
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 30 Nov 2010, 11:37:35 am
OK, non-critical unless I make computation mistakes  ( I was mostly concerned here to not make code slower...).  Stock / x32f code there is doing something your GPU doesn't like IMO.

Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 30 Nov 2010, 06:33:23 pm

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   28.4 GFlops  113.7 GB/s   0.0ulps

 SumMax (    64)    2.3 GFlops    9.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    7.4 GFlops   29.9 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       41.4 GFlops  165.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
        10.9 GFlops   44.0 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.2 GFlops   65.4 GB/s 121.7ulps


This was much easier than typing it out. Thanks, Richard.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 30 Nov 2010, 11:26:41 pm
Thanks Steve!,
   Now your increased Core speed is showing via the improved 'worst case' speedup over mine ( Your 10.9 Vs my 8.1 GFlops )

GTX480 (watercooled)
Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~53%   ( 1.53x )
    best case speedup:   ~119%  ( 2.19x )

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 01 Dec 2010, 02:38:29 pm

Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64  :D


Just re-run the Mod5 test on my GTX465 on Win7 x64 SP1 v.721 RC
getting the same results as before:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   63.9 GB/s   0.0ulps

 SumMax (    64)    1.3 GFlops    5.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.4 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 01 Dec 2010, 02:43:41 pm
Interesting.  In the meantime I also managed to verify that 32 bit versus 64 bit executable yielded no discernible performance difference here ( Since it's GPU jard coded anyway  ;) )

So we're left with WinXP32's simpler driver model with no Direct10+ support, or WDDM stuff going on IMO.  I wonder if there's a way to turn off more stuff in Win7x64, video subsystem-wise.

[Edit:] Hmmm....
http://www.anandtech.com/show/3924/nvidia-announces-parallel-nsight-15-cuda-toolkit-32

"Compared to the old XPDM, WDDM was a big step up for GPU usage on Windows, but only for graphical purposes. With Windows’ iron-fisted control over the GPU and a focus on task scheduling for responsiveness over performance, it wasn’t ideal for GPGPU purposes. Case in point, with a WDDM driver NVIDIA was finding it took 30μs for a kernel to be launched, but if they had Windows treat the GPU as a generic device by using a Windows Driver Model (WDM) driver, that launch time dropped to 2.5μs. This coupled with the fact that a WDM driver is necessary to use Tesla cards in a Windows Remote Desktop Protocol environment (as any Folding @Home junkie can tell you, RDP sessions can’t access the GPU through WDDM) resulted in the birth of TCC mode."
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 01 Dec 2010, 02:59:02 pm
Looks good - a massive drop in time to launch kernels, shame it's only available for Tesla GPU's at the moment
Hopefully NV will release a similar driver for atleast the fermi cards if not all the current cards
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 01 Dec 2010, 03:08:18 pm
Yeah, OmegaDrivers.Net Guy looks like broke & struggling to Work out Win7 Drivers too (None for Win7 available when you read further in). 
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 01 Dec 2010, 03:17:13 pm
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"
Title: Re: [Split] PowerSpectrum Unit Test
Post by: kevin6912 on 01 Dec 2010, 05:18:11 pm
Test 5  output.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.3 GFlops   81.3 GB/s   0.0ulps

 SumMax (    64)    0.7 GFlops    2.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    2.3 GFlops    9.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       29.2 GFlops  117.0 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.3 GFlops    9.5 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.1 GFlops   44.9 GB/s 121.7ulps

Kevin
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 01 Dec 2010, 05:37:45 pm
PowerSpectrumTest5.exe -device 0
.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.6 GFlops   82.5 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.0 GFlops  119.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.6 GFlops   26.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.2 GFlops   45.2 GB/s 121.7ulps


PowerSpectrumTest5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.7 GFlops   82.6 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.7 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.1 GFlops  120.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.6 GFlops   26.9 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.2 GFlops   45.3 GB/s 121.7ulps


.
Done
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 02 Dec 2010, 03:46:56 pm
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"


@Ghost: I did get the following so far:
- Made the modifications appropriate to the inf file, and successfully installed 263.06 TCC driver ( On 480 )
- Disabled the device as a 'normal' display (using mobo display instead)
- Merged the nSight registry key that disables WPF acceleration (for good measure, shouldn't be necessary with no active display on it)


Next step should be to switch the devices driver mode to TCC mode.  That's done via the command:
  nvidia-smi --driver-model=

howevr I get this response:
Quote
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe --driver-model=
GPU 0 is not a supported TCC device, skipping
[Edit:] Note that it doesn't say that the card/driver doesn't support it...
Confirming with DeviceQuery:
Quote
CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 480"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 1576468480 bytes
  Multiprocessors x Cores/MP = Cores:            15 (MP) x 32 (Cores/MP) = 480 (
Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    0.81 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads
can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = GeForce GTX 480


PASSED

So I gather we're stuck for now  :(  [Edit:] unless you happen to be good with SoftIce or similar.... ::)

Going to try checking if I got the section number in the inf right etc...
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 02 Dec 2010, 05:31:33 pm
I got stuck @ the same point as well.
When I ran the nvidia-smi.exe -dm0 cmd I got the same message about the GPU not being supported on TCC.
I tried modifying the .inf file with limited success so I used this site http://laptopvideo2go.com
Basically they create a standard .inf file that allows all NV cards to use all drivers ;D Saved a lot of time and hassle - they also have unreleased drivers on their site. The latest I could see were 265.90. Not sure where they get them from so use at your own risk, but I've had no issues with them.
also saw a slight increase in the worst case scenario with Mod5 with these drivers, of the top of my head it was about .2 increase over the official release drivers.
Haven't had a look at SoftIce yet - I'll do a bit of research tomorrow as it looks like I may not be getting into the office again :D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 03 Dec 2010, 02:25:17 pm
Despite finding a couple more posts (after spending a few hour searching) saying that it is possible to enable TCC mode on non-Tesla cards, I haven't been able to get it running on my 465.
At the moment all I have managed is to change the compute mode ruleset through nvidia-smi. Although DeviceQuery still says its running in Default mode, so wether this has had any real affect or not is up for debate >:(
Think its about time to give up on this idea unfortunately, shame though, it would have been nice to get it working as all this card does is crunch Seti
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 03 Dec 2010, 02:45:34 pm
Yeah tried that compute mode thing too.  With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances  ;).

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM.  I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

If it had been a matter of the current ~10% difference stock sees between the driver models, I wouldn't have gone to the trouble of DualBoot.  My optimisations, on the other hand, yielding ~30% in favour of XPDM, really force the issue for me (even though faster on Both OS /Driver Models), since a lot of the refinement achieved here with a small kernel is likely to apply through most of the application (after a lot of work).   That translates to a crapload of compute performance in my book, since the single 480 on Wolfdale,, x32f on Win7x64) was sustaining 25-26k RAC when there was work.  Much as I dislike RAC as a measure, ~10% extra there (current code) would only boost it to ~27-28k or so (within work dependant variation anyway),  ~32k, though, seems more definite & well worth the added effort. ( [Edit:] then add optimisation benefit I suppose )

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 03 Dec 2010, 02:54:56 pm
OK, non-critical unless I make computation mistakes  ( I was mostly concerned here to not make code slower...).  Stock / x32f code there is doing something your GPU doesn't like IMO.

Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?


I dug out a 'shared memory: no' from a german comparison site. its got 256M of it's own as far as I know.

nvidia control panel system info comes up with
total available 1535 MB
dedicated 256 MB GDDR3
sytem video 0MB
shared system mem 1279MB
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 03 Dec 2010, 03:05:16 pm
Great, thanks.  Yes the dedicated number is the clincher, it's a discrete GPU then, explaining why it didn't trigger a special integrated GPU optimisation I made in the last test (that particular functionality remains untested/verified).  The shared bit will just be system memory the driver's using for WDDM paging.   I only care about that because It looks like santa might be bringing me an ION2 based netbook (I was a good boy all year, sortof) , so poking around with that functionality early seemed a good idea.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 03 Dec 2010, 06:19:27 pm
Yeah tried that compute mode thing too.  With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances  ;).

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM.  I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

Jason

I was thinking about something similar - just ordered a couple of 1TB drives and a raid controller, so after migrating my current data drives to those (and finally getting some internal redundancy  :)) I'll have a spare drive that I was planning on either loading with Linux, or XP if I can find the disk again. A 30% boost is definatly worth the extra effort of setting up the dualboot. Although I may just run a VM for Boinc (depending on wether I can get the GPU's to be seen by the VM) and how good Boinc operates, I'll create an XP VM and run it that way
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 12:34:51 pm
Extra drives on order here ... hopefully will be able to find a floppy disk for XP raid driver install before they arrive  ::).

If you're able to verify (increased XPDM advantage with the heavily optimised kernels, over stock ~10% advantage between driver models) prior to me getting setup, I'll report the increased XPDM<->WDDM speed discrepancy with highly optimised kernels ... Since they may not have factored as much as 30% performance difference into decisions (related to TCC mode).
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 04 Dec 2010, 02:09:09 pm
I've managed to scavenge an old drive from an old machine for this test, so have now got a dual-boot machine for a short time ;)
Just downloading and installing the standard drivers to get a baseline for the test -

Stock results on XP Pro x32 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   63.8 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.4 GFlops   17.7 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.0 GFlops   91.9 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.7 GFlops   27.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 04 Dec 2010, 02:18:56 pm
PowerSpectrumxe2011Test5.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5)
Stock:
 PwrSpec<    64>   11.9 GFlops   47.6 GB/s   0.0ulps

 SumMax (    64)    0.4 GFlops    1.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.4 GFlops    5.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       18.5 GFlops   73.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.1 GFlops    8.3 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         2.4 GFlops    9.6 GB/s 121.7ulps


PowerSpectrumxe2011Test5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5)
Stock:
 PwrSpec<    64>   11.9 GFlops   47.6 GB/s   0.0ulps

 SumMax (    64)    0.4 GFlops    1.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.4 GFlops    5.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       18.3 GFlops   73.3 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.1 GFlops    8.4 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         2.4 GFlops    9.6 GB/s 121.7ulps


.
Done

Remark:compiled with XE2011

modify:
something must be changed, last Test5 above shows
11.2 GFlops   45.3 GB/s 121.7ulps

in last line
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 02:32:34 pm

something must be changed, last Test5 above shows
11.2 GFlops   45.3 GB/s 121.7ulps

Yeah, 11.2 is more like what that card should be doing heinz.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 04:23:27 pm
Stock results on XP Pro x32 260.99 drivers:
...
 PS+SuMx(    64)    4.4 GFlops   17.7 GB/s
...
  256 threads, fftlen 64: (worst case: full summax copy)
         6.7 GFlops   27.2 GB/s 121.7ulps
...
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps

OK, so far against your previous results (assuming all else equal), we're back to our roughly ~10% performance advantage to XP:

(XP32-Win7x64)/Win7x64
Stock case: (4.4-4.1)/4.1 = ~7.3 % advantage to XP (expected, not too annoying)
Worst case: (6.7-6.0)/6.0 = ~11.7% advantage to XP ( I can *almost* live with that)
Best case:  (8.7-8.7)/8.7 = ~0.0% advantage to XP (fine)

So there appears to be a greater advantage to XP with the worst case (lot's of memory transfers), though not as great as feared... Phew!  ;D

Since the Memory numbers have more significant digits, and the worst case advantage indicates a memory issue of some sort, I'll compare the throughput figures also:
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case:  (35.3-35.4)/35.4 = ~0.3% advantage to Win7

Tentative analysis based on above:   Raw compute speed between the two OS/Driver models is roughly the same ('Best Case has no memory transfer of results), however WDDM's memory paging schemes increase overheads for the worst case by up to ~14.2% on that system ( 1/(1-0.124) ).

So memory transfers will have to be minimised in critical kernels.  I can enable a pinned memory optimisation I implemented for integrated GPUs, which might just help the situation.  At least we're not looking at the ~30% difference that had me petrified.

Jason

 
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 04:42:10 pm
@Heinz, something broke in that source you used, investigating.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 04 Dec 2010, 04:44:04 pm
I've been playing with a couple of other versions of drivers (263.xx & 256.xx) as well and there is no improvement over the current 260.99 WHQL release drivers figures.
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 04:54:29 pm
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....

Yep going back is a challenge after adapting.  Now that I'm pretty confident the memory transfers are the main factor, I'm hopeful a certain 'trick' may squash the difference.  We'll see.

[Edit:] Updated first post:
Quote
Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither,  Test5 remains for comparison)

Will use pinned memory, for Opt1, on GPUs that can do so.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 04 Dec 2010, 07:07:07 pm
Hi Jason,

Getting an error with the new build saying that cudart_32_32_7.dll isn't present - is this meant to be in the .7z file?

ghost
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 04 Dec 2010, 07:35:43 pm
Just to see if it would run, I made a copy of the cudart32_32_16.dll, renamed it to cudart32_32_7.dll and then ran the test

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   12.9 GFlops   51.4 GB/s   0.0ulps

 SumMax (    64)    1.0 GFlops    4.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.4 GFlops   13.6 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       19.4 GFlops   77.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.4 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         7.0 GFlops   28.2 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Dec 2010, 08:09:44 pm
Cheers both, will investigate.  Not sure why the build decided to  use  32_7 from  ::) , probably from messing with drivers earlier.  Will rebuild shortly & reattach. [Done]

Jason 
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 05 Dec 2010, 05:15:18 am
Thanks Jason:
Here's results under XP:
Quote
Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   15.8 GFlops   63.3 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.4 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.6 GFlops   30.6 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps

and under Win 7:
Quote
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   17.3 GFlops   69.2 GB/s   0.0ulps

 SumMax (    64)    1.2 GFlops    5.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.0 GFlops   16.3 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       27.5 GFlops  110.0 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.2 GFlops   29.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         9.2 GFlops   37.3 GB/s 121.7ulps

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 06:01:54 am
Ghosts' before Pinned memory usage ( Test #5 memory throughput) :

Quote
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case:  (35.3-35.4)/35.4 = ~0.3% advantage to Win7

with pinned memory (Test #6 Memory throughput )
Stock case*:  (17.5-16.3)/16.3 = ~7.36 advantage to XP (consistent with prior result)
Worst case: (30.6-29.2)/29.2 =  ~4.8% advantage to XP (Narrowed)
Best case:  (35.3-37.3)/37.3 = ~5.4% advantage to Win7 (!)  :o

*Stock code doesn't use pinned memory

Further tentative analysis:  Hiding memory transfers with the use of pinned (non-pageable) memory for critical datasets, and Asynchronous Host<->Device transfers aids in hiding additional overheads experienced in the WDDM driver model.  Careful use of these latency hiding mehanisms, though complex, can yield improved performance on WDDM platforms when large transfers are needed (such as with 'worst case'), and completely hide costs when transfers are minimised (such as with 'best case').  The end result on WDDM platforms with partial implementation of the optimisation strategies, will likely be  performance that roughly approximates XPDM performance, or exceeds it by some small margin when costs can be totally hidden.  This is likely a function of the WDDM host memory paging scheme employed under the newer driver model, already having effectively 'mirrored' some required data on the host & device.

Cheers Alll! Success!  ;D  More ammunition to go on with helps a lot.

Overall, it seems Windows 7/Vista WDDM driver model is not slower after all, but requires 'more careful' (& complex) programming to make the implementations efficient.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 05 Dec 2010, 06:36:13 am
Brilliant news  :D

Ghost
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 05 Dec 2010, 07:21:51 am
Here's the PowerSpectrum6 results on my 9800GTX+ on Win 7 64bit:

Quote
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.6 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.5 GFlops   18.3 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         7.1 GFlops   28.7 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   40.0 GB/s 121.7ulps

and on Win Vista 64bit:

Quote
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.3 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.4 GFlops   17.8 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         6.9 GFlops   27.8 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   39.9 GB/s 121.7ulps

and on my 128Mb 8400M GS on Vista 32bit:

Quote
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    1.2 GFlops    4.8 GB/s 1183.3ulps

 SumMax (    64)    0.1 GFlops    0.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        1.2 GFlops    4.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         0.6 GFlops    2.5 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         0.6 GFlops    2.6 GB/s 121.7ulps

Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 07:54:06 am
Hehe, those ( worst case Opt1) are up a bit ( apart from the 8400M, I suppose unsurprisingly ).  Looks like we found WDDM display driver limitation, and should be able to work around it, with lots of effort.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 05 Dec 2010, 08:04:23 am
I also added PowerSpectrum5 results for my 9800GTX+ on Vista 64bit, on page Eight (http://lunatics.kwsn.net/12-gpu-crunching/split-powerspectrum-unit-test.msg33630.html#msg33630)

Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 08:21:58 am
Cheers, yep was looking back there, definitely confirms the use of pinned memory helped Opt1, a bit more than I expected too.

On the XPDM Vs WDDM issue, I've had further confirmation on 8800GTS, from a non-crunching friend, that test #5 Opt1 worst case is faster on XPDM over win7, but roughly same speed in Test #6 (using Pinned Memory).  The 'Best case' is also faster on Win7, so the numbers seem to match up.   Make the code a bit more sophisticated & Win7 performance is ~equal to a bit faster than XP.

I'll be stewing on these additional aspects we've worked out here for a little while, and apply the knowledge to expanded tests with more fft sizes ~end of week.  If that pans out well, it'll be time to start levering in these small improvements into the X series codebase.  After the powerspectrum+reduction is integrated, then will probably be refinement & expansion of the 'freaky powerspectrum' (custom FFT) kernels using the same knowledge.

All this, of course is working towards 'fixing' the problematic puslefinding down the road, and having enough strategies to do so effectively.
(Can't wait for the time when I can ask Berkeley to send VLARs back out to GPUs again  :P)

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 05 Dec 2010, 09:40:16 am
9800GTX+, Windows 7/32

Code: [Select]
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   15.8 GFlops   63.4 GB/s 1183.3ulps

 SumMax (    64)    1.3 GFlops    5.3 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       15.9 GFlops   63.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         6.9 GFlops   28.1 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.8 GFlops   39.5 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 05 Dec 2010, 11:13:38 am
Took a couple of tries but I think I got it right....


Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd \test

C:\test>powerspectrum6.exe >results.txt
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1840 MHz clock, 1008 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.8 GFlops   11.3 GB/s 1183.3ulps

 SumMax (    64)    0.4 GFlops    1.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.2 GFlops    4.9 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.8 GFlops   11.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.9 GFlops    7.6 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         2.0 GFlops    8.2 GB/s 121.7ulps



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 05 Dec 2010, 11:31:33 am
Vista64
~~~~
Stopping Boinc...
PowerSpectrumTest6.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   20.4 GFlops   81.6 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.7 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.0 GFlops  119.9 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.1 GFlops   28.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.1 GFlops   45.1 GB/s 121.7ulps


PowerSpectrumTest6.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   20.4 GFlops   81.8 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.1 GFlops  120.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.3 GFlops   29.7 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.2 GFlops   45.2 GB/s 121.7ulps


.
Done
Restarting Boinc...
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 12:40:39 pm
Thanks Richard, perryjay & Heinz.

All fit with the models so far.

The Compute capability 1.1, devices, Richard & Perryjay,  are IMO doing their memory bound best with the powerspectrum, ~matching stock 'PwrSpec' speed for that, then 'magically' lifting with the reductions (summax)  for Opt1 worst case.  I beleive that must be purely a result of the memory transfer hiding, since the compute density of the reduction hasn't changed from O(logn).

@Heinz, glad to see your numbers back up to where they should be.  I reckon that's scaling well against my OC'd 480:
Stock (PS+Summax): 5.9 GFlops  , 23.7 GB/s
worse (opt1):          10.0 GFlops , 40.4 GB/s
best   (opt1):          16.0 GFlops , 64.8 GB/s

Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 05 Dec 2010, 02:43:36 pm

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   28.1 GFlops  112.5 GB/s   0.0ulps

 SumMax (    64)    2.3 GFlops    9.6 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    7.2 GFlops   29.2 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       41.4 GFlops  165.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
        12.7 GFlops   51.5 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.3 GB/s 121.7ulps


Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 04:14:37 pm
Ouch! 27% more throughput on worst case Opt1 than mine  (12.7 Vs 10 GFlops) ;D despite slower powerspectrum (memory), that can't be core (same 'best' case @16.1) .... PCIe Bus overclocked ? (ahh, faster host memory too I suppose)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 05 Dec 2010, 05:06:18 pm
My CPU memory is at 1774 MHz. My PCIe buss is slightly over clocked. I adjusted my GPU RAM to 1900 MHz. There is still room for more. I am on my last GPU wu for Einstein. There aren't any available at the moment. Piggy hit the #5 spot for the top rigs at Einstein with a RAC of over 14,000. There is nothing slow about Piggy. It does a fantastic job at running Starry Night Pro Plus astronomy software. I can't wait to get back to SETI crunching!

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 05:13:01 pm
My CPU memory is at 1774 MHz. My PCIe buss is slightly over clocked. ..

Whew! that's a relief. My host is only running dual channel DDR2 memory (corsair stuff though), so I'm due for some upgrades on the host if it's limiting the 480.  Will see if I can hold out 'till Sandy Bridge release & get decent CPU/RAM/Mobo to drive it :-\.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 05 Dec 2010, 06:25:39 pm
9800GT, Windows XP/32

Code: [Select]
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   12.1 GFlops   48.5 GB/s 1183.3ulps

 SumMax (    64)    1.1 GFlops    4.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.5 GFlops   14.2 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       12.1 GFlops   48.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         5.8 GFlops   23.4 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.0 GFlops   28.4 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 05 Dec 2010, 09:16:32 pm
Win7 x64
*********
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   26.5 GFlops  105.8 GB/s 1183.3ulps

 SumMax (    64)    2.2 GFlops    9.3 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.8 GFlops   27.3 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.7 GFlops  106.9 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  128 threads, fftlen 64: (worst case: full summax copy)
        11.4 GFlops   46.1 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        15.5 GFlops   62.8 GB/s 121.7ulps

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   26.1 GFlops  104.3 GB/s 1183.3ulps

 SumMax (    64)    2.2 GFlops    9.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.9 GFlops   28.0 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.4 GFlops  105.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  128 threads, fftlen 64: (worst case: full summax copy)
        11.3 GFlops   45.9 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        15.4 GFlops   62.2 GB/s 121.7ulps

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   25.5 GFlops  101.9 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.6 GFlops   26.7 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       25.9 GFlops  103.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  128 threads, fftlen 64: (worst case: full summax copy)
        10.8 GFlops   43.5 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        14.4 GFlops   58.2 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 05 Dec 2010, 10:04:13 pm
Ahah, I wondered how the 200 series would respond (haven't had a chance to test on the 260 in the other room yet).  Looks like they appreciate the lifting of memory constraints as well.  That means we'll probably All start going up in GFlops as we pack in more computation (Chirps, FFTs, findspikes, etc ).  This latest test appears to be capping out at host memory & PCIe bus speeds, so while faster, it has an artificial ceiling imposed by the current code designs & their communication costs (memory & bus bound), rather than GPU compute performance .
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 06 Dec 2010, 05:36:57 am
and one small mobile GPU ;) :

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    4.5 GFlops   17.8 GB/s 1183.3ulps

 SumMax (    64)    0.2 GFlops    1.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.9 GFlops    3.4 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        4.5 GFlops   17.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.5 GFlops    5.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.6 GFlops    6.7 GB/s 121.7ulps

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Vyper on 06 Dec 2010, 07:38:30 am
Well here is one of my slightly overclocked GTX460.

Running Win7X64 & 260.99 version.

Kind regards Vyper
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 06 Dec 2010, 08:09:17 am
and one small mobile GPU ;) :
The worst case reduction is faster while the powerspectrum same speed, great ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 06 Dec 2010, 08:11:29 am
Well here is one of my slightly overclocked GTX460.

Thank's!  We Fermi users are going to need more computation packed in there to bring those GFlops up.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 06 Dec 2010, 09:30:38 am
ok, a bit of statistics then. average +- std dev over 15 runs

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    4.4 GFlops   17.5 GB/s 1183.3ulps

 SumMax (    64)    0.3 GFlops    1.1 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.82 +- 0.086 GFlops    3.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        4.37 +- 0.046 GFlops   17.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.37 +- 0.149 GFlops    6.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.61 +- 0.026 GFlops    6.6 GB/s 121.7ulps


now if the pink was better distingushabel from the white ::)
would you like that for the GB/s as well?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 06 Dec 2010, 09:57:58 am
now if the pink was better distingushabel from the white ::)
would you like that for the GB/s as well?

Thanks for the tolerances.  Being largely memory bound, the FLops tolerances are more than enough, and indicate +/- 10% variation of worst case on that.  I presume that's driving a display, so that's reasonable.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 06 Dec 2010, 03:57:33 pm
Thanks for the tolerances.  Being largely memory bound, the FLops tolerances are more than enough, and indicate +/- 10% variation of worst case on that.  I presume that's driving a display, so that's reasonable.

You're welcome - now what exactly makes you think the mobile GPU of a laptop might be driving a display? ;D
No bluescreens with the lastest driver yet - touch wood...

I'll do statistics on all the numbers next time round then.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 06 Dec 2010, 05:36:07 pm
OK, I ran version 6 of the tool on my system (Q6600/8GB/8800GTX) under both WinXP32 as well as Win7-64. If you want me to (re-)run other versions of the tool, let me know. ;)

Both loggings below each-other, first the oldest, WinXP32:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   18.3 GFlops   73.1 GB/s 1183.3ulps

 SumMax (    64)    1.3 GFlops    5.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.3 GFlops   73.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         6.4 GFlops   26.1 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         8.1 GFlops   32.7 GB/s 121.7ulps



Then Win7-64:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   18.1 GFlops   72.5 GB/s 1183.3ulps

 SumMax (    64)    1.1 GFlops    4.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.8 GFlops   15.4 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.1 GFlops   72.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.4 GFlops   21.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         6.6 GFlops   26.8 GB/s 121.7ulps


Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 06 Dec 2010, 06:19:49 pm
Ahhh, hi Patrick.  Looks like your card  should still be able to use pinned host memory, but isn't  :( .  It indeed doesn't support mapped memory (a different kind), but didn't engage the pinned memory improvement because I need to change how I detect that feature.  I'm checking the wrong feature flags it looks like.... ooops  ::)

Will make a #7 end of week, and pay special attention to making sure that engages properly on compute capability 1.0 cards (that don't support mapped memory).

Cheers for finding the problem  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 06 Dec 2010, 08:21:04 pm
Ahhh, hi Patrick.  Looks like your card  should still be able to use pinned host memory, but isn't  :( .  It indeed doesn't support mapped memory (a different kind), but didn't engage the pinned memory improvement because I need to change how I detect that feature.  I'm checking the wrong feature flags it looks like.... ooops  ::)

Will make a #7 end of week, and pay special attention to making sure that engages properly on compute capability 1.0 cards (that don't support mapped memory).

Cheers for finding the problem  ;)

I have no idea what I did, but you're quite welcome. ;)

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 06 Dec 2010, 08:41:10 pm
Thanks,

    It's what you (the test #6 anyway) didn't do  :D

This line's missing:
Quote
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.5 GFlops    5.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.6 GFlops    6.7 GB/s 121.7ulps

When operational, that feature seems to add a touch of throughput to both XP & Vista/Win7, and seems to close the performance difference. (we've been so worried about).  You should get a boost when I fix that.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 07 Dec 2010, 03:49:52 am
Thanks,

    It's what you (the test #6 anyway) didn't do  :D

This line's missing:
Quote
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.5 GFlops    5.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.6 GFlops    6.7 GB/s 121.7ulps

When operational, that feature seems to add a touch of throughput to both XP & Vista/Win7, and seems to close the performance difference. (we've been so worried about).  You should get a boost when I fix that.

Jason

Ah, ok, thanks for the elaboration. Looking forward to test #7 then!

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 08 Dec 2010, 07:10:12 am
Windows XP32. GTX 570. Nvidia Driver 263.09.


Device: GeForce GTX 570, 1464 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   25.6 GFlops  102.5 GB/s   0.0ulps

 SumMax (    64)    1.9 GFlops    7.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.2 GFlops   25.1 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       33.3 GFlops  133.3 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
        10.9 GFlops   44.0 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        13.5 GFlops   54.7 GB/s 121.7ulps
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 08 Dec 2010, 08:47:31 am
570 wooot!  ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 08 Dec 2010, 08:50:00 am
570 wooot!  ;D

Borrowed it from a friend. It's hot, almost non-overclockable, and slightly slower than 480.

I am really looking forward to AMD HD6950/6970 !

[EDIT] Seems I got a bad sample. I've seen reports where the 570 has been overclocked to 840@4250 (stock: 732@3800) with air cooling.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 08 Dec 2010, 08:51:55 am
It's hot, almost non-overclockable

Why bother ?  harvesting faulty parts you think ?

[Edit:] that worst case is slightly better than my 480 worst case, but the best cases are inferioir.   From the powerspectrum I see the constraint is memory ( again  ::) ) ... So indeed these may not be be a good choice for seti in the short term ... probably do Batman really well though  ::)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Frizz on 08 Dec 2010, 09:16:19 am
probably do Batman really well though  ::)

LOL

Yeah ... memory. They chopped the memory interface. I guess they did this so it's not getting to close to the 580 - and not to far ahead of the 480.

GTX570: 320bit
GTX480 & GTX580: 384bit

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 08 Dec 2010, 08:35:46 pm
@All:  In the meantime, having identified the major issues at pllay with these code areas, along with appropriate techniques to use,  I have come up with some ideas for a major redesign of the FFT->Powerspectrum->Summax(reduction)->FindSpikes pipeline, which currently accounts for around ~40%-60% of processing. 

I'll change the format of the next test quite a bit, and spend time tomorrow to get things underway toward #7.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 09 Dec 2010, 06:17:07 am
I would suggest to  test these samples of code at different GPU freq to mem freq ratios.
SubSpace's experiment with beta OpenCL apps showed that it's very informative approach.
(He established that HD5 wins over usual OpenCL MB if GPU engine is relative fast and memory relative slow, while if GPU clocks lowed usual app wins).
I think it's quite explains why other testers see bigger execution time on VLAR for HD5 than for usual app - their GPUs not so fast relative their memory.
Memory influence can be quite highlighted this way.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 09 Dec 2010, 03:29:04 pm
Yes, in fact that's exactly what happened to confirm memory bound nature of what's going on, ( from yet another angle ).

Steve's 480 core is clocked considerably higher than mine, yet he was initially achieving lower throughput than my card.  He tweaked his memory throughput for some improvement. 

After that, a discrepancy between throughput on XP Vs Win7 was then noted, somewhere around the familiar 10% difference.  I added use of pinned memory for the transfers, to try hide them.  With Ghost's help, In the heavy transfer case ( worst case full summax array copy, as with stock code) the XP-Win7 difference was narrowed to ~4% or less, while the WDDM performance proved more efficient with the raw processing in best case (No transfers needed)

Now Steve's 480 achieves some 27% more throughput, in the worst case,  than mine does.  I take this as an indication that the transfer hiding is shifting the bottleneck around as intended, and that it's time to move on to more sophisticated code portions with the acquired tools & techniques.

Still learning stuff every day with these things.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 09 Dec 2010, 06:33:02 pm
I have kicked my 480 memory speed up to 1975 MHz, with plenty of room to go. The 480 cores are clocked at 860 MHz. I tried to increase my CPU memory, but 1774 MHz is as fast as I can get it. I was able to increase my CPU speed to 4.26 GHz with hypethreading enabled, while maintaining about 57°C to 60°C core temps.

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 10 Dec 2010, 09:59:00 am
I've changed over to win-7 64 bit just before we came back up so I decided to run test 6 again. Not sure how much of a difference it will make.

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrum4.exe > results.txt
'powerspectrum4.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrum6.exe
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1400 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.9 GFlops   11.4 GB/s 1183.3ulps

 SumMax (    64)    0.3 GFlops    1.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.0 GFlops    4.1 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.9 GFlops   11.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.6 GFlops    6.6 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.8 GFlops    7.3 GB/s 121.7ulps



Leave it to me to mess up, EVGA precision wasn't holding the o/c. I looked all over the place but couldn't find the little button to make it apply at startup until just now. Here's the corrected test...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.9 GFlops   11.5 GB/s 1183.3ulps

 SumMax (    64)    0.4 GFlops    1.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.2 GFlops    4.7 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.9 GFlops   11.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         0.7 GFlops    3.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         2.1 GFlops    8.3 GB/s 121.7ulps



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 11:26:12 am
Updated first post:
Quote
Update: PowerSpectrum(+summax reduction) Test #7
 - completed summax reduction sizes 8 through 64
 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test
 - tidied up test result layout
 - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Please test on all cuda capable cards.
example output:

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.9 GFlops   12.9 GB/s
 PS+SuMx(    16) [OK]    3.9 GFlops   16.2 GB/s
 PS+SuMx(    32) [OK]    3.9 GFlops   15.8 GB/s
 PS+SuMx(    64) [OK]    6.0 GFlops   24.2 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.3   18.6 121.7 [OK]   22.8   99.7 121.7
 PS+SuMx(    16)    6.7   28.1 121.7 [OK]   21.4   89.7 121.7
 PS+SuMx(    32)    9.4   38.6 121.7 [OK]   20.8   85.2 121.7
 PS+SuMx(    64)   11.7   47.4 121.7 [OK]   20.4   82.6 121.7

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 21 Dec 2010, 11:47:05 am
My 9800GTX+ on Win 7 x64:


Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.0 GFlops    8.8 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   10.7 GB/s
 PS+SuMx(    32) [OK]    2.8 GFlops   11.5 GB/s
 PS+SuMx(    64) [OK]    4.5 GFlops   18.1 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    2.7   11.8 121.7 [OK]    7.1   31.0 121.7
 PS+SuMx(    16)    4.0   16.5 121.7 [OK]    7.7   32.1 121.7
 PS+SuMx(    32)    4.9   19.9 121.7 [OK]    7.3   29.7 121.7
 PS+SuMx(    64)    6.6   26.7 121.7 [OK]    8.9   35.9 121.7


and on my 128Mb 8400M GS on Vista 32bit:


Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.3 GFlops    1.3 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.2 GB/s
 PS+SuMx(    32) [OK]    0.2 GFlops    0.9 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.4    1.9 121.7 [OK]    0.5    2.1 121.7
 PS+SuMx(    16)    0.4    1.8 121.7 [OK]    0.5    1.9 121.7
 PS+SuMx(    32)    0.4    1.7 121.7 [OK]    0.4    1.8 121.7
 PS+SuMx(    64)    0.5    2.1 121.7 [OK]    0.5    2.2 121.7


Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 11:57:36 am
LoL, I thought stock code was already G80 optimised, guess I was WRONG.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 21 Dec 2010, 12:11:58 pm

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.57 +- 0.048 GFlops    2.49 +- 0.24 GB/s
 PS+SuMx(    16) [OK]    0.57 +- 0.048 GFlops    2.39 +- 0.19 GB/s
 PS+SuMx(    32) [OK]    0.49 +- 0.031 GFlops    2.01 +- 0.11 GB/s
 PS+SuMx(    64) [OK]    0.80 +- 0.105 GFlops    3.20 +- 0.41 GB/s


Opt1: 64 thrds/block
                        worst case                                 best case
                         GFlps          GB/s        ulps            GFlps         GB/s     ulps
 PS+SuMx(     8)    0.87 +- 0.048    3.92 +- 0.20 121.7 [OK]    1.21 +- 0.03  5.49 +- 0.03 121.7
 PS+SuMx(    16)    0.89 +- 0.19     3.70 +- 0.78 121.7 [OK]    1.20 +- 0      5.00  +- 0   121.7
 PS+SuMx(    32)    0.97 +-0.048    3.92 +- 0.19 121.7 [OK]    1.10 +- 0       4.60 +- 0  121.7
 PS+SuMx(    64)    1.24 +- 0.11    5.02 +- 0.42 121.7 [OK]    1.41 +- 0.03   5.85 +- 0.05 121.7


Average and standard deviation over 10 runs.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 12:14:18 pm
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ?  magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot  ::)]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 21 Dec 2010, 12:23:27 pm
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ?  magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot ::)]

A run takes some 20 seconds - makes some 5 minutes with graceful rounding. Typing the data into Excel and the calculated values back into the post took about half an hour.  :P

timer tolerances?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 12:27:14 pm
timer tolerances?

Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test  ;)

[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit  :)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 21 Dec 2010, 12:44:19 pm
Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test  ;)

Well manual data collection works just as well, only more tedious.

Quote
[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit  :)

I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 12:46:29 pm
I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.

Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 21 Dec 2010, 01:04:35 pm
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 21 Dec 2010, 01:08:17 pm
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?

Yes.  Actual performance will fall somewhere in between best & worst cases ...  :P ... Though initially I'll be using 'worst case' code for rapid code  improvements to working prototypes ( Size 64 already in field testing in x33 ), best case code is a glass ceiling to aim for with 'advanced coding'

[Edit:] size 64 (worst case implementation) provides ~3% performance improvement to 'shorties' on GTX 480

[Edit2:] oh, that was 'old' worst case code, nevermind  ::)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 21 Dec 2010, 02:04:14 pm
I re-ran the tests on my rig (Q6600/8GB/8800GTX) under both Win764 as well as WinXP32.

First WinXP32:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.6 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   11.1 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    4.3 GFlops   17.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.6   15.8 121.7 [OK]    6.2   27.2 121.7
 PS+SuMx(    16)    4.5   18.8 121.7 [OK]    6.1   25.5 121.7
 PS+SuMx(    32)    4.9   20.1 121.7 [OK]    5.8   23.8 121.7
 PS+SuMx(    64)    6.6   26.5 121.7 [OK]    7.4   30.0 121.7


Then Win7-64:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.1 GFlops    9.0 GB/s
 PS+SuMx(    16) [OK]    2.4 GFlops   10.2 GB/s
 PS+SuMx(    32) [OK]    2.4 GFlops    9.8 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.6 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.4   14.9 121.7 [OK]    6.1   26.8 121.7
 PS+SuMx(    16)    4.2   17.5 121.7 [OK]    6.0   25.3 121.7
 PS+SuMx(    32)    4.6   18.7 121.7 [OK]    5.8   23.7 121.7
 PS+SuMx(    64)    5.9   24.0 121.7 [OK]    7.4   29.8 121.7

As always, hope it helps. ;)

Regards, Patrick.

EDIT: Modified to use no smilies due to the 'cool' smilies in the test-results.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 21 Dec 2010, 02:07:47 pm
Win7x64 results:


Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8 ) [OK]    2.4 GFlops   10.7 GB/s
 PS+SuMx(    16) [OK]    3.1 GFlops   13.0 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.6 GB/s
 PS+SuMx(    64) [OK]    4.0 GFlops   16.1 GB/s


Opt1: 256 thrds/block
                                worst case                    best case
                            GFlps  GB/s ulps             GFlps  GB/s ulps
 PS+SuMx(     8 )    4.9   21.4 121.7 [OK]   13.1   57.4 121.7
 PS+SuMx(    16)    6.5   27.2 121.7 [OK]   12.3   51.4 121.7
 PS+SuMx(    32)    7.8   31.8 121.7 [OK]   11.9   48.7 121.7
 PS+SuMx(    64)    8.6   34.8 121.7 [OK]   11.6   47.0 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: M_M on 21 Dec 2010, 02:24:52 pm
GTX460 1GB OC Core=880MHz Mem=2000MHz Win7-64bit

C:\Test>powerspectrumtest7

Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    3.4 GFlops   14.7 GB/s
 PS+SuMx(    16) [OK]    3.5 GFlops   14.7 GB/s
 PS+SuMx(    32) [OK]    2.3 GFlops    9.6 GB/s
 PS+SuMx(    64) [OK]    3.5 GFlops   14.3 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.5   28.4 121.7 [OK]   13.5   59.1 121.7
 PS+SuMx(    16)    7.7   32.3 121.7 [OK]   12.6   52.8 121.7
 PS+SuMx(    32)    8.5   34.8 121.7 [OK]   12.2   49.8 121.7
 PS+SuMx(    64)    9.0   36.3 121.7 [OK]   12.3   49.6 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 21 Dec 2010, 02:48:19 pm
Preparing the usual three:

9800GTX+, Windows 7/32
Code: [Select]
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    1.7 GFlops    7.4 GB/s
 PS+SuMx(    16) [OK]    2.3 GFlops    9.6 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.9 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.5   15.4 121.7 [OK]    7.1   31.3 121.7
 PS+SuMx(    16)    4.0   16.5 121.7 [OK]    7.4   31.0 121.7
 PS+SuMx(    32)    4.9   20.0 121.7 [OK]    7.2   29.5 121.7
 PS+SuMx(    64)    6.3   25.4 121.7 [OK]    8.8   35.5 121.7

9800GT, Windows XP/32
Code: [Select]
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    1.7 GFlops    7.2 GB/s
 PS+SuMx(    16) [OK]    2.1 GFlops    8.9 GB/s
 PS+SuMx(    32) [OK]    2.2 GFlops    9.0 GB/s
 PS+SuMx(    64) [OK]    3.6 GFlops   14.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    2.5   11.1 121.7 [OK]    5.2   22.9 121.7
 PS+SuMx(    16)    3.5   14.7 121.7 [OK]    5.5   23.0 121.7
 PS+SuMx(    32)    4.1   16.7 121.7 [OK]    5.2   21.2 121.7
 PS+SuMx(    64)    5.4   21.7 121.7 [OK]    6.3   25.7 121.7

GTX 470, Windows XP/32
Code: [Select]
Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.3 GFlops    9.9 GB/s
 PS+SuMx(    16) [OK]    3.0 GFlops   12.6 GB/s
 PS+SuMx(    32) [OK]    3.0 GFlops   12.1 GB/s
 PS+SuMx(    64) [OK]    4.8 GFlops   19.3 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.7   16.0 121.7 [OK]   15.6   68.4 121.7
 PS+SuMx(    16)    5.7   23.9 121.7 [OK]   14.8   61.8 121.7
 PS+SuMx(    32)    7.9   32.5 121.7 [OK]   14.3   58.7 121.7
 PS+SuMx(    64)    9.9   39.9 121.7 [OK]   14.0   56.7 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 21 Dec 2010, 03:09:47 pm
Here's mine...


Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd/test

C:\test> powerspectrumtest7.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.7 GFlops    3.2 GB/s
 PS+SuMx(    16) [OK]    0.8 GFlops    3.5 GB/s
 PS+SuMx(    32) [OK]    0.8 GFlops    3.1 GB/s
 PS+SuMx(    64) [OK]    1.1 GFlops    4.4 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    1.2    5.4 121.7 [OK]    1.6    6.8 121.7
 PS+SuMx(    16)    0.7    3.0 121.7 [OK]    1.5    6.1 121.7
 PS+SuMx(    32)    1.4    5.6 121.7 [OK]    1.6    6.4 121.7
 PS+SuMx(    64)    1.7    6.7 121.7 [OK]    1.8    7.5 121.7



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Josef W. Segur on 21 Dec 2010, 03:44:24 pm
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?

The lower graph on http://setiathome.berkeley.edu/sah_glossary/spike_graphs.php is related, note the log scale on the counts. S@H Enhanced does relatively more short FFT lengths, but there's still a very strong bias toward the long FFT lengths for both reportable and "best" spikes. A quick survey of 44 recent results from my P-M showed 35 best_spikes at fft_len 131072, 6 at fft_len 65536, 2 at fft_len 32768, and 1 at fft_len 16384.

However, the processing order starts at FFT length 8 and works up, so there should be some "worst case" for short FFT lengths during that zero chirp sequence. Subsequent visits to the short FFT lengths are likely to be all "best case". At AR 0.42 FFT length 8 is done 13 times so overall there will be mostly "best case", but at AR 3.0 FFT length 8 is only done once so the probability of "worst case" will be higher.

Note that our test WUs shortened by lowering chirp limits will have a higher proportion of the zero chirp worst cases than full length WUs. In general I think that's good, brief sloppy tests which slightly underestimate improvement from optimization are better than those which cause unwarranted enthusiasm. But it would also be possible to create a set of test WUs shortened by adjusting chirp resolution which would give better quick test timing.

Edit: Jason, result_overflow is triggered by the 31st found signal...
                                                                                           Joe
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 21 Dec 2010, 07:07:52 pm
And now the GTX460-768 card,

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.7 GB/s
 PS+SuMx(    16) [OK]    2.8 GFlops   11.5 GB/s
 PS+SuMx(    32) [OK]    2.1 GFlops    8.7 GB/s
 PS+SuMx(    64) [OK]    3.4 GFlops   13.6 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.2   18.3 121.7 [OK]   11.1   48.5 121.7
 PS+SuMx(    16)    5.8   24.5 121.7 [OK]   10.5   44.1 121.7
 PS+SuMx(    32)    7.2   29.7 121.7 [OK]   10.2   41.7 121.7
 PS+SuMx(    64)    8.4   33.9 121.7 [OK]   10.2   41.5 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 21 Dec 2010, 07:19:50 pm

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    5.0 GFlops   22.0 GB/s
 PS+SuMx(    16) [OK]    6.0 GFlops   25.3 GB/s
 PS+SuMx(    32) [OK]    4.7 GFlops   19.2 GB/s
 PS+SuMx(    64) [OK]    7.2 GFlops   29.1 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    9.0   39.2 121.7 [OK]   23.0  100.7 121.7
 PS+SuMx(    16)   11.7   49.0 121.7 [OK]   21.7   90.8 121.7
 PS+SuMx(    32)   13.6   55.8 121.7 [OK]   21.1   86.4 121.7
 PS+SuMx(    64)   15.1   61.2 121.7 [OK]   20.7   83.7 121.7


Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 01:11:14 am
Thanks all for the massive amount of data  ;D  , will peruse to see if anything;s amiss, but think I found the sweet spot for 'worst case' at the moment, which is straightforward implementation.  I'm delighted that nothing seems to be broken on any GPU tested so far.  There is a lot of work to do to add the remaining sizes into the test (remaining powers of 2 up to 128k or so, maybe some larger sizes for growing room), Then adding FFTs & Findspikes on either side of this pipeline.   Once that's done looks like I can stripe the processing to fit Fermi's L2 cache, right through this pipeline, which should speed things up a lot for those cards.

@Joe, Thanks!, I keep forgetting it's 31 not 30  ::)  probably would have found it the hard way (again), but the heads up helps.

Jason

Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 22 Dec 2010, 01:11:31 am
-device 0
Code: [Select]
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.5 GFlops   19.6 GB/s
 PS+SuMx(    16) [OK]    5.0 GFlops   20.9 GB/s
 PS+SuMx(    32) [OK]    4.6 GFlops   18.7 GB/s
 PS+SuMx(    64) [OK]    7.0 GFlops   28.4 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.1   26.7 121.7 [OK]   11.7   51.4 121.7
 PS+SuMx(    16)    7.5   31.2 121.7 [OK]   11.5   48.0 121.7
 PS+SuMx(    32)    8.7   35.6 121.7 [OK]   12.0   48.9 121.7
 PS+SuMx(    64)   10.9   44.1 121.7 [OK]   14.5   58.9 121.7

-device 1
Code: [Select]
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.4 GFlops   19.3 GB/s
 PS+SuMx(    16) [OK]    4.9 GFlops   20.6 GB/s
 PS+SuMx(    32) [OK]    4.5 GFlops   18.5 GB/s
 PS+SuMx(    64) [OK]    6.9 GFlops   27.9 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.0   26.3 121.7 [OK]   11.6   50.8 121.7
 PS+SuMx(    16)    7.3   30.5 121.7 [OK]   11.4   47.7 121.7
 PS+SuMx(    32)    8.6   35.1 121.7 [OK]   11.7   48.1 121.7
 PS+SuMx(    64)   10.7   43.3 121.7 [OK]   14.4   58.2 121.7

-device 2
Code: [Select]
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.3 GFlops   18.7 GB/s
 PS+SuMx(    16) [OK]    4.8 GFlops   19.9 GB/s
 PS+SuMx(    32) [OK]    4.3 GFlops   17.6 GB/s
 PS+SuMx(    64) [OK]    6.6 GFlops   26.8 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    5.8   25.5 121.7 [OK]   10.9   47.5 121.7
 PS+SuMx(    16)    7.1   29.7 121.7 [OK]   10.6   44.3 121.7
 PS+SuMx(    32)    8.2   33.7 121.7 [OK]   11.0   45.2 121.7
 PS+SuMx(    64)   10.4   42.0 121.7 [OK]   13.5   54.7 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 01:26:47 am
a Hah!, we 're finding the 2xx series limits at last.  'best case' is tapering off sooner & clearly compute bound, while the worst cases show the limit of DDR3 against fermi's DDR5 memory.

Fermi best cases appear to be limited by the memory subsystem still, so down the road I'll be striping(streaming) this pipeline to fit in those cache levels.  That should lift the apparent ~20GFlops limit a bit on Fermis,  Unfortunately the 2xx cards don't have the cache levels, so we might be reaching a limit with those in some respects.

@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ?  I can lengthen the test sequence if needed.

[A bit Later:]  extending the tests from 0.5 to 5 seconds allowed me to see what the 480 is doing as a cross check.  Looks like the Opt1 best cases are reaching ~100%, and opt1 worst cases are bandwidth limited, all as expected, no surprises yet.

[Still later:] I've added the extended PowerSpectrumTest7 to the first post.  I don't need data for the extended test(results are more or less the same), but provide it for those that want to be able to see GPU utilisation differences between the test phases on their cards, like the attached image. 

Moving onto larger sizes & FFT integration , after some beer   ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 22 Dec 2010, 02:42:25 am
@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ?  I can lengthen the test sequence if needed.

Yes, Opt1 spikes to 99%.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 02:43:36 am
Cheers!
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 22 Dec 2010, 04:45:42 am
7_extended
~~~~~~~
PowerSpectrumTest7_extended.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.7 GFlops   12.0 GB/s
 PS+SuMx(    16) [OK]    3.7 GFlops   15.6 GB/s
 PS+SuMx(    32) [OK]    3.3 GFlops   13.7 GB/s
 PS+SuMx(    64) [OK]    5.1 GFlops   20.7 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.9   21.5 121.7 [OK]   17.6   77.2 121.7
 PS+SuMx(    16)    7.1   29.7 121.7 [OK]   16.7   69.8 121.7
 PS+SuMx(    32)    8.3   34.1 121.7 [OK]   16.2   66.4 121.7
 PS+SuMx(    64)   10.2   41.3 121.7 [OK]   16.0   64.6 121.7


PowerSpectrumTest7_extended.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.7 GFlops   12.0 GB/s
 PS+SuMx(    16) [OK]    3.7 GFlops   15.4 GB/s
 PS+SuMx(    32) [OK]    3.4 GFlops   13.9 GB/s
 PS+SuMx(    64) [OK]    5.1 GFlops   20.7 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    5.0   21.8 121.7 [OK]   17.7   77.4 121.7
 PS+SuMx(    16)    7.1   29.9 121.7 [OK]   16.7   70.0 121.7
 PS+SuMx(    32)    8.9   36.5 121.7 [OK]   16.3   66.6 121.7
 PS+SuMx(    64)   10.5   42.4 121.7 [OK]   16.0   64.7 121.7


.
Done
gpuload (http://www.britta-d.de/images/powerspectrum/ps_gpuload_test7ext.jpg)
I had never seen this Memory Controller load spike, comparing with primegrid it shows nothing.
gpuload_prime (http://www.britta-d.de/images/primegrid/pg_gtx470_gpuz_stable_sensors.jpg)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 22 Dec 2010, 06:42:54 am
7 extended ION
~~~~~~~~~~
PowerSpectrumTest7_extended.exe -device 0

Device: ION, 1100 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.4 GFlops    1.5 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.4 GB/s
 PS+SuMx(    32) [OK]    0.3 GFlops    1.1 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.7 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.5    2.4 121.7 [OK]    0.6    2.8 121.7
 PS+SuMx(    16)    0.6    2.3 121.7 [OK]    0.6    2.6 121.7
 PS+SuMx(    32)    0.5    2.2 121.7 [OK]    0.6    2.3 121.7
 PS+SuMx(    64)    0.7    2.7 121.7 [OK]    0.7    2.9 121.7


.
Done
hmm. how to interpret
the stock values 1,7GB/s are much better with the ION.
must lookup to the ION device properties
CUDA: ION
Informationsliste   Wert
Geräteeigenschaften   
Gerätename   ION
Taktrate   1100 MHz
Multiprozessor / Kerne   2 / 16
Max Threads Per Block   512
Max Registers Per Block   8192
Warp Size   32 threads
Max Block Size   512 x 512 x 64
Max Grid Size   65535 x 65535 x 1
Compute Capability   1.1
CUDA DLL   nvcuda.dll (8.17.12.6061 - nVIDIA ForceWare 260.61)
   
Speichereigenschaften   
Total Memory   241 MB
Total Constant Memory   64 KB
Max Shared Memory Per Block   16 KB
Max Memory Pitch   2147483647 Bytes
Texture Alignment   256 Bytes
   
Gerät Besonderheiten   
32-bit Floating-Point Atomic Addition   Nicht unterstützt
32-bit Integer Atomic Operations   Unterstützt
64-bit Integer Atomic Operations   Nicht unterstützt
Concurrent Memory Copy & Execute   Nicht unterstützt
Double-Precision Floating-Point   Nicht unterstützt
Warp Vote Functions   Nicht unterstützt
__ballot()   Nicht unterstützt
__syncthreads_and()   Nicht unterstützt
__syncthreads_count()   Nicht unterstützt
__syncthreads_or()   Nicht unterstützt
__threadfence_system()   Nicht unterstützt
   
Gerätehersteller   
Firmenname   NVIDIA Corporation
Produktinformation   http://www.nvidia.com/page/products.html
Treiberdownload   http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate   http://www.aida64.com/driver-updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OPEN_CL
~~~~~~~
OpenCL: ION
Informationsliste   Wert
OpenCL Properties   
Platform Name   NVIDIA CUDA
Platform Vendor   NVIDIA Corporation
Platform Version   OpenCL 1.0 CUDA 3.2.1
Platform Profile   Full
   
Geräteeigenschaften   
Gerätename   ION
Geräteart   Grafikprozessor (GPU)
Device Vendor   NVIDIA Corporation
Device Version   OpenCL 1.0 CUDA
Device Profile   Full
Taktrate   1100 MHz
Multiprocessors   2
Max 2D Image Size   4096 x 32768
Max 3D Image Size   2048 x 2048 x 2048
Max Samplers   16
Max Work-Item Size   512 x 512 x 64
Max Work-Group Size   512
Max Argument Size   4352 Bytes
Max Constant Buffer Size   64 KB
Max Constant Arguments   9
Profiling Timer Resolution   1000 ns
OpenCL DLL   opencl.dll (1.0.0)
   
Speichereigenschaften   
Global Memory   241 MB
Local Memory   16 KB
Memory Base Address Alignment   2048 Bit
Min Data Type Alignment   128 Bytes
   
Gerät Besonderheiten   
Command-Queue Out Of Order Execution   Aktiviert
Command-Queue Profiling   Aktiviert
Compiler   Unterstützt
Fehlerkorrektur   Nicht unterstützt
Images   Unterstützt
Kernel Execution   Unterstützt
Native Kernel Execution   Nicht unterstützt
   
Device Extensions   
cl_amd_d3d10_interop   Nicht unterstützt
cl_amd_d3d9_interop   Nicht unterstützt
cl_amd_device_attribute_query   Nicht unterstützt
cl_amd_fp64   Nicht unterstützt
cl_amd_media_ops   Nicht unterstützt
cl_amd_printf   Nicht unterstützt
cl_khr_3d_image_writes   Nicht unterstützt
cl_khr_byte_addressable_store   Unterstützt
cl_khr_d3d10_sharing   Unterstützt
cl_khr_fp16   Nicht unterstützt
cl_khr_fp64   Nicht unterstützt
cl_khr_gl_sharing   Unterstützt
cl_khr_global_int32_base_atomics   Unterstützt
cl_khr_global_int32_extended_atomics   Unterstützt
cl_khr_icd   Unterstützt
cl_khr_int64_base_atomics   Nicht unterstützt
cl_khr_int64_extended_atomics   Nicht unterstützt
cl_khr_local_int32_base_atomics   Nicht unterstützt
cl_khr_local_int32_extended_atomics   Nicht unterstützt
cl_khr_select_fprounding_mode   Nicht unterstützt
cl_nv_compiler_options   Unterstützt
cl_nv_d3d10_sharing   Unterstützt
cl_nv_d3d11_sharing   Unterstützt
cl_nv_d3d9_sharing   Unterstützt
cl_nv_device_attribute_query   Unterstützt
cl_nv_pragma_unroll   Unterstützt
   
Gerätehersteller   
Firmenname   NVIDIA Corporation
Produktinformation   http://www.nvidia.com/page/products.html
Treiberdownload   http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate   http://www.aida64.com/driver-updates
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 08:59:03 am
hmm. how to interpret
the stock values 1,7GB/s are much better with the ION.
must lookup to the ION device properties

No, your labels are misaligned Heinz, will fix them for you ....[Done... 2.7GB/s is a bit better than 1.7GB/s ]

[Edit] Fixed it again, and fixed the 470 ones so you can read them properly  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 22 Dec 2010, 09:24:15 am
Thanks Jason,
must clean my glasses  ::)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 22 Dec 2010, 05:53:11 pm
7 extended ION
~~~~~~~~~~
rerun, now light oc'ed from  450 / 800 / 1100 to 475 / 850 / 1161

PowerSpectrumTest7_extended.exe -device 0

Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.4 GFlops    1.6 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.4 GB/s
 PS+SuMx(    32) [OK]    0.3 GFlops    1.1 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.8 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.6    2.5 121.7 [OK]    0.7    2.9 121.7
 PS+SuMx(    16)    0.6    2.4 121.7 [OK]    0.6    2.7 121.7
 PS+SuMx(    32)    0.6    2.3 121.7 [OK]    0.6    2.4 121.7
 PS+SuMx(    64)    0.7    2.8 121.7 [OK]    0.8    3.1 121.7


.
Done
modify: the latest GPU-Z 0.4.9 did not show any Memory Controller load (http://www.britta-d.de/images/powerspectrum/ps_7ext_ION_gpuz_no_memory_controller_load.jpg)
looks like a issue ?
further it shows 4 ROPs (http://www.britta-d.de/images/powerspectrum/ps_gpuz_049_show_ROPs_4.jpg) for the ION, but it has 2 Multiprocessors(as far as I know)
emailed to techpowerup
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 06:12:50 pm
Well, to get that 30-50% speedup (1.5-2x) on the small GPU, we went a bit further than what the nVidia documentation specifies for efficient reductions, and the code 'looks nice' (a good sign in engineering)... Still the larger sizes to go, might have to send some notes back to nVidia after we finish this, to update the optimisation manual a bit  :o
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 22 Dec 2010, 06:29:34 pm
looks like a issue ?

Not 'our' problem  ;)  see what msi afterburner says (for memory),  Maybe they confuse ION & ION2, don't know
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 04:45:06 am
First post updated:
Quote
Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.

Please test repeatedly on all Cuda enabled GPUs... No posting of results please (too large for me to look through, I'll go crosseyed  ;)), just confirm all Opt1 [OK] & faster at all sizes, And alert if you see and marked [BAD] or too slow, may need to run several times to see if a problem appears or not.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 23 Dec 2010, 06:05:33 am
All systems are go except....

gtx 295
core 0 - 1 bad at test 1/5 under 128 size
core 1 - 1 slow at test 2/5 under 128 size

gtx 260 - 1 slow at test 4/5 under 256 size
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 06:09:48 am
All systems are go except....

gtx 295
core 0 - 1 bad at test 1/5 under 128 size
core 1 - 1 slow at test 2/5 under 128 size

gtx 260 - 1 slow at test 4/5 under 256 size

Thanks!  on the 295 is the Video memory OC'd ?  I found here the Opt1 around size #thrds/block(256 on Fermi, 128 on 2xx)  can be unstable if VRAM OC is pushed.  I had to back off my Video memory OC by 80MHz for it to stabilise

GTX260 - Please run that one a few times & see if that's consistently slower than stock at size 256.  Will be checking that code  in the meantime.
[Edit:] I see you did, & got one slow out of 5 ... OK


[Edit2:] Darn 128 still a little unstable here too  ???, will dial size 128 & 256 back & replace the test shortly (might be pushing a tad hard )
Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 06:23:47 am
@glenaxl: have updated the PowerSpectrumTest8 archive attached to first post, to dial back the borderline kernels a bit (for now, will dig deeper into those later if needed).

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 23 Dec 2010, 07:25:44 am
@glenaxl: have updated the PowerSpectrumTest8 archive attached to first post, to dial back the borderline kernels a bit (for now, will dig deeper into those later if needed).

Jason
Yah, my gtx 295 vram is oc'd to 1080 from 999


The new test8 are all good now. Perfect!  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 07:27:50 am
The new test8 are all good now. Perfect!  ;)

Good, good.  will keep those ones dialled in a bit then, allowing some possible fine tuning later.  It seems cramming that much data through we're beginning to find weak spots, so will look at moving onto FFT integration.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 23 Dec 2010, 09:41:44 am
Hi Jason,
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Stock best result
 PS+SuMx( 32768) [OK]   12.7 GFlops   50.7 GB/s

Opt best result
 PS+SuMx( 32768)   16.4   65.7 121.7 [OK]   27.8  111.4 121.7

all others are ok
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 23 Dec 2010, 09:57:28 am
Hi Jason,
excellent performance on the ION
worth to post full result
PowerSpectrumTest8.exe -device 0

Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    0.4 GFlops    1.6 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.5 GB/s
 PS+SuMx(    32) [OK]    0.3 GFlops    1.1 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.8 GB/s
 PS+SuMx(   128) [OK]    0.7 GFlops    2.7 GB/s
 PS+SuMx(   256) [OK]    0.8 GFlops    3.4 GB/s
 PS+SuMx(   512) [OK]    1.1 GFlops    4.3 GB/s
 PS+SuMx(  1024) [OK]    1.1 GFlops    4.4 GB/s
 PS+SuMx(  2048) [OK]    1.2 GFlops    4.9 GB/s
 PS+SuMx(  4096) [OK]    1.2 GFlops    4.8 GB/s
 PS+SuMx(  8192) [OK]    1.3 GFlops    5.2 GB/s
 PS+SuMx( 16384) [OK]    1.3 GFlops    5.1 GB/s
 PS+SuMx( 32768) [OK]    1.3 GFlops    5.4 GB/s
 PS+SuMx( 65536) [OK]    1.4 GFlops    5.4 GB/s
 PS+SuMx(131072) [OK]    1.4 GFlops    5.6 GB/s

Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.6    2.5 121.7 [OK]    0.7    2.9 121.7
 PS+SuMx(    16)    0.6    2.4 121.7 [OK]    0.6    2.7 121.7
 PS+SuMx(    32)    0.6    2.3 121.7 [OK]    0.6    2.4 121.7
 PS+SuMx(    64)    0.7    2.8 121.7 [OK]    0.7    3.0 121.7
 PS+SuMx(   128)    0.7    2.7 121.7 [OK]    0.7    3.0 121.7
 PS+SuMx(   256)    0.9    3.5 121.7 [OK]    1.0    3.9 121.7
 PS+SuMx(   512)    1.1    4.5 121.7 [OK]    1.2    5.0 121.7
 PS+SuMx(  1024)    1.2    4.6 121.7 [OK]    1.3    5.1 121.7
 PS+SuMx(  2048)    1.3    5.3 121.7 [OK]    1.5    5.9 121.7
 PS+SuMx(  4096)    1.3    5.0 121.7 [OK]    1.4    5.6 121.7
 PS+SuMx(  8192)    1.4    5.5 121.7 [OK]    1.5    6.1 121.7
 PS+SuMx( 16384)    1.3    5.4 121.7 [OK]    1.5    6.0 121.7
 PS+SuMx( 32768)    1.4    5.7 121.7 [OK]    1.6    6.4 121.7
 PS+SuMx( 65536)    1.4    5.8 121.7 [OK]    1.6    6.5 121.7
 PS+SuMx(131072)    1.2    4.8 121.7 [OK]    1.7    6.6 121.7

.
Done
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 10:18:54 am
Yes, size 128k drops off a bit on mine too, not sure why yet.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Raistmer on 23 Dec 2010, 11:13:47 am
Was able to get results for GSO9600 at last:


Device: GeForce 9600 GSO, 1700 MHz clock, 384 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    1.2 GFlops    5.4 GB/s
 PS+SuMx(    16) [OK]    1.6 GFlops    6.9 GB/s
 PS+SuMx(    32) [OK]    1.8 GFlops    7.3 GB/s
 PS+SuMx(    64) [OK]    2.9 GFlops   11.8 GB/s
 PS+SuMx(   128) [OK]    4.3 GFlops   17.1 GB/s
 PS+SuMx(   256) [OK]    5.5 GFlops   22.1 GB/s
 PS+SuMx(   512) [OK]    6.7 GFlops   27.0 GB/s
 PS+SuMx(  1024) [OK]    7.0 GFlops   28.1 GB/s
 PS+SuMx(  2048) [OK]    7.7 GFlops   30.8 GB/s
 PS+SuMx(  4096) [OK]    7.6 GFlops   30.4 GB/s
 PS+SuMx(  8192) [OK]    7.9 GFlops   31.6 GB/s
 PS+SuMx( 16384) [OK]    7.7 GFlops   31.0 GB/s
 PS+SuMx( 32768) [OK]    8.1 GFlops   32.5 GB/s
 PS+SuMx( 65536) [OK]    7.8 GFlops   31.3 GB/s
 PS+SuMx(131072) [OK]    8.0 GFlops   32.2 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    1.5    6.5 121.7 [OK]    4.5   19.6 121.7
 PS+SuMx(    16)    2.3    9.6 121.7 [OK]    4.8   20.0 121.7
 PS+SuMx(    32)    3.0   12.1 121.7 [OK]    4.5   18.5 121.7
 PS+SuMx(    64)    3.1   12.7 121.7 [OK]    5.4   21.7 121.7
 PS+SuMx(   128)    4.5   18.1 121.7 [OK]    5.3   21.3 121.7
 PS+SuMx(   256)    5.8   23.1 121.7 [OK]    6.5   25.9 121.7
 PS+SuMx(   512)    6.9   27.8 121.7 [OK]    7.5   30.0 121.7
 PS+SuMx(  1024)    7.3   29.1 121.7 [OK]    7.8   31.2 121.7
 PS+SuMx(  2048)    7.9   31.5 121.7 [OK]    8.4   33.6 121.7
 PS+SuMx(  4096)    7.8   31.1 121.7 [OK]    8.2   32.6 121.7
 PS+SuMx(  8192)    8.1   32.3 121.7 [OK]    8.5   33.9 121.7
 PS+SuMx( 16384)    7.9   31.5 121.7 [OK]    8.2   32.8 121.7
 PS+SuMx( 32768)    8.1   32.5 121.7 [OK]    8.6   34.6 121.7
 PS+SuMx( 65536)    5.7   22.7 121.7 [OK]    8.3   33.2 121.7
 PS+SuMx(131072)    8.2   32.6 121.7 [OK]    8.5   34.1 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 23 Dec 2010, 11:21:18 am
Okay, here's test 8. Figured it would be better for me to post it rather than try to explain what I don't understand.  :8

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test> powerspectrumtest8.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    0.7 GFlops    3.1 GB/s
 PS+SuMx(    16) [OK]    0.8 GFlops    3.2 GB/s
 PS+SuMx(    32) [OK]    0.7 GFlops    3.0 GB/s
 PS+SuMx(    64) [OK]    1.0 GFlops    4.2 GB/s
 PS+SuMx(   128) [OK]    0.8 GFlops    3.4 GB/s
 PS+SuMx(   256) [OK]    1.6 GFlops    6.6 GB/s
 PS+SuMx(   512) [OK]    2.0 GFlops    7.8 GB/s
 PS+SuMx(  1024) [OK]    2.1 GFlops    8.2 GB/s
 PS+SuMx(  2048) [OK]    2.1 GFlops    8.2 GB/s
 PS+SuMx(  4096) [OK]    2.0 GFlops    8.1 GB/s
 PS+SuMx(  8192) [OK]    2.1 GFlops    8.4 GB/s
 PS+SuMx( 16384) [OK]    2.1 GFlops    8.4 GB/s
 PS+SuMx( 32768) [OK]    0.5 GFlops    1.9 GB/s
 PS+SuMx( 65536) [OK]    0.4 GFlops    1.5 GB/s
 PS+SuMx(131072) [OK]    2.1 GFlops    8.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    1.1    4.8 121.7 [OK]    1.5    6.8 121.7
 PS+SuMx(    16)    1.2    5.0 121.7 [OK]    1.7    6.9 121.7
 PS+SuMx(    32)    1.2    5.0 121.7 [OK]    1.5    6.1 121.7
 PS+SuMx(    64)    0.5    1.9 121.7 [OK]    1.7    7.1 121.7
 PS+SuMx(   128)    0.6    2.5 121.7 [OK]    1.8    7.2 121.7
 PS+SuMx(   256)    0.6    2.3 121.7 [OK]    2.1    8.3 121.7
 PS+SuMx(   512)    2.0    8.1 121.7 [OK]    2.5   10.1 121.7
 PS+SuMx(  1024)    1.9    7.8 121.7 [OK]    2.6   10.3 121.7
 PS+SuMx(  2048)    2.1    8.6 121.7 [OK]    2.6   10.3 121.7
 PS+SuMx(  4096)    0.5    2.1 121.7 [OK]    2.5   10.0 121.7
 PS+SuMx(  8192)    2.2    8.7 121.7 [OK]    2.8   11.1 121.7
 PS+SuMx( 16384)    2.1    8.2 121.7 [OK]    2.7   10.9 121.7
 PS+SuMx( 32768)    2.2    8.8 121.7 [OK]    2.8   11.1 121.7
 PS+SuMx( 65536)    2.2    8.9 121.7 [OK]    2.8   11.2 121.7
 PS+SuMx(131072)    2.3    9.2 121.7 [OK]    2.8   11.3 121.7



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 12:12:55 pm
Was able to get results for GSO9600 at last:

Ouch, not much headroom between worst & best (fast GDDR3 memory  on 9600GSO IIRC).  I reckon the 64k size is an anomaly worth looking into, as with the 128k drop-off on other cards (like ION).  Thankfully that part (larger sizes) is mostly stock, so there should be plenty of tweaking possibilities.... Even if only for a GFlop here and there.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 12:22:38 pm
Okay, here's test 8. Figured it would be better for me to post it rather than try to explain what I don't understand.  :8

Thanks, A couple of sizes choking there for whatever reason.  I think I'm going to have to improve everything from size 64&128 upward before moving onto the FFTs ... Nice that it's working with all '[OK]'
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 23 Dec 2010, 01:23:33 pm
All OK (5 runs) on GTX 465:

Stock Best Result:
Quote
PS+SuMx( 32768) [OK]   12.2 GFlops   48.7 GB/s
Opt1 Best Result:
Quote
Opt1: 256 thrds/block
                                   worst case                  best case
                              GFlps  GB/s  ulps           GFlps   GB/s  ulps
 PS+SuMx( 32768)   17.7   71.0 121.7 [OK]   24.6   98.2 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 01:29:52 pm
Hey Ghost, what's the memory bus width & memory clock on that 465 ?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 23 Dec 2010, 01:48:51 pm
Hey Ghost, what's the memory bus width & memory clock on that 465 ?

Here's a GPU-Z image for the card
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 01:53:55 pm
PS+SuMx( 32768)   17.7   71.0 121.7 [OK]   24.6   98.2 121.7

Hmm this *could* be near max theoretical then ... checking
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 23 Dec 2010, 01:59:50 pm
PS+SuMx( 32768)   17.7   71.0 121.7 [OK]   24.6   98.2 121.7

Hmm this *could* be near max theoretical then ... checking

Thats good  :D
Was getting a nice capacitor whine when running the tests, so knew it was being pushed hard!
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 23 Dec 2010, 02:02:14 pm
I calculate 122.24 GB/s theoretical max (matching GPU-z listing), so 98.2 seems pretty good.  I'll look at what that size is doing & see if I can spread some performance around up in that area.

[Edit:] I get the impression we might be best seeing what streaming those kernels will do sometime soon  :-\  too many new fan-dangled features in this stuff  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 23 Dec 2010, 02:47:43 pm
OK here on 9800GTX+ (5 runs) GPU usage up from stock's ~80% to ~95% on Opt1:

Best Stock result:
Quote
PS+SuMx( 65536) [OK]   11.6 GFlops   46.5 GB/s

Opt1 Best Result:
Quote
Opt1: 64 thrds/block
                                worst case                 best case
                           GFlps   GB/s   ulps         GFlps   GB/s  ulps
PS+SuMx( 65536)   13.0   52.1 121.7 [OK]   15.6   62.5 121.7

and O.K on 128Mb 8400M GS (5 runs):


Quote
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    0.3 GFlops    1.3 GB/s
 PS+SuMx(    16) [OK]    0.3 GFlops    1.2 GB/s
 PS+SuMx(    32) [OK]    0.2 GFlops    0.9 GB/s
 PS+SuMx(    64) [OK]    0.4 GFlops    1.5 GB/s
 PS+SuMx(   128) [OK]    0.5 GFlops    2.2 GB/s
 PS+SuMx(   256) [OK]    0.7 GFlops    2.8 GB/s
 PS+SuMx(   512) [OK]    0.8 GFlops    3.4 GB/s
 PS+SuMx(  1024) [OK]    0.9 GFlops    3.5 GB/s
 PS+SuMx(  2048) [OK]    1.0 GFlops    4.0 GB/s
 PS+SuMx(  4096) [OK]    0.9 GFlops    3.7 GB/s
 PS+SuMx(  8192) [OK]    1.0 GFlops    4.0 GB/s
 PS+SuMx( 16384) [OK]    1.0 GFlops    3.9 GB/s
 PS+SuMx( 32768) [OK]    1.0 GFlops    4.1 GB/s
 PS+SuMx( 65536) [OK]    1.1 GFlops    4.2 GB/s
 PS+SuMx(131072) [OK]    1.1 GFlops    4.3 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    0.4    1.9 121.7 [OK]    0.5    2.1 121.7
 PS+SuMx(    16)    0.4    1.8 121.7 [OK]    0.5    1.9 121.7
 PS+SuMx(    32)    0.4    1.7 121.7 [OK]    0.4    1.7 121.7
 PS+SuMx(    64)    0.5    2.1 121.7 [OK]    0.5    2.2 121.7
 PS+SuMx(   128)    0.6    2.2 121.7 [OK]    0.6    2.3 121.7
 PS+SuMx(   256)    0.7    2.9 121.7 [OK]    0.7    3.0 121.7
 PS+SuMx(   512)    0.9    3.5 121.7 [OK]    0.9    3.6 121.7
 PS+SuMx(  1024)    0.9    3.5 121.7 [OK]    0.9    3.7 121.7
 PS+SuMx(  2048)    1.0    4.0 121.7 [OK]    1.0    4.2 121.7
 PS+SuMx(  4096)    0.9    3.8 121.7 [OK]    1.0    3.9 121.7
 PS+SuMx(  8192)    1.0    4.0 121.7 [OK]    1.0    4.2 121.7
 PS+SuMx( 16384)    1.0    4.0 121.7 [OK]    1.0    4.1 121.7
 PS+SuMx( 32768)    1.1    4.2 121.7 [OK]    1.1    4.3 121.7
 PS+SuMx( 65536)    1.1    4.3 121.7 [OK]    1.1    4.5 121.7
 PS+SuMx(131072)    1.1    4.4 121.7 [OK]    1.1    4.5 121.7

Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 23 Dec 2010, 04:06:32 pm
I ran this on my usual rig (Q6600/8GB/8800GTX) but version 8 added something new, an error. Under WinXP it just shows the error, but under Win7-64 the screen turns black and I get a "driver stopped responding error". Running 260.99.

First the WinXP-32 log:

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.7 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   11.1 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    4.3 GFlops   17.6 GB/s
 PS+SuMx(   128) [OK]    6.7 GFlops   26.9 GB/s
 PS+SuMx(   256) [OK]    9.0 GFlops   36.0 GB/s
 PS+SuMx(   512) [OK]   11.2 GFlops   44.7 GB/s
 PS+SuMx(  1024) [OK]   11.8 GFlops   47.4 GB/s
 PS+SuMx(  2048) [OK]   13.5 GFlops   53.9 GB/s
 PS+SuMx(  4096) [OK]   13.2 GFlops   52.6 GB/s
 PS+SuMx(  8192) [OK]   14.4 GFlops   57.4 GB/s
 PS+SuMx( 16384) [OK]   14.1 GFlops   56.4 GB/s
 PS+SuMx( 32768) [OK]   14.9 GFlops   59.5 GB/s
 PS+SuMx( 65536) [OK]   15.3 GFlops   61.1 GB/s
 PS+SuMx(131072) [OK]   11.9 GFlops   47.7 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.6   15.8 121.7 [OK]    6.2   27.2 121.7
 PS+SuMx(    16)    4.5   18.8 121.7 [OK]    6.1   25.5 121.7
 PS+SuMx(    32)    4.9   20.1 121.7 [OK]    5.8   23.8 121.7
 PS+SuMx(    64)
 FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456

Then the Win7-64 log:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    2.0 GFlops    9.0 GB/s
 PS+SuMx(    16) [OK]    2.4 GFlops   10.2 GB/s
 PS+SuMx(    32) [OK]    2.4 GFlops    9.8 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.6 GB/s
 PS+SuMx(   128) [OK]    5.7 GFlops   22.8 GB/s
 PS+SuMx(   256) [OK]    7.2 GFlops   28.8 GB/s
 PS+SuMx(   512) [OK]    8.5 GFlops   34.1 GB/s
 PS+SuMx(  1024) [OK]    8.9 GFlops   35.8 GB/s
 PS+SuMx(  2048) [OK]    9.8 GFlops   39.3 GB/s
 PS+SuMx(  4096) [OK]    9.7 GFlops   38.8 GB/s
 PS+SuMx(  8192) [OK]   10.3 GFlops   41.3 GB/s
 PS+SuMx( 16384) [OK]   10.1 GFlops   40.5 GB/s
 PS+SuMx( 32768) [OK]   10.6 GFlops   42.2 GB/s
 PS+SuMx( 65536) [OK]   10.7 GFlops   43.0 GB/s
 PS+SuMx(131072) [OK]    9.0 GFlops   36.0 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.4   14.8 121.7 [OK]    6.1   26.8 121.7
 PS+SuMx(    16)    4.2   17.4 121.7 [OK]    6.0   25.3 121.7
 PS+SuMx(    32)    4.6   18.7 121.7 [OK]    5.8   23.7 121.7
 PS+SuMx(    64)
 FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 23 Dec 2010, 04:36:20 pm
All OK here with GPU RAM at 1975 MHz with 5 runs

Best Stock result
Quote
PS+SuMx( 32768) [OK]   18.7 GFlops   75.0 GB/s

Best Opt. 1 result
Quote
PS+SuMx( 32768)   26.8  107.4 121.7 [OK]   37.0  148.1 121.7

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 23 Dec 2010, 06:00:09 pm
very interesting test8 shows for the cards GTX470/480 --> 32768 as best result.
But with slow end cards 131072 is best.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 23 Dec 2010, 06:37:37 pm
All OK worst 0-0.4 faster than stock, best another .1-.4 faster than worst.
about 5 runs.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 23 Dec 2010, 07:19:45 pm
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.

Code: [Select]
PS+SuMx( 65536) [OK]   12.4 GFlops   49.6 GB/s
Code: [Select]
PS+SuMx( 65536)   16.6   66.4 121.7 [OK]   17.7   70.7 121.7
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 12:47:51 am
PS+SuMx(    64)
 FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456

Wow Patrick, clearly something I'm doing in size 64 has changed (and only appears on cc1.0  :o), will check.  we're going to need to fix that before moving on.

[Later:] @Patrick: when you can, please reboot & try the attached fix attempt ( for compute cap 1.0)... If OK on that card I'll be able to avoid breaking that again...

[Removed attachment]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 24 Dec 2010, 05:03:00 am
PS+SuMx(    64)
 FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456

Wow Patrick, clearly something I'm doing in size 64 has changed (and only appears on cc1.0  :o), will check.  we're going to need to fix that before moving on.

[Later:] @Patrick: when you can, please reboot & try the attached fix attempt ( for compute cap 1.0)... If OK on that card I'll be able to avoid breaking that again...

It looks like you fixed it, full loggings for completion sake:

WinXP-32:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.7 GB/s
 PS+SuMx(    16) [OK]    2.6 GFlops   11.1 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    4.3 GFlops   17.6 GB/s
 PS+SuMx(   128) [OK]    6.7 GFlops   26.9 GB/s
 PS+SuMx(   256) [OK]    9.0 GFlops   36.0 GB/s
 PS+SuMx(   512) [OK]   11.2 GFlops   44.7 GB/s
 PS+SuMx(  1024) [OK]   11.8 GFlops   47.4 GB/s
 PS+SuMx(  2048) [OK]   13.5 GFlops   53.9 GB/s
 PS+SuMx(  4096) [OK]   13.2 GFlops   52.6 GB/s
 PS+SuMx(  8192) [OK]   14.4 GFlops   57.5 GB/s
 PS+SuMx( 16384) [OK]   14.1 GFlops   56.5 GB/s
 PS+SuMx( 32768) [OK]   14.9 GFlops   59.5 GB/s
 PS+SuMx( 65536) [OK]   15.3 GFlops   61.2 GB/s
 PS+SuMx(131072) [OK]   12.0 GFlops   47.8 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.6   15.8 121.7 [OK]    6.2   27.2 121.7
 PS+SuMx(    16)    4.5   18.8 121.7 [OK]    6.1   25.5 121.7
 PS+SuMx(    32)    4.9   20.1 121.7 [OK]    5.8   23.8 121.7
 PS+SuMx(    64)    6.5   26.5 121.7 [OK]    7.4   30.0 121.7
 PS+SuMx(   128)    7.2   28.8 121.7 [OK]    7.8   31.3 121.7
 PS+SuMx(   256)    9.4   37.8 121.7 [OK]   10.2   40.7 121.7
 PS+SuMx(   512)   11.6   46.3 121.7 [OK]   12.4   49.7 121.7
 PS+SuMx(  1024)   12.1   48.5 121.7 [OK]   12.9   51.6 121.7
 PS+SuMx(  2048)   13.7   54.9 121.7 [OK]   14.6   58.5 121.7
 PS+SuMx(  4096)   13.4   53.5 121.7 [OK]   14.2   56.8 121.7
 PS+SuMx(  8192)   14.5   58.2 121.7 [OK]   15.5   62.0 121.7
 PS+SuMx( 16384)   14.3   57.1 121.7 [OK]   15.2   60.9 121.7
 PS+SuMx( 32768)   15.1   60.3 121.7 [OK]   16.1   64.4 121.7
 PS+SuMx( 65536)   15.5   62.0 121.7 [OK]   16.5   66.2 121.7
 PS+SuMx(131072)   12.1   48.2 121.7 [OK]   12.7   50.8 121.7

Win7-64:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
 PS+SuMx(     8) [OK]    2.0 GFlops    8.7 GB/s
 PS+SuMx(    16) [OK]    2.4 GFlops   10.2 GB/s
 PS+SuMx(    32) [OK]    2.4 GFlops    9.7 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.8 GB/s
 PS+SuMx(   128) [OK]    5.6 GFlops   22.7 GB/s
 PS+SuMx(   256) [OK]    7.2 GFlops   29.0 GB/s
 PS+SuMx(   512) [OK]    8.7 GFlops   34.7 GB/s
 PS+SuMx(  1024) [OK]    9.0 GFlops   36.0 GB/s
 PS+SuMx(  2048) [OK]   10.0 GFlops   40.1 GB/s
 PS+SuMx(  4096) [OK]    9.8 GFlops   39.0 GB/s
 PS+SuMx(  8192) [OK]   10.4 GFlops   41.6 GB/s
 PS+SuMx( 16384) [OK]   10.2 GFlops   40.7 GB/s
 PS+SuMx( 32768) [OK]   10.8 GFlops   43.2 GB/s
 PS+SuMx( 65536) [OK]   10.9 GFlops   43.6 GB/s
 PS+SuMx(131072) [OK]    9.0 GFlops   36.1 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.4   14.9 121.7 [OK]    6.1   26.8 121.7
 PS+SuMx(    16)    4.2   17.6 121.7 [OK]    6.1   25.4 121.7
 PS+SuMx(    32)    4.6   18.7 121.7 [OK]    5.8   23.7 121.7
 PS+SuMx(    64)    6.0   24.2 121.7 [OK]    7.3   29.4 121.7
 PS+SuMx(   128)    6.5   26.0 121.7 [OK]    7.7   31.1 121.7
 PS+SuMx(   256)    8.3   33.3 121.7 [OK]   10.1   40.4 121.7
 PS+SuMx(   512)    9.9   39.8 121.7 [OK]   12.3   49.4 121.7
 PS+SuMx(  1024)   10.2   40.8 121.7 [OK]   12.8   51.3 121.7
 PS+SuMx(  2048)   11.3   45.2 121.7 [OK]   14.5   58.2 121.7
 PS+SuMx(  4096)   11.2   44.6 121.7 [OK]   14.1   56.3 121.7
 PS+SuMx(  8192)   12.1   48.3 121.7 [OK]   15.4   61.5 121.7
 PS+SuMx( 16384)   11.7   46.8 121.7 [OK]   15.1   60.4 121.7
 PS+SuMx( 32768)   12.2   48.8 121.7 [OK]   16.0   63.8 121.7
 PS+SuMx( 65536)   12.5   50.0 121.7 [OK]   16.4   65.8 121.7
 PS+SuMx(131072)   10.1   40.5 121.7 [OK]   12.6   50.5 121.7

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 06:47:56 am
Phew!  cool, thanks  ;D

Not much headroom on that chip either, but I'll be happy with that small fraction improvement on the oldest cards for now. 

Moving onto test #9 soon, will add in the FFTs, then will stream the test kernels after that, just to see what that does... Progress at last  ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 24 Dec 2010, 07:56:28 am
Phew!  cool, thanks  ;D

Not much headroom on that chip either, but I'll be happy with that small fraction improvement on the oldest cards for now. 

Moving onto test #9 soon, will add in the FFTs, then will stream the test kernels after that, just to see what that does... Progress at last  ;D

You're quite welcome. What exactly do you mean with 'not much headroom on that chip'?

Looking forward to the next test-programs. ;)

Oh, and a Merry Christmas!

Regards,

Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 09:50:16 am
You're quite welcome. What exactly do you mean with 'not much headroom on that chip'?

only that It seems the best & worst case Opt1 aren't as far apart as on the newer/bigger GPUs, which means we're getting close to limits of the smaller chips, as to what optimisations can be useful on those (with this part of code anyway )

Onto combining FFTs into the pipline now, which will change the picture a lot.  Back later

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 11:56:29 am
Darn Next test won't fit ! Arggh! .... When I can get it posted somewhere,  Net progress so far looks something like this for ~40-60% of multibeam processing:

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   12.7 GFlops   22.4 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   20.6 GFlops   28.1 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   25.1 GFlops   28.0 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   43.1 GFlops   40.8 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   63.7 GFlops   52.4 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   85.6 GFlops   62.4 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)  114.2 GFlops   74.6 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)  136.7 GFlops   81.0 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  149.3 GFlops   81.0 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  154.1 GFlops   77.1 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  156.2 GFlops   72.4 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  149.2 GFlops   64.5 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  155.5 GFlops   63.0 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  152.0 GFlops   57.9 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)  134.7 GFlops   48.4 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   19.2 GFlops   33.8 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   37.0 GFlops   50.5 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   61.1 GFlops   68.2 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   86.9 GFlops   82.2 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   93.4 GFlops   76.8 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  137.0 GFlops   99.8 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  174.8 GFlops  114.2 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  218.7 GFlops  129.6 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  231.2 GFlops  125.4 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  236.8 GFlops  118.4 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  229.0 GFlops  106.2 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  223.9 GFlops   96.8 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  216.0 GFlops   87.5 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  214.0 GFlops   81.5 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  205.0 GFlops   73.7 GB/s  ulps(fft  2.7,ps 5392.8) [OK]


Figuring out how to get it uploaded ...
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 12:12:20 pm
Unable to upload here, Please try

ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z

Updated first post:

Quote
Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only

Temporary download location:
ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 24 Dec 2010, 12:27:55 pm
GTX 465 results:

Quote
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8 )   10.6 GFlops   18.7 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   16.5 GFlops   22.5 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   16.9 GFlops   18.9 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   29.0 GFlops   27.4 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   43.4 GFlops   35.7 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   57.8 GFlops   42.1 GB/s  ulps(fft  1.7,ps 4254.8 ) [OK]
 FFT+PS+SM(   512)   77.4 GFlops   50.6 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)   92.9 GFlops   55.1 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)   99.7 GFlops   54.1 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  101.1 GFlops   50.6 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  103.9 GFlops   48.2 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  103.1 GFlops   44.6 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  104.6 GFlops   42.4 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  102.4 GFlops   39.0 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)   93.8 GFlops   33.7 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8 )   20.5 GFlops   36.2 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   33.7 GFlops   45.9 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   47.3 GFlops   52.8 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   60.0 GFlops   56.8 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   59.0 GFlops   48.5 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)   85.8 GFlops   62.5 GB/s  ulps(fft  1.7,ps 4261.8 ) [OK]
 FFT+PS+SM(   512)  109.0 GFlops   71.2 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  133.7 GFlops   79.3 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  136.9 GFlops   74.3 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  141.5 GFlops   70.7 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  136.7 GFlops   63.4 GB/s  ulps(fft  2.6,ps 5278.8 ) [OK]
 FFT+PS+SM( 16384)  141.3 GFlops   61.1 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  134.9 GFlops   54.6 GB/s  ulps(fft  2.3,ps 4992.8 ) [OK]
 FFT+PS+SM( 65536)  132.6 GFlops   50.5 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  130.5 GFlops   46.9 GB/s  ulps(fft  2.7,ps 5392.8 ) [OK]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 12:32:42 pm
Thanks, That's crazy speedup there too  ( 1.3-2x) .  Will be checking thoroughly before moving on  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 24 Dec 2010, 12:49:33 pm
And the 460-768

Code: [Select]
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    9.5 GFlops   16.7 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   14.4 GFlops   19.7 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   13.8 GFlops   15.4 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   24.2 GFlops   22.9 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   36.9 GFlops   30.4 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   49.9 GFlops   36.3 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)   70.7 GFlops   46.2 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)   90.4 GFlops   53.6 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  102.7 GFlops   55.7 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  111.2 GFlops   55.6 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)   97.5 GFlops   45.2 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)   93.4 GFlops   40.4 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  100.6 GFlops   40.7 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  106.9 GFlops   40.7 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)   86.9 GFlops   31.3 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   16.5 GFlops   29.1 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   27.2 GFlops   37.1 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   38.4 GFlops   42.9 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   49.9 GFlops   47.2 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   45.0 GFlops   37.0 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)   64.5 GFlops   47.0 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)   82.9 GFlops   54.2 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  108.0 GFlops   64.0 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  123.3 GFlops   66.9 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  132.9 GFlops   66.4 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  111.0 GFlops   51.5 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  107.2 GFlops   46.3 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  111.4 GFlops   45.1 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  117.4 GFlops   44.7 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)   95.6 GFlops   34.4 GB/s  ulps(fft  2.7,ps 5392.8) [OK]

Rehosting of the test on a faster connection.
http://www.arkayn.us/seti/PowerSpectrumTest9.7z
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 12:54:16 pm
And the 460-768...
We're pushing that narrower memory bus I guess  ;), totally different spread is interesting.

Quote
Rehosting of the test on a faster connection.
http://www.arkayn.us/seti/PowerSpectrumTest9.7z

Cheers! (adding link to first post..[done] )
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 24 Dec 2010, 01:23:40 pm
This is fun!


Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #9 (FFT pipeline)
            Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   21.2 GFlops   37.3 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   30.5 GFlops   41.6 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   30.7 GFlops   34.2 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   50.3 GFlops   47.6 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   73.0 GFlops   60.0 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   92.7 GFlops   67.5 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)  125.8 GFlops   82.2 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)  149.6 GFlops   88.7 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  163.0 GFlops   88.4 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  168.5 GFlops   84.2 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  170.0 GFlops   78.8 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  157.2 GFlops   68.0 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  167.4 GFlops   67.8 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  164.6 GFlops   62.7 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)  141.9 GFlops   51.0 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   37.4 GFlops   65.9 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   58.9 GFlops   80.4 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   81.7 GFlops   91.2 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)  102.4 GFlops   96.9 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)  100.5 GFlops   82.7 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  142.2 GFlops  103.6 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  177.3 GFlops  115.9 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  218.1 GFlops  129.3 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  233.4 GFlops  126.6 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  238.4 GFlops  119.2 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  229.6 GFlops  106.5 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  217.5 GFlops   94.1 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  213.6 GFlops   86.5 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  213.2 GFlops   81.2 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  198.0 GFlops   71.2 GB/s  ulps(fft  2.7,ps 5392.8) [OK]

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Richard Haselgrove on 24 Dec 2010, 01:25:55 pm
The usual three:

9800GTX+, Windows 7/32
Code: [Select]
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    6.9 GFlops   12.2 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   11.8 GFlops   16.1 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   15.6 GFlops   17.4 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   26.2 GFlops   24.8 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   36.6 GFlops   30.1 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   48.7 GFlops   35.5 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   57.8 GFlops   37.8 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   62.9 GFlops   37.3 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   61.7 GFlops   33.5 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   57.6 GFlops   28.8 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   56.7 GFlops   26.3 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   52.5 GFlops   22.7 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   50.3 GFlops   20.4 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   55.3 GFlops   21.1 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   56.9 GFlops   20.5 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)   14.9 GFlops   26.2 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   23.3 GFlops   31.8 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   30.5 GFlops   34.0 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   43.2 GFlops   40.9 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   49.8 GFlops   41.0 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   64.9 GFlops   47.3 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   79.3 GFlops   51.8 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   81.9 GFlops   48.6 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   78.1 GFlops   42.4 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   73.3 GFlops   36.7 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   70.5 GFlops   32.7 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   65.7 GFlops   28.4 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   60.7 GFlops   24.6 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   67.0 GFlops   25.5 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   68.5 GFlops   24.6 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

9800GT, Windows XP/32
Code: [Select]
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    6.6 GFlops   11.6 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   10.5 GFlops   14.3 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   13.0 GFlops   14.5 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   22.4 GFlops   21.2 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   33.8 GFlops   27.8 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   45.2 GFlops   32.9 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   56.0 GFlops   36.6 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   57.6 GFlops   34.1 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   57.4 GFlops   31.1 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   50.4 GFlops   25.2 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   48.9 GFlops   22.7 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   46.8 GFlops   20.3 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   42.4 GFlops   17.2 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   47.8 GFlops   18.2 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   50.5 GFlops   18.1 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    9.7 GFlops   17.2 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   16.0 GFlops   21.9 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   21.5 GFlops   24.0 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   31.1 GFlops   29.4 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   36.3 GFlops   29.9 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   47.7 GFlops   34.8 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   58.6 GFlops   38.3 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   59.7 GFlops   35.4 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   59.0 GFlops   32.0 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   51.9 GFlops   26.0 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   50.0 GFlops   23.2 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   47.7 GFlops   20.6 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   43.2 GFlops   17.5 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   48.7 GFlops   18.6 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   51.6 GFlops   18.6 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

GTX 470, Windows XP/32
Code: [Select]
Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    7.9 GFlops   14.0 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   14.0 GFlops   19.1 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   17.7 GFlops   19.7 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   32.4 GFlops   30.7 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   51.7 GFlops   42.6 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   72.0 GFlops   52.5 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)  100.4 GFlops   65.6 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)  124.9 GFlops   74.1 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  136.6 GFlops   74.1 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  139.1 GFlops   69.6 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  141.0 GFlops   65.4 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  132.7 GFlops   57.4 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  137.9 GFlops   55.9 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  134.5 GFlops   51.2 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)  116.0 GFlops   41.7 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   14.2 GFlops   25.1 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   27.2 GFlops   37.1 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   43.9 GFlops   49.0 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   61.3 GFlops   58.0 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   65.6 GFlops   54.0 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)   95.7 GFlops   69.7 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  121.1 GFlops   79.2 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  153.4 GFlops   91.0 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  161.9 GFlops   87.8 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  168.3 GFlops   84.2 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  157.7 GFlops   73.1 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  155.1 GFlops   67.1 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  151.9 GFlops   61.5 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  150.7 GFlops   57.4 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  137.2 GFlops   49.3 GB/s  ulps(fft  2.7,ps 5392.8) [OK]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 24 Dec 2010, 01:48:12 pm
My 128Mb 8400M GS on Vista32, and Merry Christmas:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    1.2 GFlops    2.1 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)    1.4 GFlops    1.8 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)    1.4 GFlops    1.5 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)    2.3 GFlops    2.2 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)    3.6 GFlops    2.9 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)    4.7 GFlops    3.4 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)    5.6 GFlops    3.7 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)    5.5 GFlops    3.2 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)    5.5 GFlops    3.0 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)    5.3 GFlops    2.6 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)    4.7 GFlops    2.2 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)    4.4 GFlops    1.9 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)    5.0 GFlops    2.0 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)    5.2 GFlops    2.0 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)    5.5 GFlops    2.0 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    1.6 GFlops    2.8 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)    1.9 GFlops    2.6 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)    2.3 GFlops    2.5 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)    3.1 GFlops    2.9 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)    3.6 GFlops    3.0 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)    4.8 GFlops    3.5 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)    5.8 GFlops    3.8 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)    5.6 GFlops    3.3 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)    5.7 GFlops    3.1 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)    5.3 GFlops    2.7 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)    4.8 GFlops    2.2 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)    4.4 GFlops    1.9 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)    5.0 GFlops    2.0 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)    5.2 GFlops    2.0 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)    5.5 GFlops    2.0 GB/s  ulps(fft  3.6,ps 6590.4) [OK]


and 9800GTX+ on Win 7 x64:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    8.1 GFlops   14.3 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   12.6 GFlops   17.2 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   16.6 GFlops   18.5 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   28.7 GFlops   27.1 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   42.1 GFlops   34.6 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   55.5 GFlops   40.4 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   68.2 GFlops   44.6 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   72.3 GFlops   42.9 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   70.7 GFlops   38.4 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   66.1 GFlops   33.1 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   64.2 GFlops   29.8 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   60.7 GFlops   26.2 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   56.1 GFlops   22.7 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   62.0 GFlops   23.6 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   63.2 GFlops   22.7 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)   11.1 GFlops   19.6 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   19.4 GFlops   26.4 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   27.5 GFlops   30.7 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   40.8 GFlops   38.6 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   48.9 GFlops   40.2 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   64.2 GFlops   46.8 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   79.3 GFlops   51.8 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   82.7 GFlops   49.0 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   79.9 GFlops   43.3 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   74.3 GFlops   37.2 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   71.6 GFlops   33.2 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   66.9 GFlops   28.9 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   61.4 GFlops   24.9 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   68.0 GFlops   25.9 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   69.3 GFlops   24.9 GB/s  ulps(fft  3.6,ps 6590.4) [OK]


Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 24 Dec 2010, 01:55:43 pm
-device 0
Code: [Select]
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   17.3 GFlops   30.4 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   23.2 GFlops   31.7 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   27.2 GFlops   30.4 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   43.8 GFlops   41.5 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   60.7 GFlops   49.9 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   75.6 GFlops   55.1 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   91.6 GFlops   59.9 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   92.1 GFlops   54.6 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   96.9 GFlops   52.6 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   93.1 GFlops   46.6 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   98.7 GFlops   45.8 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   96.1 GFlops   41.6 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   96.5 GFlops   39.1 GB/s  ulps(fft  3.1,ps 6152.4) [OK]
 FFT+PS+SM( 65536)   88.2 GFlops   33.6 GB/s  ulps(fft  2.8,ps 5899.2) [OK]
 FFT+PS+SM(131072)   94.4 GFlops   33.9 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 128 thrds/block
 FFT+PS+SM(     8)   25.0 GFlops   44.0 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   37.1 GFlops   50.6 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   49.8 GFlops   55.6 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   68.5 GFlops   64.9 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   81.4 GFlops   67.0 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   94.6 GFlops   68.9 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)  115.9 GFlops   75.7 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)  122.4 GFlops   72.6 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)  124.9 GFlops   67.7 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)  113.9 GFlops   57.0 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)  120.5 GFlops   55.9 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)  121.6 GFlops   52.6 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)  120.1 GFlops   48.7 GB/s  ulps(fft  3.1,ps 6041.9) [OK]
 FFT+PS+SM( 65536)  103.7 GFlops   39.5 GB/s  ulps(fft  2.8,ps 5782.9) [OK]
 FFT+PS+SM(131072)  111.2 GFlops   40.0 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

-device 1
Code: [Select]
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   16.3 GFlops   28.7 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   22.9 GFlops   31.3 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   26.3 GFlops   29.3 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   42.1 GFlops   39.8 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   63.2 GFlops   52.0 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   75.0 GFlops   54.6 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   89.7 GFlops   58.6 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   92.9 GFlops   55.1 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   96.6 GFlops   52.4 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   87.3 GFlops   43.7 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   49.6 GFlops   23.0 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   98.6 GFlops   42.6 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   97.1 GFlops   39.3 GB/s  ulps(fft  3.1,ps 6152.4) [OK]
 FFT+PS+SM( 65536)   85.5 GFlops   32.6 GB/s  ulps(fft  2.8,ps 5899.2) [OK]
 FFT+PS+SM(131072)   91.4 GFlops   32.9 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 128 thrds/block
 FFT+PS+SM(     8)   24.5 GFlops   43.2 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   36.4 GFlops   49.7 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   48.8 GFlops   54.5 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   67.0 GFlops   63.4 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   79.6 GFlops   65.5 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   92.7 GFlops   67.5 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)  113.9 GFlops   74.4 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)  118.9 GFlops   70.5 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)  122.9 GFlops   66.7 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)  111.8 GFlops   55.9 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)  117.7 GFlops   54.6 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)  118.7 GFlops   51.3 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)  117.7 GFlops   47.7 GB/s  ulps(fft  3.1,ps 6041.9) [OK]
 FFT+PS+SM( 65536)  101.2 GFlops   38.5 GB/s  ulps(fft  2.8,ps 5782.9) [OK]
 FFT+PS+SM(131072)  108.6 GFlops   39.0 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

-device 2
Code: [Select]
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   16.5 GFlops   29.2 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   23.1 GFlops   31.5 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   25.3 GFlops   28.3 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   41.3 GFlops   39.1 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   61.6 GFlops   50.7 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   72.0 GFlops   52.5 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   87.7 GFlops   57.3 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   94.5 GFlops   56.0 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   96.7 GFlops   52.5 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   90.5 GFlops   45.2 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   95.0 GFlops   44.1 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   95.0 GFlops   41.1 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   91.2 GFlops   36.9 GB/s  ulps(fft  3.1,ps 6152.4) [OK]
 FFT+PS+SM( 65536)   83.6 GFlops   31.8 GB/s  ulps(fft  2.8,ps 5899.2) [OK]
 FFT+PS+SM(131072)   90.6 GFlops   32.6 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 128 thrds/block
 FFT+PS+SM(     8)   24.1 GFlops   42.4 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   35.3 GFlops   48.2 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   47.1 GFlops   52.6 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   64.9 GFlops   61.4 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   77.0 GFlops   63.3 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   89.2 GFlops   65.0 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)  110.0 GFlops   71.9 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)  118.1 GFlops   70.0 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)  118.8 GFlops   64.5 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)  110.6 GFlops   55.3 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)  116.2 GFlops   53.9 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)  116.1 GFlops   50.2 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)  108.7 GFlops   44.0 GB/s  ulps(fft  3.1,ps 6041.9) [OK]
 FFT+PS+SM( 65536)   97.8 GFlops   37.2 GB/s  ulps(fft  2.8,ps 5782.9) [OK]
 FFT+PS+SM(131072)  108.3 GFlops   38.9 GB/s  ulps(fft  3.6,ps 6590.4) [OK]
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 24 Dec 2010, 02:17:06 pm
oooh, now my eyes have gone funny  ;D

@All: Thanks very much and Merry Christmas!

Summary of what I can see:
- The newer&bigger the card, the more we seem to be able to extract
- Opt1 FFT (worst case) pipeline not slower than stock at any speed on any GPU so far. (Even the small GPUs)
- Seems stable [OK] on all
-  200 series holding in there
-  Fermi peak starting to push unexpectedly high this early ( but still ~50% theoretical, will need to try streaming next as planned.)

I reckon we're getting a good start toward optimising multibeam now.   With FFT, powerspectrum, & Summax reductions covered, we account for about ~40-50% of processing (depending on angle range).  With a few more refinements to this area ( mainly streaming & findspikes itself to try) we should be ready to tackle the more challenging areas that remain (& dominate). 

Long road still to travel, but I reckon we've managed to nail a few key techniques that will help dramatically with certain problem areas down the road.

Cheers, off to give things a short Christmas break before going through all that with a fine tooth comb.

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 24 Dec 2010, 02:37:20 pm
Merry Christmas!
Thank you for the Christmas 2010 edition  ;)
PowerSpectrumTest9.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   10.5 GFlops   18.6 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   16.6 GFlops   22.6 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   21.6 GFlops   24.2 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   36.0 GFlops   34.1 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   52.7 GFlops   43.3 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   69.5 GFlops   50.6 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)   94.6 GFlops   61.8 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)  107.8 GFlops   63.9 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  118.0 GFlops   64.0 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  125.2 GFlops   62.6 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  131.7 GFlops   61.1 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  113.8 GFlops   49.2 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  121.3 GFlops   49.1 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  121.6 GFlops   46.3 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)  100.4 GFlops   36.1 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   21.7 GFlops   38.3 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   37.7 GFlops   51.4 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   55.7 GFlops   62.1 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   73.3 GFlops   69.4 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   75.4 GFlops   62.0 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  106.5 GFlops   77.6 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  132.7 GFlops   86.7 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  163.9 GFlops   97.2 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  179.4 GFlops   97.3 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  183.0 GFlops   91.5 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  179.3 GFlops   83.2 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  161.0 GFlops   69.6 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  163.6 GFlops   66.3 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  165.4 GFlops   63.0 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  146.7 GFlops   52.7 GB/s  ulps(fft  2.7,ps 5392.8) [OK]


PowerSpectrumTest9.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)   11.7 GFlops   20.6 GB/s  ulps(fft  1.2,ps 4389.0) [OK]
 FFT+PS+SM(    16)   19.0 GFlops   26.0 GB/s  ulps(fft  1.6,ps 4518.6) [OK]
 FFT+PS+SM(    32)   21.7 GFlops   24.2 GB/s  ulps(fft  1.3,ps 3977.6) [OK]
 FFT+PS+SM(    64)   36.1 GFlops   34.2 GB/s  ulps(fft  1.5,ps 4206.9) [OK]
 FFT+PS+SM(   128)   52.7 GFlops   43.3 GB/s  ulps(fft  1.7,ps 4351.9) [OK]
 FFT+PS+SM(   256)   69.7 GFlops   50.8 GB/s  ulps(fft  1.7,ps 4254.8) [OK]
 FFT+PS+SM(   512)   90.4 GFlops   59.1 GB/s  ulps(fft  1.8,ps 4305.7) [OK]
 FFT+PS+SM(  1024)   99.8 GFlops   59.2 GB/s  ulps(fft  2.1,ps 4725.7) [OK]
 FFT+PS+SM(  2048)  109.7 GFlops   59.5 GB/s  ulps(fft  2.2,ps 4918.4) [OK]
 FFT+PS+SM(  4096)  117.8 GFlops   58.9 GB/s  ulps(fft  2.2,ps 4762.0) [OK]
 FFT+PS+SM(  8192)  126.7 GFlops   58.8 GB/s  ulps(fft  2.6,ps 5275.5) [OK]
 FFT+PS+SM( 16384)  113.9 GFlops   49.2 GB/s  ulps(fft  2.6,ps 5355.0) [OK]
 FFT+PS+SM( 32768)  121.2 GFlops   49.1 GB/s  ulps(fft  2.3,ps 4987.7) [OK]
 FFT+PS+SM( 65536)  121.5 GFlops   46.3 GB/s  ulps(fft  2.0,ps 4601.3) [OK]
 FFT+PS+SM(131072)   99.9 GFlops   35.9 GB/s  ulps(fft  2.7,ps 5392.0) [OK]


Opt1 (worst case): 256 thrds/block
 FFT+PS+SM(     8)   21.8 GFlops   38.5 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   37.8 GFlops   51.6 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   55.9 GFlops   62.4 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   73.6 GFlops   69.7 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   75.7 GFlops   62.3 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  107.0 GFlops   77.9 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  133.3 GFlops   87.1 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  164.6 GFlops   97.6 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  180.0 GFlops   97.6 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  183.0 GFlops   91.5 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  179.7 GFlops   83.3 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  162.1 GFlops   70.1 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  164.3 GFlops   66.6 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  165.7 GFlops   63.1 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  147.5 GFlops   53.0 GB/s  ulps(fft  2.7,ps 5392.8) [OK]


.
Done
PowerSpectrumTest9.exe -device 0

Device: ION, 1161 MHz clock, 242 MB memory.Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    1.2 GFlops    2.0 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)    1.6 GFlops    2.2 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)    1.6 GFlops    1.8 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)    2.7 GFlops    2.6 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)    3.9 GFlops    3.2 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)    5.1 GFlops    3.7 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)    6.1 GFlops    4.0 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)    5.9 GFlops    3.5 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)    6.2 GFlops    3.4 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)    5.2 GFlops    2.6 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)    5.1 GFlops    2.4 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)    4.9 GFlops    2.1 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)    5.1 GFlops    2.1 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)    5.3 GFlops    2.0 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)    5.6 GFlops    2.0 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    1.9 GFlops    3.3 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)    2.4 GFlops    3.2 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)    2.8 GFlops    3.1 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)    3.8 GFlops    3.6 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)    4.2 GFlops    3.5 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)    5.4 GFlops    3.9 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)    6.6 GFlops    4.3 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)    6.3 GFlops    3.7 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)    6.6 GFlops    3.6 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)    5.6 GFlops    2.8 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)    5.4 GFlops    2.5 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)    5.2 GFlops    2.2 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)    5.4 GFlops    2.2 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)    5.6 GFlops    2.1 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)    5.8 GFlops    2.1 GB/s  ulps(fft  3.6,ps 6590.4) [OK]


.
Done
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 24 Dec 2010, 02:40:49 pm
Here's mine, Merry Christmas

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest9.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    1.2 GFlops    2.2 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)    1.3 GFlops    1.8 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)    2.0 GFlops    2.2 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)    2.7 GFlops    2.5 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)    3.8 GFlops    3.2 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)    5.2 GFlops    3.8 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)    3.1 GFlops    2.0 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)    5.7 GFlops    3.3 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)    6.5 GFlops    3.5 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)    5.5 GFlops    2.8 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)    5.9 GFlops    2.7 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)    4.7 GFlops    2.0 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)    6.1 GFlops    2.5 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)    5.8 GFlops    2.2 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)    7.0 GFlops    2.5 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    3.5 GFlops    6.1 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)    5.4 GFlops    7.4 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)    6.1 GFlops    6.8 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)    8.9 GFlops    8.4 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   10.2 GFlops    8.4 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   12.2 GFlops    8.9 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   15.5 GFlops   10.2 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   17.0 GFlops   10.1 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   18.1 GFlops    9.8 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   12.9 GFlops    6.5 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   14.3 GFlops    6.7 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   14.6 GFlops    6.3 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   12.4 GFlops    5.0 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   13.6 GFlops    5.2 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   13.9 GFlops    5.0 GB/s  ulps(fft  3.6,ps 6590.4) [OK]



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 24 Dec 2010, 04:14:26 pm
nr9

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.

 FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, line 254

ouch :)

ok stopping boinc helps ::) result tomorrow ok result now

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    1.8 GFlops    3.2 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)    2.9 GFlops    4.0 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)    2.7 GFlops    3.0 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)    5.3 GFlops    5.0 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)    7.9 GFlops    6.5 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   11.0 GFlops    8.0 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   13.3 GFlops    8.7 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   13.1 GFlops    7.8 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   13.2 GFlops    7.2 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   12.3 GFlops    6.1 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   11.5 GFlops    5.3 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   10.7 GFlops    4.6 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   12.2 GFlops    5.0 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   12.2 GFlops    4.7 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   12.5 GFlops    4.5 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    3.7 GFlops    6.6 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)    4.7 GFlops    6.4 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)    5.6 GFlops    6.3 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)    7.9 GFlops    7.5 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)    9.4 GFlops    7.7 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   12.5 GFlops    9.1 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   15.3 GFlops   10.0 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   15.0 GFlops    8.9 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   14.6 GFlops    7.9 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   14.1 GFlops    7.0 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   12.8 GFlops    6.0 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   11.6 GFlops    5.0 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   13.1 GFlops    5.3 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   14.1 GFlops    5.4 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   14.0 GFlops    5.0 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

sorry no time for avarages atm
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Dec 2010, 01:35:50 am
Thanks Heinz, perrjay & Carola,

   Nice to see the stubborn chips(that Quadro & ION) edging forward a bit now.

@perryjay:  ~3x for 9500GT in some sizes? Don't know why that is completely but I like it  ;D 

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 25 Dec 2010, 05:21:25 am
Hi there,

Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.

First, WinXP-32:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    9.3 GFlops   16.4 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   13.6 GFlops   18.5 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   16.0 GFlops   17.8 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   28.3 GFlops   26.8 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   44.4 GFlops   36.5 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   59.2 GFlops   43.1 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   72.6 GFlops   47.4 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   71.7 GFlops   42.5 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   72.1 GFlops   39.1 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   66.5 GFlops   33.3 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   63.3 GFlops   29.4 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   58.6 GFlops   25.3 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   62.9 GFlops   25.5 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   67.2 GFlops   25.6 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   66.0 GFlops   23.7 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)   14.3 GFlops   25.2 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   21.2 GFlops   28.9 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   27.5 GFlops   30.7 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   39.1 GFlops   37.0 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   47.4 GFlops   39.0 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   62.5 GFlops   45.5 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   76.0 GFlops   49.7 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   74.1 GFlops   43.9 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   74.2 GFlops   40.3 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   67.3 GFlops   33.7 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   64.7 GFlops   30.0 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   59.8 GFlops   25.9 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   64.3 GFlops   26.0 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   68.6 GFlops   26.1 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   67.5 GFlops   24.3 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

Second, Win7-64:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    8.4 GFlops   14.9 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)   12.1 GFlops   16.6 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)   14.6 GFlops   16.3 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)   25.9 GFlops   24.5 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)   38.6 GFlops   31.8 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   50.3 GFlops   36.6 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   61.2 GFlops   40.0 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   61.6 GFlops   36.5 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   62.3 GFlops   33.8 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   57.5 GFlops   28.7 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   56.1 GFlops   26.0 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   52.4 GFlops   22.7 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   55.5 GFlops   22.5 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   59.2 GFlops   22.5 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   58.8 GFlops   21.1 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)   14.2 GFlops   25.0 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)   21.0 GFlops   28.6 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)   27.5 GFlops   30.7 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)   39.2 GFlops   37.1 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   46.8 GFlops   38.5 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   61.1 GFlops   44.5 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   75.2 GFlops   49.2 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   73.6 GFlops   43.6 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   73.4 GFlops   39.8 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   67.7 GFlops   33.9 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   64.4 GFlops   29.8 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   59.5 GFlops   25.7 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   64.0 GFlops   25.9 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   68.2 GFlops   26.0 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   67.1 GFlops   24.1 GB/s  ulps(fft  3.6,ps 6590.4) [OK]

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Dec 2010, 08:53:02 am
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.

Excellent, not broken on the 8800.  Last hurdle for that code area cleared & can move on  :D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 25 Dec 2010, 11:14:03 am
Carola just mentioned something I haven't been doing. I have been running the test without stopping BOINC. Should I run it with BOINC stopped?
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 25 Dec 2010, 12:56:38 pm
It's not neccessary to completely stop Boinc, but at least the GPU should be snoozed.
Can't test GPU computing/memory transfers when you are crunching with it.
Else you will see reduced values on the test.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 25 Dec 2010, 01:05:33 pm
Okay, let's see how much of a difference this makes....
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest9.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #9 (FFT pipeline)
                                Christmas 2010 edition.
Stock:
 FFT+PS+SM(     8)    3.0 GFlops    5.2 GB/s  ulps(fft  1.3,ps 4775.9) [OK]
 FFT+PS+SM(    16)    4.0 GFlops    5.5 GB/s  ulps(fft  1.6,ps 4817.4) [OK]
 FFT+PS+SM(    32)    4.4 GFlops    5.0 GB/s  ulps(fft  1.6,ps 4628.1) [OK]
 FFT+PS+SM(    64)    7.1 GFlops    6.7 GB/s  ulps(fft  1.6,ps 4557.6) [OK]
 FFT+PS+SM(   128)    9.8 GFlops    8.1 GB/s  ulps(fft  2.0,ps 4942.0) [OK]
 FFT+PS+SM(   256)   11.9 GFlops    8.6 GB/s  ulps(fft  2.0,ps 4967.8) [OK]
 FFT+PS+SM(   512)   15.0 GFlops    9.8 GB/s  ulps(fft  2.1,ps 5128.1) [OK]
 FFT+PS+SM(  1024)   16.2 GFlops    9.6 GB/s  ulps(fft  2.5,ps 5552.5) [OK]
 FFT+PS+SM(  2048)   17.5 GFlops    9.5 GB/s  ulps(fft  2.7,ps 5770.3) [OK]
 FFT+PS+SM(  4096)   13.4 GFlops    6.7 GB/s  ulps(fft  2.4,ps 5313.7) [OK]
 FFT+PS+SM(  8192)   14.2 GFlops    6.6 GB/s  ulps(fft  2.8,ps 5881.1) [OK]
 FFT+PS+SM( 16384)   13.7 GFlops    5.9 GB/s  ulps(fft  3.3,ps 6399.1) [OK]
 FFT+PS+SM( 32768)   12.1 GFlops    4.9 GB/s  ulps(fft  3.3,ps 6380.1) [OK]
 FFT+PS+SM( 65536)   13.0 GFlops    5.0 GB/s  ulps(fft  3.4,ps 6534.8) [OK]
 FFT+PS+SM(131072)   13.9 GFlops    5.0 GB/s  ulps(fft  3.6,ps 6694.2) [OK]


Opt1 (worst case): 64 thrds/block
 FFT+PS+SM(     8)    4.1 GFlops    7.3 GB/s  ulps(fft  1.3,ps 4637.5) [OK]
 FFT+PS+SM(    16)    5.7 GFlops    7.7 GB/s  ulps(fft  1.6,ps 4589.2) [OK]
 FFT+PS+SM(    32)    7.0 GFlops    7.8 GB/s  ulps(fft  1.6,ps 4535.6) [OK]
 FFT+PS+SM(    64)    9.2 GFlops    8.7 GB/s  ulps(fft  1.6,ps 4426.7) [OK]
 FFT+PS+SM(   128)   10.5 GFlops    8.6 GB/s  ulps(fft  2.0,ps 4818.1) [OK]
 FFT+PS+SM(   256)   12.7 GFlops    9.2 GB/s  ulps(fft  2.0,ps 4831.0) [OK]
 FFT+PS+SM(   512)   16.0 GFlops   10.5 GB/s  ulps(fft  2.1,ps 4987.2) [OK]
 FFT+PS+SM(  1024)   17.3 GFlops   10.2 GB/s  ulps(fft  2.5,ps 5438.0) [OK]
 FFT+PS+SM(  2048)   18.5 GFlops   10.0 GB/s  ulps(fft  2.7,ps 5674.7) [OK]
 FFT+PS+SM(  4096)   13.7 GFlops    6.9 GB/s  ulps(fft  2.4,ps 5202.4) [OK]
 FFT+PS+SM(  8192)   14.9 GFlops    6.9 GB/s  ulps(fft  2.8,ps 5765.4) [OK]
 FFT+PS+SM( 16384)   15.4 GFlops    6.6 GB/s  ulps(fft  3.3,ps 6291.8) [OK]
 FFT+PS+SM( 32768)   13.1 GFlops    5.3 GB/s  ulps(fft  3.3,ps 6275.5) [OK]
 FFT+PS+SM( 65536)   13.8 GFlops    5.3 GB/s  ulps(fft  3.4,ps 6429.1) [OK]
 FFT+PS+SM(131072)   14.5 GFlops    5.2 GB/s  ulps(fft  3.6,ps 6590.4) [OK]



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Dec 2010, 01:12:13 pm
Ahah! that explained the inflated speedup on the previous run  :) .  In essence (some of) the optimisations (namely, asynchronous transfers) I'm trying out should be less susceptible to  slowdowns under load than stock code (synchronous transfers)....

I wasn't looking to test/refine that aspect yet, but you managed to prove it already works... Thanks!  ;D

(Overlapped execution/transfers on Pre-Fermi, and concurrent kernels on Fermi next .... )
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 25 Dec 2010, 03:39:09 pm
Sorry bout that... hope I didn't mess you up too much. Glad it gave you some extra to think about.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 25 Dec 2010, 06:02:06 pm
Sorry bout that... hope I didn't mess you up too much. Glad it gave you some extra to think about.
  Not at all messed up, just had me wondering how 9500GT was managing to get 3x throughput at some sizes, and now we know it was under load ;).  That unexpected benefit does indeed give me some more things to consider for the next stage, and it looks like we might be able to push a bit harder than I thought.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 25 Dec 2010, 06:12:35 pm
Hey guys, I done something right for a change!!!  :)    ::)  Looking forward to the next test. This time I'll know to turn it off!
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 25 Dec 2010, 09:58:08 pm
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.

Excellent, not broken on the 8800.  Last hurdle for that code area cleared & can move on  :D

Wonderful to hear that. As always, looking forward to the next bit of execution-magic. ;)

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 04:36:06 am
Will take me some time to cook up the next test, working out this streaming stuff.
  Mixed results with kernel streaming so far, appearing to benefit my smaller highly optimised kernels more over the stock-ish larger sizes (don't know why yet, and dividing further into additional streams seems to slow it down again ... tricky!  ):

As with test #9 (single stream)
Quote
Opt1 (worst case): 256 thrds/block, 1 x 1048576 element streams
 FFT+PS+SM(     8)   19.2 GFlops   33.8 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   36.8 GFlops   50.3 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   60.7 GFlops   67.8 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)   86.2 GFlops   81.6 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   92.5 GFlops   76.1 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  135.0 GFlops   98.3 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  172.0 GFlops  112.4 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  214.7 GFlops  127.3 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  225.9 GFlops  122.6 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  232.3 GFlops  116.2 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  226.0 GFlops  104.8 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  221.5 GFlops   95.8 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  213.1 GFlops   86.3 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  210.5 GFlops   80.2 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  202.6 GFlops   72.8 GB/s  ulps(fft  2.7,ps 5392.8) [OK]

2x streams:
Quote
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
 FFT+PS+SM(     8)   26.7 GFlops   47.2 GB/s  ulps(fft  1.2,ps 4324.2) [OK]
 FFT+PS+SM(    16)   66.9 GFlops   91.3 GB/s  ulps(fft  1.6,ps 4326.2) [OK]
 FFT+PS+SM(    32)   90.9 GFlops  101.5 GB/s  ulps(fft  1.3,ps 4003.6) [OK]
 FFT+PS+SM(    64)  105.0 GFlops   99.4 GB/s  ulps(fft  1.5,ps 4270.2) [OK]
 FFT+PS+SM(   128)   94.0 GFlops   77.3 GB/s  ulps(fft  1.7,ps 4347.9) [OK]
 FFT+PS+SM(   256)  135.9 GFlops   98.9 GB/s  ulps(fft  1.7,ps 4261.8) [OK]
 FFT+PS+SM(   512)  167.9 GFlops  109.7 GB/s  ulps(fft  1.8,ps 4327.4) [OK]
 FFT+PS+SM(  1024)  198.4 GFlops  117.6 GB/s  ulps(fft  2.1,ps 4727.6) [OK]
 FFT+PS+SM(  2048)  209.1 GFlops  113.4 GB/s  ulps(fft  2.2,ps 4921.2) [OK]
 FFT+PS+SM(  4096)  209.9 GFlops  105.0 GB/s  ulps(fft  2.2,ps 4764.3) [OK]
 FFT+PS+SM(  8192)  204.8 GFlops   95.0 GB/s  ulps(fft  2.6,ps 5278.8) [OK]
 FFT+PS+SM( 16384)  205.0 GFlops   88.6 GB/s  ulps(fft  2.6,ps 5357.5) [OK]
 FFT+PS+SM( 32768)  187.5 GFlops   75.9 GB/s  ulps(fft  2.3,ps 4992.8) [OK]
 FFT+PS+SM( 65536)  195.2 GFlops   74.4 GB/s  ulps(fft  2.0,ps 4604.3) [OK]
 FFT+PS+SM(131072)  172.5 GFlops   62.0 GB/s  ulps(fft  2.7,ps 5392.8) [OK]

Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 12:13:36 pm
Updated first Post:
Quote
Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9
Title: Re: [Split] PowerSpectrum Unit Test
Post by: arkayn on 26 Dec 2010, 12:33:43 pm
Code: [Select]
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   67.27) Peak(  111.28) Min(    9.42) [OK]
   Memory thoughput GB/s   Avg(   36.72) Peak(   55.70) Min(   15.41)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   84.36, 1.25x) Peak(  131.47, 1.18x) Min(   31.13, 3.30x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   51.22, 1.39x) Peak(   66.16, 1.19x) Min(   34.18, 2.22x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 12:39:51 pm
Cheers,
  BTW: average roughly represents overall improvement, Peak represents speed change in the fastest Kernels, and Min is the speed change in the slowest Kernels ... So I regard 'Avg' & 'Min' as most important, with Peak being mostly just a possible indicator of remaining headroom.

[Edit:] Similarish looking deal with the 480
Code: [Select]
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(  104.34) Peak(  157.97) Min(   12.79) [OK]
   Memory thoughput GB/s   Avg(   57.34) Peak(   82.25) Min(   22.55)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  162.15, 1.55x) Peak(  232.02, 1.47x) Min(   26.47, 2.07x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   95.38, 1.66x) Peak(  127.32, 1.55x) Min(   46.67, 2.07x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: _heinz on 26 Dec 2010, 01:03:59 pm
Hi Jason,
new results from Test10
~~~~~~~~~~~~~~~
PowerSpectrumTest10.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   82.93) Peak(  130.76) Min(   12.00) [OK]
   Memory thoughput GB/s   Avg(   46.20) Peak(   64.10) Min(   21.16)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  125.13, 1.51x) Peak(  178.98, 1.37x) Min(   37.50, 3.12x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   75.48, 1.63x) Peak(   95.64, 1.49x) Min(   52.23, 2.47x)


PowerSpectrumTest10.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   80.74) Peak(  126.77) Min(   11.69) [OK]
   Memory thoughput GB/s   Avg(   44.99) Peak(   59.75) Min(   20.61)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  125.57, 1.56x) Peak(  179.89, 1.42x) Min(   37.72, 3.23x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   75.75, 1.68x) Peak(   95.76, 1.60x) Min(   52.48, 2.55x)


.
Done
PowerSpectrumTest10.exe -device 0

Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(    4.38) Peak(    6.24) Min(    1.31) [OK]
   Memory thoughput GB/s   Avg(    2.66) Peak(    3.97) Min(    1.80)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(    4.86, 1.11x) Peak(    6.64, 1.06x) Min(    1.86, 1.41x) [OK]
   Memory thoughput [GB/s]   -
      Avg(    3.08, 1.16x) Peak(    4.29, 1.08x) Min(    2.10, 1.17x)


.
Done
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 01:07:00 pm
Works on ION, YaY!  :)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 26 Dec 2010, 01:20:11 pm
On my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(    4.07) Peak(    5.64) Min(    1.19) [OK]
   Memory thoughput GB/s   Avg(    2.44) Peak(    3.69) Min(    1.51)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(    4.30, 1.06x) Peak(    5.78, 1.03x) Min(    1.68, 1.41x) [OK]
   Memory thoughput [GB/s]   -
      Avg(    2.70, 1.11x) Peak(    3.78, 1.03x) Min(    1.90, 1.26x)


Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 01:25:11 pm
On my 128Mb 8400M GS:

Work's on that too  :D,  looks like we've managed to max that one out  ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Ghost0210 on 26 Dec 2010, 01:25:30 pm
And My 465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   69.41) Peak(  104.12) Min(   10.56) [OK]
   Memory thoughput GB/s   Avg(   38.49) Peak(   54.71) Min(   18.61)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  101.54, 1.46x) Peak(  140.32, 1.35x) Min(   36.67, 3.47x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   61.36, 1.59x) Peak(   78.16, 1.43x) Min(   46.65, 2.51x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: perryjay on 26 Dec 2010, 01:26:47 pm
Okay, I remembered to stop BOINC this time....
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd\test

C:\test>powerspectrumtest10.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   11.40) Peak(   17.48) Min(    2.91) [OK]
   Memory thoughput GB/s   Avg(    6.85) Peak(    9.86) Min(    4.95)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   12.35, 1.08x) Peak(   18.33, 1.05x) Min(    4.45, 1.53x) [OK]
   Memory thoughput [GB/s]   -
      Avg(    7.76, 1.13x) Peak(   10.33, 1.05x) Min(    5.14, 1.04x)



C:\test>
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 01:34:40 pm
And My 465:

and
Okay, I remembered to stop BOINC this time....
...
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
...

Thanks both! Still some breathing room between avg & peak on those.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 26 Dec 2010, 02:11:17 pm

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(  114.30) Peak(  169.79) Min(   21.35) [OK]
   Memory thoughput GB/s   Avg(   64.38) Peak(   89.45) Min(   34.20)


Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  165.56, 1.45x) Peak(  234.17, 1.38x) Min(   61.06, 2.86x) [OK]
   Memory thoughput [GB/s]   -
      Avg(  100.82, 1.57x) Peak(  126.77, 1.42x) Min(   70.89, 2.07x)

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 26 Dec 2010, 02:21:14 pm
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
...
  Compute thoughput [GFlops] -
      Avg(  165.56, 1.45x) Peak(  234.17, 1.38x) Min(   61.06, 2.86x) [OK]

Winning! (just ;))  Glad you're on water cooling with those, My fan cranks up with that and creates a vortex in my room :D.

It made me think '1.21 GigaWatts!' (http://www.youtube.com/watch?v=mjCRUvX2D0E).   I'll be checking out & researching on water cooling the 480 here,  sometime in the new year.  Starting with the basics with guides like This one (http://www.clunk.org.uk/forums/water-cooling/33772-water-cooling-guide-beginners.html),  & doing my homework.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: SciManStev on 26 Dec 2010, 03:22:39 pm
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
...
  Compute thoughput [GFlops] -
      Avg(  165.56, 1.45x) Peak(  234.17, 1.38x) Min(   61.06, 2.86x) [OK]

Winning! (just ;))  Glad you're on water cooling with those, My fan cranks up with that and creates a vortex in my room :D.

It made me think '1.21 GigaWatts!' (http://www.youtube.com/watch?v=mjCRUvX2D0E).   I'll be checking out & researching on water cooling the 480 here,  sometime in the new year.  Starting with the basics with guides like This one (http://www.clunk.org.uk/forums/water-cooling/33772-water-cooling-guide-beginners.html),  & doing my homework.


With all the help you have given others, I would be happy to offer any assistance I could should you choose to go with water cooling. There is a lot in my system Tuning thread in NC you might find interesting. System Tuning (http://setiathome.berkeley.edu/forum_thread.php?id=62406&nowrap=true#1059367)

Steve
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 26 Dec 2010, 06:22:13 pm
Q6600/8GB/8800GTX.

One remark though: if you want to run a test multiple times, why not do that in the download-able executable? I don't mind if a benchmark of yours runs several minutes on my rig, so just do a few test-runs, determine the max/min and standard-deviation or something and output that?

I have in any case run the benchmark 3 times on both OS versions, before running a 4th one redirected to a text-file (and compared that one too). Results and speed-ups looked stable to my 'naked' eye.

WinXP-32:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   51.45) Peak(   72.63) Min(    9.33) [OK]
   Memory thoughput GB/s   Avg(   30.07) Peak(   47.47) Min(   16.45)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   55.01, 1.07x) Peak(   75.98, 1.05x) Min(   13.89, 1.49x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   33.46, 1.11x) Peak(   49.65, 1.05x) Min(   24.23, 1.47x)

Win7-64:

Code: [Select]
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   45.04) Peak(   62.72) Min(    8.62) [OK]
   Memory thoughput GB/s   Avg(   26.39) Peak(   40.07) Min(   15.21)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   54.49, 1.21x) Peak(   75.17, 1.20x) Min(   13.75, 1.59x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   33.12, 1.26x) Peak(   49.13, 1.23x) Min(   24.07, 1.58x)

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: MarkJ on 27 Dec 2010, 12:17:59 am
Did a few runs for test #10 on different cards/machines...

Cheers,
MarkJ

-------------------------------------------------

Device: GeForce GT 240, 1340 MHz clock, 475 MB memory.
Compute capability 1.2
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   32.78) Peak(   48.81) Min(    8.49) [OK]
   Memory thoughput GB/s   Avg(   19.49) Peak(   28.94) Min(   12.38)

Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   35.66, 1.09x) Peak(   51.41, 1.05x) Min(   12.84, 1.51x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   22.13, 1.14x) Peak(   30.48, 1.05x) Min(   15.22, 1.23x)

------------------------------------------------------------

Device: GeForce GTX 460, 1350 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   62.95) Peak(  102.88) Min(    8.18) [OK]
   Memory thoughput GB/s   Avg(   34.05) Peak(   52.16) Min(   13.33)

Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   79.87, 1.27x) Peak(  121.17, 1.18x) Min(   23.84, 2.91x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   47.79, 1.40x) Peak(   63.10, 1.21x) Min(   33.50, 2.51x)

-----------------------------------------------------------

Device: GeForce GTX 570, 1464 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(  101.46) Peak(  151.95) Min(   20.02) [OK]
   Memory thoughput GB/s   Avg(   57.48) Peak(   79.89) Min(   30.85)

Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
  revert to single stream from size 512
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(  139.93, 1.38x) Peak(  199.62, 1.31x) Min(   51.29, 2.56x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   85.24, 1.48x) Peak(  106.89, 1.34x) Min(   58.81, 1.91x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Dec 2010, 02:32:59 am
Q6600/8GB/8800GTX.

One remark though: if you want to run a test multiple times, why not do that in the download-able executable? I don't mind if a benchmark of yours runs several minutes on my rig, so just do a few test-runs, determine the max/min and standard-deviation or something and output that?

I have in any case run the benchmark 3 times on both OS versions, before running a 4th one redirected to a text-file (and compared that one too). Results and speed-ups looked stable to my 'naked' eye.


Cheers & No worries Patrick,
     Just wasn't sure extending the test was going to be needed.  Naked eye judgement is plenty for the purposes of testing scientific repeatability here, and running multiple times in the same exe would make it one large test rather than several small ones for comparison (if that makes any sense).  I'm happy that the 8800 seems to have some headroom left, and the 'Min' numbers indicate the sloest kernels have received a niice boost. 

Win7(WDDM) & XP(XPDM) driver model performance difference is 'gone'  ;D

Secondary confirmation from a friend's 8800GTS:
XP32
Code: [Select]
Device: GeForce 8800 GTS 512, 1625 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   44.40) Peak(   66.68) Min(    7.85) [OK]
   Memory thoughput GB/s   Avg(   26.26) Peak(   41.19) Min(   13.83)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   47.57, 1.07x) Peak(   67.80, 1.02x) Min(   17.37, 2.21x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   30.04, 1.14x) Peak(   41.89, 1.02x) Min(   19.00, 1.37x)
Win7-32
Code: [Select]
Device: GeForce 8800 GTS 512, 1625 MHz clock, 500 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   40.57) Peak(   57.91) Min(    7.32) [OK]
   Memory thoughput GB/s   Avg(   23.86) Peak(   35.82) Min(   12.91)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   48.43, 1.19x) Peak(   66.67, 1.15x) Min(   15.87, 2.17x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   30.30, 1.27x) Peak(   41.94, 1.17x) Min(   20.41, 1.58x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Dec 2010, 02:45:28 am
Did a few runs for test #10 on different cards/machines...

Cheers,
MarkJ

Thanks Mark! Starting to make a dent with the stubborn 240, and the Fermi boosts looking healthy.
I will need to get to checking the 260 in the other room soon, then we should have 'the full set'

[Later:] Here 'tis
Quote
Device: GeForce GTX 260, 1242 MHz clock, 896 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   62.64) Peak(   93.36) Min(    4.48) [OK]
   Memory thoughput GB/s   Avg(   34.47) Peak(   52.71) Min(    7.89)


Opt1 (worst case): 128 thrds/block, 2 x 524288 element streams
  revert to single stream from size 256
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   67.78, 1.08x) Peak(   95.96, 1.03x) Min(    5.69, 1.27x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   38.80, 1.13x) Peak(   55.48, 1.05x) Min(   10.03, 1.27x)
Maybe still some headroom on 200 series as well.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: PatrickV2 on 27 Dec 2010, 08:04:57 am
Cheers & No worries Patrick,
     Just wasn't sure extending the test was going to be needed.  Naked eye judgement is plenty for the purposes of testing scientific repeatability here, and running multiple times in the same exe would make it one large test rather than several small ones for comparison (if that makes any sense).  I'm happy that the 8800 seems to have some headroom left, and the 'Min' numbers indicate the sloest kernels have received a niice boost. 

Win7(WDDM) & XP(XPDM) driver model performance difference is 'gone'  ;D

Thanks for the extended explanation; my remark was merely given in by curiosity (and probably a large lack in understanding the underlying higher goals), but I feel more enlightened now. ;)

Regards, Patrick.
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 27 Dec 2010, 08:25:28 am
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(    9.58) Peak(   13.91) Min(    2.48) [OK]
   Memory thoughput GB/s   Avg(    5.70) Peak(    9.09) Min(    3.53)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   11.23, 1.17x) Peak(   15.13, 1.09x) Min(    4.27, 1.72x) [OK]
   Memory thoughput [GB/s]   -
      Avg(    6.99, 1.23x) Peak(    9.88, 1.09x) Min(    5.01, 1.42x)


values roughly +- .3 on stock and +- .1 on opt1

[edit]compute speedup 1.56x - 1.76x memory speedup 1.22x -1.47x
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Dec 2010, 08:27:47 am
Thanks for the extended explanation; my remark was merely given in by curiosity (and probably a large lack in understanding the underlying higher goals), but I feel more enlightened now. ;)

Yeah, a bit more info along those lines, the actual kernels under test run in timing loops set to roughly half a second, which is enough for ~thousands to millions of runs, so I was expecting 'fair' stability in the Avg, Peak & Min values, so we are alright for discrete kernel performance measurements. 

I have however picked up an interesting thing on a friends i7-860 w/GTX480 in comparing against mine ( 45nM core2 w/GTX480)
-  His Peaks & Averages are ~same as mine for the same clockrate ... BUT ... the 'Min (slowest kernels) are several times faster ... Better CPU & RAM does have significant impact on the running of the toughest parts of code, it seems

Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 27 Dec 2010, 08:29:00 am
values roughly +- .3 on stock and +- .1 on opt1

Hey that's decent! ... and there you were going to start a riot when initial mods yielded about 5% slowdown on yours ... tsk tsk tsk  ;D
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Miep on 27 Dec 2010, 08:51:17 am
Hey that's decent! ... and there you were going to start a riot when initial mods yielded about 5% slowdown on yours ... tsk tsk tsk  ;D

Oh I just learned how to complain when not suffering ;D did the trick didn't it? ;)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Claggy on 28 Dec 2010, 08:52:57 am
My 9800GTX+ on Win 7 x64:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   49.81) Peak(   71.73) Min(    8.11) [OK]
   Memory thoughput GB/s   Avg(   29.08) Peak(   44.80) Min(   14.31)


Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
  revert to single stream from size 128
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   57.66, 1.16x) Peak(   80.19, 1.12x) Min(   18.07, 2.23x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   35.80, 1.23x) Peak(   50.46, 1.13x) Min(   24.47, 1.71x)


Claggy
Title: Re: [Split] PowerSpectrum Unit Test
Post by: glennaxl on 04 Jan 2011, 01:53:38 am
Device: GeForce GTX 260, 1441 MHz clock, 869 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
  Processing... Done!
  Compute Thoughput GFlops Avg(   47.55) Peak(   65.20) Min(   10.16) [OK]
   Memory thoughput GB/s   Avg(   28.09) Peak(   37.12) Min(   17.92)


Opt1 (worst case): 128 thrds/block, 2 x 524288 element streams
  revert to single stream from size 256
  Processing... Done!
  Compute thoughput [GFlops] -
      Avg(   84.83, 1.78x) Peak(  111.50, 1.71x) Min(   31.57, 3.11x) [OK]
   Memory thoughput [GB/s]   -
      Avg(   52.63, 1.87x) Peak(   67.26, 1.81x) Min(   36.24, 2.02x)
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 04 Jan 2011, 03:14:52 pm
Thanks both!

@glenaxl: that's some impressive speedup on GTX 260, I'll have to look at that here carefully on mine when I get a chance to do so.

@Claggy, average at 3/4 of peak seems pretty good, but I think we can get some more maybe.

@ALL, Thanks! I'm closing this test for now.  It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular). 

FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here.  Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come  :)

[Locking thread, Please stay tuned for further Unit Tests!]
Jason
Title: Re: [Split] PowerSpectrum Unit Test
Post by: Jason G on 10 Jan 2011, 06:28:14 am
@All:
   Just a note that the concerns that arose, and distracted me from testing & development along this line, have now been at least partially resolved, and don't require any immediate action on our part.   I'm back to ruggedising &  integrating what we've accomplished here into the X-builds, and plan to start devising tests for PoT (Power over Time) processing refinement soon, in similar fashion to this thread.  PoT processing covers Gaussian searches, Triplet & Pulse finding, for which all Cuda releases have known issues to address, so there'll be plenty of tests to devise & collect data for yet.

Cheers once again!  :)
Jason