-
.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?
No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine). The kernels are all 'hard code' so no speed difference should be evident between driver change.
[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]
Jason
[Edit:] Confirmed requires driver 260.89+ , [Mod] Split off driver thread
[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2: Fixed, but sadly is slow now, remains at stock accuracy
Mod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)
[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o. Could be a smidgen faster.
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction :-\)
Example output (important numbers: highlighted, Stock, Opt1 )
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps
SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 5.9 GFlops 24.1 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulps
Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)
Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)
Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.
Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only
Temporary download location(s):
fast: http://www.arkayn.us/seti/PowerSpectrumTest9.7z
slow: ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z
Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9
Update: @ALL, Thanks! I'm closing this test for now. It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).
FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here. Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come :)
[Locking thread, Please stay tuned for further Unit Tests!]
-
How do I run this?
I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.
-
What driver ?
-
How do I run this?
I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.
updating from 258.96 to 260.99 solved that hiccup for me
-
And I've just checked that 260.89 is good enough, too.
-
As a side effect, I'm accumulating a good collection of data that tells me a lot about the different GPU memory subsystems, on different generations (Powerspectrum is a 'memory bound' computation). Will split this off into its own thread a bit later [Done]
[Edit] In our own thread now, feel free to post results here, attach, or PM. I'm getting some very handy information to use this weekend, toward optimisation strategies.
Will try and make some sort of table up once I make some sense of the data.
-
Device: GeForce GT 240, 1340 MHz clock, 512 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 8.6 GFlops 3.4 GB/s 1183.3ulps
64 threads: 10.1 GFlops 4.1 GB/s 1183.3ulps
128 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps
256 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 3.4 GFlops 1.3 GB/s 1183.3ulps
64 threads: 4.5 GFlops 1.8 GB/s 1183.3ulps
128 threads: 4.5 GFlops 1.8 GB/s 1183.3ulps
256 threads: 4.4 GFlops 1.8 GB/s 1183.3ulps
-
A couple more datapoints, from Windows 7. The 'AMP' in the card model name says it's a factory overclock version.
-
Thanks both. Those are the 'stubborn' cards ;)
-
I can't speak for Richard but my card is as stubborn as I am ;D
-
**********
-device 0
**********
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 18.6 GFlops 7.4 GB/s 1183.3ulps
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps
128 threads: 26.7 GFlops 10.7 GB/s 1183.3ulps
256 threads: 26.7 GFlops 10.7 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 5.3 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.2 GFlops 2.9 GB/s 1183.3ulps
128 threads: 10.6 GFlops 4.2 GB/s 1183.3ulps
256 threads: 10.7 GFlops 4.3 GB/s 1183.3ulps
**********
-device 1
**********
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.8 GFlops 10.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 17.9 GFlops 7.2 GB/s 1183.3ulps
64 threads: 26.0 GFlops 10.4 GB/s 1183.3ulps
128 threads: 26.1 GFlops 10.4 GB/s 1183.3ulps
256 threads: 24.6 GFlops 9.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 5.2 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.1 GFlops 2.8 GB/s 1183.3ulps
128 threads: 10.3 GFlops 4.1 GB/s 1183.3ulps
256 threads: 10.6 GFlops 4.2 GB/s 1183.3ulps
**********
-device 2
**********
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.4 GFlops 10.2 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 18.7 GFlops 7.5 GB/s 1183.3ulps
64 threads: 25.6 GFlops 10.2 GB/s 1183.3ulps
128 threads: 25.9 GFlops 10.4 GB/s 1183.3ulps
256 threads: 25.9 GFlops 10.4 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 5.2 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.0 GFlops 2.8 GB/s 1183.3ulps
128 threads: 10.3 GFlops 4.1 GB/s 1183.3ulps
256 threads: 10.4 GFlops 4.1 GB/s 1183.3ulps
-
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480. That's something to investigate. Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 29.1 GFlops 11.6 GB/s 0.0ulps
GetPowerSpectrum() mod 1:
32 threads: 17.6 GFlops 7.1 GB/s 0.0ulps
64 threads: 28.9 GFlops 11.6 GB/s 0.0ulps
128 threads: 40.5 GFlops 16.2 GB/s 0.0ulps
256 threads: 44.0 GFlops 17.6 GB/s 0.0ulps
GetPowerSpectrum() mod 2:
32 threads: 19.3 GFlops 7.7 GB/s 0.0ulps
64 threads: 38.0 GFlops 15.2 GB/s 0.0ulps
128 threads: 61.1 GFlops 24.5 GB/s 0.0ulps
256 threads: 61.4 GFlops 24.6 GB/s 0.0ulps
-
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480. That's something to investigate. Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]
My bad. FAH was running in background. Edited my post with new results.
-
My bad. FAH was running in background. Edited my post with new results.
Ahh, cheers & LoL... I'm wondering why mod2 doesn't appear to work on those. ([Later:] ah, probably some shared memory bank conflicts or such, will read into that. )
-
And on my 465 with 260.99 drivers:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 16.0 GFlops 6.4 GB/s 0.0ulps
GetPowerSpectrum() mod 1:
32 threads: 9.8 GFlops 3.9 GB/s 0.0ulps
64 threads: 15.9 GFlops 6.3 GB/s 0.0ulps
128 threads: 20.9 GFlops 8.3 GB/s 0.0ulps
256 threads: 23.1 GFlops 9.2 GB/s 0.0ulps
GetPowerSpectrum() mod 2:
32 threads: 14.4 GFlops 5.8 GB/s 0.0ulps
64 threads: 28.4 GFlops 11.4 GB/s 0.0ulps
128 threads: 33.5 GFlops 13.4 GB/s 0.0ulps
256 threads: 32.8 GFlops 13.1 GB/s 0.0ulps
-
Not sure if you're looking for this, but below my results on my 8800GTX, 260.99 drivers:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 17.8 GFlops 7.1 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 14.2 GFlops 5.7 GB/s 1183.3ulps
64 threads: 17.8 GFlops 7.1 GB/s 1183.3ulps
128 threads: 17.8 GFlops 7.1 GB/s 1183.3ulps
256 threads: 17.6 GFlops 7.0 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 6.8 GFlops 2.7 GB/s 1183.3ulps
64 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
128 threads: 9.1 GFlops 3.7 GB/s 1183.3ulps
256 threads: 8.0 GFlops 3.2 GB/s 1183.3ulps
Regards, Patrick.
-
starting PowerSpectrum2
.
-device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 20.6 GFlops 8.2 GB/s 0.0ulps
GetPowerSpectrum() mod 1:
32 threads: 12.5 GFlops 5.0 GB/s 0.0ulps
64 threads: 20.5 GFlops 8.2 GB/s 0.0ulps
128 threads: 27.6 GFlops 11.0 GB/s 0.0ulps
256 threads: 29.9 GFlops 12.0 GB/s 0.0ulps
GetPowerSpectrum() mod 2:
32 threads: 14.4 GFlops 5.8 GB/s 0.0ulps
64 threads: 28.3 GFlops 11.3 GB/s 0.0ulps
128 threads: 42.4 GFlops 16.9 GB/s 0.0ulps
256 threads: 42.5 GFlops 17.0 GB/s 0.0ulps
-device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 20.6 GFlops 8.3 GB/s 0.0ulps
GetPowerSpectrum() mod 1:
32 threads: 12.6 GFlops 5.0 GB/s 0.0ulps
64 threads: 20.5 GFlops 8.2 GB/s 0.0ulps
128 threads: 27.5 GFlops 11.0 GB/s 0.0ulps
256 threads: 30.1 GFlops 12.0 GB/s 0.0ulps
GetPowerSpectrum() mod 2:
32 threads: 14.4 GFlops 5.8 GB/s 0.0ulps
64 threads: 28.4 GFlops 11.4 GB/s 0.0ulps
128 threads: 42.2 GFlops 16.9 GB/s 0.0ulps
256 threads: 41.1 GFlops 16.4 GB/s 0.0ulps
.
Done
modify:
@Jason, woundering about you get 20 GFlops more with 256 threads than mine GTX470
have you source for me to compile with 2011XE Compiler ?
-
I tried running it on my 460 but the program always crashes on the end of 128/beginning of 256 threads in mod 2.
Never see any results.
-
Here's my 9800GTX+ result, like Richard's 9800GTX+ it's a factory overclocked example, but by XFX:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 16.1 GFlops 6.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 15.1 GFlops 6.1 GB/s 1183.3ulps
64 threads: 16.1 GFlops 6.5 GB/s 1183.3ulps
128 threads: 16.0 GFlops 6.4 GB/s 1183.3ulps
256 threads: 15.9 GFlops 6.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
64 threads: 8.2 GFlops 3.3 GB/s 1183.3ulps
128 threads: 8.3 GFlops 3.3 GB/s 1183.3ulps
256 threads: 8.1 GFlops 3.2 GB/s 1183.3ulps
Claggy
-
Here's my 128Mb 8400M GS's result, while it's not got enough RAM for Seti, it at least gives you some figures for very slow GPU's:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
64 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
128 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
256 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.6 GFlops 0.2 GB/s 1183.3ulps
Claggy
-
run it twice on the ION
~~~~~~~~~~~~~~~~~
starting PowerSpectrum2
.
Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 1.3 GFlops 0.5 GB/s 1183.3ulps
64 threads: 1.9 GFlops 0.7 GB/s 1183.3ulps
128 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
256 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 1.0 GFlops 0.4 GB/s 1183.3ulps
64 threads: 1.0 GFlops 0.4 GB/s 1183.3ulps
128 threads: 0.9 GFlops 0.4 GB/s 1183.3ulps
256 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
Device: ION, 1100 MHz clock, 242 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 1.3 GFlops 0.5 GB/s 1183.3ulps
64 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
128 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
256 threads: 1.9 GFlops 0.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 1.0 GFlops 0.4 GB/s 1183.3ulps
64 threads: 1.0 GFlops 0.4 GB/s 1183.3ulps
128 threads: 0.9 GFlops 0.4 GB/s 1183.3ulps
256 threads: &nbqp; 0.8 GFlops 0.3 GB/s 1183.3ulps
.
Done
-
This is what I got on my 480's with 260.99
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory
Compiled with CUDA 3020
Stock GetPowerSpectrum<> mod 1:
64 threads: 27.6 GFlops 11.1 GB/s 0.0ulps
GetPowerSpectrum<> mod 1:
32 threads: 17.5 GFlops 7.0 GB/s 0.0ulps
64 threads: 27.5 GFlops&nb!`; 11.0 GB/s 0.0ulps
128 threads: 36.4 GFlops 14.6 GB/s 0.0ulps
256 threads: 39.6 GFlops 15.8 GB/s 0.0ulps
GetPowerSpectrum<> mod 2:
32 threads: 20.2 GFlops 8.1 GB/s 0.0ulps
64 threads: 39.7 GFlops 15.9 GB/s 0.0ulps
128 threads: 64.1 GFlops 25.6 GB/s 0.0ulps
256 threads: 64.3 GFlops 25.7 GB/s 0.0ulps
Steve
I edited the data as the first time I was crunching.
-
modify:
@Jason, woundering about you get 20 GFlops more with 256 threads than mine GTX470
have you source for me to compile with 2011XE Compiler ?
GTX480 has wider memory bus IIRC. Also they're GPU Kernels Heinz, so CPU host side won't make any difference here (Unless Intel started messing with Cuda binaries ;) ) After some work, this will lead to a set of optimisation strategies for other kernels throughout, rather than 1 specific piece of useful code
I'm looking at this (almost pure) memory bound computation (powerspectrum), as a way to see what optimisation strategies work on different cards with that type of operation. This way I can learn to make kernels that choose the best memory access strategy internally by compute capability.
So far it looks like Mod2 is winning on Fermi (apart from whatever is causing arkayn's problems) Prior Gen 200 series seem to like Mod1 better, so I suspect there is some memory pattern issue for me to look at in Mod2 with respect to prior gen cards. Earlier G80-G92 cards could be even more memory subsystem constrained, or need even more special treatment of access patterns, by the looks of things.
@Arkayn, not sure what would cause that, but on my 480 that's where things start to get 'a bit warm' ... Is there a possibility of temperature issues ? Try cranking the fan perhaps. [Edit:] Probably pushing the 2.1 (GTX 460) architecture limits in Mod2. I'll look into that for mod3.
Steve's WINNING! (Just ;) ) -
Plenty of data for me to chew on. Will be thinking about mod3.
Jason
-
Here's my 128Mb 8400M GS's result, while it's not got enough RAM for Seti, it at least gives you some figures for very slow GPU's:
Nice! Another stubborn GPU :D
-
Not sure if you're looking for this, but below my results on my 8800GTX, 260.99 drivers:
Exactly what I'm looking for, thanks.
-
Device: GeForce 9800 GT, 1750 MHz clock, 500 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 13.6 GFlops 5.4 GB/s 1183.3ulps
GetPowerSpectrum() mod 1:
32 threads: 12.1 GFlops 4.9 GB/s 1183.3ulps
64 threads: 13.7 GFlops 5.5 GB/s 1183.3ulps
128 threads: 13.5 GFlops 5.4 GB/s 1183.3ulps
256 threads: 13.4 GFlops 5.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 2:
32 threads: 5.3 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.0 GFlops 2.8 GB/s 1183.3ulps
128 threads: 7.1 GFlops 2.8 GB/s 1183.3ulps
256 threads: 6.8 GFlops 2.7 GB/s 1183.3ulps
-
If anyone's wondering what this figure is:
... 1183.3ulps ...
It's a measure of the precision against a CPU double precision reference power spectrum.
Fermi's get 0ulps total deviation (most accurate) because they default to IEEE-754 compliance, whereas earlier gen consistently get 1183.3 because they use a fast single precision implementation by default.
I can either use special intrinsic functions on the older cards to force compliance, at a speed penalty, or allow the Fermi's to use the faster (less accurate) computation. Will see. 1183.3 'Units of Least Precision' isn't much total deviation from double precision reference over the 1048576 point data set used in multibeam.
an ulp is defined here as:
const float ulp = 1.192092896e-07f;
... about 0.00000012 ... and there'd be some of that amount of variation from double precision CPU reference scattered throughout the dataset.
Jason
-
@Arkayn: I looked through some results I have, and I have a GTX460 set that ran to completion @ stock speeds (Using driver 263.06). Might be pushing the memory OC a bit on yours ?
-
I think it is at 800/1600 right now, runs Collatz just fine at that speed.
I just took it down to stock speed as well as the lowest setting that Afterburner allowed and it still crashed the program.
This is on a XP-64 pro machine though.
Driver is the 263.06, do I need the toolkit installed as well?
-
Driver is the 263.06, do I need the toolkit installed as well?
Nope, It's definitely something weird. Bear in mind that those upper kernels are pushing Fermi's memory subsystem harder than any boinc science app has to date that I know of, so I doubt Collatz or any other existing app would be a fair comparison ( except maybe Furmark, which is just a savage thing to do to a graphics card )
If it runs this at stock OK, but not at 800/1600, then it might be Collatz stable, but is unlikely to be future X series stable. My current feeling is that the memory frequency is the culprit, rather than the core.
(If it doesn't run correctly at stock either, then more guessing to do ;) )
[Later:] At this stage I'm assuming some sort of bug in Mod2, so don;t go pulling things to bits just yet ;)
-
A source of a subtle stock code precision variation on pre-Fermi cards found, will test patch mod1 & mod3 & leave stock alone, probably fix mod2 but leave precision un-fixed as a test (fixing mod2 will make it slower anyway)
[A Bit Later:] Updated first post:
[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2: Fixed, but sadly is slow now, remains at stock accuracy
Mod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)
Some variation on #1 and/or #3 may need to end up contributing to a stock update down the road due to the stock code (very tiny) precision mismatch on CPU Vs PreFermi Vs Fermi ). The issue could be a contributor to the 'dodgy Gaussians', time will tell whether that's the case or not.
-
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.5 GFlops 10.6 GB/s 121.7ulps
128 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
256 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
64 threads: 6.3 GFlops 2.5 GB/s 1183.3ulps
128 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
256 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.5 GFlops 10.6 GB/s 121.7ulps
128 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
256 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
512 threads: 26.6 GFlops 10.7 GB/s 121.7ulps
1024 threads: N/A
-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.1 GFlops 10.4 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.4 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.1 GFlops 10.4 GB/s 121.7ulps
128 threads: 26.3 GFlops 10.5 GB/s 121.7ulps
256 threads: 26.4 GFlops 10.5 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 6.1 GFlops 2.4 GB/s 1183.3ulps
64 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
128 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
256 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 25.9 GFlops 10.3 GB/s 121.7ulps
128 threads: 26.0 GFlops 10.4 GB/s 121.7ulps
256 threads: 26.4 GFlops 10.6 GB/s 121.7ulps
512 threads: 26.4 GFlops 10.6 GB/s 121.7ulps
1024 threads: N/A
-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.5 GFlops 10.2 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.7 GFlops 7.5 GB/s 121.7ulps
64 threads: 25.6 GFlops 10.2 GB/s 121.7ulps
128 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
256 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 5.9 GFlops 2.4 GB/s 1183.3ulps
64 threads: 6.1 GFlops 2.4 GB/s 1183.3ulps
128 threads: 6.0 GFlops 2.4 GB/s 1183.3ulps
256 threads: 5.9 GFlops 2.4 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.7 GFlops 7.5 GB/s 121.7ulps
64 threads: 25.6 GFlops 10.2 GB/s 121.7ulps
128 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
256 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
512 threads: 25.8 GFlops 10.3 GB/s 121.7ulps
1024 threads: N/A
-
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 14.7 GFlops 5.9 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in a
32 threads: 8.1 GFlops 3.2 GB/s 121.7ulps
64 threads: 14.5 GFlops 5.8 GB/s 121.7ulps
128 threads: 22.2 GFlops 8.9 GB/s 121.7ulps
256 threads: 26.2 GFlops 10.5 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 9.4 GFlops 3.8 GB/s 0.0ulps
64 threads: 12.2 GFlops 4.9 GB/s 0.0ulps
128 threads: 14.7 GFlops 5.9 GB/s 0.0ulps
256 threads: 14.3 GFlops 5.7 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split lo
32 threads: 7.6 GFlops 3.0 GB/s 121.7ulps
64 threads: 14.0 GFlops 5.6 GB/s 121.7ulps
128 threads: 21.5 GFlops 8.6 GB/s 121.7ulps
256 threads: 20.8 GFlops 8.3 GB/s 121.7ulps
512 threads: 20.0 GFlops 8.0 GB/s 121.7ulps
1024 threads: 17.5 GFlops 7.0 GB/s 121.7ulps
-
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
128 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 3.0 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.4 GFlops 1.8 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
512 threads: 3.3 GFlops 1.3 GB/s 121.7ulps
1024 threads: N/A
Oh, look, I'm faster than an ION...
And look how horrible even mod3 is a whole 5% slower than stock. That's 4 minutes on a 90' task. Which means I'd diminish throughput by one task per 4-5 days. Simply outrageous.
-
LoL, don't worry, we'll put a crappy stock codepath in just for you ;)
[Edit:] I'm leaning toward the simpler Mod1 Kernel for the rest of us. On the Fermi's at least there is some cache control to play with yet, but then the denser threadcount of Mod3, at little cost, may allow more active kernels to fit on the Fermi GPU concurrently... Hmmm....
-
My 9800GTX+'s rerun:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 16.2 GFlops 6.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 15.2 GFlops 6.1 GB/s 121.7ulps
64 threads: 16.2 GFlops 6.5 GB/s 121.7ulps
128 threads: 15.9 GFlops 6.4 GB/s 121.7ulps
256 threads: 15.8 GFlops 6.3 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 2.7 GFlops 1.1 GB/s 1183.3ulps
64 threads: 2.6 GFlops 1.1 GB/s 1183.3ulps
128 threads: 2.6 GFlops 1.1 GB/s 1183.3ulps
256 threads: 2.5 GFlops 1.0 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 15.2 GFlops 6.1 GB/s 121.7ulps
64 threads: 16.2 GFlops 6.5 GB/s 121.7ulps
128 threads: 15.9 GFlops 6.4 GB/s 121.7ulps
256 threads: 15.9 GFlops 6.3 GB/s 121.7ulps
512 threads: 15.1 GFlops 6.0 GB/s 121.7ulps
1024 threads: N/A
Claggy
Edit: and my 128Mb 8400M GS:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 1.2 GFlops 0.5 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
64 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
128 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
256 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.3 GFlops 0.1 GB/s 1183.3ulps
64 threads: 0.3 GFlops 0.1 GB/s 1183.3ulps
128 threads: 0.3 GFlops 0.1 GB/s 1183.3ulps
256 threads: 0.2 GFlops 0.1 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
64 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
128 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
256 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
512 threads: 1.2 GFlops 0.5 GB/s 121.7ulps
1024 threads: N/A
-
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...
Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.
-
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
-
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...
Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.
The clock rate is just what the driver/library reports, which is some fixed number & doesn't measure any hardware (or mean much other than some general indication of the original core spec).
As far as GTX 260 Vs 9800GTX+ Vs GTX 460 goes, quite right ;) but not strange at all, This is a 'memory bound' kernel, almost purely instead of 'compute bound'. That makes it not overly dependant on the processing speed of the GPU at all, but instead on the specific memory implementation, clocks & quality of the RAM chips used, as well as the kernel playing around I've been trying out.
So for that reason, this should be taken as a comparison of memory bound operations on different cards, and relative memory subsystem performance of the cards with respect to kernel tweaking, not a guide to GPU compute performance .... as there simply is very little to compute in a powerspectrum at all.
The goals at this time involve isolating effective strategies at shovelling data in and out of the GPU, rather than what's going on inside.... That comes later with some more meaty (compute intensive) kernels.
Jason
-
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
Yes it's looking not bad. I can readily embed a couple of codepaths in now, As the drivers have their own built n dispatch ( YaY ). To me that means we probably can have our cake & eat it too, but it is just a matter of running around picking up all the crumbs & sticking them together first.
-
And on the 465:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 16.0 GFlops 6.4 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 9.8 GFlops 3.9 GB/s 121.7ulps
64 threads: 15.8 GFlops 6.3 GB/s 121.7ulps
128 threads: 20.8 GFlops 8.3 GB/s 121.7ulps
256 threads: 23.1 GFlops 9.2 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 10.8 GFlops 4.3 GB/s 0.0ulps
64 threads: 13.2 GFlops 5.3 GB/s 0.0ulps
128 threads: 13.3 GFlops 5.3 GB/s 0.0ulps
256 threads: 12.1 GFlops 4.9 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 9.4 GFlops 3.7 GB/s 121.7ulps
64 threads: 15.3 GFlops 6.1 GB/s 121.7ulps
128 threads: 20.8 GFlops 8.3 GB/s 121.7ulps
256 threads: 20.6 GFlops 8.3 GB/s 121.7ulps
512 threads: 20.6 GFlops 8.2 GB/s 121.7ulps
1024 threads: 18.6 GFlops 7.4 GB/s 121.7ulps
-
Cheers,
Will have to test the kernel concurrency next ( launch 2 - 16 powerspectrums at the same time ). No idea how much, if any, overall speed improvement might be achievable with that, but needs testing. I'll keep stock & all 3 mods in play for that, since one may 'pack' better than the others (smaller thread counts might pass the larger ones in performance if executing multiple on the same multiprocessor).
-
@Ghost: Didn't you post about 50% higher results for GTX465 results yesterday? Why's the difference? Did you change something? Drivers?
Or are there 2 different versons of PowerSpectrum floating around?
-
Or are there 2 different versons of PowerSpectrum floating around?
Check the first post, for the updated build & notes. The Mod2 kernel was doing suspect things, so I've knobbled it (for now).
[I see you used the newer build yourself, so yes, mod2 numbers will be lower than yesterday ]
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory
Compute capability 2.0
Compiled with CUDA 3020
Stock GetPowerSpectrum<>:
63 threads: 27.7 GFlops 11.1 GB/s 0.0ulps
GetPowerSpectrum<> mod 1: <made Fermi & Pre-Fermi match in accuracy.>
32 threads: 17.4 GFlops 7.0 GB/s 121.7ulps
64 threads: 27.5 GFlops 11.0 GB/s 121.7ulps
128 threads: 36.4 GFlops 14.5 GB/s 121.7ulps
256 threads: 39.6 GFlops 15.8 GB/s 121.7ulps
GetPowerSpectrum<> mod 2 <fixed, but slow>:
32 threads: 18.9 GFlops 7.6 GB/s 0.0ulps
64 threads: 23.1 GFlops 9.2 GB/s 0.0ulps
128 threads: 24.1 GFlops 9.6 GB/s 0.0ulps
256 threads: 22.7 GFlops 9.1 GB/s 0.0ulps
GetPowerSpectrum<> mod 3: <As with mod1, +threads & split loads>
32 threads: 16.7 GFlpos 6.7 GB/s 121.7ulps
64 threads: 26.9 GFlops 10.8 GB/s 121.7ulps
128 threads: 36.0 GFlops 14.4 GB/s 121.7ulps
256 threads: 34.9 GFlops 13.9 GB/s 121.7ulps
512 threads: 34.7 GFlops 13.9 GB/s 121.7ulps
1024 threads: 33.5 GFlops 13.4 GB/s 121.7ulps
Steve
-
Got through mod 2 just fine, now it crashes on mod 3 512 threads.
I even set the clocks to 505/1010/1350 just to check.
Also crashes at 800/1600/1800
-
mmm, don't know why, weird. Will look at mod3's differences to mod2 (not much). Maybe some sort of driver bug ? It runs on XP32 here, but that's only a 260, not a Fermi.
I'd try a 263.06 driver clean install & see if that helps.
Can anyone else report crashing out on Mod3 ? Looks like Mod1 (256 thread) will be the useful technique on Fermi cards anyway, but if there is some issue with Mod3 it'd be nice to find & fix for a fair comparison.
[A bit Later:] Might have found something, will try adjust mod3 & update later. @arkayn: :o why is your card the only one that tells me when I do something wrong ?
-
Updated first post:
[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o. Could be a smidgen faster.
Thanks Arkayn for picking up my bugs. Still no idea why yours is extra fussy, but it's very handy at the moment.
-
Mod3 perforamance improved in latest PS build...
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 14.7 GFlops 5.9 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 8.2 GFlops 3.3 GB/s 121.7ulps
64 threads: 14.6 GFlops 5.8 GB/s 121.7ulps
128 threads: 22.3 GFlops 8.9 GB/s 121.7ulps
256 threads: 26.2 GFlops 10.5 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 9.4 GFlops 3.8 GB/s 0.0ulps
64 threads: 12.2 GFlops 4.9 GB/s 0.0ulps
128 threads: 14.7 GFlops 5.9 GB/s 0.0ulps
256 threads: 14.3 GFlops 5.7 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 8.2 GFlops 3.3 GB/s 121.7ulps
64 threads: 14.7 GFlops 5.9 GB/s 121.7ulps
128 threads: 22.3 GFlops 8.9 GB/s 121.7ulps
256 threads: 26.1 GFlops 10.4 GB/s 121.7ulps
512 threads: 25.7 GFlops 10.3 GB/s 121.7ulps
1024 threads: 18.3 GFlops 7.3 GB/s 121.7ulps
-
hehe thanks. 460 with stock code is starting to look a bit anaemic, around all those 20+ figures
-
C:\ap_j>cd g_fft
Stopping Boinc...
starting PowerSpectrum4.exe
.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 20.6 GFlops 8.2 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 12.5 GFlops 5.0 GB/s 121.7ulps
64 threads: 20.5 GFlops 8.2 GB/s 121.7ulps
128 threads: 27.6 GFlops 11.0 GB/s 121.7ulps
256 threads: 29.9 GFlops 12.0 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 13.5 GFlops 5.4 GB/s 0.0ulps
64 threads: 16.7 GFlops 6.7 GB/s 0.0ulps
128 threads: 17.2 GFlops 6.9 GB/s 0.0ulps
256 threads: 15.7 GFlops 6.3 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 12.6 GFlops 5.0 GB/s 121.7ulps
64 threads: 20.6 GFlops 8.2 GB/s 121.7ulps
128 threads: 27.5 GFlops 11.0 GB/s 121.7ulps
256 threads: 30.0 GFlops 12.0 GB/s 121.7ulps
512 threads: 29.7 GFlops 11.9 GB/s 121.7ulps
1024 threads: 25.6 GFlops 10.2 GB/s 121.7ulps
.
Done
Restarting Boinc...
Drücken Sie eine beliebige Taste . . .
heinz
-
Mod 4 Results on my 465:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 16.0 GFlops 6.4 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 9.8 GFlops 3.9 GB/s 121.7ulps
64 threads: 15.9 GFlops 6.3 GB/s 121.7ulps
128 threads: 21.0 GFlops 8.4 GB/s 121.7ulps
256 threads: 23.1 GFlops 9.2 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 10.7 GFlops 4.3 GB/s 0.0ulps
64 threads: 13.1 GFlops 5.2 GB/s 0.0ulps
128 threads: 13.3 GFlops 5.3 GB/s 0.0ulps
256 threads: 12.1 GFlops 4.8 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 9.8 GFlops 3.9 GB/s 121.7ulps
64 threads: 15.9 GFlops 6.4 GB/s 121.7ulps
128 threads: 21.0 GFlops 8.4 GB/s 121.7ulps
256 threads: 23.1 GFlops 9.2 GB/s 121.7ulps
512 threads: 22.9 GFlops 9.1 GB/s 121.7ulps
1024 threads: 19.5 GFlops 7.8 GB/s 121.7ulps
Edit: Corrected figures - was running downclocked in previous test (no tasks) stock 465 speeds now shown
-
Mod1 & Mod3 256 threads seems to suit Fermi the best...
-
Windows XP 32 seems to be faster than Windows 7 64.
I also noticed that for AP. For both Nvidia and AMD.
Windows 7 64
Device: GeForce GTX 460, 1451 MHz clock, 1024 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 12.7 GFlops 5.1 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 7.1 GFlops 2.8 GB/s 121.7ulps
64 threads: 12.6 GFlops 5.0 GB/s 121.7ulps
128 threads: 18.7 GFlops 7.5 GB/s 121.7ulps
256 threads: 22.4 GFlops 9.0 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 8.0 GFlops 3.2 GB/s 0.0ulps
64 threads: 10.4 GFlops 4.2 GB/s 0.0ulps
128 threads: 12.5 GFlops 5.0 GB/s 0.0ulps
256 threads: 12.3 GFlops 4.9 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 7.2 GFlops 2.9 GB/s 121.7ulps
64 threads: 12.7 GFlops 5.1 GB/s 121.7ulps
128 threads: 18.8 GFlops 7.5 GB/s 121.7ulps
256 threads: 22.4 GFlops 9.0 GB/s 121.7ulps
512 threads: 21.9 GFlops 8.8 GB/s 121.7ulps
1024 threads: 15.6 GFlops 6.2 GB/s 121.7ulps
================================================
Windows XP 32
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 13.2 GFlops 5.3 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 7.3 GFlops 2.9 GB/s 121.7ulps
64 threads: 13.1 GFlops 5.2 GB/s 121.7ulps
128 threads: 19.8 GFlops 7.9 GB/s 121.7ulps
256 threads: 23.5 GFlops 9.4 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 8.4 GFlops 3.3 GB/s 0.0ulps
64 threads: 10.9 GFlops 4.4 GB/s 0.0ulps
128 threads: 13.0 GFlops 5.2 GB/s 0.0ulps
256 threads: 12.7 GFlops 5.1 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 7.4 GFlops 3.0 GB/s 121.7ulps
64 threads: 13.2 GFlops 5.3 GB/s 121.7ulps
128 threads: 19.9 GFlops 8.0 GB/s 121.7ulps
256 threads: 23.6 GFlops 9.5 GB/s 121.7ulps
512 threads: 23.2 GFlops 9.3 GB/s 121.7ulps
1024 threads: 16.2 GFlops 6.5 GB/s 121.7ulps
-
I ran on all the different cards on the farm:
1st up the GT240 (Win7 x64) has 3 cards, the DDR5 variety. Device 0 is slightly slower than 1 and 2, although they are all the same brand/model. Output is from device 0.
Device: GeForce GT 240, 1340 MHz clock, 475 MB memory.
Compute capability 1.2
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 9.9 GFlops 4.0 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 8.5 GFlops 3.4 GB/s 121.7ulps
64 threads: 10.1 GFlops 4.0 GB/s 121.7ulps
128 threads: 10.0 GFlops 4.0 GB/s 121.7ulps
256 threads: 10.0 GFlops 4.0 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 2.1 GFlops 0.8 GB/s 1183.3ulps
64 threads: 2.1 GFlops 0.8 GB/s 1183.3ulps
128 threads: 2.1 GFlops 0.9 GB/s 1183.3ulps
256 threads: 2.0 GFlops 0.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 8.8 GFlops 3.5 GB/s 121.7ulps
64 threads: 10.1 GFlops 4.0 GB/s 121.7ulps
128 threads: 10.0 GFlops 4.0 GB/s 121.7ulps
256 threads: 10.0 GFlops 4.0 GB/s 121.7ulps
512 threads: 10.0 GFlops 4.0 GB/s 121.7ulps
1024 threads: N/A
*******************************************
Next we have a GTX275 (win7 x64):
Device: GeForce GTX 275, 1404 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 27.1 GFlops 10.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 17.1 GFlops 6.8 GB/s 121.7ulps
64 threads: 27.1 GFlops 10.8 GB/s 121.7ulps
128 threads: 27.3 GFlops 10.9 GB/s 121.7ulps
256 threads: 27.3 GFlops 10.9 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
64 threads: 6.3 GFlops 2.5 GB/s 1183.3ulps
128 threads: 6.0 GFlops 2.4 GB/s 1183.3ulps
256 threads: 6.0 GFlops 2.4 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 17.1 GFlops 6.9 GB/s 121.7ulps
64 threads: 27.1 GFlops 10.8 GB/s 121.7ulps
128 threads: 27.4 GFlops 11.0 GB/s 121.7ulps
256 threads: 27.2 GFlops 10.9 GB/s 121.7ulps
512 threads: 27.3 GFlops 10.9 GB/s 121.7ulps
1024 threads: N/A
*******************************************
Next a GTX295. Yeah, I know various people have run these. Win7 x64 again
Device: GeForce GTX 295, 1242 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 24.2 GFlops 9.7 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 15.6 GFlops 6.3 GB/s 121.7ulps
64 threads: 24.6 GFlops 9.8 GB/s 121.7ulps
128 threads: 24.8 GFlops 9.9 GB/s 121.7ulps
256 threads: 24.7 GFlops 9.9 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 5.6 GFlops 2.2 GB/s 1183.3ulps
64 threads: 5.7 GFlops 2.3 GB/s 1183.3ulps
128 threads: 5.5 GFlops 2.2 GB/s 1183.3ulps
256 threads: 5.4 GFlops 2.2 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 15.6 GFlops 6.3 GB/s 121.7ulps
64 threads: 24.6 GFlops 9.8 GB/s 121.7ulps
128 threads: 24.8 GFlops 9.9 GB/s 121.7ulps
256 threads: 24.7 GFlops 9.9 GB/s 121.7ulps
512 threads: 24.7 GFlops 9.9 GB/s 121.7ulps
1024 threads: N/A
*******************************************
Then a GTX460 (factory OC'ed version from EVGA. Once again under Win7 x64
Device: GeForce GTX 460, 810 MHz clock, 738 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 12.0 GFlops 4.8 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 6.9 GFlops 2.8 GB/s 121.7ulps
64 threads: 12.0 GFlops 4.8 GB/s 121.7ulps
128 threads: 17.4 GFlops 6.9 GB/s 121.7ulps
256 threads: 19.1 GFlops 7.6 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 7.6 GFlops 3.0 GB/s 0.0ulps
64 threads: 10.0 GFlops 4.0 GB/s 0.0ulps
128 threads: 11.9 GFlops 4.8 GB/s 0.0ulps
256 threads: 11.7 GFlops 4.7 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 7.0 GFlops 2.8 GB/s 121.7ulps
64 threads: 12.1 GFlops 4.8 GB/s 121.7ulps
128 threads: 17.4 GFlops 6.9 GB/s 121.7ulps
256 threads: 19.1 GFlops 7.7 GB/s 121.7ulps
512 threads: 18.8 GFlops 7.5 GB/s 121.7ulps
1024 threads: 14.3 GFlops 5.7 GB/s 121.7ulps
*******************************************
And lastly just for comparison the same brand/model factory OC'ed GTX460 but under WinXP
Device: GeForce GTX 460, 1350 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 12.1 GFlops 4.8 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 6.9 GFlops 2.8 GB/s 121.7ulps
64 threads: 12.0 GFlops 4.8 GB/s 121.7ulps
128 threads: 17.4 GFlops 7.0 GB/s 121.7ulps
256 threads: 19.1 GFlops 7.6 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 7.6 GFlops 3.0 GB/s 0.0ulps
64 threads: 10.0 GFlops 4.0 GB/s 0.0ulps
128 threads: 11.9 GFlops 4.8 GB/s 0.0ulps
256 threads: 11.7 GFlops 4.7 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 7.0 GFlops 2.8 GB/s 121.7ulps
64 threads: 12.1 GFlops 4.8 GB/s 121.7ulps
128 threads: 17.4 GFlops 7.0 GB/s 121.7ulps
256 threads: 19.1 GFlops 7.7 GB/s 121.7ulps
512 threads: 18.9 GFlops 7.5 GB/s 121.7ulps
1024 threads: 14.3 GFlops 5.7 GB/s 121.7ulps
Cheers,
MarkJ
-
Busy thread and a lot happening here. My respect. I re-ran the version 4 benchmark again on:
Win7-64/8GB/8800GTX/260.99 drivers:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 17.8 GFlops 7.1 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 14.0 GFlops 5.6 GB/s 121.7ulps
64 threads: 17.8 GFlops 7.1 GB/s 121.7ulps
128 threads: 17.8 GFlops 7.1 GB/s 121.7ulps
256 threads: 17.6 GFlops 7.0 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 2.9 GFlops 1.1 GB/s 1183.3ulps
64 threads: 2.9 GFlops 1.2 GB/s 1183.3ulps
128 threads: 2.9 GFlops 1.1 GB/s 1183.3ulps
256 threads: 2.9 GFlops 1.1 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 14.6 GFlops 5.8 GB/s 121.7ulps
64 threads: 17.9 GFlops 7.2 GB/s 121.7ulps
128 threads: 17.7 GFlops 7.1 GB/s 121.7ulps
256 threads: 17.5 GFlops 7.0 GB/s 121.7ulps
512 threads: 16.1 GFlops 6.4 GB/s 121.7ulps
1024 threads: N/A
EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?
Regards, Patrick.
-
EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?
Yes please. The difference picked up earlier (Thanks Frizz) between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) . I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.
-
I ran on all the different cards on the farm:
1st up the GT240 (Win7 x64) has 3 cards, the DDR5 variety. Device 0 is slightly slower than 1 and 2, although they are all the same brand/model. Output is from device 0.
Device: GeForce GT 240, 1340 MHz clock, 475 MB memory....
Nice to be edging out stock on that stubborn card. With the rest of your results it's starting to paint a picture that might be easy to handle:
by Compute Capability
2.0 & 2.1: Mod3 256 thread wins (Significant Boost )
1.3: Mod3 with 128 threads ( Very small boost )
1.0-1.2: Mod3 with 64 threads (edges out stock by a slim margin sometimes, but seems consistent)
That should be fairly straightforward to follow rules like this for other more important kernels, so I'll make sure I fully understand this behaviour & build kernels with that in mind.
-
Test 4 Win 7 64 260.99
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum<>:
64 threads: 27.6 GFlops 11.0 GB/s 0.0ulps
GetPowerSpectrum<> mod 1: <made Fermi & Pre-Fermi match in accuracy.>
32 threads: 17.4 GFlops 7.0 GB/s 121.7ulps
64 threads: 27.5 GFlops 11.0 GB/s 121.7ulps
128 threads: 36.4 GFlops 14.5 GB/s 121.7ulps
256 threads: 39.6 GFlops 15.8 GB/s 121.7ulps
GetPowerSpectrum<> mod 2 <fixed, but slow>:
32 threads: 18.9 GFlops 7.6 GB/s 0.0ulps
64 threads: 23.1 GFlops 9.2 GB/s 0.0ulps
128 threads: 24.1 GFlops 9.6 GB/s 0.0ulps
256 threads: 22.7 GFlops 9.1 GB/s 0.0ulps
GetPowerSpectrum<> mod 3: <As with mod1, +threads & split loads>
32 threads: 17.5 GFlops 7.0 GB/s 121.7ulps
64 threads: 27.6 GFlops 11.0 GB/s 121.7ulps
128 threads: 36.3 GFlops 14.5 GB/s 121.7ulps
256 threads: 39.7 GFlops 15.9 GB/s 121.7ulps
512 threads: 39.2 GFlops 15.7 GB/s 121.7ulps
1024 threads: 34.7 GFlops 13.9 GB/s 121.7ulps
Steve
-
Me and my little 9500GT reporting for duty sir but it's time for a little hand holding.I downloaded the package from the first post. I got a DLL and the executable. Where do I put the DLL before I open the EXE?
-
EDIT: I still have WinXP32 installed on another HD of this machine; are you interested in a run of your tool under that OS?
Yes please. The difference picked up earlier (Thanks Frizz) between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) . I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.
Sure, no problem. The results:
WinXP32-SP3/8GB/8800GTX/260.99 drivers:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 18.3 GFlops 7.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 14.1 GFlops 5.6 GB/s 121.7ulps
64 threads: 18.2 GFlops 7.3 GB/s 121.7ulps
128 threads: 18.2 GFlops 7.3 GB/s 121.7ulps
256 threads: 17.9 GFlops 7.2 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 2.9 GFlops 1.2 GB/s 1183.3ulps
64 threads: 2.9 GFlops 1.2 GB/s 1183.3ulps
128 threads: 2.9 GFlops 1.2 GB/s 1183.3ulps
256 threads: 2.9 GFlops 1.2 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 14.7 GFlops 5.9 GB/s 121.7ulps
64 threads: 18.3 GFlops 7.3 GB/s 121.7ulps
128 threads: 18.2 GFlops 7.3 GB/s 121.7ulps
256 threads: 18.0 GFlops 7.2 GB/s 121.7ulps
512 threads: 16.4 GFlops 6.6 GB/s 121.7ulps
1024 threads: N/A
Regards, Patrick.
-
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.
Steve
-
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.
Steve
Whoa, you can copy and paste from a CMD window. :o Right click in the title bar and you will be enlightened. Saves you a LOT of typing!
Regards,
Patrick.
-
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.
Steve
Even easier: use a redirect.
PowerSpectrum4 > results.txt
Always avoid rekeying as much as you possibly can. Apart from the time wasted, it's a prolific source of errors.
-
What I did was to copy the folder with the dll and executable into the root (C:) directory. Stop crunching. Go to the command line, and get yourself into the directory where the two files are. Run the PowerSpectrum4 file, and then hand type the results shown in the command window into notepad. From there they can be coppied and posted here. There is probally an easier way to do it, but you will get results. I think you need to be using at least the 260.89 GPU driver.
Steve
Whoa, you can copy and paste from a CMD window. :o Right click in the title bar and you will be enlightened. Saves you a LOT of typing!
Regards,
Patrick.
Thank you! I was stumbling myself trying to figure out how to do it. I actually had to toss the cat out because he kept climbing on me while I was trying to type. My DOS is not as good as it once was. It was even trial and error just to get to the right directory.
Steve
-
BTW: Steve, I think your card has downclocked or something.
Here's mine, 480 @ 820MHz (Win7x64):
...GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 17.7 GFlops 7.1 GB/s 121.7ulps
64 threads: 29.1 GFlops 11.6 GB/s 121.7ulps
128 threads: 40.3 GFlops 16.1 GB/s 121.7ulps
256 threads: 44.2 GFlops 17.7 GB/s 121.7ulps
512 threads: 43.4 GFlops 17.4 GB/s 121.7ulps
1024 threads: 36.8 GFlops 14.7 GB/s 121.7ulps...
-
BTW: Steve, I think your card has downclocked or something.
Here's mine, 480 @ 820MHz (Win7x64):
...GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 17.7 GFlops 7.1 GB/s 121.7ulps
64 threads: 29.1 GFlops 11.6 GB/s 121.7ulps
128 threads: 40.3 GFlops 16.1 GB/s 121.7ulps
256 threads: 44.2 GFlops 17.7 GB/s 121.7ulps
512 threads: 43.4 GFlops 17.4 GB/s 121.7ulps
1024 threads: 36.8 GFlops 14.7 GB/s 121.7ulps...
That's interesting. Normally I am running at 860 MHz, with the voltage at 1.05 VDC.
Steve
-
Me and my little 9500GT reporting for duty sir but it's time for a little hand holding.I downloaded the package from the first post. I got a DLL and the executable. Where do I put the DLL before I open the EXE?
9500GT would be a great double check of the theories so far (Mod3 64 thread should be the right choice & extremely close to stock for that one)
Just-
- chuck the exe & dll into a new folder somewhere easy to get to, such as C:\TEST
- Open a command window (Start->Run->CMD.EXE),
- change directory to that location ( cd \TEST )
- run the test ( powerspectrum4.exe > results.txt )
- wait for it to finish & look at results.txt
Jason
-
(http://i901.photobucket.com/albums/ac211/SciManStev/GPU.jpg)
At least they crunch fast.
Steve
-
Ah Huh! Memory clock
820/1640/2088 1.138V, that's about as hard as I can reasonably push it without going to water.
This particular test kernel is memory bound, so that'll be the difference.
-
Yes please. The difference picked up earlier (Thanks Frizz) between XP32 & XP64 was interesting ( with stock, around 10% advantage to XP32, reduced to ~5% with Mod3 ) . I've little doubt XP32 has a similar advantage over Win7x64, due to the simpler driver model, but it'd be nice to confirm if the mods close that gap a bit too.
Sure, no problem. The results:
...
64 threads: 18.3 GFlops 7.3 GB/s 121.7ulps
Thanks!, Not enough in it (~2-3%) for me to consider switching back to Xp32 :).
-
Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrum4.exe
Device: GeForce 9500 GT, 1840 MHz clock, 1008 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 2.8 GFlops 1.1 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.7 GFlops 1.1 GB/s 121.7ulps
64 threads: 2.9 GFlops 1.1 GB/s 121.7ulps
128 threads: 2.9 GFlops 1.1 GB/s 121.7ulps
256 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.5 GFlops 0.2 GB/s 1183.3ulps
64 threads: 0.5 GFlops 0.2 GB/s 1183.3ulps
128 threads: 0.5 GFlops 0.2 GB/s 1183.3ulps
256 threads: 0.5 GFlops 0.2 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 2.8 GFlops 1.1 GB/s 121.7ulps
64 threads: 2.9 GFlops 1.1 GB/s 121.7ulps
128 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
256 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
512 threads: 2.9 GFlops 1.1 GB/s 121.7ulps
1024 threads: N/A
C:\test>
-
Woohoo, I like the ones that say 1.2GB/s , might have to shift compute cap 1.1 cards into the Mod 3, 128 thread category ( or add more digits next time, to find out where within 0-9% that difference is. 9% would be good )
-
I guess I was doing it wrong before as well, I was just running it straight
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 12.8 GFlops 5.1 GB/s 0.0ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 7.7 GFlops 3.1 GB/s 121.7ulps
64 threads: 12.8 GFlops 5.1 GB/s 121.7ulps
128 threads: 17.6 GFlops 7.0 GB/s 121.7ulps
256 threads: 19.3 GFlops 7.7 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 8.7 GFlops 3.5 GB/s 0.0ulps
64 threads: 11.2 GFlops 4.5 GB/s 0.0ulps
128 threads: 13.2 GFlops 5.3 GB/s 0.0ulps
256 threads: 12.8 GFlops 5.1 GB/s 0.0ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 7.8 GFlops 3.1 GB/s 121.7ulps
64 threads: 12.9 GFlops 5.1 GB/s 121.7ulps
128 threads: 17.6 GFlops 7.0 GB/s 121.7ulps
256 threads: 19.3 GFlops 7.7 GB/s 121.7ulps
512 threads: 19.1 GFlops 7.6 GB/s 121.7ulps
1024 threads: 15.2 GFlops 6.1 GB/s 121.7ulps
-
Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.
-
Interesting. Cuda visual profiler reports the global memory throughput as ~175GB/s , of Mod 3 with 256 threads. That means the measurement in the UnitTest is a factor of 10 out ::) ( reported by Powerspectrum4 was ~17.7 GB/s)
-
Possible reason:
profiler counts all memory transfers, including overhead. Your code probably counts only useful data transfers.
It can be sign of big overhead amount.
-
Hmmm, yes I read that. Whatever reason will pop up as I analyse the crap out of the 256 thread version to see why it's faster on Fermi. I'm looking for a counter for uncoalesced global loads, but can't find it so far :-\
-
in memory operations section. For NV it presents.
Regarding workgroup size quite few factors can influence:
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack
All these factors can limit number of warps in flight simultaneously on single compute unit. That is, it influence quality of memory latence hiding.
It will add to all other issues with memory access patterns vs workgroup dimensions (at same workgroup size).
-
Ahh, found that counter for uncoalesced reads & writes isn't supported on greater that compute capability 1.1... oh well.
1) register pressure
2) local (shared in NV terms) memory amount
3) deepness of call stack
We're all good there with Mod3. only 6 registers / thread, occupancy is 1, no shared mem usage in this variant, and only a single call.
So it looks like a clean memory bound kernel with no issues.
I did notice the Memcopy kerenls use 192 threads, possibly to fit extra blocks per SM despite being memory bound, so I'm going to try that.
Mod3/256 threads fits 6 blocks/SM, and the max is 8, so it might be worth checking.
[Edit:] much the same:
192 threads: 44.1 GFlops 17.6 GB/s 121.7ulps
Not much more to squeeze out of Kernels like this I think, Will add concurrent kernels next (take my time doing so)
[Later:] Oops:
float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;
fixing:
float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;
That's better:
256 threads: 44.2 GFlops 176.8 GB/s 121.7ulps
Near maximum I think, will have to calculate the theoretical.
Theoretical max of GTX480 @ 2088MHz memclock = 200.448 GB/s , so 176.8 (effective) is pushing pretty hard. Onto concurrency....
-
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulps
GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 3.2 GFlops 1.3 GB/s 121.7ulps
64 threads: 4.6 GFlops 1.8 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.2 GFlops 1.7 GB/s 121.7ulps
512 threads: 3.5 GFlops 1.4 GB/s 121.7ulps
1024 threads: N/A
HTH
-
Fits with the theories so far :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10 ::)
-
Fits with the theories so far :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10 ::)
And maybe consider whether the kernels might be memory bound on some cards?
Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.
Yeah, 1840 shader gives 117.76 GFLOPS per the nVidia formula with 32 CUDA cores (aka shaders). What I find interesting is that they're trying to discourage use of Furmark and such which actually try to achieve the highest possible performance...
Joe
-
Actually, Nvidia seems to start even more differeintiating gaming GeForce and HTPC Tesla products by putting a limitaton in GTX580 (and probably future high-end gaming GeForce products) to downclock when its usage achieve very high level (as in FurMark or OCCT for example). Reason they are giving is that games will never put such high workload on GPU, and they are probably right. However, some highly optimized real-life CUDA applications could achieve it also - my guess is that Nvidia will respond with "buy a (much more expensive) Tesla if HTPC is what you want"... :(
-
And maybe consider whether the kernels might be memory bound on some cards?,,,
This one is memory bound on all of them, with only 3 compute instructions which are partially fused in each thread. Getting the right kernel geometry per compute capability does seem to let us push bandwidth upward from stock though, and it appears stock code for that kernel was compute cap 1.0 optimised (reasonable). There are ways to switch kernels at run time that are automatic, in the driver,now though.
With a memory bound computation like this then, it does seem logical to increase the compute density, which the first freaky powerspectrum's inclusion of power-spectrum into the FFT output does do, will need to test & refine those implementations for extension to more sizes in the long run.
Next is probably to try to rearrange the FFT & powerspectrum into chunks, in order to better exploit the cache available on Fermi (~768k L2 ), which the FFT-> powerspectrum sequence appears to be thrashing solely due to dataset size. I'm hoping that the concurrent kernels mechanism is intelligent enough to discriminate cache hot data.
In either case the next test will probably need some extra compute density, which should see the GFlops rise against a hopefully similar bandwidth figure.
I haven't yet explored whether any processing subsequent to the powerspectrum could also be embedded to further raise the compute density, finding spikes immediately, for example, but it's looking like a possibility. Further on, Dealing with indiviual PulsePoTs for the pulsefinding looks like an option, if the FFT sequence preceding it can be done in suitable blocks.
-
For spike finding whole array should be scanned. That is, either long loop inside thread or threads cooperation via shared memory and barriers.
Whereas power spectrum computation inherently parallel and each thread can be mapped to separate matrix point.
I tried to fuse power spectrum computation with normalization - performance decreased because of huge drop in available separate threads (normalization required mean computation, i.e., again, access to whole PoT array ).
-
(normalization required mean computation, i.e., again, access to whole PoT array ).
mmm, there may be a way partially around that by some sync barrier / reduction. Averaging the large dataset *should* be parallelisable (By swapping local means). [Edit:] That summax stuff seems to be doing that, but seems to be fairly generalised, with lots of 'TODO' and unnecessary stuff. will work out how to reduce in powerspectrum kernel later, since it seems pointless rescanning the whole array when we just had it there for powerspectrum.
-
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]
-
summax uses thread cooperation/synching and barriers.
[
I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.
]
I agree to some extent, except for that we're already memory bound here, so pinching at least portions (say first stage of the reduction of 256 points in the block) should be almost free via shared memory (compared to memory access time anyway). If it doesn't work out then it all leads to better understanding of these complicated things anyway ;)
-
Further confirmation of this kernel being memory bound first. I wound up the memory clock without changing the core clock.
At original OC (not stock ) 2088MHz memory clock:
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 44.1 GFlops 176.4 GB/s 121.7ulps
At 2208 MHz:
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
256 threads: 46.7 GFlops 186.8 GB/s 121.7ulps
So a ~5.7% increase in throughput for similar increase in memory clock (linear scaling with memory clock)
-
Is it perhaps an option to put the latest version of your test program (the one with the fixed GB/s numbers) in the first post?
Of course, if you want to add/run more tests, I'm looking forward to providing you with the new results. ;)
Regards, Patrick.
-
Hi Patrick,
I'm currently working on adding the next part of stock code to the reference set. It seems that the method used for the next part of processing in stock cuda code is very slow (Though I'm busily checking my numbers). Once I've done that, and come up with some suitable alternatives or refinements for that code, I'll probably replace the current test with a new one (with fixed memory throughput numbers).
Until then you can just multiply the Memory throughput figures by ten in your head ;).
As part of the next refinements, whether they turn out to be replacing or integrating the summax reduction kernels, or something else if that proves unworkable as Raistmer suggests, I'll be trying to include the threads per block heuristic we work out for the powerspectrum Mod3. All going well, I should have something more worth testing in a day or so.
Jason
[A bit later:] Just to make things complicated, the performance of the next reduction (stock code) depends on what sizes are fed to it ::) (Powerspectrum performance is constant)
Stock:
PowerSpectrum< 64 thrd/blk> 29.0 GFlops 115.9 GB/s 0.0ulps
SumMax ( 8 ) 4.3 GFlops 19.0 GB/s
SumMax ( 16) 3.8 GFlops 16.5 GB/s
SumMax ( 32) 1.8 GFlops 8.0 GB/s
SumMax ( 64) 3.1 GFlops 13.5 GB/s
SumMax ( 128) 4.7 GFlops 20.5 GB/s
SumMax ( 256) 6.3 GFlops 27.6 GB/s
SumMax ( 512) 11.2 GFlops 48.9 GB/s
SumMax ( 1024) 17.2 GFlops 75.1 GB/s
SumMax ( 2048) 20.3 GFlops 88.9 GB/s
SumMax ( 4096) 24.3 GFlops 106.3 GB/s
SumMax ( 8192) 25.2 GFlops 110.2 GB/s
SumMax ( 16384) 24.8 GFlops 108.7 GB/s
SumMax ( 32768) 28.3 GFlops 123.8 GB/s
SumMax ( 65536) 18.4 GFlops 80.4 GB/s
SumMax (131072) 10.1 GFlops 44.3 GB/s
Powerspectrum + SumMax ( 8 ) 12.0 GFlops 49.1 GB/s
Powerspectrum + SumMax ( 16) 10.8 GFlops 44.4 GB/s
Powerspectrum + SumMax ( 32) 6.2 GFlops 25.2 GB/s
Powerspectrum + SumMax ( 64) 9.3 GFlops 38.3 GB/s
Powerspectrum + SumMax ( 128) 12.6 GFlops 51.7 GB/s
Powerspectrum + SumMax ( 256) 15.3 GFlops 62.5 GB/s
Powerspectrum + SumMax ( 512) 20.8 GFlops 85.1 GB/s
Powerspectrum + SumMax ( 1024) 24.8 GFlops 101.5 GB/s
Powerspectrum + SumMax ( 2048) 26.3 GFlops 107.5 GB/s
Powerspectrum + SumMax ( 4096) 27.7 GFlops 113.5 GB/s
Powerspectrum + SumMax ( 8192) 28.0 GFlops 114.6 GB/s
Powerspectrum + SumMax ( 16384) 27.8 GFlops 113.8 GB/s
Powerspectrum + SumMax ( 32768) 28.8 GFlops 117.9 GB/s
Powerspectrum + SumMax ( 65536) 25.4 GFlops 104.0 GB/s
Powerspectrum + SumMax (131072) 19.8 GFlops 81.1 GB/s
-
yes, i should be so.
Different sizes mean different block numbers - different memory latence hiding at least.
Whereas power spectrum has constant (1M) amount of threads always - each thread mapped jus o single spectrum point and there are always 1M points no matter of sizes X*Y==1024*1024 always even if X varies.
-
@Raistmer: Now I restore the stock Memory transfers, and find this response from stock code:
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
reference summax[FFT#0]( 8) mean - 0.673622, peak - 1.624994
reference summax[FFT#0]( 16) mean - 0.705653, peak - 2.213269
reference summax[FFT#0]( 32) mean - 0.728661, peak - 2.725552
reference summax[FFT#0]( 64) mean - 0.650947, peak - 3.050944
reference summax[FFT#0]( 128) mean - 0.637886, peak - 3.113411
reference summax[FFT#0]( 256) mean - 0.668928, peak - 2.968936
reference summax[FFT#0]( 512) mean - 0.666855, peak - 2.978162
reference summax[FFT#0]( 1024) mean - 0.665324, peak - 2.985018
reference summax[FFT#0]( 2048) mean - 0.661129, peak - 3.003958
reference summax[FFT#0]( 4096) mean - 0.665850, peak - 2.982658
reference summax[FFT#0]( 8192) mean - 0.667464, peak - 2.975447
reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414
reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532
reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408
reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251
PowerSpectrum+summax Unit test
Stock:
PowerSpectrum< 64 thrd/blk> 29.1 GFlops 116.3 GB/s 0.0ulps
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
SumMax ( 16) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270
SumMax ( 32) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552
SumMax ( 64) 1.8 GFlops 7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944
SumMax ( 128) 2.6 GFlops 11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411
SumMax ( 256) 3.5 GFlops 15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936
SumMax ( 512) 5.0 GFlops 21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162
SumMax ( 1024) 6.1 GFlops 26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018
SumMax ( 2048) 6.7 GFlops 29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958
SumMax ( 4096) 7.2 GFlops 31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
SumMax ( 16384) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414
SumMax ( 32768) 7.3 GFlops 32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532
SumMax ( 65536) 6.2 GFlops 27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
Did you also find the stock reduction code prior to FindSpikes is a pile of poo also ? or do you think my test is broken ?
-
Do you mean that summax reach much lower throughput than power spectrum?
If yes, it should be, I described reasons earlier.
For now I converting CUDA summax32 into OpenCL for HD5xxx GPUs. Will se if it will be better than my own reduction kernels that don't use local memory at all.
Reduction in summax allows to increase number of workitems involved. W/o reduction for long FFT (and there are many long FFT calls, much more than small FFT ones) find spike would have only few workitems each dealing with big array of data, that equal very poor memory latency hiding.
So some sort of reduction is essential here [ and surely it will decrease throughput but in much less degree ]
{BTW, summax32 starts form FFTsize==32. From your table it looks like codelets (template-based) for sizes less than 32 are not very good ones, too low throughput. Good info, I'll think twice before using them in OpenCL now ;D }
-
Yep, I mean exactly what you're getting at:
Stock:
...
SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994
...
SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447
...
SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
I think my old Pentium 3 will calculate average & peak for a 1MiB point dataset in similar GFlops speeds, and need much less power to do so. (compared to GTX 480 overclocked)
Part of the waste is definitely the memory copies back to host for result reduction (OK) but not that much. I'll continue playing around and see if I can determine whether something like this should really be done as is, with improved GPU code, partially on CPU, or fully on CPU.
Jason
-
Full CPU transfer will be slower, I started with that in early stage of OpenCL MB.
4M memory transfer of power matrix costs too much.
For low FFT sizes I use flag transfer to see if mean/max/pos data need to be downloaded from GPU or not, but for big FFT sizes (low number mean/max/pos elements) I found it's easier to transfer them than flag.
Memory transaction (for ATI at least) has some threshold size (about 16kb) after that time of transfer almost doesn't change. That is, no matter single byte transferred or 16kb - overhead will be the same. So, no sence to download flag instead of 16kb of origianl mean/max/pos data.
-
OK, then I have a middle ground in mind that might restore some throughput & hopefully be friendly to the preceeding powerspectrum threadblock layout. Will give it a go.
[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.
-
[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.
3 float numbers (12 bytes) per each array. And when FFT size is big enough (most time it ~8-16k ) it's not too much for transfer (at FFT size of 8K we have 1k/8=128 arrays => 128*12=1,5kB for memory transfer per kernel call. It's well in threshold value of constant time memory transfer for ATI. That is, no sence to reduce size of transfer (and I see no ways to eliminate it completely w/o doing full reduction and spike bookkeeping completely on GPU).
Will look onto NV profiler data to see if it has another time of transfer vs size of transfer dependance...
-
OK,it's data for OpenCL build but you will gt ~ same with CUDA perhaps:
Transfer of flag (in my case flag in uint4 so 16 bytes):
4us GPU time and ~120us CPU time (as NV profiler shows)
transfer of full results array in case of big FFT sizes, for example, transfer of 8k data: GPU: 14us, CPU:128us;
4k of data: GPU: 9us, CPU:117us
looks like for NV GPU same rule applies, maybe with slightly lower threshold value: if transfer size less than some threshold, transfer time no longer depends from transfer size.
And it should be so, because of rather big quant of data in bus transfer.
-
OK,
While playing, it is definitely numdatapoints/fftlen transfers (of size float3) that is killing performance here. I've tried variations on partial reductions, to be completed on host, even with mapped/pinned memory. I find It's going to be more efficient to transmit the thresholds into the kernel, transferring flag/results only if necessary, solely because we have an upper limit of 30 signals... will keep playing.
Jason
-
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
-
solely because we have an upper limit of 30 signals...
Per signle kernel call it doesn't matter. You can always download whole array back if needed, and this is very rare case. In common case only uint flag of 4 bytes can be transferred if threshold /best comparison inside kernel.
But again, it works good only while FFT sizes are small (many short arrays).
-
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
Yep, that gels with what I'm seeing in cuda calls. I'll avoid looking too hard at OpenCL implementation in respect of strength in diversity, but the principles match up so far.
-
But again, it works good only while FFT sizes are small (many short arrays).
Yep, same again. The shorter FFT lengths match with the longer pulsePoTs, so I want those as short as possible. I'll feed through flags to the kernels at least for the shorter FFT lengths.
-
WoW,
Now am completely brainfried & need to design a thorough test for the next part. I'll take a good break before creating a new one.
I chose one size for the combined powerspectrum+summax optimisation (fftlen=64), and *think* I've got that working. I want to be very sure though, so I can use the same techniques through templatisation of the kernel.
It turns out using the shared memory for speeding the reduction is STINKING DIFFICULT :o....I really hope it gets easier with practice :D
"Tentatively looking OK" result for some reductions... but the speed looks too fast to be 100% correct right through, hence the need for extreme caution & a break from coding for a little while (Stock = Yellow, Opt1 = Green though suspect speed ):
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.0 GB/s 0.0ulps
SumMax ( 64) 1.8 GFlops 7.4 GB/s
fft[0] avg 0.650947 Pk 3.050944 OK
fft[1] avg 0.624826 Pk 2.995684 OK
fft[2] avg 0.620340 Pk 2.418427 OK
fft[3] avg 0.779598 Pk 2.243930 OK
PS+SuMx( 64) 6.0 GFlops 24.2 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.1 GFlops 176.6 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax Array Mapped to pinned host memory.
256 threads, fftlem 64: 33.2 GFlops 134.5 GB/s 121.7ulps
fft[0] avg 0.650947 Pk 3.050944
fft[1] avg 0.624826 Pk 2.995684
fft[2] avg 0.620340 Pk 2.418427
fft[3] avg 0.779598 Pk 2.243929
I'll post a thorough updated test when I'm a bit more confident of the result, but prior to templating the other sizes.
-
Managed to slow it down some (by processing properly ;)), but tests out OK here (so far):
First post updated (particularly looking for which cards show any net gain, and which none, in worst & best cases):
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction :-\)
Example output (important numbers: highlighted, Stock, Opt1 )
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps
SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 5.9 GFlops 24.1 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulps
-
BTW: Please test on unloaded system (keep forgetting to mention that ;))
[Edit:] Attached the wrong file ::) Fixing... Nevermind, was correct file after all
-
Testing on my shrubbery. Each file contains Result4 and Result5 (since I seem to have missed a testing cycle).
Other machines will follow. Last one.
-
Cheers, analysing first one:...
On that 9800GTX+ on Win7 (compute cap 1.1, I make that ~29% worst case, best case ~63% speedup. Looks like I'm getting the pre-Fermi's to budge finally ::), good (was worried about that). Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/blk, cc1.1)
analysing second one (9800GT on XP): ...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/block, cc1.1)
worst ~44%, best ~83%.
analysing third one (GTX 470 on XP):...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 256 thrds/block, cc2.0)
worst ~45%, best ~115%.
Thanks for the test4 results, they were helpful to doublecheck the threadcount huerisitc was wise enough in all three cases.
This particular code portion has mostly low impact, but Raistmer tells me it has most impact for VHAR. In any case, it's the compute capability based hueristics, & optimisation techniques being used that should hopefully help in more significant areas. Already starting to get much better armed than a week ago. Thanks ;D
-
Here's the results from my 9800GTX+ on Win 7 64bit:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 64.2 GB/s 1183.3ulps
SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.5 GFlops 18.3 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.3 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.9 GFlops 32.1 GB/s 121.7ulps
and from my 128Mb 8400M GS:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 1.2 GFlops 4.8 GB/s 1183.3ulps
SumMax ( 64) 0.1 GFlops 0.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 0.4 GFlops 1.5 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 1.2 GFlops 4.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
0.6 GFlops 2.4 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
0.6 GFlops 2.5 GB/s 121.7ulps
Claggy
Edit: Here's the results of my 9800GTX+ on Windows Vista 64bit:
Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 64.1 GB/s 1183.3ulps
SumMax ( 64) 1.4 GFlops 5.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.3 GFlops 17.6 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.8 GFlops 23.4 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.5 GFlops 30.4 GB/s 121.7ulps
-
Ran it on my rig (Q6600/8GB/8800GTX/Win7-64), results:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.1 GFlops 72.4 GB/s 1183.3ulps
SumMax ( 64) 1.2 GFlops 4.9 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.9 GFlops 15.6 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.2 GFlops 72.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.4 GFlops 22.0 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
6.6 GFlops 26.6 GB/s 121.7ulps
Are you also interested in a run under WinXP?
Regards,
Patrick.
-
Win7 x64 - GTX465:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 15.9 GFlops 63.8 GB/s 0.0ulps
SumMax ( 64) 1.3 GFlops 5.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.1 GFlops 16.6 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.1 GFlops 92.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.4 GB/s 121.7ulps
-
....Are you also interested in a run under WinXP? ...
Sure! it'll be interesting to see if I'm closing the gap, or making it wider ;).
Analysing your first result....
8800GTX
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38%
best case speedup: ~69%
-
Win7 x64 - GTX465:
Thanks, analysing your result too....
GTX 465
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~46%
best case speedup: ~112%
-
...and from my 128Mb 8400M GS:
Analysing both ;)
9800GTX+
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~33%
best case speedup: ~75%
8400M GS
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~50% <-- nice
best case speedup: ~50% <-- nice
-
....Are you also interested in a run under WinXP? ...
Sure! it'll be interesting to see if I'm closing the gap, or making it wider ;).
Analysing your first result....
8800GTX
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38%
best case speedup: ~69%
As requested (Q6600/8GB/8800GTX/WinXP32):
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.3 GFlops 73.1 GB/s 1183.3ulps
SumMax ( 64) 1.3 GFlops 5.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.3 GFlops 17.5 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.3 GFlops 73.1 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
6.4 GFlops 25.8 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.9 GFlops 32.2 GB/s 121.7ulps
Regards, Patrick.
-
As requested (Q6600/8GB/8800GTX/WinXP32):
8800GTX earlier Win7x64 result:
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38% --> 5.4 GFlops
best case speedup: ~69% --> 6.6Gflops
8800GTX XP32 result
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~48% --> 6.4 GFlops
best case speedup: ~83% --> 7.9 GFlops
Tentative conclusion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%
That's a lot of difference (more than I expected). Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?
(looks like I might end up widening the gap, rather than narrowing it ::))
Jason
-
As requested (Q6600/8GB/8800GTX/WinXP32):
8800GTX earlier Win7x64 result:
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38% --> 5.4 GFlops
best case speedup: ~69% --> 6.6Gflops
8800GTX XP32 result
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~48% --> 6.4 GFlops
best case speedup: ~83% --> 7.9 GFlops
Tentative conculsion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%
That's a lot of difference (more than I expected). Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?
Jason
Ah, both OSes have the 260.99 driver installed. Aero was active on Win7-64. There was also a VMWare virtual machine idling on the Win7-machine.
Since I suppose you;d like me to re-run the test on the Win7 machine without Aero and without the VM active, I did :):
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.1 GFlops 72.4 GB/s 1183.3ulps
SumMax ( 64) 1.2 GFlops 4.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.9 GFlops 15.6 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.2 GFlops 72.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.4 GFlops 21.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
6.6 GFlops 26.6 GB/s 121.7ulps
Hope this provides some insight.
Regards, Patrick.
-
Hope this provides some insight.
Thanks it does :). Neither aero nor the idling VM appear to have noticeably altered the performance numbers there... so us Win7-adopters appear to be paying the price for our shiny new WDDM driver model ;).
The stock code numbers are interesting too. XP32 @ 4.3GFlops, and Win7x64 @ 3.9-4.1 GFlops ..... looks like the more familiar reported ~10% advantage to XP we've heard about before.
Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64 :D
[Edit:] Here later in the week, I'll look into if the 64 bitness of the OS is a factor now, though it hasn't shown to be significant before. The WoW64 layer could be slowing things up there somehow, possibly, but best to know for sure.
-
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 26.5 GFlops 105.8 GB/s 1183.3ulps
SumMax ( 64) 2.1 GFlops 8.6 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.7 GFlops 26.9 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.7 GFlops 106.9 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
9.1 GFlops 37.0 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.8 GFlops 43.8 GB/s 121.7ulps
-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 25.2 GFlops 100.7 GB/s 1183.3ulps
SumMax ( 64) 2.1 GFlops 8.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.5 GFlops 26.3 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.3 GFlops 105.1 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
9.1 GFlops 36.9 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.4 GFlops 42.1 GB/s 121.7ulps
-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 25.3 GFlops 101.2 GB/s 1183.3ulps
SumMax ( 64) 2.0 GFlops 8.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.4 GFlops 25.7 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 25.9 GFlops 103.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
128 threads, fftlen 64: (worst case: full summax copy)
8.8 GFlops 35.8 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
10.4 GFlops 42.1 GB/s 121.7ulps
-
Thanks! compute cap 1.3, so completes the basic heuristic functionality test :)
GTX 295 (taking lower & upper limits on each GPU as combined range)
Average, peak calcs, thread-count hueristic: OK (both)
worst case speedup: ~35%,40%
best case speedup: ~61%-60%
GTX 260
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~37%
best case speedup: ~62%
Still some legroom in those 2xx series yet :) With the 295's still pulling those kindof relative performance numbers, They'll still challenge the 480's for a while yet IMO. Running several tasks on the same 480 GPU makes the picture less clear, so as some of the small refinements creep into future releases it'll be something fun to watch at least.
-
Here is the results from my 460
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 12.9 GFlops 51.6 GB/s 0.0ulps
SumMax ( 64) 1.1 GFlops 4.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.4 GFlops 13.8 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 19.4 GFlops 77.4 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
5.5 GFlops 22.1 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
6.9 GFlops 28.1 GB/s 121.7ulps
-
Thanks!, cooperating with cc 2.1 as well (after that rocky start ;) )
GTX 460
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~61% :o
best case speedup: ~103%
looking good. I haven't worked out the worse case speedup for this kernel on my 480 yet, should be similarish, doing...
Stock PS+SuMx( 64) 5.9 GFlops 24.0 GB/s
...
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.7 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.0 GB/s 121.7ulps
So
GTX480
worst: (8.1-5.9)/5.9 ~= 37%
best: (16.1-5.9)/5.9 ~= 173%
I guess I can live with the smaller improvement in the worst case, if I can manage to get a piiece of the best case improvement in some code down the road.
-
Thanks All!
From here I'll move to complete at least the 'worst case' operation for all sizes. That will take some time to make a further test confirming which sizes will work, at least for worst case speedups (simple implementation), and which not. During that period , I'll also be seeking straightforward integration into the X series builds, It would only amount to a very very small speedup over the whole processing, but will confirm certain techniques (as already mentioned).
The 'best case' optimisation will require extensive work to extract a reasonable portion of, which would be a further small speedup overall that looks like it'll help most GPUs, but Fermi most. Again those techniques would reflect on other more critical code areas in the long run, so your help here has been appreciated most highly.
I can start to apply some of the methods determined here toward more important areas with a lot more confidence.
Cheers, Jason
-
For sake of completeness:
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 4.5 GFlops 17.8 GB/s 1183.3ulps
SumMax ( 64) 0.2 GFlops 1.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 0.4 GFlops 1.7 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 4.4 GFlops 17.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
1.3 GFlops 5.3 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.4 GFlops 5.8 GB/s 121.7ulps
Some 10% difference between the two bottom ones.
-
Cheers,
Analysing...
Average, peak calcs, thread-count hueristic: OK
worst case speedup: (1.3-0.4)/0.4 ~225% (3.25x).. Winner! ;D
best case speedup: (1.4-0.4)/0.4 ~250% (3.5x)
Double checking those ridiculous numbers: (mistakes always possible ;) )
1.3GFlops(optimised) / 0.5 GFlops(Stock) definitely = 3.25x (325% of stock throughput)
The perecentage of optimised throughput that is speedup is then 0.9 GFlops / 1.3 GFlops ~= 69 percent of Opt throughput is Bonus. Speedup component is 225% of the stock throughput.
#Stock is doing something that GPU doesn't like :-\
-
I reran a few times, getting 0.8-0.9 1.3 1.4-1.5 now.
i.e. higher baseline, optimazation values stable. Can do some statistics tomorrow.
edit: that 0.4 seems to have been exceptionally low (and no, I didn't have the GPU crunching by accident :P )
-
OK, non-critical unless I make computation mistakes ( I was mostly concerned here to not make code slower...). Stock / x32f code there is doing something your GPU doesn't like IMO.
Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 28.4 GFlops 113.7 GB/s 0.0ulps
SumMax ( 64) 2.3 GFlops 9.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 7.4 GFlops 29.9 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 41.4 GFlops 165.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
10.9 GFlops 44.0 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.2 GFlops 65.4 GB/s 121.7ulps
This was much easier than typing it out. Thanks, Richard.
Steve
-
Thanks Steve!,
Now your increased Core speed is showing via the improved 'worst case' speedup over mine ( Your 10.9 Vs my 8.1 GFlops )
GTX480 (watercooled)
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~53% ( 1.53x )
best case speedup: ~119% ( 2.19x )
-
Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64 :D
Just re-run the Mod5 test on my GTX465 on Win7 x64 SP1 v.721 RC
getting the same results as before:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 63.9 GB/s 0.0ulps
SumMax ( 64) 1.3 GFlops 5.2 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.1 GFlops 16.5 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.1 GFlops 92.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.4 GB/s 121.7ulps
-
Interesting. In the meantime I also managed to verify that 32 bit versus 64 bit executable yielded no discernible performance difference here ( Since it's GPU jard coded anyway ;) )
So we're left with WinXP32's simpler driver model with no Direct10+ support, or WDDM stuff going on IMO. I wonder if there's a way to turn off more stuff in Win7x64, video subsystem-wise.
[Edit:] Hmmm....
http://www.anandtech.com/show/3924/nvidia-announces-parallel-nsight-15-cuda-toolkit-32
"Compared to the old XPDM, WDDM was a big step up for GPU usage on Windows, but only for graphical purposes. With Windows’ iron-fisted control over the GPU and a focus on task scheduling for responsiveness over performance, it wasn’t ideal for GPGPU purposes. Case in point, with a WDDM driver NVIDIA was finding it took 30μs for a kernel to be launched, but if they had Windows treat the GPU as a generic device by using a Windows Driver Model (WDM) driver, that launch time dropped to 2.5μs. This coupled with the fact that a WDM driver is necessary to use Tesla cards in a Windows Remote Desktop Protocol environment (as any Folding @Home junkie can tell you, RDP sessions can’t access the GPU through WDDM) resulted in the birth of TCC mode."
-
Looks good - a massive drop in time to launch kernels, shame it's only available for Tesla GPU's at the moment
Hopefully NV will release a similar driver for atleast the fermi cards if not all the current cards
-
Yeah, OmegaDrivers.Net Guy looks like broke & struggling to Work out Win7 Drivers too (None for Win7 available when you read further in).
-
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens
http://forums.nvidia.com/index.php?showtopic=159208
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"
-
Test 5 output.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.3 GFlops 81.3 GB/s 0.0ulps
SumMax ( 64) 0.7 GFlops 2.9 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 2.3 GFlops 9.5 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 29.2 GFlops 117.0 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.3 GFlops 9.5 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.1 GFlops 44.9 GB/s 121.7ulps
Kevin
-
PowerSpectrumTest5.exe -device 0
.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.6 GFlops 82.5 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.6 GFlops 18.5 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.0 GFlops 119.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.6 GFlops 26.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.2 GFlops 45.2 GB/s 121.7ulps
PowerSpectrumTest5.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.7 GFlops 82.6 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.6 GFlops 18.7 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.1 GFlops 120.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.6 GFlops 26.9 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.2 GFlops 45.3 GB/s 121.7ulps
.
Done
-
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens
http://forums.nvidia.com/index.php?showtopic=159208
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"
@Ghost: I did get the following so far:
- Made the modifications appropriate to the inf file, and successfully installed 263.06 TCC driver ( On 480 )
- Disabled the device as a 'normal' display (using mobo display instead)
- Merged the nSight registry key that disables WPF acceleration (for good measure, shouldn't be necessary with no active display on it)
Next step should be to switch the devices driver mode to TCC mode. That's done via the command:
nvidia-smi --driver-model=
howevr I get this response:
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe --driver-model=
GPU 0 is not a supported TCC device, skipping
[Edit:] Note that it doesn't say that the card/driver doesn't support it...
Confirming with DeviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1576468480 bytes
Multiprocessors x Cores/MP = Cores: 15 (MP) x 32 (Cores/MP) = 480 (
Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 0.81 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads
can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = GeForce GTX 480
PASSED
So I gather we're stuck for now :( [Edit:] unless you happen to be good with SoftIce or similar.... ::)
Going to try checking if I got the section number in the inf right etc...
-
I got stuck @ the same point as well.
When I ran the nvidia-smi.exe -dm0 cmd I got the same message about the GPU not being supported on TCC.
I tried modifying the .inf file with limited success so I used this site http://laptopvideo2go.com
Basically they create a standard .inf file that allows all NV cards to use all drivers ;D Saved a lot of time and hassle - they also have unreleased drivers on their site. The latest I could see were 265.90. Not sure where they get them from so use at your own risk, but I've had no issues with them.
also saw a slight increase in the worst case scenario with Mod5 with these drivers, of the top of my head it was about .2 increase over the official release drivers.
Haven't had a look at SoftIce yet - I'll do a bit of research tomorrow as it looks like I may not be getting into the office again :D
-
Despite finding a couple more posts (after spending a few hour searching) saying that it is possible to enable TCC mode on non-Tesla cards, I haven't been able to get it running on my 465.
At the moment all I have managed is to change the compute mode ruleset through nvidia-smi. Although DeviceQuery still says its running in Default mode, so wether this has had any real affect or not is up for debate >:(
Think its about time to give up on this idea unfortunately, shame though, it would have been nice to get it working as all this card does is crunch Seti
-
Yeah tried that compute mode thing too. With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances ;).
I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM. I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.
If it had been a matter of the current ~10% difference stock sees between the driver models, I wouldn't have gone to the trouble of DualBoot. My optimisations, on the other hand, yielding ~30% in favour of XPDM, really force the issue for me (even though faster on Both OS /Driver Models), since a lot of the refinement achieved here with a small kernel is likely to apply through most of the application (after a lot of work). That translates to a crapload of compute performance in my book, since the single 480 on Wolfdale,, x32f on Win7x64) was sustaining 25-26k RAC when there was work. Much as I dislike RAC as a measure, ~10% extra there (current code) would only boost it to ~27-28k or so (within work dependant variation anyway), ~32k, though, seems more definite & well worth the added effort. ( [Edit:] then add optimisation benefit I suppose )
Jason
-
OK, non-critical unless I make computation mistakes ( I was mostly concerned here to not make code slower...). Stock / x32f code there is doing something your GPU doesn't like IMO.
Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?
I dug out a 'shared memory: no' from a german comparison site. its got 256M of it's own as far as I know.
nvidia control panel system info comes up with
total available 1535 MB
dedicated 256 MB GDDR3
sytem video 0MB
shared system mem 1279MB
-
Great, thanks. Yes the dedicated number is the clincher, it's a discrete GPU then, explaining why it didn't trigger a special integrated GPU optimisation I made in the last test (that particular functionality remains untested/verified). The shared bit will just be system memory the driver's using for WDDM paging. I only care about that because It looks like santa might be bringing me an ION2 based netbook (I was a good boy all year, sortof) , so poking around with that functionality early seemed a good idea.
-
Yeah tried that compute mode thing too. With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances ;).
I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM. I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.
Jason
I was thinking about something similar - just ordered a couple of 1TB drives and a raid controller, so after migrating my current data drives to those (and finally getting some internal redundancy :)) I'll have a spare drive that I was planning on either loading with Linux, or XP if I can find the disk again. A 30% boost is definatly worth the extra effort of setting up the dualboot. Although I may just run a VM for Boinc (depending on wether I can get the GPU's to be seen by the VM) and how good Boinc operates, I'll create an XP VM and run it that way
-
Extra drives on order here ... hopefully will be able to find a floppy disk for XP raid driver install before they arrive ::).
If you're able to verify (increased XPDM advantage with the heavily optimised kernels, over stock ~10% advantage between driver models) prior to me getting setup, I'll report the increased XPDM<->WDDM speed discrepancy with highly optimised kernels ... Since they may not have factored as much as 30% performance difference into decisions (related to TCC mode).
-
I've managed to scavenge an old drive from an old machine for this test, so have now got a dual-boot machine for a short time ;)
Just downloading and installing the standard drivers to get a baseline for the test -
Stock results on XP Pro x32 260.99 drivers:
Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 63.8 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.4 GFlops 17.7 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.0 GFlops 91.9 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.7 GFlops 27.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps
-
PowerSpectrumxe2011Test5.exe -device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5)
Stock:
PwrSpec< 64> 11.9 GFlops 47.6 GB/s 0.0ulps
SumMax ( 64) 0.4 GFlops 1.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 1.4 GFlops 5.8 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 18.5 GFlops 73.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.1 GFlops 8.3 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
2.4 GFlops 9.6 GB/s 121.7ulps
PowerSpectrumxe2011Test5.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5)
Stock:
PwrSpec< 64> 11.9 GFlops 47.6 GB/s 0.0ulps
SumMax ( 64) 0.4 GFlops 1.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 1.4 GFlops 5.8 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 18.3 GFlops 73.3 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.1 GFlops 8.4 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
2.4 GFlops 9.6 GB/s 121.7ulps
.
Done
Remark:compiled with XE2011
modify:
something must be changed, last Test5 above shows
11.2 GFlops 45.3 GB/s 121.7ulps
in last line
-
something must be changed, last Test5 above shows
11.2 GFlops 45.3 GB/s 121.7ulps
Yeah, 11.2 is more like what that card should be doing heinz.
-
Stock results on XP Pro x32 260.99 drivers:
...
PS+SuMx( 64) 4.4 GFlops 17.7 GB/s
...
256 threads, fftlen 64: (worst case: full summax copy)
6.7 GFlops 27.2 GB/s 121.7ulps
...
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps
OK, so far against your previous results (assuming all else equal), we're back to our roughly ~10% performance advantage to XP:
(XP32-Win7x64)/Win7x64
Stock case: (4.4-4.1)/4.1 = ~7.3 % advantage to XP (expected, not too annoying)
Worst case: (6.7-6.0)/6.0 = ~11.7% advantage to XP ( I can *almost* live with that)
Best case: (8.7-8.7)/8.7 = ~0.0% advantage to XP (fine)
So there appears to be a greater advantage to XP with the worst case (lot's of memory transfers), though not as great as feared... Phew! ;D
Since the Memory numbers have more significant digits, and the worst case advantage indicates a memory issue of some sort, I'll compare the throughput figures also:
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case: (35.3-35.4)/35.4 = ~0.3% advantage to Win7
Tentative analysis based on above: Raw compute speed between the two OS/Driver models is roughly the same ('Best Case has no memory transfer of results), however WDDM's memory paging schemes increase overheads for the worst case by up to ~14.2% on that system ( 1/(1-0.124) ).
So memory transfers will have to be minimised in critical kernels. I can enable a pinned memory optimisation I implemented for integrated GPUs, which might just help the situation. At least we're not looking at the ~30% difference that had me petrified.
Jason
-
@Heinz, something broke in that source you used, investigating.
-
I've been playing with a couple of other versions of drivers (263.xx & 256.xx) as well and there is no improvement over the current 260.99 WHQL release drivers figures.
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....
-
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....
Yep going back is a challenge after adapting. Now that I'm pretty confident the memory transfers are the main factor, I'm hopeful a certain 'trick' may squash the difference. We'll see.
[Edit:] Updated first post:
Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)
Will use pinned memory, for Opt1, on GPUs that can do so.
-
Hi Jason,
Getting an error with the new build saying that cudart_32_32_7.dll isn't present - is this meant to be in the .7z file?
ghost
-
Just to see if it would run, I made a copy of the cudart32_32_16.dll, renamed it to cudart32_32_7.dll and then ran the test
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 12.9 GFlops 51.4 GB/s 0.0ulps
SumMax ( 64) 1.0 GFlops 4.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.4 GFlops 13.6 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 19.4 GFlops 77.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.4 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
7.0 GFlops 28.2 GB/s 121.7ulps
-
Cheers both, will investigate. Not sure why the build decided to use 32_7 from ::) , probably from messing with drivers earlier. Will rebuild shortly & reattach. [Done]
Jason
-
Thanks Jason:
Here's results under XP:
Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 15.8 GFlops 63.3 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 5.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.3 GFlops 17.5 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.1 GFlops 92.4 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.6 GFlops 30.6 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps
and under Win 7:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 17.3 GFlops 69.2 GB/s 0.0ulps
SumMax ( 64) 1.2 GFlops 5.2 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.0 GFlops 16.3 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 27.5 GFlops 110.0 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.2 GFlops 29.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
9.2 GFlops 37.3 GB/s 121.7ulps
-
Ghosts' before Pinned memory usage ( Test #5 memory throughput) :
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case: (35.3-35.4)/35.4 = ~0.3% advantage to Win7
with pinned memory (Test #6 Memory throughput )
Stock case*: (17.5-16.3)/16.3 = ~7.36 advantage to XP (consistent with prior result)
Worst case: (30.6-29.2)/29.2 = ~4.8% advantage to XP (Narrowed)
Best case: (35.3-37.3)/37.3 = ~5.4% advantage to Win7 (!) :o
*Stock code doesn't use pinned memory
Further tentative analysis: Hiding memory transfers with the use of pinned (non-pageable) memory for critical datasets, and Asynchronous Host<->Device transfers aids in hiding additional overheads experienced in the WDDM driver model. Careful use of these latency hiding mehanisms, though complex, can yield improved performance on WDDM platforms when large transfers are needed (such as with 'worst case'), and completely hide costs when transfers are minimised (such as with 'best case'). The end result on WDDM platforms with partial implementation of the optimisation strategies, will likely be performance that roughly approximates XPDM performance, or exceeds it by some small margin when costs can be totally hidden. This is likely a function of the WDDM host memory paging scheme employed under the newer driver model, already having effectively 'mirrored' some required data on the host & device.
Cheers Alll! Success! ;D More ammunition to go on with helps a lot.
Overall, it seems Windows 7/Vista WDDM driver model is not slower after all, but requires 'more careful' (& complex) programming to make the implementations efficient.
Jason
-
Brilliant news :D
Ghost
-
Here's the PowerSpectrum6 results on my 9800GTX+ on Win 7 64bit:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 16.1 GFlops 64.6 GB/s 1183.3ulps
SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.5 GFlops 18.3 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
7.1 GFlops 28.7 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
9.9 GFlops 40.0 GB/s 121.7ulps
and on Win Vista 64bit:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 16.1 GFlops 64.3 GB/s 1183.3ulps
SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.4 GFlops 17.8 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
6.9 GFlops 27.8 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
9.9 GFlops 39.9 GB/s 121.7ulps
and on my 128Mb 8400M GS on Vista 32bit:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 1.2 GFlops 4.8 GB/s 1183.3ulps
SumMax ( 64) 0.1 GFlops 0.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 0.4 GFlops 1.5 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 1.2 GFlops 4.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
0.6 GFlops 2.5 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
0.6 GFlops 2.6 GB/s 121.7ulps
Claggy
-
Hehe, those ( worst case Opt1) are up a bit ( apart from the 8400M, I suppose unsurprisingly ). Looks like we found WDDM display driver limitation, and should be able to work around it, with lots of effort.
-
I also added PowerSpectrum5 results for my 9800GTX+ on Vista 64bit, on page Eight (http://lunatics.kwsn.net/12-gpu-crunching/split-powerspectrum-unit-test.msg33630.html#msg33630)
Claggy
-
Cheers, yep was looking back there, definitely confirms the use of pinned memory helped Opt1, a bit more than I expected too.
On the XPDM Vs WDDM issue, I've had further confirmation on 8800GTS, from a non-crunching friend, that test #5 Opt1 worst case is faster on XPDM over win7, but roughly same speed in Test #6 (using Pinned Memory). The 'Best case' is also faster on Win7, so the numbers seem to match up. Make the code a bit more sophisticated & Win7 performance is ~equal to a bit faster than XP.
I'll be stewing on these additional aspects we've worked out here for a little while, and apply the knowledge to expanded tests with more fft sizes ~end of week. If that pans out well, it'll be time to start levering in these small improvements into the X series codebase. After the powerspectrum+reduction is integrated, then will probably be refinement & expansion of the 'freaky powerspectrum' (custom FFT) kernels using the same knowledge.
All this, of course is working towards 'fixing' the problematic puslefinding down the road, and having enough strategies to do so effectively.
(Can't wait for the time when I can ask Berkeley to send VLARs back out to GPUs again :P)
Jason
-
9800GTX+, Windows 7/32
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 15.8 GFlops 63.4 GB/s 1183.3ulps
SumMax ( 64) 1.3 GFlops 5.3 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.1 GFlops 16.5 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 15.9 GFlops 63.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
6.9 GFlops 28.1 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
9.8 GFlops 39.5 GB/s 121.7ulps
-
Took a couple of tries but I think I got it right....
Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd \test
C:\test>powerspectrum6.exe >results.txt
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.
C:\test>powerspectrumtest6.exe
Device: GeForce 9500 GT, 1840 MHz clock, 1008 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.8 GFlops 11.3 GB/s 1183.3ulps
SumMax ( 64) 0.4 GFlops 1.9 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 1.2 GFlops 4.9 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.8 GFlops 11.4 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.9 GFlops 7.6 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
2.0 GFlops 8.2 GB/s 121.7ulps
C:\test>
-
Vista64
~~~~
Stopping Boinc...
PowerSpectrumTest6.exe -device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 20.4 GFlops 81.6 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.6 GFlops 18.7 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.0 GFlops 119.9 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.1 GFlops 28.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.1 GFlops 45.1 GB/s 121.7ulps
PowerSpectrumTest6.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 20.4 GFlops 81.8 GB/s 0.0ulps
SumMax ( 64) 1.4 GFlops 5.9 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.6 GFlops 18.5 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.1 GFlops 120.6 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.3 GFlops 29.7 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.2 GFlops 45.2 GB/s 121.7ulps
.
Done
Restarting Boinc...
-
Thanks Richard, perryjay & Heinz.
All fit with the models so far.
The Compute capability 1.1, devices, Richard & Perryjay, are IMO doing their memory bound best with the powerspectrum, ~matching stock 'PwrSpec' speed for that, then 'magically' lifting with the reductions (summax) for Opt1 worst case. I beleive that must be purely a result of the memory transfer hiding, since the compute density of the reduction hasn't changed from O(logn).
@Heinz, glad to see your numbers back up to where they should be. I reckon that's scaling well against my OC'd 480:
Stock (PS+Summax): 5.9 GFlops , 23.7 GB/s
worse (opt1): 10.0 GFlops , 40.4 GB/s
best (opt1): 16.0 GFlops , 64.8 GB/s
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 28.1 GFlops 112.5 GB/s 0.0ulps
SumMax ( 64) 2.3 GFlops 9.6 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 7.2 GFlops 29.2 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 41.4 GFlops 165.6 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
12.7 GFlops 51.5 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.3 GB/s 121.7ulps
Steve
-
Ouch! 27% more throughput on worst case Opt1 than mine (12.7 Vs 10 GFlops) ;D despite slower powerspectrum (memory), that can't be core (same 'best' case @16.1) .... PCIe Bus overclocked ? (ahh, faster host memory too I suppose)
-
My CPU memory is at 1774 MHz. My PCIe buss is slightly over clocked. I adjusted my GPU RAM to 1900 MHz. There is still room for more. I am on my last GPU wu for Einstein. There aren't any available at the moment. Piggy hit the #5 spot for the top rigs at Einstein with a RAC of over 14,000. There is nothing slow about Piggy. It does a fantastic job at running Starry Night Pro Plus astronomy software. I can't wait to get back to SETI crunching!
Steve
-
My CPU memory is at 1774 MHz. My PCIe buss is slightly over clocked. ..
Whew! that's a relief. My host is only running dual channel DDR2 memory (corsair stuff though), so I'm due for some upgrades on the host if it's limiting the 480. Will see if I can hold out 'till Sandy Bridge release & get decent CPU/RAM/Mobo to drive it :-\.
-
9800GT, Windows XP/32
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 12.1 GFlops 48.5 GB/s 1183.3ulps
SumMax ( 64) 1.1 GFlops 4.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.5 GFlops 14.2 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 12.1 GFlops 48.4 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
5.8 GFlops 23.4 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.0 GFlops 28.4 GB/s 121.7ulps
-
Win7 x64
*********
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 26.5 GFlops 105.8 GB/s 1183.3ulps
SumMax ( 64) 2.2 GFlops 9.3 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.8 GFlops 27.3 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.7 GFlops 106.9 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
128 threads, fftlen 64: (worst case: full summax copy)
11.4 GFlops 46.1 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
15.5 GFlops 62.8 GB/s 121.7ulps
-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 26.1 GFlops 104.3 GB/s 1183.3ulps
SumMax ( 64) 2.2 GFlops 9.2 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.9 GFlops 28.0 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 26.4 GFlops 105.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
128 threads, fftlen 64: (worst case: full summax copy)
11.3 GFlops 45.9 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
15.4 GFlops 62.2 GB/s 121.7ulps
-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 25.5 GFlops 101.9 GB/s 1183.3ulps
SumMax ( 64) 2.1 GFlops 8.7 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.6 GFlops 26.7 GB/s
GetPowerSpectrum() choice for Opt1: 128 thrds/block
128 threads: 25.9 GFlops 103.7 GB/s 121.7ulps
Opt1 (PSmod3+SM): 128 thrds/block
PowerSpectrumSumMax array pinned in host memory.
128 threads, fftlen 64: (worst case: full summax copy)
10.8 GFlops 43.5 GB/s 121.7ulps
Every ifft average & peak OK
128 threads, fftlen 64: (best case, nothing to update)
14.4 GFlops 58.2 GB/s 121.7ulps
-
Ahah, I wondered how the 200 series would respond (haven't had a chance to test on the 260 in the other room yet). Looks like they appreciate the lifting of memory constraints as well. That means we'll probably All start going up in GFlops as we pack in more computation (Chirps, FFTs, findspikes, etc ). This latest test appears to be capping out at host memory & PCIe bus speeds, so while faster, it has an artificial ceiling imposed by the current code designs & their communication costs (memory & bus bound), rather than GPU compute performance .
-
and one small mobile GPU ;) :
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 4.5 GFlops 17.8 GB/s 1183.3ulps
SumMax ( 64) 0.2 GFlops 1.0 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 0.9 GFlops 3.4 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 4.5 GFlops 17.8 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.5 GFlops 5.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.6 GFlops 6.7 GB/s 121.7ulps
-
Well here is one of my slightly overclocked GTX460.
Running Win7X64 & 260.99 version.
Kind regards Vyper
-
and one small mobile GPU ;) :
The worst case reduction is faster while the powerspectrum same speed, great ;D
-
Well here is one of my slightly overclocked GTX460.
Thank's! We Fermi users are going to need more computation packed in there to bring those GFlops up.
-
ok, a bit of statistics then. average +- std dev over 15 runs
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 4.4 GFlops 17.5 GB/s 1183.3ulps
SumMax ( 64) 0.3 GFlops 1.1 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 0.82 +- 0.086 GFlops 3.5 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 4.37 +- 0.046 GFlops 17.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.37 +- 0.149 GFlops 6.0 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.61 +- 0.026 GFlops 6.6 GB/s 121.7ulps
now if the pink was better distingushabel from the white ::)
would you like that for the GB/s as well?
-
now if the pink was better distingushabel from the white ::)
would you like that for the GB/s as well?
Thanks for the tolerances. Being largely memory bound, the FLops tolerances are more than enough, and indicate +/- 10% variation of worst case on that. I presume that's driving a display, so that's reasonable.
-
Thanks for the tolerances. Being largely memory bound, the FLops tolerances are more than enough, and indicate +/- 10% variation of worst case on that. I presume that's driving a display, so that's reasonable.
You're welcome - now what exactly makes you think the mobile GPU of a laptop might be driving a display? ;D
No bluescreens with the lastest driver yet - touch wood...
I'll do statistics on all the numbers next time round then.
-
OK, I ran version 6 of the tool on my system (Q6600/8GB/8800GTX) under both WinXP32 as well as Win7-64. If you want me to (re-)run other versions of the tool, let me know. ;)
Both loggings below each-other, first the oldest, WinXP32:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 18.3 GFlops 73.1 GB/s 1183.3ulps
SumMax ( 64) 1.3 GFlops 5.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 4.3 GFlops 17.6 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.3 GFlops 73.1 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
6.4 GFlops 26.1 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
8.1 GFlops 32.7 GB/s 121.7ulps
Then Win7-64:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 18.1 GFlops 72.5 GB/s 1183.3ulps
SumMax ( 64) 1.1 GFlops 4.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 3.8 GFlops 15.4 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.1 GFlops 72.6 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.4 GFlops 21.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
6.6 GFlops 26.8 GB/s 121.7ulps
Regards, Patrick.
-
Ahhh, hi Patrick. Looks like your card should still be able to use pinned host memory, but isn't :( . It indeed doesn't support mapped memory (a different kind), but didn't engage the pinned memory improvement because I need to change how I detect that feature. I'm checking the wrong feature flags it looks like.... ooops ::)
Will make a #7 end of week, and pay special attention to making sure that engages properly on compute capability 1.0 cards (that don't support mapped memory).
Cheers for finding the problem ;)
-
Ahhh, hi Patrick. Looks like your card should still be able to use pinned host memory, but isn't :( . It indeed doesn't support mapped memory (a different kind), but didn't engage the pinned memory improvement because I need to change how I detect that feature. I'm checking the wrong feature flags it looks like.... ooops ::)
Will make a #7 end of week, and pay special attention to making sure that engages properly on compute capability 1.0 cards (that don't support mapped memory).
Cheers for finding the problem ;)
I have no idea what I did, but you're quite welcome. ;)
Regards, Patrick.
-
Thanks,
It's what you (the test #6 anyway) didn't do :D
This line's missing:
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.5 GFlops 5.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.6 GFlops 6.7 GB/s 121.7ulps
When operational, that feature seems to add a touch of throughput to both XP & Vista/Win7, and seems to close the performance difference. (we've been so worried about). You should get a boost when I fix that.
Jason
-
Thanks,
It's what you (the test #6 anyway) didn't do :D
This line's missing:
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.5 GFlops 5.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.6 GFlops 6.7 GB/s 121.7ulps
When operational, that feature seems to add a touch of throughput to both XP & Vista/Win7, and seems to close the performance difference. (we've been so worried about). You should get a boost when I fix that.
Jason
Ah, ok, thanks for the elaboration. Looking forward to test #7 then!
Regards, Patrick.
-
Windows XP32. GTX 570. Nvidia Driver 263.09.
Device: GeForce GTX 570, 1464 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 25.6 GFlops 102.5 GB/s 0.0ulps
SumMax ( 64) 1.9 GFlops 7.9 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 6.2 GFlops 25.1 GB/s
GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 33.3 GFlops 133.3 GB/s 121.7ulps
Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
10.9 GFlops 44.0 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
13.5 GFlops 54.7 GB/s 121.7ulps
-
570 wooot! ;D
-
570 wooot! ;D
Borrowed it from a friend. It's hot, almost non-overclockable, and slightly slower than 480.
I am really looking forward to AMD HD6950/6970 !
[EDIT] Seems I got a bad sample. I've seen reports where the 570 has been overclocked to 840@4250 (stock: 732@3800) with air cooling.
-
It's hot, almost non-overclockable
Why bother ? harvesting faulty parts you think ?
[Edit:] that worst case is slightly better than my 480 worst case, but the best cases are inferioir. From the powerspectrum I see the constraint is memory ( again ::) ) ... So indeed these may not be be a good choice for seti in the short term ... probably do Batman really well though ::)
-
probably do Batman really well though ::)
LOL
Yeah ... memory. They chopped the memory interface. I guess they did this so it's not getting to close to the 580 - and not to far ahead of the 480.
GTX570: 320bit
GTX480 & GTX580: 384bit
-
@All: In the meantime, having identified the major issues at pllay with these code areas, along with appropriate techniques to use, I have come up with some ideas for a major redesign of the FFT->Powerspectrum->Summax(reduction)->FindSpikes pipeline, which currently accounts for around ~40%-60% of processing.
I'll change the format of the next test quite a bit, and spend time tomorrow to get things underway toward #7.
Jason
-
I would suggest to test these samples of code at different GPU freq to mem freq ratios.
SubSpace's experiment with beta OpenCL apps showed that it's very informative approach.
(He established that HD5 wins over usual OpenCL MB if GPU engine is relative fast and memory relative slow, while if GPU clocks lowed usual app wins).
I think it's quite explains why other testers see bigger execution time on VLAR for HD5 than for usual app - their GPUs not so fast relative their memory.
Memory influence can be quite highlighted this way.
-
Yes, in fact that's exactly what happened to confirm memory bound nature of what's going on, ( from yet another angle ).
Steve's 480 core is clocked considerably higher than mine, yet he was initially achieving lower throughput than my card. He tweaked his memory throughput for some improvement.
After that, a discrepancy between throughput on XP Vs Win7 was then noted, somewhere around the familiar 10% difference. I added use of pinned memory for the transfers, to try hide them. With Ghost's help, In the heavy transfer case ( worst case full summax array copy, as with stock code) the XP-Win7 difference was narrowed to ~4% or less, while the WDDM performance proved more efficient with the raw processing in best case (No transfers needed)
Now Steve's 480 achieves some 27% more throughput, in the worst case, than mine does. I take this as an indication that the transfer hiding is shifting the bottleneck around as intended, and that it's time to move on to more sophisticated code portions with the acquired tools & techniques.
Still learning stuff every day with these things.
Jason
-
I have kicked my 480 memory speed up to 1975 MHz, with plenty of room to go. The 480 cores are clocked at 860 MHz. I tried to increase my CPU memory, but 1774 MHz is as fast as I can get it. I was able to increase my CPU speed to 4.26 GHz with hypethreading enabled, while maintaining about 57°C to 60°C core temps.
Steve
-
I've changed over to win-7 64 bit just before we came back up so I decided to run test 6 again. Not sure how much of a difference it will make.
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrum4.exe > results.txt
'powerspectrum4.exe' is not recognized as an internal or external command,
operable program or batch file.
C:\test>powerspectrum6.exe
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.
C:\test>powerspectrumtest6.exe
Device: GeForce 9500 GT, 1400 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.4 GB/s 1183.3ulps
SumMax ( 64) 0.3 GFlops 1.5 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 1.0 GFlops 4.1 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.5 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
1.6 GFlops 6.6 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
1.8 GFlops 7.3 GB/s 121.7ulps
Leave it to me to mess up, EVGA precision wasn't holding the o/c. I looked all over the place but couldn't find the little button to make it apply at startup until just now. Here's the corrected test...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrumtest6.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 2.9 GFlops 11.5 GB/s 1183.3ulps
SumMax ( 64) 0.4 GFlops 1.8 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 1.2 GFlops 4.7 GB/s
GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 2.9 GFlops 11.6 GB/s 121.7ulps
Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
0.7 GFlops 3.0 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
2.1 GFlops 8.3 GB/s 121.7ulps
C:\test>
-
Updated first post:
Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)
Please test on all cuda capable cards.
example output:
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.9 GFlops 12.9 GB/s
PS+SuMx( 16) [OK] 3.9 GFlops 16.2 GB/s
PS+SuMx( 32) [OK] 3.9 GFlops 15.8 GB/s
PS+SuMx( 64) [OK] 6.0 GFlops 24.2 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 4.3 18.6 121.7 [OK] 22.8 99.7 121.7
PS+SuMx( 16) 6.7 28.1 121.7 [OK] 21.4 89.7 121.7
PS+SuMx( 32) 9.4 38.6 121.7 [OK] 20.8 85.2 121.7
PS+SuMx( 64) 11.7 47.4 121.7 [OK] 20.4 82.6 121.7
-
My 9800GTX+ on Win 7 x64:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.0 GFlops 8.8 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 10.7 GB/s
PS+SuMx( 32) [OK] 2.8 GFlops 11.5 GB/s
PS+SuMx( 64) [OK] 4.5 GFlops 18.1 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 2.7 11.8 121.7 [OK] 7.1 31.0 121.7
PS+SuMx( 16) 4.0 16.5 121.7 [OK] 7.7 32.1 121.7
PS+SuMx( 32) 4.9 19.9 121.7 [OK] 7.3 29.7 121.7
PS+SuMx( 64) 6.6 26.7 121.7 [OK] 8.9 35.9 121.7
and on my 128Mb 8400M GS on Vista 32bit:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.3 GFlops 1.3 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.2 GB/s
PS+SuMx( 32) [OK] 0.2 GFlops 0.9 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.5 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.4 1.9 121.7 [OK] 0.5 2.1 121.7
PS+SuMx( 16) 0.4 1.8 121.7 [OK] 0.5 1.9 121.7
PS+SuMx( 32) 0.4 1.7 121.7 [OK] 0.4 1.8 121.7
PS+SuMx( 64) 0.5 2.1 121.7 [OK] 0.5 2.2 121.7
Claggy
-
LoL, I thought stock code was already G80 optimised, guess I was WRONG.
-
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.57 +- 0.048 GFlops 2.49 +- 0.24 GB/s
PS+SuMx( 16) [OK] 0.57 +- 0.048 GFlops 2.39 +- 0.19 GB/s
PS+SuMx( 32) [OK] 0.49 +- 0.031 GFlops 2.01 +- 0.11 GB/s
PS+SuMx( 64) [OK] 0.80 +- 0.105 GFlops 3.20 +- 0.41 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.87 +- 0.048 3.92 +- 0.20 121.7 [OK] 1.21 +- 0.03 5.49 +- 0.03 121.7
PS+SuMx( 16) 0.89 +- 0.19 3.70 +- 0.78 121.7 [OK] 1.20 +- 0 5.00 +- 0 121.7
PS+SuMx( 32) 0.97 +-0.048 3.92 +- 0.19 121.7 [OK] 1.10 +- 0 4.60 +- 0 121.7
PS+SuMx( 64) 1.24 +- 0.11 5.02 +- 0.42 121.7 [OK] 1.41 +- 0.03 5.85 +- 0.05 121.7
Average and standard deviation over 10 runs.
-
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ? magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot ::)]
-
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ? magic ?
[ Oh yeah I set the timer tolerances to do that, I forgot ::)]
A run takes some 20 seconds - makes some 5 minutes with graceful rounding. Typing the data into Excel and the calculated values back into the post took about half an hour. :P
timer tolerances?
-
timer tolerances?
Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test ;)
[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit :)
-
Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test ;)
Well manual data collection works just as well, only more tedious.
[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit :)
I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.
-
I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections) ;)
[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
-
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections) ;)
[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
-
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
Yes. Actual performance will fall somewhere in between best & worst cases ... :P ... Though initially I'll be using 'worst case' code for rapid code improvements to working prototypes ( Size 64 already in field testing in x33 ), best case code is a glass ceiling to aim for with 'advanced coding'
[Edit:] size 64 (worst case implementation) provides ~3% performance improvement to 'shorties' on GTX 480
[Edit2:] oh, that was 'old' worst case code, nevermind ::)
-
I re-ran the tests on my rig (Q6600/8GB/8800GTX) under both Win764 as well as WinXP32.
First WinXP32:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.6 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 11.1 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 4.3 GFlops 17.5 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.6 15.8 121.7 [OK] 6.2 27.2 121.7
PS+SuMx( 16) 4.5 18.8 121.7 [OK] 6.1 25.5 121.7
PS+SuMx( 32) 4.9 20.1 121.7 [OK] 5.8 23.8 121.7
PS+SuMx( 64) 6.6 26.5 121.7 [OK] 7.4 30.0 121.7
Then Win7-64:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.1 GFlops 9.0 GB/s
PS+SuMx( 16) [OK] 2.4 GFlops 10.2 GB/s
PS+SuMx( 32) [OK] 2.4 GFlops 9.8 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.6 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.4 14.9 121.7 [OK] 6.1 26.8 121.7
PS+SuMx( 16) 4.2 17.5 121.7 [OK] 6.0 25.3 121.7
PS+SuMx( 32) 4.6 18.7 121.7 [OK] 5.8 23.7 121.7
PS+SuMx( 64) 5.9 24.0 121.7 [OK] 7.4 29.8 121.7
As always, hope it helps. ;)
Regards, Patrick.
EDIT: Modified to use no smilies due to the 'cool' smilies in the test-results.
-
Win7x64 results:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8 ) [OK] 2.4 GFlops 10.7 GB/s
PS+SuMx( 16) [OK] 3.1 GFlops 13.0 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.6 GB/s
PS+SuMx( 64) [OK] 4.0 GFlops 16.1 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8 ) 4.9 21.4 121.7 [OK] 13.1 57.4 121.7
PS+SuMx( 16) 6.5 27.2 121.7 [OK] 12.3 51.4 121.7
PS+SuMx( 32) 7.8 31.8 121.7 [OK] 11.9 48.7 121.7
PS+SuMx( 64) 8.6 34.8 121.7 [OK] 11.6 47.0 121.7
-
GTX460 1GB OC Core=880MHz Mem=2000MHz Win7-64bit
C:\Test>powerspectrumtest7
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 3.4 GFlops 14.7 GB/s
PS+SuMx( 16) [OK] 3.5 GFlops 14.7 GB/s
PS+SuMx( 32) [OK] 2.3 GFlops 9.6 GB/s
PS+SuMx( 64) [OK] 3.5 GFlops 14.3 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 6.5 28.4 121.7 [OK] 13.5 59.1 121.7
PS+SuMx( 16) 7.7 32.3 121.7 [OK] 12.6 52.8 121.7
PS+SuMx( 32) 8.5 34.8 121.7 [OK] 12.2 49.8 121.7
PS+SuMx( 64) 9.0 36.3 121.7 [OK] 12.3 49.6 121.7
-
Preparing the usual three:
9800GTX+, Windows 7/32
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 1.7 GFlops 7.4 GB/s
PS+SuMx( 16) [OK] 2.3 GFlops 9.6 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.9 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.5 15.4 121.7 [OK] 7.1 31.3 121.7
PS+SuMx( 16) 4.0 16.5 121.7 [OK] 7.4 31.0 121.7
PS+SuMx( 32) 4.9 20.0 121.7 [OK] 7.2 29.5 121.7
PS+SuMx( 64) 6.3 25.4 121.7 [OK] 8.8 35.5 121.7
9800GT, Windows XP/32
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 1.7 GFlops 7.2 GB/s
PS+SuMx( 16) [OK] 2.1 GFlops 8.9 GB/s
PS+SuMx( 32) [OK] 2.2 GFlops 9.0 GB/s
PS+SuMx( 64) [OK] 3.6 GFlops 14.5 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 2.5 11.1 121.7 [OK] 5.2 22.9 121.7
PS+SuMx( 16) 3.5 14.7 121.7 [OK] 5.5 23.0 121.7
PS+SuMx( 32) 4.1 16.7 121.7 [OK] 5.2 21.2 121.7
PS+SuMx( 64) 5.4 21.7 121.7 [OK] 6.3 25.7 121.7
GTX 470, Windows XP/32
Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.3 GFlops 9.9 GB/s
PS+SuMx( 16) [OK] 3.0 GFlops 12.6 GB/s
PS+SuMx( 32) [OK] 3.0 GFlops 12.1 GB/s
PS+SuMx( 64) [OK] 4.8 GFlops 19.3 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.7 16.0 121.7 [OK] 15.6 68.4 121.7
PS+SuMx( 16) 5.7 23.9 121.7 [OK] 14.8 61.8 121.7
PS+SuMx( 32) 7.9 32.5 121.7 [OK] 14.3 58.7 121.7
PS+SuMx( 64) 9.9 39.9 121.7 [OK] 14.0 56.7 121.7
-
Here's mine...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd/test
C:\test> powerspectrumtest7.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.7 GFlops 3.2 GB/s
PS+SuMx( 16) [OK] 0.8 GFlops 3.5 GB/s
PS+SuMx( 32) [OK] 0.8 GFlops 3.1 GB/s
PS+SuMx( 64) [OK] 1.1 GFlops 4.4 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 1.2 5.4 121.7 [OK] 1.6 6.8 121.7
PS+SuMx( 16) 0.7 3.0 121.7 [OK] 1.5 6.1 121.7
PS+SuMx( 32) 1.4 5.6 121.7 [OK] 1.6 6.4 121.7
PS+SuMx( 64) 1.7 6.7 121.7 [OK] 1.8 7.5 121.7
C:\test>
-
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections) ;)
[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
The lower graph on http://setiathome.berkeley.edu/sah_glossary/spike_graphs.php is related, note the log scale on the counts. S@H Enhanced does relatively more short FFT lengths, but there's still a very strong bias toward the long FFT lengths for both reportable and "best" spikes. A quick survey of 44 recent results from my P-M showed 35 best_spikes at fft_len 131072, 6 at fft_len 65536, 2 at fft_len 32768, and 1 at fft_len 16384.
However, the processing order starts at FFT length 8 and works up, so there should be some "worst case" for short FFT lengths during that zero chirp sequence. Subsequent visits to the short FFT lengths are likely to be all "best case". At AR 0.42 FFT length 8 is done 13 times so overall there will be mostly "best case", but at AR 3.0 FFT length 8 is only done once so the probability of "worst case" will be higher.
Note that our test WUs shortened by lowering chirp limits will have a higher proportion of the zero chirp worst cases than full length WUs. In general I think that's good, brief sloppy tests which slightly underestimate improvement from optimization are better than those which cause unwarranted enthusiasm. But it would also be possible to create a set of test WUs shortened by adjusting chirp resolution which would give better quick test timing.
Edit: Jason, result_overflow is triggered by the 31st found signal...
Joe
-
And now the GTX460-768 card,
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.7 GB/s
PS+SuMx( 16) [OK] 2.8 GFlops 11.5 GB/s
PS+SuMx( 32) [OK] 2.1 GFlops 8.7 GB/s
PS+SuMx( 64) [OK] 3.4 GFlops 13.6 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 4.2 18.3 121.7 [OK] 11.1 48.5 121.7
PS+SuMx( 16) 5.8 24.5 121.7 [OK] 10.5 44.1 121.7
PS+SuMx( 32) 7.2 29.7 121.7 [OK] 10.2 41.7 121.7
PS+SuMx( 64) 8.4 33.9 121.7 [OK] 10.2 41.5 121.7
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 5.0 GFlops 22.0 GB/s
PS+SuMx( 16) [OK] 6.0 GFlops 25.3 GB/s
PS+SuMx( 32) [OK] 4.7 GFlops 19.2 GB/s
PS+SuMx( 64) [OK] 7.2 GFlops 29.1 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 9.0 39.2 121.7 [OK] 23.0 100.7 121.7
PS+SuMx( 16) 11.7 49.0 121.7 [OK] 21.7 90.8 121.7
PS+SuMx( 32) 13.6 55.8 121.7 [OK] 21.1 86.4 121.7
PS+SuMx( 64) 15.1 61.2 121.7 [OK] 20.7 83.7 121.7
Steve
-
Thanks all for the massive amount of data ;D , will peruse to see if anything;s amiss, but think I found the sweet spot for 'worst case' at the moment, which is straightforward implementation. I'm delighted that nothing seems to be broken on any GPU tested so far. There is a lot of work to do to add the remaining sizes into the test (remaining powers of 2 up to 128k or so, maybe some larger sizes for growing room), Then adding FFTs & Findspikes on either side of this pipeline. Once that's done looks like I can stripe the processing to fit Fermi's L2 cache, right through this pipeline, which should speed things up a lot for those cards.
@Joe, Thanks!, I keep forgetting it's 31 not 30 ::) probably would have found it the hard way (again), but the heads up helps.
Jason
-
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 4.5 GFlops 19.6 GB/s
PS+SuMx( 16) [OK] 5.0 GFlops 20.9 GB/s
PS+SuMx( 32) [OK] 4.6 GFlops 18.7 GB/s
PS+SuMx( 64) [OK] 7.0 GFlops 28.4 GB/s
Opt1: 128 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 6.1 26.7 121.7 [OK] 11.7 51.4 121.7
PS+SuMx( 16) 7.5 31.2 121.7 [OK] 11.5 48.0 121.7
PS+SuMx( 32) 8.7 35.6 121.7 [OK] 12.0 48.9 121.7
PS+SuMx( 64) 10.9 44.1 121.7 [OK] 14.5 58.9 121.7
-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 4.4 GFlops 19.3 GB/s
PS+SuMx( 16) [OK] 4.9 GFlops 20.6 GB/s
PS+SuMx( 32) [OK] 4.5 GFlops 18.5 GB/s
PS+SuMx( 64) [OK] 6.9 GFlops 27.9 GB/s
Opt1: 128 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 6.0 26.3 121.7 [OK] 11.6 50.8 121.7
PS+SuMx( 16) 7.3 30.5 121.7 [OK] 11.4 47.7 121.7
PS+SuMx( 32) 8.6 35.1 121.7 [OK] 11.7 48.1 121.7
PS+SuMx( 64) 10.7 43.3 121.7 [OK] 14.4 58.2 121.7
-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 4.3 GFlops 18.7 GB/s
PS+SuMx( 16) [OK] 4.8 GFlops 19.9 GB/s
PS+SuMx( 32) [OK] 4.3 GFlops 17.6 GB/s
PS+SuMx( 64) [OK] 6.6 GFlops 26.8 GB/s
Opt1: 128 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 5.8 25.5 121.7 [OK] 10.9 47.5 121.7
PS+SuMx( 16) 7.1 29.7 121.7 [OK] 10.6 44.3 121.7
PS+SuMx( 32) 8.2 33.7 121.7 [OK] 11.0 45.2 121.7
PS+SuMx( 64) 10.4 42.0 121.7 [OK] 13.5 54.7 121.7
-
a Hah!, we 're finding the 2xx series limits at last. 'best case' is tapering off sooner & clearly compute bound, while the worst cases show the limit of DDR3 against fermi's DDR5 memory.
Fermi best cases appear to be limited by the memory subsystem still, so down the road I'll be striping(streaming) this pipeline to fit in those cache levels. That should lift the apparent ~20GFlops limit a bit on Fermis, Unfortunately the 2xx cards don't have the cache levels, so we might be reaching a limit with those in some respects.
@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ? I can lengthen the test sequence if needed.
[A bit Later:] extending the tests from 0.5 to 5 seconds allowed me to see what the 480 is doing as a cross check. Looks like the Opt1 best cases are reaching ~100%, and opt1 worst cases are bandwidth limited, all as expected, no surprises yet.
[Still later:] I've added the extended PowerSpectrumTest7 to the first post. I don't need data for the extended test(results are more or less the same), but provide it for those that want to be able to see GPU utilisation differences between the test phases on their cards, like the attached image.
Moving onto larger sizes & FFT integration , after some beer ;)
-
@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ? I can lengthen the test sequence if needed.
Yes, Opt1 spikes to 99%.
-
Cheers!
-
7_extended
~~~~~~~
PowerSpectrumTest7_extended.exe -device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.7 GFlops 12.0 GB/s
PS+SuMx( 16) [OK] 3.7 GFlops 15.6 GB/s
PS+SuMx( 32) [OK] 3.3 GFlops 13.7 GB/s
PS+SuMx( 64) [OK] 5.1 GFlops 20.7 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 4.9 21.5 121.7 [OK] 17.6 77.2 121.7
PS+SuMx( 16) 7.1 29.7 121.7 [OK] 16.7 69.8 121.7
PS+SuMx( 32) 8.3 34.1 121.7 [OK] 16.2 66.4 121.7
PS+SuMx( 64) 10.2 41.3 121.7 [OK] 16.0 64.6 121.7
PowerSpectrumTest7_extended.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.7 GFlops 12.0 GB/s
PS+SuMx( 16) [OK] 3.7 GFlops 15.4 GB/s
PS+SuMx( 32) [OK] 3.4 GFlops 13.9 GB/s
PS+SuMx( 64) [OK] 5.1 GFlops 20.7 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 5.0 21.8 121.7 [OK] 17.7 77.4 121.7
PS+SuMx( 16) 7.1 29.9 121.7 [OK] 16.7 70.0 121.7
PS+SuMx( 32) 8.9 36.5 121.7 [OK] 16.3 66.6 121.7
PS+SuMx( 64) 10.5 42.4 121.7 [OK] 16.0 64.7 121.7
.
Done
gpuload (http://www.britta-d.de/images/powerspectrum/ps_gpuload_test7ext.jpg)
I had never seen this Memory Controller load spike, comparing with primegrid it shows nothing.
gpuload_prime (http://www.britta-d.de/images/primegrid/pg_gtx470_gpuz_stable_sensors.jpg)
-
7 extended ION
~~~~~~~~~~
PowerSpectrumTest7_extended.exe -device 0
Device: ION, 1100 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.4 GFlops 1.5 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.4 GB/s
PS+SuMx( 32) [OK] 0.3 GFlops 1.1 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.7 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.5 2.4 121.7 [OK] 0.6 2.8 121.7
PS+SuMx( 16) 0.6 2.3 121.7 [OK] 0.6 2.6 121.7
PS+SuMx( 32) 0.5 2.2 121.7 [OK] 0.6 2.3 121.7
PS+SuMx( 64) 0.7 2.7 121.7 [OK] 0.7 2.9 121.7
.
Done
hmm. how to interpret
the stock values 1,7GB/s are much better with the ION.
must lookup to the ION device properties
CUDA: ION
Informationsliste Wert
Geräteeigenschaften
Gerätename ION
Taktrate 1100 MHz
Multiprozessor / Kerne 2 / 16
Max Threads Per Block 512
Max Registers Per Block 8192
Warp Size 32 threads
Max Block Size 512 x 512 x 64
Max Grid Size 65535 x 65535 x 1
Compute Capability 1.1
CUDA DLL nvcuda.dll (8.17.12.6061 - nVIDIA ForceWare 260.61)
Speichereigenschaften
Total Memory 241 MB
Total Constant Memory 64 KB
Max Shared Memory Per Block 16 KB
Max Memory Pitch 2147483647 Bytes
Texture Alignment 256 Bytes
Gerät Besonderheiten
32-bit Floating-Point Atomic Addition Nicht unterstützt
32-bit Integer Atomic Operations Unterstützt
64-bit Integer Atomic Operations Nicht unterstützt
Concurrent Memory Copy & Execute Nicht unterstützt
Double-Precision Floating-Point Nicht unterstützt
Warp Vote Functions Nicht unterstützt
__ballot() Nicht unterstützt
__syncthreads_and() Nicht unterstützt
__syncthreads_count() Nicht unterstützt
__syncthreads_or() Nicht unterstützt
__threadfence_system() Nicht unterstützt
Gerätehersteller
Firmenname NVIDIA Corporation
Produktinformation http://www.nvidia.com/page/products.html
Treiberdownload http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate http://www.aida64.com/driver-updates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OPEN_CL
~~~~~~~
OpenCL: ION
Informationsliste Wert
OpenCL Properties
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.0 CUDA 3.2.1
Platform Profile Full
Geräteeigenschaften
Gerätename ION
Geräteart Grafikprozessor (GPU)
Device Vendor NVIDIA Corporation
Device Version OpenCL 1.0 CUDA
Device Profile Full
Taktrate 1100 MHz
Multiprocessors 2
Max 2D Image Size 4096 x 32768
Max 3D Image Size 2048 x 2048 x 2048
Max Samplers 16
Max Work-Item Size 512 x 512 x 64
Max Work-Group Size 512
Max Argument Size 4352 Bytes
Max Constant Buffer Size 64 KB
Max Constant Arguments 9
Profiling Timer Resolution 1000 ns
OpenCL DLL opencl.dll (1.0.0)
Speichereigenschaften
Global Memory 241 MB
Local Memory 16 KB
Memory Base Address Alignment 2048 Bit
Min Data Type Alignment 128 Bytes
Gerät Besonderheiten
Command-Queue Out Of Order Execution Aktiviert
Command-Queue Profiling Aktiviert
Compiler Unterstützt
Fehlerkorrektur Nicht unterstützt
Images Unterstützt
Kernel Execution Unterstützt
Native Kernel Execution Nicht unterstützt
Device Extensions
cl_amd_d3d10_interop Nicht unterstützt
cl_amd_d3d9_interop Nicht unterstützt
cl_amd_device_attribute_query Nicht unterstützt
cl_amd_fp64 Nicht unterstützt
cl_amd_media_ops Nicht unterstützt
cl_amd_printf Nicht unterstützt
cl_khr_3d_image_writes Nicht unterstützt
cl_khr_byte_addressable_store Unterstützt
cl_khr_d3d10_sharing Unterstützt
cl_khr_fp16 Nicht unterstützt
cl_khr_fp64 Nicht unterstützt
cl_khr_gl_sharing Unterstützt
cl_khr_global_int32_base_atomics Unterstützt
cl_khr_global_int32_extended_atomics Unterstützt
cl_khr_icd Unterstützt
cl_khr_int64_base_atomics Nicht unterstützt
cl_khr_int64_extended_atomics Nicht unterstützt
cl_khr_local_int32_base_atomics Nicht unterstützt
cl_khr_local_int32_extended_atomics Nicht unterstützt
cl_khr_select_fprounding_mode Nicht unterstützt
cl_nv_compiler_options Unterstützt
cl_nv_d3d10_sharing Unterstützt
cl_nv_d3d11_sharing Unterstützt
cl_nv_d3d9_sharing Unterstützt
cl_nv_device_attribute_query Unterstützt
cl_nv_pragma_unroll Unterstützt
Gerätehersteller
Firmenname NVIDIA Corporation
Produktinformation http://www.nvidia.com/page/products.html
Treiberdownload http://www.nvidia.com/content/drivers/drivers.asp
Treiberupdate http://www.aida64.com/driver-updates
-
hmm. how to interpret
the stock values 1,7GB/s are much better with the ION.
must lookup to the ION device properties
No, your labels are misaligned Heinz, will fix them for you ....[Done... 2.7GB/s is a bit better than 1.7GB/s ]
[Edit] Fixed it again, and fixed the 470 ones so you can read them properly ;)
-
Thanks Jason,
must clean my glasses ::)
-
7 extended ION
~~~~~~~~~~
rerun, now light oc'ed from 450 / 800 / 1100 to 475 / 850 / 1161
PowerSpectrumTest7_extended.exe -device 0
Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.4 GFlops 1.6 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.4 GB/s
PS+SuMx( 32) [OK] 0.3 GFlops 1.1 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.8 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.6 2.5 121.7 [OK] 0.7 2.9 121.7
PS+SuMx( 16) 0.6 2.4 121.7 [OK] 0.6 2.7 121.7
PS+SuMx( 32) 0.6 2.3 121.7 [OK] 0.6 2.4 121.7
PS+SuMx( 64) 0.7 2.8 121.7 [OK] 0.8 3.1 121.7
.
Done
modify: the latest GPU-Z 0.4.9 did not show any Memory Controller load (http://www.britta-d.de/images/powerspectrum/ps_7ext_ION_gpuz_no_memory_controller_load.jpg)
looks like a issue ?
further it shows 4 ROPs (http://www.britta-d.de/images/powerspectrum/ps_gpuz_049_show_ROPs_4.jpg) for the ION, but it has 2 Multiprocessors(as far as I know)
emailed to techpowerup
-
Well, to get that 30-50% speedup (1.5-2x) on the small GPU, we went a bit further than what the nVidia documentation specifies for efficient reductions, and the code 'looks nice' (a good sign in engineering)... Still the larger sizes to go, might have to send some notes back to nVidia after we finish this, to update the optimisation manual a bit :o
-
looks like a issue ?
Not 'our' problem ;) see what msi afterburner says (for memory), Maybe they confuse ION & ION2, don't know
-
First post updated:
Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.
Please test repeatedly on all Cuda enabled GPUs... No posting of results please (too large for me to look through, I'll go crosseyed ;)), just confirm all Opt1 [OK] & faster at all sizes, And alert if you see and marked [BAD] or too slow, may need to run several times to see if a problem appears or not.
Jason
-
All systems are go except....
gtx 295
core 0 - 1 bad at test 1/5 under 128 size
core 1 - 1 slow at test 2/5 under 128 size
gtx 260 - 1 slow at test 4/5 under 256 size
-
All systems are go except....
gtx 295
core 0 - 1 bad at test 1/5 under 128 size
core 1 - 1 slow at test 2/5 under 128 size
gtx 260 - 1 slow at test 4/5 under 256 size
Thanks! on the 295 is the Video memory OC'd ? I found here the Opt1 around size #thrds/block(256 on Fermi, 128 on 2xx) can be unstable if VRAM OC is pushed. I had to back off my Video memory OC by 80MHz for it to stabilise
GTX260 - Please run that one a few times & see if that's consistently slower than stock at size 256. Will be checking that code in the meantime.
[Edit:] I see you did, & got one slow out of 5 ... OK
[Edit2:] Darn 128 still a little unstable here too ???, will dial size 128 & 256 back & replace the test shortly (might be pushing a tad hard )
Jason
-
@glenaxl: have updated the PowerSpectrumTest8 archive attached to first post, to dial back the borderline kernels a bit (for now, will dig deeper into those later if needed).
Jason
-
@glenaxl: have updated the PowerSpectrumTest8 archive attached to first post, to dial back the borderline kernels a bit (for now, will dig deeper into those later if needed).
Jason
Yah, my gtx 295 vram is oc'd to 1080 from 999
The new test8 are all good now. Perfect! ;)
-
The new test8 are all good now. Perfect! ;)
Good, good. will keep those ones dialled in a bit then, allowing some possible fine tuning later. It seems cramming that much data through we're beginning to find weak spots, so will look at moving onto FFT integration.
-
Hi Jason,
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Stock best result
PS+SuMx( 32768) [OK] 12.7 GFlops 50.7 GB/s
Opt best result
PS+SuMx( 32768) 16.4 65.7 121.7 [OK] 27.8 111.4 121.7
all others are ok
-
Hi Jason,
excellent performance on the ION
worth to post full result
PowerSpectrumTest8.exe -device 0
Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 0.4 GFlops 1.6 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.5 GB/s
PS+SuMx( 32) [OK] 0.3 GFlops 1.1 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.8 GB/s
PS+SuMx( 128) [OK] 0.7 GFlops 2.7 GB/s
PS+SuMx( 256) [OK] 0.8 GFlops 3.4 GB/s
PS+SuMx( 512) [OK] 1.1 GFlops 4.3 GB/s
PS+SuMx( 1024) [OK] 1.1 GFlops 4.4 GB/s
PS+SuMx( 2048) [OK] 1.2 GFlops 4.9 GB/s
PS+SuMx( 4096) [OK] 1.2 GFlops 4.8 GB/s
PS+SuMx( 8192) [OK] 1.3 GFlops 5.2 GB/s
PS+SuMx( 16384) [OK] 1.3 GFlops 5.1 GB/s
PS+SuMx( 32768) [OK] 1.3 GFlops 5.4 GB/s
PS+SuMx( 65536) [OK] 1.4 GFlops 5.4 GB/s
PS+SuMx(131072) [OK] 1.4 GFlops 5.6 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.6 2.5 121.7 [OK] 0.7 2.9 121.7
PS+SuMx( 16) 0.6 2.4 121.7 [OK] 0.6 2.7 121.7
PS+SuMx( 32) 0.6 2.3 121.7 [OK] 0.6 2.4 121.7
PS+SuMx( 64) 0.7 2.8 121.7 [OK] 0.7 3.0 121.7
PS+SuMx( 128) 0.7 2.7 121.7 [OK] 0.7 3.0 121.7
PS+SuMx( 256) 0.9 3.5 121.7 [OK] 1.0 3.9 121.7
PS+SuMx( 512) 1.1 4.5 121.7 [OK] 1.2 5.0 121.7
PS+SuMx( 1024) 1.2 4.6 121.7 [OK] 1.3 5.1 121.7
PS+SuMx( 2048) 1.3 5.3 121.7 [OK] 1.5 5.9 121.7
PS+SuMx( 4096) 1.3 5.0 121.7 [OK] 1.4 5.6 121.7
PS+SuMx( 8192) 1.4 5.5 121.7 [OK] 1.5 6.1 121.7
PS+SuMx( 16384) 1.3 5.4 121.7 [OK] 1.5 6.0 121.7
PS+SuMx( 32768) 1.4 5.7 121.7 [OK] 1.6 6.4 121.7
PS+SuMx( 65536) 1.4 5.8 121.7 [OK] 1.6 6.5 121.7
PS+SuMx(131072) 1.2 4.8 121.7 [OK] 1.7 6.6 121.7
.
Done
-
Yes, size 128k drops off a bit on mine too, not sure why yet.
-
Was able to get results for GSO9600 at last:
Device: GeForce 9600 GSO, 1700 MHz clock, 384 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 1.2 GFlops 5.4 GB/s
PS+SuMx( 16) [OK] 1.6 GFlops 6.9 GB/s
PS+SuMx( 32) [OK] 1.8 GFlops 7.3 GB/s
PS+SuMx( 64) [OK] 2.9 GFlops 11.8 GB/s
PS+SuMx( 128) [OK] 4.3 GFlops 17.1 GB/s
PS+SuMx( 256) [OK] 5.5 GFlops 22.1 GB/s
PS+SuMx( 512) [OK] 6.7 GFlops 27.0 GB/s
PS+SuMx( 1024) [OK] 7.0 GFlops 28.1 GB/s
PS+SuMx( 2048) [OK] 7.7 GFlops 30.8 GB/s
PS+SuMx( 4096) [OK] 7.6 GFlops 30.4 GB/s
PS+SuMx( 8192) [OK] 7.9 GFlops 31.6 GB/s
PS+SuMx( 16384) [OK] 7.7 GFlops 31.0 GB/s
PS+SuMx( 32768) [OK] 8.1 GFlops 32.5 GB/s
PS+SuMx( 65536) [OK] 7.8 GFlops 31.3 GB/s
PS+SuMx(131072) [OK] 8.0 GFlops 32.2 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 1.5 6.5 121.7 [OK] 4.5 19.6 121.7
PS+SuMx( 16) 2.3 9.6 121.7 [OK] 4.8 20.0 121.7
PS+SuMx( 32) 3.0 12.1 121.7 [OK] 4.5 18.5 121.7
PS+SuMx( 64) 3.1 12.7 121.7 [OK] 5.4 21.7 121.7
PS+SuMx( 128) 4.5 18.1 121.7 [OK] 5.3 21.3 121.7
PS+SuMx( 256) 5.8 23.1 121.7 [OK] 6.5 25.9 121.7
PS+SuMx( 512) 6.9 27.8 121.7 [OK] 7.5 30.0 121.7
PS+SuMx( 1024) 7.3 29.1 121.7 [OK] 7.8 31.2 121.7
PS+SuMx( 2048) 7.9 31.5 121.7 [OK] 8.4 33.6 121.7
PS+SuMx( 4096) 7.8 31.1 121.7 [OK] 8.2 32.6 121.7
PS+SuMx( 8192) 8.1 32.3 121.7 [OK] 8.5 33.9 121.7
PS+SuMx( 16384) 7.9 31.5 121.7 [OK] 8.2 32.8 121.7
PS+SuMx( 32768) 8.1 32.5 121.7 [OK] 8.6 34.6 121.7
PS+SuMx( 65536) 5.7 22.7 121.7 [OK] 8.3 33.2 121.7
PS+SuMx(131072) 8.2 32.6 121.7 [OK] 8.5 34.1 121.7
-
Okay, here's test 8. Figured it would be better for me to post it rather than try to explain what I don't understand. :8
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test> powerspectrumtest8.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 0.7 GFlops 3.1 GB/s
PS+SuMx( 16) [OK] 0.8 GFlops 3.2 GB/s
PS+SuMx( 32) [OK] 0.7 GFlops 3.0 GB/s
PS+SuMx( 64) [OK] 1.0 GFlops 4.2 GB/s
PS+SuMx( 128) [OK] 0.8 GFlops 3.4 GB/s
PS+SuMx( 256) [OK] 1.6 GFlops 6.6 GB/s
PS+SuMx( 512) [OK] 2.0 GFlops 7.8 GB/s
PS+SuMx( 1024) [OK] 2.1 GFlops 8.2 GB/s
PS+SuMx( 2048) [OK] 2.1 GFlops 8.2 GB/s
PS+SuMx( 4096) [OK] 2.0 GFlops 8.1 GB/s
PS+SuMx( 8192) [OK] 2.1 GFlops 8.4 GB/s
PS+SuMx( 16384) [OK] 2.1 GFlops 8.4 GB/s
PS+SuMx( 32768) [OK] 0.5 GFlops 1.9 GB/s
PS+SuMx( 65536) [OK] 0.4 GFlops 1.5 GB/s
PS+SuMx(131072) [OK] 2.1 GFlops 8.5 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 1.1 4.8 121.7 [OK] 1.5 6.8 121.7
PS+SuMx( 16) 1.2 5.0 121.7 [OK] 1.7 6.9 121.7
PS+SuMx( 32) 1.2 5.0 121.7 [OK] 1.5 6.1 121.7
PS+SuMx( 64) 0.5 1.9 121.7 [OK] 1.7 7.1 121.7
PS+SuMx( 128) 0.6 2.5 121.7 [OK] 1.8 7.2 121.7
PS+SuMx( 256) 0.6 2.3 121.7 [OK] 2.1 8.3 121.7
PS+SuMx( 512) 2.0 8.1 121.7 [OK] 2.5 10.1 121.7
PS+SuMx( 1024) 1.9 7.8 121.7 [OK] 2.6 10.3 121.7
PS+SuMx( 2048) 2.1 8.6 121.7 [OK] 2.6 10.3 121.7
PS+SuMx( 4096) 0.5 2.1 121.7 [OK] 2.5 10.0 121.7
PS+SuMx( 8192) 2.2 8.7 121.7 [OK] 2.8 11.1 121.7
PS+SuMx( 16384) 2.1 8.2 121.7 [OK] 2.7 10.9 121.7
PS+SuMx( 32768) 2.2 8.8 121.7 [OK] 2.8 11.1 121.7
PS+SuMx( 65536) 2.2 8.9 121.7 [OK] 2.8 11.2 121.7
PS+SuMx(131072) 2.3 9.2 121.7 [OK] 2.8 11.3 121.7
C:\test>
-
Was able to get results for GSO9600 at last:
Ouch, not much headroom between worst & best (fast GDDR3 memory on 9600GSO IIRC). I reckon the 64k size is an anomaly worth looking into, as with the 128k drop-off on other cards (like ION). Thankfully that part (larger sizes) is mostly stock, so there should be plenty of tweaking possibilities.... Even if only for a GFlop here and there.
-
Okay, here's test 8. Figured it would be better for me to post it rather than try to explain what I don't understand. :8
Thanks, A couple of sizes choking there for whatever reason. I think I'm going to have to improve everything from size 64&128 upward before moving onto the FFTs ... Nice that it's working with all '[OK]'
-
All OK (5 runs) on GTX 465:
Stock Best Result:
PS+SuMx( 32768) [OK] 12.2 GFlops 48.7 GB/s
Opt1 Best Result:
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 32768) 17.7 71.0 121.7 [OK] 24.6 98.2 121.7
-
Hey Ghost, what's the memory bus width & memory clock on that 465 ?
-
Hey Ghost, what's the memory bus width & memory clock on that 465 ?
Here's a GPU-Z image for the card
-
PS+SuMx( 32768) 17.7 71.0 121.7 [OK] 24.6 98.2 121.7
Hmm this *could* be near max theoretical then ... checking
-
PS+SuMx( 32768) 17.7 71.0 121.7 [OK] 24.6 98.2 121.7
Hmm this *could* be near max theoretical then ... checking
Thats good :D
Was getting a nice capacitor whine when running the tests, so knew it was being pushed hard!
-
I calculate 122.24 GB/s theoretical max (matching GPU-z listing), so 98.2 seems pretty good. I'll look at what that size is doing & see if I can spread some performance around up in that area.
[Edit:] I get the impression we might be best seeing what streaming those kernels will do sometime soon :-\ too many new fan-dangled features in this stuff ;)
-
OK here on 9800GTX+ (5 runs) GPU usage up from stock's ~80% to ~95% on Opt1:
Best Stock result:
PS+SuMx( 65536) [OK] 11.6 GFlops 46.5 GB/s
Opt1 Best Result:
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 65536) 13.0 52.1 121.7 [OK] 15.6 62.5 121.7
and O.K on 128Mb 8400M GS (5 runs):
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 0.3 GFlops 1.3 GB/s
PS+SuMx( 16) [OK] 0.3 GFlops 1.2 GB/s
PS+SuMx( 32) [OK] 0.2 GFlops 0.9 GB/s
PS+SuMx( 64) [OK] 0.4 GFlops 1.5 GB/s
PS+SuMx( 128) [OK] 0.5 GFlops 2.2 GB/s
PS+SuMx( 256) [OK] 0.7 GFlops 2.8 GB/s
PS+SuMx( 512) [OK] 0.8 GFlops 3.4 GB/s
PS+SuMx( 1024) [OK] 0.9 GFlops 3.5 GB/s
PS+SuMx( 2048) [OK] 1.0 GFlops 4.0 GB/s
PS+SuMx( 4096) [OK] 0.9 GFlops 3.7 GB/s
PS+SuMx( 8192) [OK] 1.0 GFlops 4.0 GB/s
PS+SuMx( 16384) [OK] 1.0 GFlops 3.9 GB/s
PS+SuMx( 32768) [OK] 1.0 GFlops 4.1 GB/s
PS+SuMx( 65536) [OK] 1.1 GFlops 4.2 GB/s
PS+SuMx(131072) [OK] 1.1 GFlops 4.3 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 0.4 1.9 121.7 [OK] 0.5 2.1 121.7
PS+SuMx( 16) 0.4 1.8 121.7 [OK] 0.5 1.9 121.7
PS+SuMx( 32) 0.4 1.7 121.7 [OK] 0.4 1.7 121.7
PS+SuMx( 64) 0.5 2.1 121.7 [OK] 0.5 2.2 121.7
PS+SuMx( 128) 0.6 2.2 121.7 [OK] 0.6 2.3 121.7
PS+SuMx( 256) 0.7 2.9 121.7 [OK] 0.7 3.0 121.7
PS+SuMx( 512) 0.9 3.5 121.7 [OK] 0.9 3.6 121.7
PS+SuMx( 1024) 0.9 3.5 121.7 [OK] 0.9 3.7 121.7
PS+SuMx( 2048) 1.0 4.0 121.7 [OK] 1.0 4.2 121.7
PS+SuMx( 4096) 0.9 3.8 121.7 [OK] 1.0 3.9 121.7
PS+SuMx( 8192) 1.0 4.0 121.7 [OK] 1.0 4.2 121.7
PS+SuMx( 16384) 1.0 4.0 121.7 [OK] 1.0 4.1 121.7
PS+SuMx( 32768) 1.1 4.2 121.7 [OK] 1.1 4.3 121.7
PS+SuMx( 65536) 1.1 4.3 121.7 [OK] 1.1 4.5 121.7
PS+SuMx(131072) 1.1 4.4 121.7 [OK] 1.1 4.5 121.7
Claggy
-
I ran this on my usual rig (Q6600/8GB/8800GTX) but version 8 added something new, an error. Under WinXP it just shows the error, but under Win7-64 the screen turns black and I get a "driver stopped responding error". Running 260.99.
First the WinXP-32 log:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.7 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 11.1 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 4.3 GFlops 17.6 GB/s
PS+SuMx( 128) [OK] 6.7 GFlops 26.9 GB/s
PS+SuMx( 256) [OK] 9.0 GFlops 36.0 GB/s
PS+SuMx( 512) [OK] 11.2 GFlops 44.7 GB/s
PS+SuMx( 1024) [OK] 11.8 GFlops 47.4 GB/s
PS+SuMx( 2048) [OK] 13.5 GFlops 53.9 GB/s
PS+SuMx( 4096) [OK] 13.2 GFlops 52.6 GB/s
PS+SuMx( 8192) [OK] 14.4 GFlops 57.4 GB/s
PS+SuMx( 16384) [OK] 14.1 GFlops 56.4 GB/s
PS+SuMx( 32768) [OK] 14.9 GFlops 59.5 GB/s
PS+SuMx( 65536) [OK] 15.3 GFlops 61.1 GB/s
PS+SuMx(131072) [OK] 11.9 GFlops 47.7 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.6 15.8 121.7 [OK] 6.2 27.2 121.7
PS+SuMx( 16) 4.5 18.8 121.7 [OK] 6.1 25.5 121.7
PS+SuMx( 32) 4.9 20.1 121.7 [OK] 5.8 23.8 121.7
PS+SuMx( 64)
FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456
Then the Win7-64 log:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 2.0 GFlops 9.0 GB/s
PS+SuMx( 16) [OK] 2.4 GFlops 10.2 GB/s
PS+SuMx( 32) [OK] 2.4 GFlops 9.8 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.6 GB/s
PS+SuMx( 128) [OK] 5.7 GFlops 22.8 GB/s
PS+SuMx( 256) [OK] 7.2 GFlops 28.8 GB/s
PS+SuMx( 512) [OK] 8.5 GFlops 34.1 GB/s
PS+SuMx( 1024) [OK] 8.9 GFlops 35.8 GB/s
PS+SuMx( 2048) [OK] 9.8 GFlops 39.3 GB/s
PS+SuMx( 4096) [OK] 9.7 GFlops 38.8 GB/s
PS+SuMx( 8192) [OK] 10.3 GFlops 41.3 GB/s
PS+SuMx( 16384) [OK] 10.1 GFlops 40.5 GB/s
PS+SuMx( 32768) [OK] 10.6 GFlops 42.2 GB/s
PS+SuMx( 65536) [OK] 10.7 GFlops 43.0 GB/s
PS+SuMx(131072) [OK] 9.0 GFlops 36.0 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.4 14.8 121.7 [OK] 6.1 26.8 121.7
PS+SuMx( 16) 4.2 17.4 121.7 [OK] 6.0 25.3 121.7
PS+SuMx( 32) 4.6 18.7 121.7 [OK] 5.8 23.7 121.7
PS+SuMx( 64)
FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456
Regards, Patrick.
-
All OK here with GPU RAM at 1975 MHz with 5 runs
Best Stock result
PS+SuMx( 32768) [OK] 18.7 GFlops 75.0 GB/s
Best Opt. 1 result
PS+SuMx( 32768) 26.8 107.4 121.7 [OK] 37.0 148.1 121.7
Steve
-
very interesting test8 shows for the cards GTX470/480 --> 32768 as best result.
But with slow end cards 131072 is best.
-
All OK worst 0-0.4 faster than stock, best another .1-.4 faster than worst.
about 5 runs.
-
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PS+SuMx( 65536) [OK] 12.4 GFlops 49.6 GB/s
PS+SuMx( 65536) 16.6 66.4 121.7 [OK] 17.7 70.7 121.7
-
PS+SuMx( 64)
FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456
Wow Patrick, clearly something I'm doing in size 64 has changed (and only appears on cc1.0 :o), will check. we're going to need to fix that before moving on.
[Later:] @Patrick: when you can, please reboot & try the attached fix attempt ( for compute cap 1.0)... If OK on that card I'll be able to avoid breaking that again...
[Removed attachment]
-
PS+SuMx( 64)
FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, lin
e 456
Wow Patrick, clearly something I'm doing in size 64 has changed (and only appears on cc1.0 :o), will check. we're going to need to fix that before moving on.
[Later:] @Patrick: when you can, please reboot & try the attached fix attempt ( for compute cap 1.0)... If OK on that card I'll be able to avoid breaking that again...
It looks like you fixed it, full loggings for completion sake:
WinXP-32:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.7 GB/s
PS+SuMx( 16) [OK] 2.6 GFlops 11.1 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 4.3 GFlops 17.6 GB/s
PS+SuMx( 128) [OK] 6.7 GFlops 26.9 GB/s
PS+SuMx( 256) [OK] 9.0 GFlops 36.0 GB/s
PS+SuMx( 512) [OK] 11.2 GFlops 44.7 GB/s
PS+SuMx( 1024) [OK] 11.8 GFlops 47.4 GB/s
PS+SuMx( 2048) [OK] 13.5 GFlops 53.9 GB/s
PS+SuMx( 4096) [OK] 13.2 GFlops 52.6 GB/s
PS+SuMx( 8192) [OK] 14.4 GFlops 57.5 GB/s
PS+SuMx( 16384) [OK] 14.1 GFlops 56.5 GB/s
PS+SuMx( 32768) [OK] 14.9 GFlops 59.5 GB/s
PS+SuMx( 65536) [OK] 15.3 GFlops 61.2 GB/s
PS+SuMx(131072) [OK] 12.0 GFlops 47.8 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.6 15.8 121.7 [OK] 6.2 27.2 121.7
PS+SuMx( 16) 4.5 18.8 121.7 [OK] 6.1 25.5 121.7
PS+SuMx( 32) 4.9 20.1 121.7 [OK] 5.8 23.8 121.7
PS+SuMx( 64) 6.5 26.5 121.7 [OK] 7.4 30.0 121.7
PS+SuMx( 128) 7.2 28.8 121.7 [OK] 7.8 31.3 121.7
PS+SuMx( 256) 9.4 37.8 121.7 [OK] 10.2 40.7 121.7
PS+SuMx( 512) 11.6 46.3 121.7 [OK] 12.4 49.7 121.7
PS+SuMx( 1024) 12.1 48.5 121.7 [OK] 12.9 51.6 121.7
PS+SuMx( 2048) 13.7 54.9 121.7 [OK] 14.6 58.5 121.7
PS+SuMx( 4096) 13.4 53.5 121.7 [OK] 14.2 56.8 121.7
PS+SuMx( 8192) 14.5 58.2 121.7 [OK] 15.5 62.0 121.7
PS+SuMx( 16384) 14.3 57.1 121.7 [OK] 15.2 60.9 121.7
PS+SuMx( 32768) 15.1 60.3 121.7 [OK] 16.1 64.4 121.7
PS+SuMx( 65536) 15.5 62.0 121.7 [OK] 16.5 66.2 121.7
PS+SuMx(131072) 12.1 48.2 121.7 [OK] 12.7 50.8 121.7
Win7-64:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #8 (Sanity Check)
Stock:
PS+SuMx( 8) [OK] 2.0 GFlops 8.7 GB/s
PS+SuMx( 16) [OK] 2.4 GFlops 10.2 GB/s
PS+SuMx( 32) [OK] 2.4 GFlops 9.7 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.8 GB/s
PS+SuMx( 128) [OK] 5.6 GFlops 22.7 GB/s
PS+SuMx( 256) [OK] 7.2 GFlops 29.0 GB/s
PS+SuMx( 512) [OK] 8.7 GFlops 34.7 GB/s
PS+SuMx( 1024) [OK] 9.0 GFlops 36.0 GB/s
PS+SuMx( 2048) [OK] 10.0 GFlops 40.1 GB/s
PS+SuMx( 4096) [OK] 9.8 GFlops 39.0 GB/s
PS+SuMx( 8192) [OK] 10.4 GFlops 41.6 GB/s
PS+SuMx( 16384) [OK] 10.2 GFlops 40.7 GB/s
PS+SuMx( 32768) [OK] 10.8 GFlops 43.2 GB/s
PS+SuMx( 65536) [OK] 10.9 GFlops 43.6 GB/s
PS+SuMx(131072) [OK] 9.0 GFlops 36.1 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.4 14.9 121.7 [OK] 6.1 26.8 121.7
PS+SuMx( 16) 4.2 17.6 121.7 [OK] 6.1 25.4 121.7
PS+SuMx( 32) 4.6 18.7 121.7 [OK] 5.8 23.7 121.7
PS+SuMx( 64) 6.0 24.2 121.7 [OK] 7.3 29.4 121.7
PS+SuMx( 128) 6.5 26.0 121.7 [OK] 7.7 31.1 121.7
PS+SuMx( 256) 8.3 33.3 121.7 [OK] 10.1 40.4 121.7
PS+SuMx( 512) 9.9 39.8 121.7 [OK] 12.3 49.4 121.7
PS+SuMx( 1024) 10.2 40.8 121.7 [OK] 12.8 51.3 121.7
PS+SuMx( 2048) 11.3 45.2 121.7 [OK] 14.5 58.2 121.7
PS+SuMx( 4096) 11.2 44.6 121.7 [OK] 14.1 56.3 121.7
PS+SuMx( 8192) 12.1 48.3 121.7 [OK] 15.4 61.5 121.7
PS+SuMx( 16384) 11.7 46.8 121.7 [OK] 15.1 60.4 121.7
PS+SuMx( 32768) 12.2 48.8 121.7 [OK] 16.0 63.8 121.7
PS+SuMx( 65536) 12.5 50.0 121.7 [OK] 16.4 65.8 121.7
PS+SuMx(131072) 10.1 40.5 121.7 [OK] 12.6 50.5 121.7
Regards, Patrick.
-
Phew! cool, thanks ;D
Not much headroom on that chip either, but I'll be happy with that small fraction improvement on the oldest cards for now.
Moving onto test #9 soon, will add in the FFTs, then will stream the test kernels after that, just to see what that does... Progress at last ;D
-
Phew! cool, thanks ;D
Not much headroom on that chip either, but I'll be happy with that small fraction improvement on the oldest cards for now.
Moving onto test #9 soon, will add in the FFTs, then will stream the test kernels after that, just to see what that does... Progress at last ;D
You're quite welcome. What exactly do you mean with 'not much headroom on that chip'?
Looking forward to the next test-programs. ;)
Oh, and a Merry Christmas!
Regards,
Patrick.
-
You're quite welcome. What exactly do you mean with 'not much headroom on that chip'?
only that It seems the best & worst case Opt1 aren't as far apart as on the newer/bigger GPUs, which means we're getting close to limits of the smaller chips, as to what optimisations can be useful on those (with this part of code anyway )
Onto combining FFTs into the pipline now, which will change the picture a lot. Back later
Jason
-
Darn Next test won't fit ! Arggh! .... When I can get it posted somewhere, Net progress so far looks something like this for ~40-60% of multibeam processing:
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 12.7 GFlops 22.4 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 20.6 GFlops 28.1 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 25.1 GFlops 28.0 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 43.1 GFlops 40.8 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 63.7 GFlops 52.4 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 85.6 GFlops 62.4 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 114.2 GFlops 74.6 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 136.7 GFlops 81.0 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 149.3 GFlops 81.0 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 154.1 GFlops 77.1 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 156.2 GFlops 72.4 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 149.2 GFlops 64.5 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 155.5 GFlops 63.0 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 152.0 GFlops 57.9 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 134.7 GFlops 48.4 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 19.2 GFlops 33.8 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 37.0 GFlops 50.5 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 61.1 GFlops 68.2 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 86.9 GFlops 82.2 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 93.4 GFlops 76.8 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 137.0 GFlops 99.8 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 174.8 GFlops 114.2 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 218.7 GFlops 129.6 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 231.2 GFlops 125.4 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 236.8 GFlops 118.4 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 229.0 GFlops 106.2 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 223.9 GFlops 96.8 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 216.0 GFlops 87.5 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 214.0 GFlops 81.5 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 205.0 GFlops 73.7 GB/s ulps(fft 2.7,ps 5392.8) [OK]
Figuring out how to get it uploaded ...
-
Unable to upload here, Please try
ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z
Updated first post:
Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only
Temporary download location:
ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z
-
GTX 465 results:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8 ) 10.6 GFlops 18.7 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 16.5 GFlops 22.5 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 16.9 GFlops 18.9 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 29.0 GFlops 27.4 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 43.4 GFlops 35.7 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 57.8 GFlops 42.1 GB/s ulps(fft 1.7,ps 4254.8 ) [OK]
FFT+PS+SM( 512) 77.4 GFlops 50.6 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 92.9 GFlops 55.1 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 99.7 GFlops 54.1 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 101.1 GFlops 50.6 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 103.9 GFlops 48.2 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 103.1 GFlops 44.6 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 104.6 GFlops 42.4 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 102.4 GFlops 39.0 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 93.8 GFlops 33.7 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8 ) 20.5 GFlops 36.2 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 33.7 GFlops 45.9 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 47.3 GFlops 52.8 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 60.0 GFlops 56.8 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 59.0 GFlops 48.5 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 85.8 GFlops 62.5 GB/s ulps(fft 1.7,ps 4261.8 ) [OK]
FFT+PS+SM( 512) 109.0 GFlops 71.2 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 133.7 GFlops 79.3 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 136.9 GFlops 74.3 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 141.5 GFlops 70.7 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 136.7 GFlops 63.4 GB/s ulps(fft 2.6,ps 5278.8 ) [OK]
FFT+PS+SM( 16384) 141.3 GFlops 61.1 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 134.9 GFlops 54.6 GB/s ulps(fft 2.3,ps 4992.8 ) [OK]
FFT+PS+SM( 65536) 132.6 GFlops 50.5 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 130.5 GFlops 46.9 GB/s ulps(fft 2.7,ps 5392.8 ) [OK]
-
Thanks, That's crazy speedup there too ( 1.3-2x) . Will be checking thoroughly before moving on ;)
-
And the 460-768
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 9.5 GFlops 16.7 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 14.4 GFlops 19.7 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 13.8 GFlops 15.4 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 24.2 GFlops 22.9 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 36.9 GFlops 30.4 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 49.9 GFlops 36.3 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 70.7 GFlops 46.2 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 90.4 GFlops 53.6 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 102.7 GFlops 55.7 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 111.2 GFlops 55.6 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 97.5 GFlops 45.2 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 93.4 GFlops 40.4 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 100.6 GFlops 40.7 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 106.9 GFlops 40.7 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 86.9 GFlops 31.3 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 16.5 GFlops 29.1 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 27.2 GFlops 37.1 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 38.4 GFlops 42.9 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 49.9 GFlops 47.2 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 45.0 GFlops 37.0 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 64.5 GFlops 47.0 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 82.9 GFlops 54.2 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 108.0 GFlops 64.0 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 123.3 GFlops 66.9 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 132.9 GFlops 66.4 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 111.0 GFlops 51.5 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 107.2 GFlops 46.3 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 111.4 GFlops 45.1 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 117.4 GFlops 44.7 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 95.6 GFlops 34.4 GB/s ulps(fft 2.7,ps 5392.8) [OK]
Rehosting of the test on a faster connection.
http://www.arkayn.us/seti/PowerSpectrumTest9.7z
-
And the 460-768...
We're pushing that narrower memory bus I guess ;), totally different spread is interesting.
Rehosting of the test on a faster connection.
http://www.arkayn.us/seti/PowerSpectrumTest9.7z
Cheers! (adding link to first post..[done] )
-
This is fun!
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 21.2 GFlops 37.3 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 30.5 GFlops 41.6 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 30.7 GFlops 34.2 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 50.3 GFlops 47.6 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 73.0 GFlops 60.0 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 92.7 GFlops 67.5 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 125.8 GFlops 82.2 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 149.6 GFlops 88.7 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 163.0 GFlops 88.4 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 168.5 GFlops 84.2 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 170.0 GFlops 78.8 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 157.2 GFlops 68.0 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 167.4 GFlops 67.8 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 164.6 GFlops 62.7 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 141.9 GFlops 51.0 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 37.4 GFlops 65.9 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 58.9 GFlops 80.4 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 81.7 GFlops 91.2 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 102.4 GFlops 96.9 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 100.5 GFlops 82.7 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 142.2 GFlops 103.6 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 177.3 GFlops 115.9 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 218.1 GFlops 129.3 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 233.4 GFlops 126.6 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 238.4 GFlops 119.2 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 229.6 GFlops 106.5 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 217.5 GFlops 94.1 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 213.6 GFlops 86.5 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 213.2 GFlops 81.2 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 198.0 GFlops 71.2 GB/s ulps(fft 2.7,ps 5392.8) [OK]
Steve
-
The usual three:
9800GTX+, Windows 7/32
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 6.9 GFlops 12.2 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 11.8 GFlops 16.1 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 15.6 GFlops 17.4 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 26.2 GFlops 24.8 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 36.6 GFlops 30.1 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 48.7 GFlops 35.5 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 57.8 GFlops 37.8 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 62.9 GFlops 37.3 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 61.7 GFlops 33.5 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 57.6 GFlops 28.8 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 56.7 GFlops 26.3 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 52.5 GFlops 22.7 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 50.3 GFlops 20.4 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 55.3 GFlops 21.1 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 56.9 GFlops 20.5 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 14.9 GFlops 26.2 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 23.3 GFlops 31.8 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 30.5 GFlops 34.0 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 43.2 GFlops 40.9 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 49.8 GFlops 41.0 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 64.9 GFlops 47.3 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 79.3 GFlops 51.8 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 81.9 GFlops 48.6 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 78.1 GFlops 42.4 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 73.3 GFlops 36.7 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 70.5 GFlops 32.7 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 65.7 GFlops 28.4 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 60.7 GFlops 24.6 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 67.0 GFlops 25.5 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 68.5 GFlops 24.6 GB/s ulps(fft 3.6,ps 6590.4) [OK]
9800GT, Windows XP/32
Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 6.6 GFlops 11.6 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 10.5 GFlops 14.3 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 13.0 GFlops 14.5 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 22.4 GFlops 21.2 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 33.8 GFlops 27.8 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 45.2 GFlops 32.9 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 56.0 GFlops 36.6 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 57.6 GFlops 34.1 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 57.4 GFlops 31.1 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 50.4 GFlops 25.2 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 48.9 GFlops 22.7 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 46.8 GFlops 20.3 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 42.4 GFlops 17.2 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 47.8 GFlops 18.2 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 50.5 GFlops 18.1 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 9.7 GFlops 17.2 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 16.0 GFlops 21.9 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 21.5 GFlops 24.0 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 31.1 GFlops 29.4 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 36.3 GFlops 29.9 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 47.7 GFlops 34.8 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 58.6 GFlops 38.3 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 59.7 GFlops 35.4 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 59.0 GFlops 32.0 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 51.9 GFlops 26.0 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 50.0 GFlops 23.2 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 47.7 GFlops 20.6 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 43.2 GFlops 17.5 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 48.7 GFlops 18.6 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 51.6 GFlops 18.6 GB/s ulps(fft 3.6,ps 6590.4) [OK]
GTX 470, Windows XP/32
Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 7.9 GFlops 14.0 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 14.0 GFlops 19.1 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 17.7 GFlops 19.7 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 32.4 GFlops 30.7 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 51.7 GFlops 42.6 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 72.0 GFlops 52.5 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 100.4 GFlops 65.6 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 124.9 GFlops 74.1 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 136.6 GFlops 74.1 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 139.1 GFlops 69.6 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 141.0 GFlops 65.4 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 132.7 GFlops 57.4 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 137.9 GFlops 55.9 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 134.5 GFlops 51.2 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 116.0 GFlops 41.7 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 14.2 GFlops 25.1 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 27.2 GFlops 37.1 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 43.9 GFlops 49.0 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 61.3 GFlops 58.0 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 65.6 GFlops 54.0 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 95.7 GFlops 69.7 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 121.1 GFlops 79.2 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 153.4 GFlops 91.0 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 161.9 GFlops 87.8 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 168.3 GFlops 84.2 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 157.7 GFlops 73.1 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 155.1 GFlops 67.1 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 151.9 GFlops 61.5 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 150.7 GFlops 57.4 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 137.2 GFlops 49.3 GB/s ulps(fft 2.7,ps 5392.8) [OK]
-
My 128Mb 8400M GS on Vista32, and Merry Christmas:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 1.2 GFlops 2.1 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 1.4 GFlops 1.8 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 1.4 GFlops 1.5 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 2.3 GFlops 2.2 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 3.6 GFlops 2.9 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 4.7 GFlops 3.4 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 5.6 GFlops 3.7 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 5.5 GFlops 3.2 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 5.5 GFlops 3.0 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 5.3 GFlops 2.6 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 4.7 GFlops 2.2 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 4.4 GFlops 1.9 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 5.0 GFlops 2.0 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 5.2 GFlops 2.0 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 5.5 GFlops 2.0 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 1.6 GFlops 2.8 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 1.9 GFlops 2.6 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 2.3 GFlops 2.5 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 3.1 GFlops 2.9 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 3.6 GFlops 3.0 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 4.8 GFlops 3.5 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 5.8 GFlops 3.8 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 5.6 GFlops 3.3 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 5.7 GFlops 3.1 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 5.3 GFlops 2.7 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 4.8 GFlops 2.2 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 4.4 GFlops 1.9 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 5.0 GFlops 2.0 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 5.2 GFlops 2.0 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 5.5 GFlops 2.0 GB/s ulps(fft 3.6,ps 6590.4) [OK]
and 9800GTX+ on Win 7 x64:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 8.1 GFlops 14.3 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 12.6 GFlops 17.2 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 16.6 GFlops 18.5 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 28.7 GFlops 27.1 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 42.1 GFlops 34.6 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 55.5 GFlops 40.4 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 68.2 GFlops 44.6 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 72.3 GFlops 42.9 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 70.7 GFlops 38.4 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 66.1 GFlops 33.1 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 64.2 GFlops 29.8 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 60.7 GFlops 26.2 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 56.1 GFlops 22.7 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 62.0 GFlops 23.6 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 63.2 GFlops 22.7 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 11.1 GFlops 19.6 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 19.4 GFlops 26.4 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 27.5 GFlops 30.7 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 40.8 GFlops 38.6 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 48.9 GFlops 40.2 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 64.2 GFlops 46.8 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 79.3 GFlops 51.8 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 82.7 GFlops 49.0 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 79.9 GFlops 43.3 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 74.3 GFlops 37.2 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 71.6 GFlops 33.2 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 66.9 GFlops 28.9 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 61.4 GFlops 24.9 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 68.0 GFlops 25.9 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 69.3 GFlops 24.9 GB/s ulps(fft 3.6,ps 6590.4) [OK]
Claggy
-
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 17.3 GFlops 30.4 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 23.2 GFlops 31.7 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 27.2 GFlops 30.4 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 43.8 GFlops 41.5 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 60.7 GFlops 49.9 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 75.6 GFlops 55.1 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 91.6 GFlops 59.9 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 92.1 GFlops 54.6 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 96.9 GFlops 52.6 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 93.1 GFlops 46.6 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 98.7 GFlops 45.8 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 96.1 GFlops 41.6 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 96.5 GFlops 39.1 GB/s ulps(fft 3.1,ps 6152.4) [OK]
FFT+PS+SM( 65536) 88.2 GFlops 33.6 GB/s ulps(fft 2.8,ps 5899.2) [OK]
FFT+PS+SM(131072) 94.4 GFlops 33.9 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 128 thrds/block
FFT+PS+SM( 8) 25.0 GFlops 44.0 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 37.1 GFlops 50.6 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 49.8 GFlops 55.6 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 68.5 GFlops 64.9 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 81.4 GFlops 67.0 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 94.6 GFlops 68.9 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 115.9 GFlops 75.7 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 122.4 GFlops 72.6 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 124.9 GFlops 67.7 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 113.9 GFlops 57.0 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 120.5 GFlops 55.9 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 121.6 GFlops 52.6 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 120.1 GFlops 48.7 GB/s ulps(fft 3.1,ps 6041.9) [OK]
FFT+PS+SM( 65536) 103.7 GFlops 39.5 GB/s ulps(fft 2.8,ps 5782.9) [OK]
FFT+PS+SM(131072) 111.2 GFlops 40.0 GB/s ulps(fft 3.6,ps 6590.4) [OK]
-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 16.3 GFlops 28.7 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 22.9 GFlops 31.3 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 26.3 GFlops 29.3 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 42.1 GFlops 39.8 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 63.2 GFlops 52.0 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 75.0 GFlops 54.6 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 89.7 GFlops 58.6 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 92.9 GFlops 55.1 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 96.6 GFlops 52.4 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 87.3 GFlops 43.7 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 49.6 GFlops 23.0 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 98.6 GFlops 42.6 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 97.1 GFlops 39.3 GB/s ulps(fft 3.1,ps 6152.4) [OK]
FFT+PS+SM( 65536) 85.5 GFlops 32.6 GB/s ulps(fft 2.8,ps 5899.2) [OK]
FFT+PS+SM(131072) 91.4 GFlops 32.9 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 128 thrds/block
FFT+PS+SM( 8) 24.5 GFlops 43.2 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 36.4 GFlops 49.7 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 48.8 GFlops 54.5 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 67.0 GFlops 63.4 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 79.6 GFlops 65.5 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 92.7 GFlops 67.5 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 113.9 GFlops 74.4 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 118.9 GFlops 70.5 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 122.9 GFlops 66.7 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 111.8 GFlops 55.9 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 117.7 GFlops 54.6 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 118.7 GFlops 51.3 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 117.7 GFlops 47.7 GB/s ulps(fft 3.1,ps 6041.9) [OK]
FFT+PS+SM( 65536) 101.2 GFlops 38.5 GB/s ulps(fft 2.8,ps 5782.9) [OK]
FFT+PS+SM(131072) 108.6 GFlops 39.0 GB/s ulps(fft 3.6,ps 6590.4) [OK]
-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 16.5 GFlops 29.2 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 23.1 GFlops 31.5 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 25.3 GFlops 28.3 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 41.3 GFlops 39.1 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 61.6 GFlops 50.7 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 72.0 GFlops 52.5 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 87.7 GFlops 57.3 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 94.5 GFlops 56.0 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 96.7 GFlops 52.5 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 90.5 GFlops 45.2 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 95.0 GFlops 44.1 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 95.0 GFlops 41.1 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 91.2 GFlops 36.9 GB/s ulps(fft 3.1,ps 6152.4) [OK]
FFT+PS+SM( 65536) 83.6 GFlops 31.8 GB/s ulps(fft 2.8,ps 5899.2) [OK]
FFT+PS+SM(131072) 90.6 GFlops 32.6 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 128 thrds/block
FFT+PS+SM( 8) 24.1 GFlops 42.4 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 35.3 GFlops 48.2 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 47.1 GFlops 52.6 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 64.9 GFlops 61.4 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 77.0 GFlops 63.3 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 89.2 GFlops 65.0 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 110.0 GFlops 71.9 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 118.1 GFlops 70.0 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 118.8 GFlops 64.5 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 110.6 GFlops 55.3 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 116.2 GFlops 53.9 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 116.1 GFlops 50.2 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 108.7 GFlops 44.0 GB/s ulps(fft 3.1,ps 6041.9) [OK]
FFT+PS+SM( 65536) 97.8 GFlops 37.2 GB/s ulps(fft 2.8,ps 5782.9) [OK]
FFT+PS+SM(131072) 108.3 GFlops 38.9 GB/s ulps(fft 3.6,ps 6590.4) [OK]
-
oooh, now my eyes have gone funny ;D
@All: Thanks very much and Merry Christmas!
Summary of what I can see:
- The newer&bigger the card, the more we seem to be able to extract
- Opt1 FFT (worst case) pipeline not slower than stock at any speed on any GPU so far. (Even the small GPUs)
- Seems stable [OK] on all
- 200 series holding in there
- Fermi peak starting to push unexpectedly high this early ( but still ~50% theoretical, will need to try streaming next as planned.)
I reckon we're getting a good start toward optimising multibeam now. With FFT, powerspectrum, & Summax reductions covered, we account for about ~40-50% of processing (depending on angle range). With a few more refinements to this area ( mainly streaming & findspikes itself to try) we should be ready to tackle the more challenging areas that remain (& dominate).
Long road still to travel, but I reckon we've managed to nail a few key techniques that will help dramatically with certain problem areas down the road.
Cheers, off to give things a short Christmas break before going through all that with a fine tooth comb.
Jason
-
Merry Christmas!
Thank you for the Christmas 2010 edition ;)
PowerSpectrumTest9.exe -device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 10.5 GFlops 18.6 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 16.6 GFlops 22.6 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 21.6 GFlops 24.2 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 36.0 GFlops 34.1 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 52.7 GFlops 43.3 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 69.5 GFlops 50.6 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 94.6 GFlops 61.8 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 107.8 GFlops 63.9 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 118.0 GFlops 64.0 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 125.2 GFlops 62.6 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 131.7 GFlops 61.1 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 113.8 GFlops 49.2 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 121.3 GFlops 49.1 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 121.6 GFlops 46.3 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 100.4 GFlops 36.1 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 21.7 GFlops 38.3 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 37.7 GFlops 51.4 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 55.7 GFlops 62.1 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 73.3 GFlops 69.4 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 75.4 GFlops 62.0 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 106.5 GFlops 77.6 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 132.7 GFlops 86.7 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 163.9 GFlops 97.2 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 179.4 GFlops 97.3 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 183.0 GFlops 91.5 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 179.3 GFlops 83.2 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 161.0 GFlops 69.6 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 163.6 GFlops 66.3 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 165.4 GFlops 63.0 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 146.7 GFlops 52.7 GB/s ulps(fft 2.7,ps 5392.8) [OK]
PowerSpectrumTest9.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 11.7 GFlops 20.6 GB/s ulps(fft 1.2,ps 4389.0) [OK]
FFT+PS+SM( 16) 19.0 GFlops 26.0 GB/s ulps(fft 1.6,ps 4518.6) [OK]
FFT+PS+SM( 32) 21.7 GFlops 24.2 GB/s ulps(fft 1.3,ps 3977.6) [OK]
FFT+PS+SM( 64) 36.1 GFlops 34.2 GB/s ulps(fft 1.5,ps 4206.9) [OK]
FFT+PS+SM( 128) 52.7 GFlops 43.3 GB/s ulps(fft 1.7,ps 4351.9) [OK]
FFT+PS+SM( 256) 69.7 GFlops 50.8 GB/s ulps(fft 1.7,ps 4254.8) [OK]
FFT+PS+SM( 512) 90.4 GFlops 59.1 GB/s ulps(fft 1.8,ps 4305.7) [OK]
FFT+PS+SM( 1024) 99.8 GFlops 59.2 GB/s ulps(fft 2.1,ps 4725.7) [OK]
FFT+PS+SM( 2048) 109.7 GFlops 59.5 GB/s ulps(fft 2.2,ps 4918.4) [OK]
FFT+PS+SM( 4096) 117.8 GFlops 58.9 GB/s ulps(fft 2.2,ps 4762.0) [OK]
FFT+PS+SM( 8192) 126.7 GFlops 58.8 GB/s ulps(fft 2.6,ps 5275.5) [OK]
FFT+PS+SM( 16384) 113.9 GFlops 49.2 GB/s ulps(fft 2.6,ps 5355.0) [OK]
FFT+PS+SM( 32768) 121.2 GFlops 49.1 GB/s ulps(fft 2.3,ps 4987.7) [OK]
FFT+PS+SM( 65536) 121.5 GFlops 46.3 GB/s ulps(fft 2.0,ps 4601.3) [OK]
FFT+PS+SM(131072) 99.9 GFlops 35.9 GB/s ulps(fft 2.7,ps 5392.0) [OK]
Opt1 (worst case): 256 thrds/block
FFT+PS+SM( 8) 21.8 GFlops 38.5 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 37.8 GFlops 51.6 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 55.9 GFlops 62.4 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 73.6 GFlops 69.7 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 75.7 GFlops 62.3 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 107.0 GFlops 77.9 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 133.3 GFlops 87.1 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 164.6 GFlops 97.6 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 180.0 GFlops 97.6 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 183.0 GFlops 91.5 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 179.7 GFlops 83.3 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 162.1 GFlops 70.1 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 164.3 GFlops 66.6 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 165.7 GFlops 63.1 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 147.5 GFlops 53.0 GB/s ulps(fft 2.7,ps 5392.8) [OK]
.
Done
PowerSpectrumTest9.exe -device 0
Device: ION, 1161 MHz clock, 242 MB memory.Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 1.2 GFlops 2.0 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 1.6 GFlops 2.2 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 1.6 GFlops 1.8 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 2.7 GFlops 2.6 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 3.9 GFlops 3.2 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 5.1 GFlops 3.7 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 6.1 GFlops 4.0 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 5.9 GFlops 3.5 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 6.2 GFlops 3.4 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 5.2 GFlops 2.6 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 5.1 GFlops 2.4 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 4.9 GFlops 2.1 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 5.1 GFlops 2.1 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 5.3 GFlops 2.0 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 5.6 GFlops 2.0 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 1.9 GFlops 3.3 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 2.4 GFlops 3.2 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 2.8 GFlops 3.1 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 3.8 GFlops 3.6 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 4.2 GFlops 3.5 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 5.4 GFlops 3.9 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 6.6 GFlops 4.3 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 6.3 GFlops 3.7 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 6.6 GFlops 3.6 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 5.6 GFlops 2.8 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 5.4 GFlops 2.5 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 5.2 GFlops 2.2 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 5.4 GFlops 2.2 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 5.6 GFlops 2.1 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 5.8 GFlops 2.1 GB/s ulps(fft 3.6,ps 6590.4) [OK]
.
Done
-
Here's mine, Merry Christmas
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrumtest9.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 1.2 GFlops 2.2 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 1.3 GFlops 1.8 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 2.0 GFlops 2.2 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 2.7 GFlops 2.5 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 3.8 GFlops 3.2 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 5.2 GFlops 3.8 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 3.1 GFlops 2.0 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 5.7 GFlops 3.3 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 6.5 GFlops 3.5 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 5.5 GFlops 2.8 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 5.9 GFlops 2.7 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 4.7 GFlops 2.0 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 6.1 GFlops 2.5 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 5.8 GFlops 2.2 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 7.0 GFlops 2.5 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 3.5 GFlops 6.1 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 5.4 GFlops 7.4 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 6.1 GFlops 6.8 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 8.9 GFlops 8.4 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 10.2 GFlops 8.4 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 12.2 GFlops 8.9 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 15.5 GFlops 10.2 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 17.0 GFlops 10.1 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 18.1 GFlops 9.8 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 12.9 GFlops 6.5 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 14.3 GFlops 6.7 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 14.6 GFlops 6.3 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 12.4 GFlops 5.0 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 13.6 GFlops 5.2 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 13.9 GFlops 5.0 GB/s ulps(fft 3.6,ps 6590.4) [OK]
C:\test>
-
nr9
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
FAILURE in c:/[Projects]/LunaticsUnited/Tools/Tests/PowerSpectrum/main.cpp, line 254
ouch :)
ok stopping boinc helps ::) result tomorrow ok result now
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 1.8 GFlops 3.2 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 2.9 GFlops 4.0 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 2.7 GFlops 3.0 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 5.3 GFlops 5.0 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 7.9 GFlops 6.5 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 11.0 GFlops 8.0 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 13.3 GFlops 8.7 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 13.1 GFlops 7.8 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 13.2 GFlops 7.2 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 12.3 GFlops 6.1 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 11.5 GFlops 5.3 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 10.7 GFlops 4.6 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 12.2 GFlops 5.0 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 12.2 GFlops 4.7 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 12.5 GFlops 4.5 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 3.7 GFlops 6.6 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 4.7 GFlops 6.4 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 5.6 GFlops 6.3 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 7.9 GFlops 7.5 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 9.4 GFlops 7.7 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 12.5 GFlops 9.1 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 15.3 GFlops 10.0 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 15.0 GFlops 8.9 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 14.6 GFlops 7.9 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 14.1 GFlops 7.0 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 12.8 GFlops 6.0 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 11.6 GFlops 5.0 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 13.1 GFlops 5.3 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 14.1 GFlops 5.4 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 14.0 GFlops 5.0 GB/s ulps(fft 3.6,ps 6590.4) [OK]
sorry no time for avarages atm
-
Thanks Heinz, perrjay & Carola,
Nice to see the stubborn chips(that Quadro & ION) edging forward a bit now.
@perryjay: ~3x for 9500GT in some sizes? Don't know why that is completely but I like it ;D
Jason
-
Hi there,
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.
First, WinXP-32:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 9.3 GFlops 16.4 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 13.6 GFlops 18.5 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 16.0 GFlops 17.8 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 28.3 GFlops 26.8 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 44.4 GFlops 36.5 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 59.2 GFlops 43.1 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 72.6 GFlops 47.4 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 71.7 GFlops 42.5 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 72.1 GFlops 39.1 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 66.5 GFlops 33.3 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 63.3 GFlops 29.4 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 58.6 GFlops 25.3 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 62.9 GFlops 25.5 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 67.2 GFlops 25.6 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 66.0 GFlops 23.7 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 14.3 GFlops 25.2 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 21.2 GFlops 28.9 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 27.5 GFlops 30.7 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 39.1 GFlops 37.0 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 47.4 GFlops 39.0 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 62.5 GFlops 45.5 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 76.0 GFlops 49.7 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 74.1 GFlops 43.9 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 74.2 GFlops 40.3 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 67.3 GFlops 33.7 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 64.7 GFlops 30.0 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 59.8 GFlops 25.9 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 64.3 GFlops 26.0 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 68.6 GFlops 26.1 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 67.5 GFlops 24.3 GB/s ulps(fft 3.6,ps 6590.4) [OK]
Second, Win7-64:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 8.4 GFlops 14.9 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 12.1 GFlops 16.6 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 14.6 GFlops 16.3 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 25.9 GFlops 24.5 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 38.6 GFlops 31.8 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 50.3 GFlops 36.6 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 61.2 GFlops 40.0 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 61.6 GFlops 36.5 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 62.3 GFlops 33.8 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 57.5 GFlops 28.7 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 56.1 GFlops 26.0 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 52.4 GFlops 22.7 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 55.5 GFlops 22.5 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 59.2 GFlops 22.5 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 58.8 GFlops 21.1 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 14.2 GFlops 25.0 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 21.0 GFlops 28.6 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 27.5 GFlops 30.7 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 39.2 GFlops 37.1 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 46.8 GFlops 38.5 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 61.1 GFlops 44.5 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 75.2 GFlops 49.2 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 73.6 GFlops 43.6 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 73.4 GFlops 39.8 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 67.7 GFlops 33.9 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 64.4 GFlops 29.8 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 59.5 GFlops 25.7 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 64.0 GFlops 25.9 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 68.2 GFlops 26.0 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 67.1 GFlops 24.1 GB/s ulps(fft 3.6,ps 6590.4) [OK]
Regards, Patrick.
-
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.
Excellent, not broken on the 8800. Last hurdle for that code area cleared & can move on :D
-
Carola just mentioned something I haven't been doing. I have been running the test without stopping BOINC. Should I run it with BOINC stopped?
-
It's not neccessary to completely stop Boinc, but at least the GPU should be snoozed.
Can't test GPU computing/memory transfers when you are crunching with it.
Else you will see reduced values on the test.
-
Okay, let's see how much of a difference this makes....
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrumtest9.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #9 (FFT pipeline)
Christmas 2010 edition.
Stock:
FFT+PS+SM( 8) 3.0 GFlops 5.2 GB/s ulps(fft 1.3,ps 4775.9) [OK]
FFT+PS+SM( 16) 4.0 GFlops 5.5 GB/s ulps(fft 1.6,ps 4817.4) [OK]
FFT+PS+SM( 32) 4.4 GFlops 5.0 GB/s ulps(fft 1.6,ps 4628.1) [OK]
FFT+PS+SM( 64) 7.1 GFlops 6.7 GB/s ulps(fft 1.6,ps 4557.6) [OK]
FFT+PS+SM( 128) 9.8 GFlops 8.1 GB/s ulps(fft 2.0,ps 4942.0) [OK]
FFT+PS+SM( 256) 11.9 GFlops 8.6 GB/s ulps(fft 2.0,ps 4967.8) [OK]
FFT+PS+SM( 512) 15.0 GFlops 9.8 GB/s ulps(fft 2.1,ps 5128.1) [OK]
FFT+PS+SM( 1024) 16.2 GFlops 9.6 GB/s ulps(fft 2.5,ps 5552.5) [OK]
FFT+PS+SM( 2048) 17.5 GFlops 9.5 GB/s ulps(fft 2.7,ps 5770.3) [OK]
FFT+PS+SM( 4096) 13.4 GFlops 6.7 GB/s ulps(fft 2.4,ps 5313.7) [OK]
FFT+PS+SM( 8192) 14.2 GFlops 6.6 GB/s ulps(fft 2.8,ps 5881.1) [OK]
FFT+PS+SM( 16384) 13.7 GFlops 5.9 GB/s ulps(fft 3.3,ps 6399.1) [OK]
FFT+PS+SM( 32768) 12.1 GFlops 4.9 GB/s ulps(fft 3.3,ps 6380.1) [OK]
FFT+PS+SM( 65536) 13.0 GFlops 5.0 GB/s ulps(fft 3.4,ps 6534.8) [OK]
FFT+PS+SM(131072) 13.9 GFlops 5.0 GB/s ulps(fft 3.6,ps 6694.2) [OK]
Opt1 (worst case): 64 thrds/block
FFT+PS+SM( 8) 4.1 GFlops 7.3 GB/s ulps(fft 1.3,ps 4637.5) [OK]
FFT+PS+SM( 16) 5.7 GFlops 7.7 GB/s ulps(fft 1.6,ps 4589.2) [OK]
FFT+PS+SM( 32) 7.0 GFlops 7.8 GB/s ulps(fft 1.6,ps 4535.6) [OK]
FFT+PS+SM( 64) 9.2 GFlops 8.7 GB/s ulps(fft 1.6,ps 4426.7) [OK]
FFT+PS+SM( 128) 10.5 GFlops 8.6 GB/s ulps(fft 2.0,ps 4818.1) [OK]
FFT+PS+SM( 256) 12.7 GFlops 9.2 GB/s ulps(fft 2.0,ps 4831.0) [OK]
FFT+PS+SM( 512) 16.0 GFlops 10.5 GB/s ulps(fft 2.1,ps 4987.2) [OK]
FFT+PS+SM( 1024) 17.3 GFlops 10.2 GB/s ulps(fft 2.5,ps 5438.0) [OK]
FFT+PS+SM( 2048) 18.5 GFlops 10.0 GB/s ulps(fft 2.7,ps 5674.7) [OK]
FFT+PS+SM( 4096) 13.7 GFlops 6.9 GB/s ulps(fft 2.4,ps 5202.4) [OK]
FFT+PS+SM( 8192) 14.9 GFlops 6.9 GB/s ulps(fft 2.8,ps 5765.4) [OK]
FFT+PS+SM( 16384) 15.4 GFlops 6.6 GB/s ulps(fft 3.3,ps 6291.8) [OK]
FFT+PS+SM( 32768) 13.1 GFlops 5.3 GB/s ulps(fft 3.3,ps 6275.5) [OK]
FFT+PS+SM( 65536) 13.8 GFlops 5.3 GB/s ulps(fft 3.4,ps 6429.1) [OK]
FFT+PS+SM(131072) 14.5 GFlops 5.2 GB/s ulps(fft 3.6,ps 6590.4) [OK]
C:\test>
-
Ahah! that explained the inflated speedup on the previous run :) . In essence (some of) the optimisations (namely, asynchronous transfers) I'm trying out should be less susceptible to slowdowns under load than stock code (synchronous transfers)....
I wasn't looking to test/refine that aspect yet, but you managed to prove it already works... Thanks! ;D
(Overlapped execution/transfers on Pre-Fermi, and concurrent kernels on Fermi next .... )
-
Sorry bout that... hope I didn't mess you up too much. Glad it gave you some extra to think about.
-
Sorry bout that... hope I didn't mess you up too much. Glad it gave you some extra to think about.
Not at all messed up, just had me wondering how 9500GT was managing to get 3x throughput at some sizes, and now we know it was under load ;). That unexpected benefit does indeed give me some more things to consider for the next stage, and it looks like we might be able to push a bit harder than I thought.
-
Hey guys, I done something right for a change!!! :) ::) Looking forward to the next test. This time I'll know to turn it off!
-
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.
Excellent, not broken on the 8800. Last hurdle for that code area cleared & can move on :D
Wonderful to hear that. As always, looking forward to the next bit of execution-magic. ;)
Regards, Patrick.
-
Will take me some time to cook up the next test, working out this streaming stuff.
Mixed results with kernel streaming so far, appearing to benefit my smaller highly optimised kernels more over the stock-ish larger sizes (don't know why yet, and dividing further into additional streams seems to slow it down again ... tricky! ):
As with test #9 (single stream)
Opt1 (worst case): 256 thrds/block, 1 x 1048576 element streams
FFT+PS+SM( 8) 19.2 GFlops 33.8 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 36.8 GFlops 50.3 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 60.7 GFlops 67.8 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 86.2 GFlops 81.6 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 92.5 GFlops 76.1 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 135.0 GFlops 98.3 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 172.0 GFlops 112.4 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 214.7 GFlops 127.3 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 225.9 GFlops 122.6 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 232.3 GFlops 116.2 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 226.0 GFlops 104.8 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 221.5 GFlops 95.8 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 213.1 GFlops 86.3 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 210.5 GFlops 80.2 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 202.6 GFlops 72.8 GB/s ulps(fft 2.7,ps 5392.8) [OK]
2x streams:
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
FFT+PS+SM( 8) 26.7 GFlops 47.2 GB/s ulps(fft 1.2,ps 4324.2) [OK]
FFT+PS+SM( 16) 66.9 GFlops 91.3 GB/s ulps(fft 1.6,ps 4326.2) [OK]
FFT+PS+SM( 32) 90.9 GFlops 101.5 GB/s ulps(fft 1.3,ps 4003.6) [OK]
FFT+PS+SM( 64) 105.0 GFlops 99.4 GB/s ulps(fft 1.5,ps 4270.2) [OK]
FFT+PS+SM( 128) 94.0 GFlops 77.3 GB/s ulps(fft 1.7,ps 4347.9) [OK]
FFT+PS+SM( 256) 135.9 GFlops 98.9 GB/s ulps(fft 1.7,ps 4261.8) [OK]
FFT+PS+SM( 512) 167.9 GFlops 109.7 GB/s ulps(fft 1.8,ps 4327.4) [OK]
FFT+PS+SM( 1024) 198.4 GFlops 117.6 GB/s ulps(fft 2.1,ps 4727.6) [OK]
FFT+PS+SM( 2048) 209.1 GFlops 113.4 GB/s ulps(fft 2.2,ps 4921.2) [OK]
FFT+PS+SM( 4096) 209.9 GFlops 105.0 GB/s ulps(fft 2.2,ps 4764.3) [OK]
FFT+PS+SM( 8192) 204.8 GFlops 95.0 GB/s ulps(fft 2.6,ps 5278.8) [OK]
FFT+PS+SM( 16384) 205.0 GFlops 88.6 GB/s ulps(fft 2.6,ps 5357.5) [OK]
FFT+PS+SM( 32768) 187.5 GFlops 75.9 GB/s ulps(fft 2.3,ps 4992.8) [OK]
FFT+PS+SM( 65536) 195.2 GFlops 74.4 GB/s ulps(fft 2.0,ps 4604.3) [OK]
FFT+PS+SM(131072) 172.5 GFlops 62.0 GB/s ulps(fft 2.7,ps 5392.8) [OK]
-
Updated first Post:
Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9
-
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 67.27) Peak( 111.28) Min( 9.42) [OK]
Memory thoughput GB/s Avg( 36.72) Peak( 55.70) Min( 15.41)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 84.36, 1.25x) Peak( 131.47, 1.18x) Min( 31.13, 3.30x) [OK]
Memory thoughput [GB/s] -
Avg( 51.22, 1.39x) Peak( 66.16, 1.19x) Min( 34.18, 2.22x)
-
Cheers,
BTW: average roughly represents overall improvement, Peak represents speed change in the fastest Kernels, and Min is the speed change in the slowest Kernels ... So I regard 'Avg' & 'Min' as most important, with Peak being mostly just a possible indicator of remaining headroom.
[Edit:] Similarish looking deal with the 480
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 104.34) Peak( 157.97) Min( 12.79) [OK]
Memory thoughput GB/s Avg( 57.34) Peak( 82.25) Min( 22.55)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 162.15, 1.55x) Peak( 232.02, 1.47x) Min( 26.47, 2.07x) [OK]
Memory thoughput [GB/s] -
Avg( 95.38, 1.66x) Peak( 127.32, 1.55x) Min( 46.67, 2.07x)
-
Hi Jason,
new results from Test10
~~~~~~~~~~~~~~~
PowerSpectrumTest10.exe -device 0
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 82.93) Peak( 130.76) Min( 12.00) [OK]
Memory thoughput GB/s Avg( 46.20) Peak( 64.10) Min( 21.16)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 125.13, 1.51x) Peak( 178.98, 1.37x) Min( 37.50, 3.12x) [OK]
Memory thoughput [GB/s] -
Avg( 75.48, 1.63x) Peak( 95.64, 1.49x) Min( 52.23, 2.47x)
PowerSpectrumTest10.exe -device 1
Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 80.74) Peak( 126.77) Min( 11.69) [OK]
Memory thoughput GB/s Avg( 44.99) Peak( 59.75) Min( 20.61)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 125.57, 1.56x) Peak( 179.89, 1.42x) Min( 37.72, 3.23x) [OK]
Memory thoughput [GB/s] -
Avg( 75.75, 1.68x) Peak( 95.76, 1.60x) Min( 52.48, 2.55x)
.
Done
PowerSpectrumTest10.exe -device 0
Device: ION, 1161 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 4.38) Peak( 6.24) Min( 1.31) [OK]
Memory thoughput GB/s Avg( 2.66) Peak( 3.97) Min( 1.80)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 4.86, 1.11x) Peak( 6.64, 1.06x) Min( 1.86, 1.41x) [OK]
Memory thoughput [GB/s] -
Avg( 3.08, 1.16x) Peak( 4.29, 1.08x) Min( 2.10, 1.17x)
.
Done
-
Works on ION, YaY! :)
-
On my 128Mb 8400M GS:
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 4.07) Peak( 5.64) Min( 1.19) [OK]
Memory thoughput GB/s Avg( 2.44) Peak( 3.69) Min( 1.51)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 4.30, 1.06x) Peak( 5.78, 1.03x) Min( 1.68, 1.41x) [OK]
Memory thoughput [GB/s] -
Avg( 2.70, 1.11x) Peak( 3.78, 1.03x) Min( 1.90, 1.26x)
Claggy
-
On my 128Mb 8400M GS:
Work's on that too :D, looks like we've managed to max that one out ;)
-
And My 465:
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 69.41) Peak( 104.12) Min( 10.56) [OK]
Memory thoughput GB/s Avg( 38.49) Peak( 54.71) Min( 18.61)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 101.54, 1.46x) Peak( 140.32, 1.35x) Min( 36.67, 3.47x) [OK]
Memory thoughput [GB/s] -
Avg( 61.36, 1.59x) Peak( 78.16, 1.43x) Min( 46.65, 2.51x)
-
Okay, I remembered to stop BOINC this time....
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd\test
C:\test>powerspectrumtest10.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 11.40) Peak( 17.48) Min( 2.91) [OK]
Memory thoughput GB/s Avg( 6.85) Peak( 9.86) Min( 4.95)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 12.35, 1.08x) Peak( 18.33, 1.05x) Min( 4.45, 1.53x) [OK]
Memory thoughput [GB/s] -
Avg( 7.76, 1.13x) Peak( 10.33, 1.05x) Min( 5.14, 1.04x)
C:\test>
-
And My 465:
and
Okay, I remembered to stop BOINC this time....
...
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
...
Thanks both! Still some breathing room between avg & peak on those.
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 114.30) Peak( 169.79) Min( 21.35) [OK]
Memory thoughput GB/s Avg( 64.38) Peak( 89.45) Min( 34.20)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 165.56, 1.45x) Peak( 234.17, 1.38x) Min( 61.06, 2.86x) [OK]
Memory thoughput [GB/s] -
Avg( 100.82, 1.57x) Peak( 126.77, 1.42x) Min( 70.89, 2.07x)
Steve
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
...
Compute thoughput [GFlops] -
Avg( 165.56, 1.45x) Peak( 234.17, 1.38x) Min( 61.06, 2.86x) [OK]
Winning! (just ;)) Glad you're on water cooling with those, My fan cranks up with that and creates a vortex in my room :D.
It made me think '1.21 GigaWatts!' (http://www.youtube.com/watch?v=mjCRUvX2D0E). I'll be checking out & researching on water cooling the 480 here, sometime in the new year. Starting with the basics with guides like This one (http://www.clunk.org.uk/forums/water-cooling/33772-water-cooling-guide-beginners.html), & doing my homework.
-
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
...
Compute thoughput [GFlops] -
Avg( 165.56, 1.45x) Peak( 234.17, 1.38x) Min( 61.06, 2.86x) [OK]
Winning! (just ;)) Glad you're on water cooling with those, My fan cranks up with that and creates a vortex in my room :D.
It made me think '1.21 GigaWatts!' (http://www.youtube.com/watch?v=mjCRUvX2D0E). I'll be checking out & researching on water cooling the 480 here, sometime in the new year. Starting with the basics with guides like This one (http://www.clunk.org.uk/forums/water-cooling/33772-water-cooling-guide-beginners.html), & doing my homework.
With all the help you have given others, I would be happy to offer any assistance I could should you choose to go with water cooling. There is a lot in my system Tuning thread in NC you might find interesting. System Tuning (http://setiathome.berkeley.edu/forum_thread.php?id=62406&nowrap=true#1059367)
Steve
-
Q6600/8GB/8800GTX.
One remark though: if you want to run a test multiple times, why not do that in the download-able executable? I don't mind if a benchmark of yours runs several minutes on my rig, so just do a few test-runs, determine the max/min and standard-deviation or something and output that?
I have in any case run the benchmark 3 times on both OS versions, before running a 4th one redirected to a text-file (and compared that one too). Results and speed-ups looked stable to my 'naked' eye.
WinXP-32:
Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 51.45) Peak( 72.63) Min( 9.33) [OK]
Memory thoughput GB/s Avg( 30.07) Peak( 47.47) Min( 16.45)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 55.01, 1.07x) Peak( 75.98, 1.05x) Min( 13.89, 1.49x) [OK]
Memory thoughput [GB/s] -
Avg( 33.46, 1.11x) Peak( 49.65, 1.05x) Min( 24.23, 1.47x)
Win7-64:
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 45.04) Peak( 62.72) Min( 8.62) [OK]
Memory thoughput GB/s Avg( 26.39) Peak( 40.07) Min( 15.21)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 54.49, 1.21x) Peak( 75.17, 1.20x) Min( 13.75, 1.59x) [OK]
Memory thoughput [GB/s] -
Avg( 33.12, 1.26x) Peak( 49.13, 1.23x) Min( 24.07, 1.58x)
Regards, Patrick.
-
Did a few runs for test #10 on different cards/machines...
Cheers,
MarkJ
-------------------------------------------------
Device: GeForce GT 240, 1340 MHz clock, 475 MB memory.
Compute capability 1.2
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 32.78) Peak( 48.81) Min( 8.49) [OK]
Memory thoughput GB/s Avg( 19.49) Peak( 28.94) Min( 12.38)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 35.66, 1.09x) Peak( 51.41, 1.05x) Min( 12.84, 1.51x) [OK]
Memory thoughput [GB/s] -
Avg( 22.13, 1.14x) Peak( 30.48, 1.05x) Min( 15.22, 1.23x)
------------------------------------------------------------
Device: GeForce GTX 460, 1350 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 62.95) Peak( 102.88) Min( 8.18) [OK]
Memory thoughput GB/s Avg( 34.05) Peak( 52.16) Min( 13.33)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 79.87, 1.27x) Peak( 121.17, 1.18x) Min( 23.84, 2.91x) [OK]
Memory thoughput [GB/s] -
Avg( 47.79, 1.40x) Peak( 63.10, 1.21x) Min( 33.50, 2.51x)
-----------------------------------------------------------
Device: GeForce GTX 570, 1464 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 101.46) Peak( 151.95) Min( 20.02) [OK]
Memory thoughput GB/s Avg( 57.48) Peak( 79.89) Min( 30.85)
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams
revert to single stream from size 512
Processing... Done!
Compute thoughput [GFlops] -
Avg( 139.93, 1.38x) Peak( 199.62, 1.31x) Min( 51.29, 2.56x) [OK]
Memory thoughput [GB/s] -
Avg( 85.24, 1.48x) Peak( 106.89, 1.34x) Min( 58.81, 1.91x)
-
Q6600/8GB/8800GTX.
One remark though: if you want to run a test multiple times, why not do that in the download-able executable? I don't mind if a benchmark of yours runs several minutes on my rig, so just do a few test-runs, determine the max/min and standard-deviation or something and output that?
I have in any case run the benchmark 3 times on both OS versions, before running a 4th one redirected to a text-file (and compared that one too). Results and speed-ups looked stable to my 'naked' eye.
Cheers & No worries Patrick,
Just wasn't sure extending the test was going to be needed. Naked eye judgement is plenty for the purposes of testing scientific repeatability here, and running multiple times in the same exe would make it one large test rather than several small ones for comparison (if that makes any sense). I'm happy that the 8800 seems to have some headroom left, and the 'Min' numbers indicate the sloest kernels have received a niice boost.
Win7(WDDM) & XP(XPDM) driver model performance difference is 'gone' ;D
Secondary confirmation from a friend's 8800GTS:
XP32
Device: GeForce 8800 GTS 512, 1625 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 44.40) Peak( 66.68) Min( 7.85) [OK]
Memory thoughput GB/s Avg( 26.26) Peak( 41.19) Min( 13.83)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 47.57, 1.07x) Peak( 67.80, 1.02x) Min( 17.37, 2.21x) [OK]
Memory thoughput [GB/s] -
Avg( 30.04, 1.14x) Peak( 41.89, 1.02x) Min( 19.00, 1.37x)
Win7-32
Device: GeForce 8800 GTS 512, 1625 MHz clock, 500 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 40.57) Peak( 57.91) Min( 7.32) [OK]
Memory thoughput GB/s Avg( 23.86) Peak( 35.82) Min( 12.91)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 48.43, 1.19x) Peak( 66.67, 1.15x) Min( 15.87, 2.17x) [OK]
Memory thoughput [GB/s] -
Avg( 30.30, 1.27x) Peak( 41.94, 1.17x) Min( 20.41, 1.58x)
-
Did a few runs for test #10 on different cards/machines...
Cheers,
MarkJ
Thanks Mark! Starting to make a dent with the stubborn 240, and the Fermi boosts looking healthy.
I will need to get to checking the 260 in the other room soon, then we should have 'the full set'
[Later:] Here 'tis
Device: GeForce GTX 260, 1242 MHz clock, 896 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 62.64) Peak( 93.36) Min( 4.48) [OK]
Memory thoughput GB/s Avg( 34.47) Peak( 52.71) Min( 7.89)
Opt1 (worst case): 128 thrds/block, 2 x 524288 element streams
revert to single stream from size 256
Processing... Done!
Compute thoughput [GFlops] -
Avg( 67.78, 1.08x) Peak( 95.96, 1.03x) Min( 5.69, 1.27x) [OK]
Memory thoughput [GB/s] -
Avg( 38.80, 1.13x) Peak( 55.48, 1.05x) Min( 10.03, 1.27x)
Maybe still some headroom on 200 series as well.
-
Cheers & No worries Patrick,
Just wasn't sure extending the test was going to be needed. Naked eye judgement is plenty for the purposes of testing scientific repeatability here, and running multiple times in the same exe would make it one large test rather than several small ones for comparison (if that makes any sense). I'm happy that the 8800 seems to have some headroom left, and the 'Min' numbers indicate the sloest kernels have received a niice boost.
Win7(WDDM) & XP(XPDM) driver model performance difference is 'gone' ;D
Thanks for the extended explanation; my remark was merely given in by curiosity (and probably a large lack in understanding the underlying higher goals), but I feel more enlightened now. ;)
Regards, Patrick.
-
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 9.58) Peak( 13.91) Min( 2.48) [OK]
Memory thoughput GB/s Avg( 5.70) Peak( 9.09) Min( 3.53)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 11.23, 1.17x) Peak( 15.13, 1.09x) Min( 4.27, 1.72x) [OK]
Memory thoughput [GB/s] -
Avg( 6.99, 1.23x) Peak( 9.88, 1.09x) Min( 5.01, 1.42x)
values roughly +- .3 on stock and +- .1 on opt1
[edit]compute speedup 1.56x - 1.76x memory speedup 1.22x -1.47x
-
Thanks for the extended explanation; my remark was merely given in by curiosity (and probably a large lack in understanding the underlying higher goals), but I feel more enlightened now. ;)
Yeah, a bit more info along those lines, the actual kernels under test run in timing loops set to roughly half a second, which is enough for ~thousands to millions of runs, so I was expecting 'fair' stability in the Avg, Peak & Min values, so we are alright for discrete kernel performance measurements.
I have however picked up an interesting thing on a friends i7-860 w/GTX480 in comparing against mine ( 45nM core2 w/GTX480)
- His Peaks & Averages are ~same as mine for the same clockrate ... BUT ... the 'Min (slowest kernels) are several times faster ... Better CPU & RAM does have significant impact on the running of the toughest parts of code, it seems
Jason
-
values roughly +- .3 on stock and +- .1 on opt1
Hey that's decent! ... and there you were going to start a riot when initial mods yielded about 5% slowdown on yours ... tsk tsk tsk ;D
-
Hey that's decent! ... and there you were going to start a riot when initial mods yielded about 5% slowdown on yours ... tsk tsk tsk ;D
Oh I just learned how to complain when not suffering ;D did the trick didn't it? ;)
-
My 9800GTX+ on Win 7 x64:
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 49.81) Peak( 71.73) Min( 8.11) [OK]
Memory thoughput GB/s Avg( 29.08) Peak( 44.80) Min( 14.31)
Opt1 (worst case): 64 thrds/block, 2 x 524288 element streams
revert to single stream from size 128
Processing... Done!
Compute thoughput [GFlops] -
Avg( 57.66, 1.16x) Peak( 80.19, 1.12x) Min( 18.07, 2.23x) [OK]
Memory thoughput [GB/s] -
Avg( 35.80, 1.23x) Peak( 50.46, 1.13x) Min( 24.47, 1.71x)
Claggy
-
Device: GeForce GTX 260, 1441 MHz clock, 869 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)
Stock:
Processing... Done!
Compute Thoughput GFlops Avg( 47.55) Peak( 65.20) Min( 10.16) [OK]
Memory thoughput GB/s Avg( 28.09) Peak( 37.12) Min( 17.92)
Opt1 (worst case): 128 thrds/block, 2 x 524288 element streams
revert to single stream from size 256
Processing... Done!
Compute thoughput [GFlops] -
Avg( 84.83, 1.78x) Peak( 111.50, 1.71x) Min( 31.57, 3.11x) [OK]
Memory thoughput [GB/s] -
Avg( 52.63, 1.87x) Peak( 67.26, 1.81x) Min( 36.24, 2.02x)
-
Thanks both!
@glenaxl: that's some impressive speedup on GTX 260, I'll have to look at that here carefully on mine when I get a chance to do so.
@Claggy, average at 3/4 of peak seems pretty good, but I think we can get some more maybe.
@ALL, Thanks! I'm closing this test for now. It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).
FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here. Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come :)
[Locking thread, Please stay tuned for further Unit Tests!]
Jason
-
@All:
Just a note that the concerns that arose, and distracted me from testing & development along this line, have now been at least partially resolved, and don't require any immediate action on our part. I'm back to ruggedising & integrating what we've accomplished here into the X-builds, and plan to start devising tests for PoT (Power over Time) processing refinement soon, in similar fashion to this thread. PoT processing covers Gaussian searches, Triplet & Pulse finding, for which all Cuda releases have known issues to address, so there'll be plenty of tests to devise & collect data for yet.
Cheers once again! :)
Jason