.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?
No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine). The kernels are all 'hard code' so no speed difference should be evident between driver change.
[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]
Jason
[Edit:] Confirmed requires driver 260.89+ ,
[Mod] Split off driver thread[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2: Fixed, but sadly is slow now, remains at stock accuracy
Mod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)
[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far
. Could be a smidgen faster.
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction
)
Example output (important numbers: highlighted,
Stock,
Opt1 )
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps
SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK
PS+SuMx( 64) 5.9 GFlops 24.1 GB/sGetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps
Opt1 (PSmod3+SM):
256 thrds/block 256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulpsEvery ifft average & peak OK 256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulpsUpdate: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)
Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)
Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardw
are that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just
confirmation all [OK] &
no Opt1 'worst case' slower than stock- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.
Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only
Temporary download location(s):
fast:
http://www.arkayn.us/seti/PowerSpectrumTest9.7zslow:
ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7zUpdate: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9
Update: @ALL, Thanks! I'm closing this test for now. It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular). FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here. Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come [Locking thread, Please stay tuned for further Unit Tests!]