Forum > GPU crunching

[Split] PowerSpectrum Unit Test

(1/62) > >>

Jason G:

--- Quote from: Richard Haselgrove on 18 Nov 2010, 06:12:19 am ---.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?

--- End quote ---

No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine).   The kernels are all 'hard code' so no speed difference should be evident between driver change.

[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]

Jason

[Edit:] Confirmed requires driver 260.89+ , [Mod] Split off driver thread

[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o.  Could be a smidgen faster.
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64)  1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
 - Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
 - Opt1 best & worse cases likely to occur in real life tested,  worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
 - On Integrated GPUs, use mapped/pinned host memory, so on those  worst case should be ~= best case ( and hopefully some margin better than the stock reduction  :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.1 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    5.9 GFlops   24.1 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.3 GFlops  177.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
        8.1 GFlops   32.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.2 GB/s 121.7ulps

Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)

Update: PowerSpectrum(+summax reduction) Test #7
 - completed summax reduction sizes 8 through 64
 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test
 - tidied up test result layout
 - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.

Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only

Temporary download location(s):
fast:  http://www.arkayn.us/seti/PowerSpectrumTest9.7z
slow: ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z


Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9

Update: @ALL, Thanks! I'm closing this test for now.  It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).

FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here.  Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come   :)

[Locking thread, Please stay tuned for further Unit Tests!]

Frizz:
How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

Jason G:
What driver ?

Miep:

--- Quote from: Frizz on 18 Nov 2010, 08:43:49 am ---How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

--- End quote ---

updating from 258.96 to 260.99 solved that hiccup for me

Richard Haselgrove:
And I've just checked that 260.89 is good enough, too.

Navigation

[0] Message Index

[#] Next page

Go to full version