Author Topic: [Split] PowerSpectrum Unit Test (Read 188059 times)

Jason G · « **on:** 18 Nov 2010, 07:45:10 am »

Quote from: Richard Haselgrove on 18 Nov 2010, 06:12:19 am

.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?

No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine). The kernels are all 'hard code' so no speed difference should be evident between driver change.

[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]

Jason

[Edit:] Confirmed requires driver 260.89+ , [Mod] Split off driver thread

[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2: Fixed, but sadly is slow now, remains at stock accuracy
Mod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far

. Could be a smidgen faster.

[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction $:-\$ )

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps

SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 5.9 GFlops 24.1 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulps

Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)

Update: PowerSpectrum(+summax reduction) Test #7
- completed summax reduction sizes 8 through 64
- refined Opt1 a little, should be a tad faster for size 64 that was in prior test
- tidied up test result layout
- enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
- Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.

Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only

Temporary download location(s):
fast: http://www.arkayn.us/seti/PowerSpectrumTest9.7z
slow: ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z

Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9

Update: @ALL, Thanks! I'm closing this test for now. It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).

FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here. Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come

[Locking thread, Please stay tuned for further Unit Tests!]

Frizz · « **Reply #1 on:** 18 Nov 2010, 08:43:49 am »

How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

Jason G · « **Reply #2 on:** 18 Nov 2010, 08:44:56 am »

What driver ?

Miep · « **Reply #3 on:** 18 Nov 2010, 08:48:29 am »

Quote from: Frizz on 18 Nov 2010, 08:43:49 am

How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

updating from 258.96 to 260.99 solved that hiccup for me

Richard Haselgrove · « **Reply #4 on:** 18 Nov 2010, 08:52:28 am »

And I've just checked that 260.89 is good enough, too.

Jason G · « **Reply #5 on:** 18 Nov 2010, 09:07:18 am »

As a side effect, I'm accumulating a good collection of data that tells me a lot about the different GPU memory subsystems, on different generations (Powerspectrum is a 'memory bound' computation). Will split this off into its own thread a bit later [Done]

[Edit] In our own thread now, feel free to post results here, attach, or PM. I'm getting some very handy information to use this weekend, toward optimisation strategies.

Will try and make some sort of table up once I make some sense of the data.

Frizz · « **Reply #6 on:** 18 Nov 2010, 10:29:01 am »

Device: GeForce GT 240, 1340 MHz clock, 512 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps

GetPowerSpectrum() mod 1:
32 threads: 8.6 GFlops 3.4 GB/s 1183.3ulps
64 threads: 10.1 GFlops 4.1 GB/s 1183.3ulps
128 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps
256 threads: 10.1 GFlops 4.0 GB/s 1183.3ulps

GetPowerSpectrum() mod 2:
32 threads: 3.4 GFlops 1.3 GB/s 1183.3ulps
64 threads: 4.5 GFlops 1.8 GB/s 1183.3ulps
128 threads: 4.5 GFlops 1.8 GB/s 1183.3ulps
256 threads: 4.4 GFlops 1.8 GB/s 1183.3ulps

Richard Haselgrove · « **Reply #7 on:** 18 Nov 2010, 10:36:49 am »

A couple more datapoints, from Windows 7. The 'AMP' in the card model name says it's a factory overclock version.

Jason G · « **Reply #8 on:** 18 Nov 2010, 10:49:27 am »

Thanks both. Those are the 'stubborn' cards

Frizz · « **Reply #9 on:** 18 Nov 2010, 11:23:24 am »

I can't speak for Richard but my card is as stubborn as I am

glennaxl · « **Reply #10 on:** 18 Nov 2010, 11:36:31 am »

**********
-device 0
**********
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps

GetPowerSpectrum() mod 1:
32 threads: 18.6 GFlops 7.4 GB/s 1183.3ulps
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps
128 threads: 26.7 GFlops 10.7 GB/s 1183.3ulps
256 threads: 26.7 GFlops 10.7 GB/s 1183.3ulps

GetPowerSpectrum() mod 2:
32 threads: 5.3 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.2 GFlops 2.9 GB/s 1183.3ulps
128 threads: 10.6 GFlops 4.2 GB/s 1183.3ulps
256 threads: 10.7 GFlops 4.3 GB/s 1183.3ulps

**********
-device 1
**********
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.8 GFlops 10.3 GB/s 1183.3ulps

GetPowerSpectrum() mod 1:
32 threads: 17.9 GFlops 7.2 GB/s 1183.3ulps
64 threads: 26.0 GFlops 10.4 GB/s 1183.3ulps
128 threads: 26.1 GFlops 10.4 GB/s 1183.3ulps
256 threads: 24.6 GFlops 9.8 GB/s 1183.3ulps

GetPowerSpectrum() mod 2:
32 threads: 5.2 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.1 GFlops 2.8 GB/s 1183.3ulps
128 threads: 10.3 GFlops 4.1 GB/s 1183.3ulps
256 threads: 10.6 GFlops 4.2 GB/s 1183.3ulps

**********
-device 2
**********
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.4 GFlops 10.2 GB/s 1183.3ulps

GetPowerSpectrum() mod 1:
32 threads: 18.7 GFlops 7.5 GB/s 1183.3ulps
64 threads: 25.6 GFlops 10.2 GB/s 1183.3ulps
128 threads: 25.9 GFlops 10.4 GB/s 1183.3ulps
256 threads: 25.9 GFlops 10.4 GB/s 1183.3ulps

GetPowerSpectrum() mod 2:
32 threads: 5.2 GFlops 2.1 GB/s 1183.3ulps
64 threads: 7.0 GFlops 2.8 GB/s 1183.3ulps
128 threads: 10.3 GFlops 4.1 GB/s 1183.3ulps
256 threads: 10.4 GFlops 4.1 GB/s 1183.3ulps

Jason G · « **Reply #11 on:** 18 Nov 2010, 11:39:16 am »

Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480. That's something to investigate. Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

Quote

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 29.1 GFlops 11.6 GB/s 0.0ulps

GetPowerSpectrum() mod 1:
32 threads: 17.6 GFlops 7.1 GB/s 0.0ulps
64 threads: 28.9 GFlops 11.6 GB/s 0.0ulps
128 threads: 40.5 GFlops 16.2 GB/s 0.0ulps
256 threads: 44.0 GFlops 17.6 GB/s 0.0ulps

GetPowerSpectrum() mod 2:
32 threads: 19.3 GFlops 7.7 GB/s 0.0ulps
64 threads: 38.0 GFlops 15.2 GB/s 0.0ulps
128 threads: 61.1 GFlops 24.5 GB/s 0.0ulps
256 threads: 61.4 GFlops 24.6 GB/s 0.0ulps

glennaxl · « **Reply #12 on:** 18 Nov 2010, 11:44:39 am »

Quote from: Jason G on 18 Nov 2010, 11:39:16 am

Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480. That's something to investigate. Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

My bad. FAH was running in background. Edited my post with new results.

Jason G · « **Reply #13 on:** 18 Nov 2010, 11:50:00 am »

Quote from: glennaxl on 18 Nov 2010, 11:44:39 am

My bad. FAH was running in background. Edited my post with new results.

Ahh, cheers & LoL... I'm wondering why mod2 doesn't appear to work on those. ([Later:] ah, probably some shared memory bank conflicts or such, will read into that. )

Ghost0210 · « **Reply #14 on:** 18 Nov 2010, 02:07:05 pm »

And on my 465 with 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 16.0 GFlops 6.4 GB/s 0.0ulps

GetPowerSpectrum() mod 1:
32 threads: 9.8 GFlops 3.9 GB/s 0.0ulps
64 threads: 15.9 GFlops 6.3 GB/s 0.0ulps
128 threads: 20.9 GFlops 8.3 GB/s 0.0ulps
256 threads: 23.1 GFlops 9.2 GB/s 0.0ulps

GetPowerSpectrum() mod 2:
32 threads: 14.4 GFlops 5.8 GB/s 0.0ulps
64 threads: 28.4 GFlops 11.4 GB/s 0.0ulps
128 threads: 33.5 GFlops 13.4 GB/s 0.0ulps
256 threads: 32.8 GFlops 13.1 GB/s 0.0ulps

Author Topic: [Split] PowerSpectrum Unit Test (Read 188059 times)

Jason G

[Split] PowerSpectrum Unit Test

Frizz

Re: Latest nVIDIA_driver and CUDA_Version

Jason G

Re: Latest nVIDIA_driver and CUDA_Version

Miep

Re: Latest nVIDIA_driver and CUDA_Version

Richard Haselgrove

Re: Latest nVIDIA_driver and CUDA_Version

Jason G

Re: PowerSpectrum Unit Test

Frizz

Re: [Split] PowerSpectrum Unit Test

Richard Haselgrove

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Frizz

Re: [Split] PowerSpectrum Unit Test

glennaxl

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

glennaxl

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test