+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 137813 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
[Split] PowerSpectrum Unit Test
« on: 18 Nov 2010, 07:45:10 am »
.... Is there a CUDA 3.2 app available yet for alpha testing, just to see where the dividing line really is?

No, but I was just playing with a power spectrum kernel unit test built with 3.2 Release that could be sufficient to see which drivers work with 3.2 Release, and which don't ( I expect min 260.99 is fine).   The kernels are all 'hard code' so no speed difference should be evident between driver change.

[ PowerSpectrum Unit Test attached, the provided DLL must be present when executed at a command prompt. ]

Jason

[Edit:] Confirmed requires driver 260.89+ , [Mod] Split off driver thread

[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

[Updated] to PowerSpectrum Unit Test #4
Mod1: no changes
Mod2: no changes
Mod3: Tidy up & ironed out a bug that only manifests on Arkayn's card so far :o.  Could be a smidgen faster.

[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64)  1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
 - Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
 - Opt1 best & worse cases likely to occur in real life tested,  worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
 - On Integrated GPUs, use mapped/pinned host memory, so on those  worst case should be ~= best case ( and hopefully some margin better than the stock reduction  :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.1 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    5.9 GFlops   24.1 GB/s



GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.3 GFlops  177.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
        8.1 GFlops   32.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.2 GB/s 121.7ulps

Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither, Test5 remains for comparison)

Update: PowerSpectrum(+summax reduction) Test #7
 - completed summax reduction sizes 8 through 64
 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test
 - tidied up test result layout
 - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)

Update: PowerSpectrum(+summax reduction) Test #8 - 'Sanity check'
- Check of all needed reduction sizes
- minimal changes to larger sizes, larger than selected thrds/blk is 'almost' stock (but a bit better)
- Looking for any hardware that could yield [BAD] instead of [OK] on some sizes, particularly around selected thrds/blk
-
Don't need full results, just confirmation all [OK] & no Opt1 'worst case' slower than stock
- Intend to integrate FFTs next, so this is a critical sanity check.
- having all sizes it's a longer run, and may require several runs to see if a '[BAD]' will manifest.

Update: Powerspectrum Test #9 (Xmas edition)
- full FFT processing added
- Tightened peak/average tolerances to 0.001%
- worst case Opt1 only

Temporary download location(s):
fast:  http://www.arkayn.us/seti/PowerSpectrumTest9.7z
slow: ftp://temp:temp@sinbadsvn.dyndns.org:31469/Jason_PowerSpectrum_Test/PowerSpectrumTest9.7z


Update: PowerPsectrum Test #10 (attached)
- summary performance of FFT pipeline improvements against stock, for assessing overall progress
- can vary, so may need a few runs, just to check stability of result
- Please use DLLs provided with Test#9

Update: @ALL, Thanks! I'm closing this test for now.  It's been an extremely valuable contribution from you all that has had a huge impact on the pace & quality of our progress (mine in particular).

FYI: Some urgent issues may have come to light from Raistmer's OpenCL development when combined with the refinements here.  Those will need some fairly close attention for a short while, to get some information back to Berkeley, but stay tuned as there are more tests to come   :)

[Locking thread, Please stay tuned for further Unit Tests!]
« Last Edit: 04 Jan 2011, 03:44:33 pm by Jason G »

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: Latest nVIDIA_driver and CUDA_Version
« Reply #1 on: 18 Nov 2010, 08:43:49 am »
How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.
Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: Latest nVIDIA_driver and CUDA_Version
« Reply #2 on: 18 Nov 2010, 08:44:56 am »
What driver ?

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: Latest nVIDIA_driver and CUDA_Version
« Reply #3 on: 18 Nov 2010, 08:48:29 am »
How do I run this?

I get "FAILURE in c:/[Projects]/PowerSpectrum/main.cpp, line 126" at the moment.

updating from 258.96 to 260.99 solved that hiccup for me
The road to hell is paved with good intentions

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: Latest nVIDIA_driver and CUDA_Version
« Reply #4 on: 18 Nov 2010, 08:52:28 am »
And I've just checked that 260.89 is good enough, too.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: PowerSpectrum Unit Test
« Reply #5 on: 18 Nov 2010, 09:07:18 am »
As a side effect, I'm accumulating a good collection of data that tells me a lot about the different GPU memory subsystems, on different generations (Powerspectrum is a 'memory bound' computation).  Will split this off into its own thread a bit later [Done]

[Edit] In our own thread now, feel free to post results here,  attach, or PM.  I'm getting some very handy information to use this weekend, toward optimisation strategies.

Will try and make some sort of table up once I make some sense of the data.
« Last Edit: 18 Nov 2010, 10:07:41 am by Jason G »

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: [Split] PowerSpectrum Unit Test
« Reply #6 on: 18 Nov 2010, 10:29:01 am »
Device: GeForce GT 240, 1340 MHz clock, 512 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:        8.6 GFlops    3.4 GB/s 1183.3ulps
     64 threads:       10.1 GFlops    4.1 GB/s 1183.3ulps
    128 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps
    256 threads:       10.1 GFlops    4.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        3.4 GFlops    1.3 GB/s 1183.3ulps
     64 threads:        4.5 GFlops    1.8 GB/s 1183.3ulps
    128 threads:        4.5 GFlops    1.8 GB/s 1183.3ulps
    256 threads:        4.4 GFlops    1.8 GB/s 1183.3ulps


Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: [Split] PowerSpectrum Unit Test
« Reply #7 on: 18 Nov 2010, 10:36:49 am »
A couple more datapoints, from Windows 7. The 'AMP' in the card model name says it's a factory overclock version.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #8 on: 18 Nov 2010, 10:49:27 am »
Thanks both.  Those are the 'stubborn' cards  ;)

Offline Frizz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 541
Re: [Split] PowerSpectrum Unit Test
« Reply #9 on: 18 Nov 2010, 11:23:24 am »
I can't speak for Richard but my card is as stubborn as I am  ;D
Please stop using this 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Offline glennaxl

  • Knight o' The Realm
  • **
  • Posts: 86
Re: [Split] PowerSpectrum Unit Test
« Reply #10 on: 18 Nov 2010, 11:36:31 am »
**********
-device 0
**********
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       18.6 GFlops    7.4 GB/s 1183.3ulps
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps
    128 threads:       26.7 GFlops   10.7 GB/s 1183.3ulps
    256 threads:       26.7 GFlops   10.7 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.3 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.2 GFlops    2.9 GB/s 1183.3ulps
    128 threads:       10.6 GFlops    4.2 GB/s 1183.3ulps
    256 threads:       10.7 GFlops    4.3 GB/s 1183.3ulps


**********
-device 1
**********
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.8 GFlops   10.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       17.9 GFlops    7.2 GB/s 1183.3ulps
     64 threads:       26.0 GFlops   10.4 GB/s 1183.3ulps
    128 threads:       26.1 GFlops   10.4 GB/s 1183.3ulps
    256 threads:       24.6 GFlops    9.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.2 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.1 GFlops    2.8 GB/s 1183.3ulps
    128 threads:       10.3 GFlops    4.1 GB/s 1183.3ulps
    256 threads:       10.6 GFlops    4.2 GB/s 1183.3ulps


**********
-device 2
**********
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.4 GFlops   10.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 1:
     32 threads:       18.7 GFlops    7.5 GB/s 1183.3ulps
     64 threads:       25.6 GFlops   10.2 GB/s 1183.3ulps
    128 threads:       25.9 GFlops   10.4 GB/s 1183.3ulps
    256 threads:       25.9 GFlops   10.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 2:
     32 threads:        5.2 GFlops    2.1 GB/s 1183.3ulps
     64 threads:        7.0 GFlops    2.8 GB/s 1183.3ulps
    128 threads:       10.3 GFlops    4.1 GB/s 1183.3ulps
    256 threads:       10.4 GFlops    4.1 GB/s 1183.3ulps
« Last Edit: 18 Nov 2010, 11:41:58 am by glennaxl »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #11 on: 18 Nov 2010, 11:39:16 am »
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480.  That's something to investigate.  Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       29.1 GFlops   11.6 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:       17.6 GFlops    7.1 GB/s   0.0ulps
     64 threads:       28.9 GFlops   11.6 GB/s   0.0ulps
    128 threads:       40.5 GFlops   16.2 GB/s   0.0ulps
    256 threads:       44.0 GFlops   17.6 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       19.3 GFlops    7.7 GB/s   0.0ulps
     64 threads:       38.0 GFlops   15.2 GB/s   0.0ulps
    128 threads:       61.1 GFlops   24.5 GB/s   0.0ulps
    256 threads:       61.4 GFlops   24.6 GB/s   0.0ulps
« Last Edit: 18 Nov 2010, 11:42:07 am by Jason G »

Offline glennaxl

  • Knight o' The Realm
  • **
  • Posts: 86
Re: [Split] PowerSpectrum Unit Test
« Reply #12 on: 18 Nov 2010, 11:44:39 am »
Hmm, I expected GTX 295 results, on each GPU to be closer to half GTX 480.  That's something to investigate.  Maybe the memory subsystem on those 295s isn't as good, or requires some different handling. [Edit: actually I suppose with stock code it is better than half a 480]

My bad. FAH was running in background. Edited my post with new results.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #13 on: 18 Nov 2010, 11:50:00 am »
My bad. FAH was running in background. Edited my post with new results.

Ahh, cheers & LoL... I'm wondering why mod2 doesn't appear to work on those.  ([Later:] ah, probably some shared memory bank conflicts or such, will read into that. )
« Last Edit: 18 Nov 2010, 12:07:46 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #14 on: 18 Nov 2010, 02:07:05 pm »
And on my 465 with 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.0 GFlops    6.4 GB/s   0.0ulps


GetPowerSpectrum() mod 1:
     32 threads:        9.8 GFlops    3.9 GB/s   0.0ulps
     64 threads:       15.9 GFlops    6.3 GB/s   0.0ulps
    128 threads:       20.9 GFlops    8.3 GB/s   0.0ulps
    256 threads:       23.1 GFlops    9.2 GB/s   0.0ulps


GetPowerSpectrum() mod 2:
     32 threads:       14.4 GFlops    5.8 GB/s   0.0ulps
     64 threads:       28.4 GFlops   11.4 GB/s   0.0ulps
    128 threads:       33.5 GFlops   13.4 GB/s   0.0ulps
    256 threads:       32.8 GFlops   13.1 GB/s   0.0ulps

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 48
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 39
Total: 39
Powered by EzPortal