+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 164816 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #30 on: 19 Nov 2010, 06:06:32 am »
A source of a subtle stock code precision variation on pre-Fermi cards found, will test patch mod1 & mod3 & leave stock alone,  probably fix mod2 but leave precision un-fixed as a test (fixing mod2 will make it slower anyway)

[A Bit Later:] Updated first post:

Quote
[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

Some variation on #1 and/or #3 may need to end up contributing to a stock update down the road due to the stock code (very tiny) precision mismatch on CPU Vs PreFermi Vs Fermi ).  The issue could be a contributor to the 'dodgy Gaussians', time will tell whether that's the case or not.
« Last Edit: 19 Nov 2010, 10:51:49 am by Jason G »

Offline glennaxl

  • Knight o' The Realm
  • **
  • Posts: 86
Re: [Split] PowerSpectrum Unit Test
« Reply #31 on: 19 Nov 2010, 11:40:16 am »
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
     64 threads:        6.3 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    512 threads:       26.6 GFlops   10.7 GB/s 121.7ulps
   1024 threads: N/A

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.1 GFlops   10.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.4 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.1 GFlops   10.4 GB/s 121.7ulps
    128 threads:       26.3 GFlops   10.5 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       25.9 GFlops   10.3 GB/s 121.7ulps
    128 threads:       26.0 GFlops   10.4 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
    512 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
   1024 threads: N/A

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.5 GFlops   10.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
    128 threads:        6.0 GFlops    2.4 GB/s 1183.3ulps
    256 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    512 threads:       25.8 GFlops   10.3 GB/s 121.7ulps
   1024 threads: N/A

Offline M_M

  • Squire
  • *
  • Posts: 32
Re: [Split] PowerSpectrum Unit Test
« Reply #32 on: 19 Nov 2010, 11:42:05 am »
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       14.7 GFlops    5.9 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in a
     32 threads:        8.1 GFlops    3.2 GB/s 121.7ulps
     64 threads:       14.5 GFlops    5.8 GB/s 121.7ulps
    128 threads:       22.2 GFlops    8.9 GB/s 121.7ulps
    256 threads:       26.2 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        9.4 GFlops    3.8 GB/s   0.0ulps
     64 threads:       12.2 GFlops    4.9 GB/s   0.0ulps
    128 threads:       14.7 GFlops    5.9 GB/s   0.0ulps
    256 threads:       14.3 GFlops    5.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split lo
     32 threads:        7.6 GFlops    3.0 GB/s 121.7ulps
     64 threads:       14.0 GFlops    5.6 GB/s 121.7ulps
    128 threads:       21.5 GFlops    8.6 GB/s 121.7ulps
    256 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    512 threads:       20.0 GFlops    8.0 GB/s 121.7ulps
   1024 threads:       17.5 GFlops    7.0 GB/s 121.7ulps

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #33 on: 19 Nov 2010, 11:56:46 am »
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        4.6 GFlops    1.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    128 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        3.0 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.4 GFlops    1.8 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    512 threads:        3.3 GFlops    1.3 GB/s 121.7ulps
   1024 threads: N/A


Oh, look, I'm faster than an ION...
And look how horrible even mod3 is a whole 5% slower than stock. That's 4 minutes on a 90' task. Which means I'd diminish throughput by one task per 4-5 days. Simply outrageous.
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #34 on: 19 Nov 2010, 12:08:27 pm »
LoL, don't worry, we'll put a crappy stock codepath in just for you  ;)

[Edit:] I'm leaning toward the simpler Mod1 Kernel for the rest of us.  On the Fermi's at least there is some cache control to play with yet, but then the denser threadcount of Mod3, at little cost, may allow more active kernels to fit on the Fermi GPU concurrently... Hmmm....
« Last Edit: 19 Nov 2010, 12:13:49 pm by Jason G »

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: [Split] PowerSpectrum Unit Test
« Reply #35 on: 19 Nov 2010, 01:22:41 pm »
My 9800GTX+'s rerun:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.2 GFlops    6.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.8 GFlops    6.3 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.7 GFlops    1.1 GB/s 1183.3ulps
     64 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    128 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    256 threads:        2.5 GFlops    1.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.9 GFlops    6.3 GB/s 121.7ulps
    512 threads:       15.1 GFlops    6.0 GB/s 121.7ulps
   1024 threads: N/A

Claggy

Edit: and my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
     64 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    128 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    256 threads:        0.2 GFlops    0.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    512 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
   1024 threads: N/A
« Last Edit: 19 Nov 2010, 03:11:45 pm by Claggy »

Offline M_M

  • Squire
  • *
  • Posts: 32
Re: [Split] PowerSpectrum Unit Test
« Reply #36 on: 19 Nov 2010, 01:30:12 pm »
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #37 on: 19 Nov 2010, 01:36:13 pm »
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #38 on: 19 Nov 2010, 01:46:12 pm »
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.

The clock rate is just what the driver/library reports, which is some fixed number & doesn't measure any hardware (or mean much other than some general indication of the original core spec).

As far as GTX 260 Vs 9800GTX+ Vs GTX 460 goes, quite right  ;) but not strange at all,  This is a 'memory bound' kernel, almost purely instead of 'compute bound'.  That makes it not overly dependant on the processing speed of the GPU at all, but instead on the specific memory implementation, clocks & quality of the RAM chips used, as well as the kernel playing around I've been trying out.

So for that reason, this should be taken as a comparison of memory bound operations on different cards, and relative memory subsystem performance of the cards with respect to kernel tweaking, not a guide to GPU compute performance .... as there simply is very little to compute in a powerspectrum at all.

The goals at this time involve isolating effective strategies at shovelling data in and out of the GPU, rather than what's going on inside.... That comes later with some more meaty (compute intensive) kernels.

Jason

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #39 on: 19 Nov 2010, 01:51:11 pm »
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
  Yes it's looking not bad.  I can readily embed a couple of codepaths in now,  As the drivers have their own built n dispatch ( YaY ).  To me that means we probably can have our cake & eat it too, but it is just a matter of running around picking up all the crumbs & sticking them together first.

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #40 on: 19 Nov 2010, 01:54:14 pm »
And on the 465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.0 GFlops    6.4 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        9.8 GFlops    3.9 GB/s 121.7ulps
     64 threads:       15.8 GFlops    6.3 GB/s 121.7ulps
    128 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    256 threads:       23.1 GFlops    9.2 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:       10.8 GFlops    4.3 GB/s   0.0ulps
     64 threads:       13.2 GFlops    5.3 GB/s   0.0ulps
    128 threads:       13.3 GFlops    5.3 GB/s   0.0ulps
    256 threads:       12.1 GFlops    4.9 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        9.4 GFlops    3.7 GB/s 121.7ulps
     64 threads:       15.3 GFlops    6.1 GB/s 121.7ulps
    128 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    256 threads:       20.6 GFlops    8.3 GB/s 121.7ulps
    512 threads:       20.6 GFlops    8.2 GB/s 121.7ulps
   1024 threads:       18.6 GFlops    7.4 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #41 on: 19 Nov 2010, 02:08:43 pm »
Cheers,
   Will have to test the kernel concurrency next ( launch 2 - 16 powerspectrums at the same time ). No idea how much, if any, overall speed improvement might be achievable with that, but  needs testing.  I'll keep stock & all 3 mods in play for that, since one may 'pack' better than the others (smaller thread counts might pass the larger ones in performance if executing multiple on the same multiprocessor).

Offline M_M

  • Squire
  • *
  • Posts: 32
Re: [Split] PowerSpectrum Unit Test
« Reply #42 on: 19 Nov 2010, 02:09:58 pm »
@Ghost: Didn't you post about 50% higher results for GTX465 results yesterday? Why's the difference? Did you change something? Drivers?

Or are there 2 different versons of PowerSpectrum floating around?

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #43 on: 19 Nov 2010, 02:11:28 pm »
Or are there 2 different versons of PowerSpectrum floating around?

Check the first post, for the updated build & notes.  The Mod2 kernel was doing suspect things, so I've knobbled it (for now).

[I see you used the newer build yourself, so yes, mod2 numbers will be lower than yesterday ]
« Last Edit: 19 Nov 2010, 02:17:08 pm by Jason G »

Offline SciManStev

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 263
Re: [Split] PowerSpectrum Unit Test
« Reply #44 on: 19 Nov 2010, 04:44:41 pm »
Device:  GeForce GTX 480, 810 MHz clock,  1503 MB memory
Compute capability 2.0
Compiled with CUDA 3020
Stock GetPowerSpectrum<>:
     63 threads:       27.7 GFlops  11.1 GB/s    0.0ulps

GetPowerSpectrum<> mod 1: <made Fermi & Pre-Fermi match in accuracy.>
     32 threads:       17.4 GFlops   7.0 GB/s    121.7ulps
     64 threads:       27.5 GFlops  11.0 GB/s    121.7ulps
    128 threads:       36.4 GFlops  14.5 GB/s    121.7ulps
    256 threads:       39.6 GFlops  15.8 GB/s    121.7ulps

GetPowerSpectrum<> mod 2 <fixed, but slow>:
     32 threads:       18.9 GFlops   7.6 GB/s      0.0ulps
     64 threads:       23.1 GFlops   9.2 GB/s      0.0ulps
    128 threads:       24.1 GFlops   9.6 GB/s      0.0ulps
    256 threads:       22.7 GFlops   9.1 GB/s      0.0ulps

GetPowerSpectrum<> mod 3: <As with mod1, +threads & split loads>
     32 threads:       16.7 GFlpos   6.7 GB/s    121.7ulps
     64 threads:       26.9 GFlops  10.8 GB/s    121.7ulps
    128 threads:       36.0 GFlops  14.4 GB/s    121.7ulps
    256 threads:       34.9 GFlops  13.9 GB/s    121.7ulps
    512 threads:       34.7 GFlops  13.9 GB/s    121.7ulps
   1024 threads:       33.5 GFlops  13.4 GB/s    121.7ulps


Steve

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 4
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 110
Total: 110
Powered by EzPortal