+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 162835 times)

Offline glennaxl

  • Knight o' The Realm
  • **
  • Posts: 86
Re: [Split] PowerSpectrum Unit Test
« Reply #120 on: 29 Nov 2010, 03:35:06 pm »
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   26.5 GFlops  105.8 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.6 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.7 GFlops   26.9 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.7 GFlops  106.9 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   37.0 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.8 GFlops   43.8 GB/s 121.7ulps


-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.2 GFlops  100.7 GB/s 1183.3ulps

 SumMax (    64)    2.1 GFlops    8.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.5 GFlops   26.3 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       26.3 GFlops  105.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         9.1 GFlops   36.9 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   25.3 GFlops  101.2 GB/s 1183.3ulps

 SumMax (    64)    2.0 GFlops    8.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    6.4 GFlops   25.7 GB/s


GetPowerSpectrum() choice for Opt1: 128 thrds/block
    128 threads:       25.9 GFlops  103.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 128 thrds/block
  128 threads, fftlen 64: (worst case: full summax copy)
         8.8 GFlops   35.8 GB/s 121.7ulps
Every ifft average & peak OK
  128 threads, fftlen 64: (best case, nothing to update)
        10.4 GFlops   42.1 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #121 on: 29 Nov 2010, 03:41:52 pm »
Thanks!  compute cap 1.3, so completes the basic heuristic functionality test  :)

GTX 295 (taking lower & upper limits on each GPU as combined range)
    Average, peak calcs, thread-count hueristic: OK (both)
    worst case speedup: ~35%,40%
    best case speedup: ~61%-60%

GTX 260
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~37%
    best case speedup: ~62%

Still some legroom in those 2xx series yet  :)  With the 295's still pulling those kindof relative performance numbers,  They'll still challenge the 480's for a while yet IMO.  Running several tasks on the same 480 GPU makes the picture less clear, so as some of the small refinements creep into future releases it'll be something fun to watch at least.
« Last Edit: 29 Nov 2010, 03:56:33 pm by Jason G »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: [Split] PowerSpectrum Unit Test
« Reply #122 on: 29 Nov 2010, 05:24:58 pm »
Here is the results from my 460

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   12.9 GFlops   51.6 GB/s   0.0ulps

 SumMax (    64)    1.1 GFlops    4.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.4 GFlops   13.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       19.4 GFlops   77.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         5.5 GFlops   22.1 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         6.9 GFlops   28.1 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #123 on: 29 Nov 2010, 05:32:58 pm »
Thanks!, cooperating with cc 2.1 as well (after that rocky start  ;) )

GTX 460
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~61%  :o
    best case speedup: ~103%

looking good.  I haven't worked out the worse case speedup for this kernel on my 480 yet, should be similarish, doing...

Stock  PS+SuMx(    64)    5.9 GFlops   24.0 GB/s
...

Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         8.1 GFlops   32.7 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.0 GB/s 121.7ulps

So
GTX480
worst:   (8.1-5.9)/5.9  ~= 37%
best:  (16.1-5.9)/5.9 ~= 173%

I guess I can live with the smaller improvement in the worst case, if I can manage to get a piiece of the best case improvement in some code down the road.
« Last Edit: 29 Nov 2010, 05:43:46 pm by Jason G »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #124 on: 29 Nov 2010, 06:14:54 pm »
Thanks All!

From here I'll move to complete at least the 'worst case' operation for all sizes.  That will take some time to make a further test confirming which sizes will work, at least for worst case speedups (simple implementation), and which not.  During that period , I'll also be seeking straightforward integration into the X series builds, It would only amount to a very very small speedup over the whole processing, but will confirm certain techniques (as already mentioned). 

The 'best case' optimisation will require extensive work to extract a reasonable portion of, which would be a further small speedup overall that looks like it'll help most GPUs, but Fermi most.  Again those techniques would reflect on other more critical code areas in the long run, so your help here has been appreciated most highly.

I can start to apply some of the methods determined here toward more important areas with a lot more confidence.

Cheers, Jason


Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #125 on: 30 Nov 2010, 10:59:44 am »
For sake of completeness:

Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>    4.5 GFlops   17.8 GB/s 1183.3ulps

 SumMax (    64)    0.2 GFlops    1.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.7 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        4.4 GFlops   17.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         1.3 GFlops    5.3 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         1.4 GFlops    5.8 GB/s 121.7ulps


Some 10% difference between the two bottom ones.
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #126 on: 30 Nov 2010, 11:03:32 am »
Cheers,
  Analysing...
   
Average, peak calcs, thread-count hueristic: OK
    worst case speedup: (1.3-0.4)/0.4   ~225%  (3.25x).. Winner!  ;D
    best case speedup:  (1.4-0.4)/0.4    ~250%  (3.5x)


Double checking those ridiculous numbers: (mistakes always possible  ;) )

1.3GFlops(optimised) / 0.5 GFlops(Stock) definitely = 3.25x  (325% of stock throughput)
The perecentage of optimised throughput that is speedup is then 0.9 GFlops / 1.3 GFlops  ~= 69 percent of Opt throughput is Bonus.  Speedup component is 225% of the stock throughput.

#Stock is doing something that GPU doesn't like  :-\
« Last Edit: 30 Nov 2010, 11:35:33 am by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #127 on: 30 Nov 2010, 11:35:27 am »
I reran a few times, getting 0.8-0.9 1.3 1.4-1.5 now.
i.e. higher baseline, optimazation values stable. Can do some statistics tomorrow.

edit: that 0.4 seems to have been exceptionally low (and no, I didn't have the GPU crunching by accident :P )
« Last Edit: 30 Nov 2010, 11:40:11 am by Miep »
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #128 on: 30 Nov 2010, 11:37:35 am »
OK, non-critical unless I make computation mistakes  ( I was mostly concerned here to not make code slower...).  Stock / x32f code there is doing something your GPU doesn't like IMO.

Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?
« Last Edit: 30 Nov 2010, 11:47:31 am by Jason G »

Offline SciManStev

  • Alpha Tester
  • Knight Templar
  • ***
  • Posts: 263
Re: [Split] PowerSpectrum Unit Test
« Reply #129 on: 30 Nov 2010, 06:33:23 pm »

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   28.4 GFlops  113.7 GB/s   0.0ulps

 SumMax (    64)    2.3 GFlops    9.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    7.4 GFlops   29.9 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       41.4 GFlops  165.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
        10.9 GFlops   44.0 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.2 GFlops   65.4 GB/s 121.7ulps


This was much easier than typing it out. Thanks, Richard.

Steve

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #130 on: 30 Nov 2010, 11:26:41 pm »
Thanks Steve!,
   Now your increased Core speed is showing via the improved 'worst case' speedup over mine ( Your 10.9 Vs my 8.1 GFlops )

GTX480 (watercooled)
Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~53%   ( 1.53x )
    best case speedup:   ~119%  ( 2.19x )


Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #131 on: 01 Dec 2010, 02:38:29 pm »

Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64  :D


Just re-run the Mod5 test on my GTX465 on Win7 x64 SP1 v.721 RC
getting the same results as before:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   63.9 GB/s   0.0ulps

 SumMax (    64)    1.3 GFlops    5.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.4 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #132 on: 01 Dec 2010, 02:43:41 pm »
Interesting.  In the meantime I also managed to verify that 32 bit versus 64 bit executable yielded no discernible performance difference here ( Since it's GPU jard coded anyway  ;) )

So we're left with WinXP32's simpler driver model with no Direct10+ support, or WDDM stuff going on IMO.  I wonder if there's a way to turn off more stuff in Win7x64, video subsystem-wise.

[Edit:] Hmmm....
http://www.anandtech.com/show/3924/nvidia-announces-parallel-nsight-15-cuda-toolkit-32

"Compared to the old XPDM, WDDM was a big step up for GPU usage on Windows, but only for graphical purposes. With Windows’ iron-fisted control over the GPU and a focus on task scheduling for responsiveness over performance, it wasn’t ideal for GPGPU purposes. Case in point, with a WDDM driver NVIDIA was finding it took 30μs for a kernel to be launched, but if they had Windows treat the GPU as a generic device by using a Windows Driver Model (WDM) driver, that launch time dropped to 2.5μs. This coupled with the fact that a WDM driver is necessary to use Tesla cards in a Windows Remote Desktop Protocol environment (as any Folding @Home junkie can tell you, RDP sessions can’t access the GPU through WDDM) resulted in the birth of TCC mode."
« Last Edit: 01 Dec 2010, 02:46:25 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #133 on: 01 Dec 2010, 02:59:02 pm »
Looks good - a massive drop in time to launch kernels, shame it's only available for Tesla GPU's at the moment
Hopefully NV will release a similar driver for atleast the fermi cards if not all the current cards

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #134 on: 01 Dec 2010, 03:08:18 pm »
Yeah, OmegaDrivers.Net Guy looks like broke & struggling to Work out Win7 Drivers too (None for Win7 available when you read further in). 

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 24
Total: 24
Powered by EzPortal