+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 162859 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #105 on: 28 Nov 2010, 01:05:24 am »
WoW,
    Now am completely brainfried & need to design a thorough test for the next part.  I'll take a good break before creating a new one.

I chose one size for the combined powerspectrum+summax optimisation (fftlen=64), and *think* I've got that working.  I want to be very sure though, so I can use the same techniques through templatisation of the kernel.

It turns out using the shared memory for speeding the reduction is STINKING DIFFICULT  :o....I really hope it gets easier with practice  :D

"Tentatively looking OK" result for some reductions... but the speed looks too fast to be 100% correct right through, hence the need for extreme caution & a break from coding for a little while (Stock = Yellow, Opt1 = Green though suspect speed ):

Quote
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.0 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
      fft[0] avg 0.650947 Pk 3.050944 OK
      fft[1] avg 0.624826 Pk 2.995684 OK
      fft[2] avg 0.620340 Pk 2.418427 OK
      fft[3] avg 0.779598 Pk 2.243930 OK

PS+SuMx(    64)    6.0 GFlops   24.2 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.1 GFlops  176.6 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax Array Mapped to pinned host memory.
  256 threads, fftlem 64:    33.2 GFlops  134.5 GB/s 121.7ulps
       fft[0] avg 0.650947 Pk 3.050944
       fft[1] avg 0.624826 Pk 2.995684
       fft[2] avg 0.620340 Pk 2.418427
       fft[3] avg 0.779598 Pk 2.243929


I'll post a thorough updated test when I'm a bit more confident of the result, but prior to templating the other sizes.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #106 on: 29 Nov 2010, 10:02:05 am »
Managed to slow it down some (by processing properly  ;)), but tests out OK here (so far):

First post updated (particularly looking for which cards show any net gain, and which none, in worst & best cases):
Quote
[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64)  1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
 - Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
 - Opt1 best & worse cases likely to occur in real life tested,  worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
 - On Integrated GPUs, use mapped/pinned host memory, so on those  worst case should be ~= best case ( and hopefully some margin better than the stock reduction  :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   29.0 GFlops  116.1 GB/s   0.0ulps

 SumMax (    64)    1.8 GFlops    7.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    5.9 GFlops   24.1 GB/s



GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       44.3 GFlops  177.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
        8.1 GFlops   32.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        16.1 GFlops   65.2 GB/s 121.7ulps
« Last Edit: 29 Nov 2010, 10:11:11 am by Jason G »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #107 on: 29 Nov 2010, 10:37:07 am »
BTW: Please test on unloaded system (keep forgetting to mention that  ;))

[Edit:]  Attached the wrong file  ::)  Fixing...  Nevermind, was correct file after all
« Last Edit: 29 Nov 2010, 10:51:58 am by Jason G »

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: [Split] PowerSpectrum Unit Test
« Reply #108 on: 29 Nov 2010, 12:18:52 pm »
Testing on my shrubbery. Each file contains Result4 and Result5 (since I seem to have missed a testing cycle). Other machines will follow. Last one.
« Last Edit: 29 Nov 2010, 12:29:49 pm by Richard Haselgrove »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #109 on: 29 Nov 2010, 12:20:12 pm »
Cheers, analysing first one:...

On that 9800GTX+ on Win7 (compute cap 1.1, I make that ~29% worst case, best case  ~63% speedup.  Looks like I'm getting the pre-Fermi's to budge finally  ::), good (was worried about that).  Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/blk, cc1.1)

analysing second one (9800GT on XP): ...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/block, cc1.1)
worst ~44%, best ~83%. 

analysing third one (GTX 470 on XP):...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 256 thrds/block, cc2.0)
worst ~45%, best ~115%. 

Thanks for the test4 results, they were helpful to doublecheck the threadcount huerisitc was wise enough in all three cases.

This particular code portion has mostly low impact, but Raistmer tells me it has most impact for VHAR.  In any case, it's the compute capability based hueristics,  & optimisation techniques being used that should hopefully help in more significant areas.  Already starting to get much better armed than a week ago.  Thanks  ;D
« Last Edit: 29 Nov 2010, 12:47:23 pm by Jason G »

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: [Split] PowerSpectrum Unit Test
« Reply #110 on: 29 Nov 2010, 01:43:41 pm »
Here's the results from my 9800GTX+ on Win 7 64bit:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   64.2 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.5 GFlops   18.3 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.3 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.9 GFlops   32.1 GB/s 121.7ulps

and from my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>    1.2 GFlops    4.8 GB/s 1183.3ulps

 SumMax (    64)    0.1 GFlops    0.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        1.2 GFlops    4.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         0.6 GFlops    2.4 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         0.6 GFlops    2.5 GB/s 121.7ulps

Claggy

Edit: Here's the results of my 9800GTX+ on Windows Vista 64bit:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   64.1 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    5.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.8 GFlops   23.4 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.5 GFlops   30.4 GB/s 121.7ulps

« Last Edit: 05 Dec 2010, 07:11:23 am by Claggy »

Offline PatrickV2

  • Knight o' The Round Table
  • ***
  • Posts: 139
Re: [Split] PowerSpectrum Unit Test
« Reply #111 on: 29 Nov 2010, 01:44:54 pm »
Ran it on my rig (Q6600/8GB/8800GTX/Win7-64), results:


Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.1 GFlops   72.4 GB/s 1183.3ulps

 SumMax (    64)    1.2 GFlops    4.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.9 GFlops   15.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.2 GFlops   72.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.4 GFlops   22.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         6.6 GFlops   26.6 GB/s 121.7ulps



Are you also interested in a run under WinXP?

Regards,

Patrick.

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #112 on: 29 Nov 2010, 01:47:57 pm »
Win7 x64 - GTX465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   15.9 GFlops   63.8 GB/s   0.0ulps

 SumMax (    64)    1.3 GFlops    5.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.6 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.4 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #113 on: 29 Nov 2010, 01:49:17 pm »
....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider  ;).

Analysing your first result....

8800GTX
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38%
    best case speedup: ~69%
« Last Edit: 29 Nov 2010, 02:03:48 pm by Jason G »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #114 on: 29 Nov 2010, 01:49:59 pm »
Win7 x64 - GTX465:

Thanks, analysing your result too....

GTX 465
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~46%
    best case speedup: ~112%
« Last Edit: 29 Nov 2010, 02:07:18 pm by Jason G »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #115 on: 29 Nov 2010, 01:51:06 pm »
...and from my 128Mb 8400M GS:

Analysing both ;)


9800GTX+
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~33%
    best case speedup: ~75%

8400M GS
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~50%  <-- nice
    best case speedup:  ~50%   <-- nice



« Last Edit: 29 Nov 2010, 01:59:50 pm by Jason G »

Offline PatrickV2

  • Knight o' The Round Table
  • ***
  • Posts: 139
Re: [Split] PowerSpectrum Unit Test
« Reply #116 on: 29 Nov 2010, 02:13:32 pm »
....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider  ;).

Analysing your first result....

8800GTX
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38%
    best case speedup: ~69%

As requested (Q6600/8GB/8800GTX/WinXP32):


Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.3 GFlops   73.1 GB/s 1183.3ulps

 SumMax (    64)    1.3 GFlops    5.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.3 GFlops   73.1 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         6.4 GFlops   25.8 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         7.9 GFlops   32.2 GB/s 121.7ulps


Regards, Patrick.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #117 on: 29 Nov 2010, 02:25:06 pm »
As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38% --> 5.4 GFlops
    best case speedup: ~69%  -->  6.6Gflops

8800GTX XP32 result
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~48%   --> 6.4 GFlops
    best case speedup: ~83%    --> 7.9 GFlops


Tentative conclusion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected).  Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

(looks like I might end up widening the gap, rather than narrowing it  ::))
Jason
« Last Edit: 29 Nov 2010, 02:31:50 pm by Jason G »

Offline PatrickV2

  • Knight o' The Round Table
  • ***
  • Posts: 139
Re: [Split] PowerSpectrum Unit Test
« Reply #118 on: 29 Nov 2010, 02:34:11 pm »
As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~38% --> 5.4 GFlops
    best case speedup: ~69%  -->  6.6Gflops

8800GTX XP32 result
    Average, peak calcs, thread-count hueristic: OK
    worst case speedup: ~48%   --> 6.4 GFlops
    best case speedup: ~83%    --> 7.9 GFlops


Tentative conculsion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected).  Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

Jason

Ah, both OSes have the 260.99 driver installed. Aero was active on Win7-64. There was also a VMWare virtual machine idling on the Win7-machine.

Since I suppose you;d like me to re-run the test on the Win7 machine without Aero and without the VM active, I did :):


Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   18.1 GFlops   72.4 GB/s 1183.3ulps

 SumMax (    64)    1.2 GFlops    4.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.9 GFlops   15.6 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       18.2 GFlops   72.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
         5.4 GFlops   21.9 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         6.6 GFlops   26.6 GB/s 121.7ulps


Hope this provides some insight.

Regards, Patrick.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #119 on: 29 Nov 2010, 02:43:24 pm »
Hope this provides some insight.

Thanks it does  :).  Neither aero nor the idling VM appear to have noticeably altered the performance numbers there... so us Win7-adopters  appear to be paying the price for our shiny new WDDM driver model  ;).

The stock code numbers are interesting too.  XP32 @ 4.3GFlops, and Win7x64 @ 3.9-4.1 GFlops ..... looks like the more familiar reported ~10% advantage to XP we've heard about before.

Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64  :D

[Edit:] Here later in the week, I'll look into if the 64 bitness of the OS is a factor now, though it hasn't shown to be significant before.  The WoW64 layer could be slowing  things up there somehow, possibly, but best to know for sure.
« Last Edit: 29 Nov 2010, 02:55:19 pm by Jason G »

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 27
Total: 27
Powered by EzPortal