Author Topic: [Split] PowerSpectrum Unit Test (Read 194937 times)

Jason G · « **Reply #105 on:** 28 Nov 2010, 01:05:24 am »

WoW,
Now am completely brainfried & need to design a thorough test for the next part. I'll take a good break before creating a new one.

I chose one size for the combined powerspectrum+summax optimisation (fftlen=64), and *think* I've got that working. I want to be very sure though, so I can use the same techniques through templatisation of the kernel.

It turns out using the shared memory for speeding the reduction is STINKING DIFFICULT

....I really hope it gets easier with practice

"Tentatively looking OK" result for some reductions... but the speed looks too fast to be 100% correct right through, hence the need for extreme caution & a break from coding for a little while (Stock = Yellow, Opt1 = Green though suspect speed ):

Quote

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.0 GB/s 0.0ulps

SumMax ( 64) 1.8 GFlops 7.4 GB/s
fft[0] avg 0.650947 Pk 3.050944 OK
fft[1] avg 0.624826 Pk 2.995684 OK
fft[2] avg 0.620340 Pk 2.418427 OK
fft[3] avg 0.779598 Pk 2.243930 OK

PS+SuMx( 64) 6.0 GFlops 24.2 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.1 GFlops 176.6 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax Array Mapped to pinned host memory.
256 threads, fftlem 64: 33.2 GFlops 134.5 GB/s 121.7ulps
fft[0] avg 0.650947 Pk 3.050944
fft[1] avg 0.624826 Pk 2.995684
fft[2] avg 0.620340 Pk 2.418427
fft[3] avg 0.779598 Pk 2.243929

I'll post a thorough updated test when I'm a bit more confident of the result, but prior to templating the other sizes.

Jason G · « **Reply #106 on:** 29 Nov 2010, 10:02:05 am »

Managed to slow it down some (by processing properly

), but tests out OK here (so far):

First post updated (particularly looking for which cards show any net gain, and which none, in worst & best cases):

Quote

[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction $:-\$ )

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps

SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 5.9 GFlops 24.1 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulps

Jason G · « **Reply #107 on:** 29 Nov 2010, 10:37:07 am »

BTW: Please test on unloaded system (keep forgetting to mention that

)

~~[Edit:] Attached the wrong file Fixing...~~ Nevermind, was correct file after all

Richard Haselgrove · « **Reply #108 on:** 29 Nov 2010, 12:18:52 pm »

Testing on my shrubbery. Each file contains Result4 and Result5 (since I seem to have missed a testing cycle). ~~Other machines will follow.~~ Last one.

Jason G · « **Reply #109 on:** 29 Nov 2010, 12:20:12 pm »

Cheers, analysing first one:...

On that 9800GTX+ on Win7 (compute cap 1.1, I make that ~29% worst case, best case ~63% speedup. Looks like I'm getting the pre-Fermi's to budge finally

, good (was worried about that). Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/blk, cc1.1)

analysing second one (9800GT on XP): ...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/block, cc1.1)
worst ~44%, best ~83%.

analysing third one (GTX 470 on XP):...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 256 thrds/block, cc2.0)
worst ~45%, best ~115%.

Thanks for the test4 results, they were helpful to doublecheck the threadcount huerisitc was wise enough in all three cases.

This particular code portion has mostly low impact, but Raistmer tells me it has most impact for VHAR. In any case, it's the compute capability based hueristics, & optimisation techniques being used that should hopefully help in more significant areas. Already starting to get much better armed than a week ago. Thanks

Claggy · « **Reply #110 on:** 29 Nov 2010, 01:43:41 pm »

Here's the results from my 9800GTX+ on Win 7 64bit:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 64.2 GB/s 1183.3ulps

SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.5 GFlops 18.3 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.3 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.9 GFlops 32.1 GB/s 121.7ulps

and from my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 1.2 GFlops 4.8 GB/s 1183.3ulps

SumMax ( 64) 0.1 GFlops 0.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 0.4 GFlops 1.5 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 1.2 GFlops 4.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
0.6 GFlops 2.4 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
0.6 GFlops 2.5 GB/s 121.7ulps

Claggy

Edit: Here's the results of my 9800GTX+ on Windows Vista 64bit:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 64.1 GB/s 1183.3ulps

SumMax ( 64) 1.4 GFlops 5.7 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.3 GFlops 17.6 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.8 GFlops 23.4 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.5 GFlops 30.4 GB/s 121.7ulps

PatrickV2 · « **Reply #111 on:** 29 Nov 2010, 01:44:54 pm »

Ran it on my rig (Q6600/8GB/8800GTX/Win7-64), results:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.1 GFlops 72.4 GB/s 1183.3ulps

SumMax ( 64) 1.2 GFlops 4.9 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 3.9 GFlops 15.6 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
   64 threads: 18.2 GFlops 72.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
   64 threads, fftlen 64: (worst case: full summax copy)
   5.4 GFlops 22.0 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
   6.6 GFlops 26.6 GB/s 121.7ulps

Are you also interested in a run under WinXP?

Regards,

Patrick.

Ghost0210 · « **Reply #112 on:** 29 Nov 2010, 01:47:57 pm »

Win7 x64 - GTX465:

Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 15.9 GFlops 63.8 GB/s 0.0ulps

SumMax ( 64) 1.3 GFlops 5.4 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.1 GFlops 16.6 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.1 GFlops 92.5 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.0 GFlops 24.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.4 GB/s 121.7ulps

Jason G · « **Reply #113 on:** 29 Nov 2010, 01:49:17 pm »

Quote from: PatrickV2 on 29 Nov 2010, 01:44:54 pm

....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider

.

Analysing your first result....

8800GTX
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38%
best case speedup: ~69%

Jason G · « **Reply #114 on:** 29 Nov 2010, 01:49:59 pm »

Quote from: Ghost on 29 Nov 2010, 01:47:57 pm

Win7 x64 - GTX465:

Thanks, analysing your result too....

GTX 465
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~46%
best case speedup: ~112%

Jason G · « **Reply #115 on:** 29 Nov 2010, 01:51:06 pm »

Quote from: Claggy on 29 Nov 2010, 01:43:41 pm

...and from my 128Mb 8400M GS:

Analysing both

9800GTX+
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~33%
best case speedup: ~75%

8400M GS
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~50% <-- nice
best case speedup: ~50% <-- nice

PatrickV2 · « **Reply #116 on:** 29 Nov 2010, 02:13:32 pm »

Quote from: Jason G on 29 Nov 2010, 01:49:17 pm

Quote from: PatrickV2 on 29 Nov 2010, 01:44:54 pm
....Are you also interested in a run under WinXP? ...

Sure! it'll be interesting to see if I'm closing the gap, or making it wider .

Analysing your first result....

8800GTX
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38%
best case speedup: ~69%

As requested (Q6600/8GB/8800GTX/WinXP32):

Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.3 GFlops 73.1 GB/s 1183.3ulps

SumMax ( 64) 1.3 GFlops 5.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.3 GFlops 17.5 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.3 GFlops 73.1 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
6.4 GFlops 25.8 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
7.9 GFlops 32.2 GB/s 121.7ulps

Regards, Patrick.

Jason G · « **Reply #117 on:** 29 Nov 2010, 02:25:06 pm »

Quote from: PatrickV2 on 29 Nov 2010, 02:13:32 pm

As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38% --> 5.4 GFlops
best case speedup: ~69% --> 6.6Gflops

8800GTX XP32 result
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~48% --> 6.4 GFlops
best case speedup: ~83% --> 7.9 GFlops

Tentative conclusion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected). Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

(looks like I might end up widening the gap, rather than narrowing it

)
Jason

PatrickV2 · « **Reply #118 on:** 29 Nov 2010, 02:34:11 pm »

Quote from: Jason G on 29 Nov 2010, 02:25:06 pm

Quote from: PatrickV2 on 29 Nov 2010, 02:13:32 pm
As requested (Q6600/8GB/8800GTX/WinXP32):

8800GTX earlier Win7x64 result:
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~38% --> 5.4 GFlops
best case speedup: ~69% --> 6.6Gflops

8800GTX XP32 result
Average, peak calcs, thread-count hueristic: OK
worst case speedup: ~48% --> 6.4 GFlops
best case speedup: ~83% --> 7.9 GFlops

Tentative conculsion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%

That's a lot of difference (more than I expected). Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?

Jason

Ah, both OSes have the 260.99 driver installed. Aero was active on Win7-64. There was also a VMWare virtual machine idling on the Win7-machine.

Since I suppose you;d like me to re-run the test on the Win7 machine without Aero and without the VM active, I did

:

Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.
Compute capability 1.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 18.1 GFlops 72.4 GB/s 1183.3ulps

SumMax ( 64) 1.2 GFlops 4.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 3.9 GFlops 15.6 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 18.2 GFlops 72.7 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
64 threads, fftlen 64: (worst case: full summax copy)
5.4 GFlops 21.9 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
6.6 GFlops 26.6 GB/s 121.7ulps

Hope this provides some insight.

Regards, Patrick.

Jason G · « **Reply #119 on:** 29 Nov 2010, 02:43:24 pm »

Quote from: PatrickV2 on 29 Nov 2010, 02:34:11 pm

Hope this provides some insight.

Thanks it does

. Neither aero nor the idling VM appear to have noticeably altered the performance numbers there... so us Win7-adopters appear to be paying the price for our shiny new WDDM driver model

.

The stock code numbers are interesting too. XP32 @ 4.3GFlops, and Win7x64 @ 3.9-4.1 GFlops ..... looks like the more familiar reported ~10% advantage to XP we've heard about before.

Nice that my tweaking works even faster on XP, but I'm starting to hope MS include some sortof video subsystem fixes in SP1 for Win7x64

[Edit:] Here later in the week, I'll look into if the 64 bitness of the OS is a factor now, though it hasn't shown to be significant before. The WoW64 layer could be slowing things up there somehow, possibly, but best to know for sure.

Author Topic: [Split] PowerSpectrum Unit Test (Read 194937 times)

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Richard Haselgrove

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Claggy

Re: [Split] PowerSpectrum Unit Test

PatrickV2

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

PatrickV2

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

PatrickV2

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test