[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (22/62) > >>

Jason G:
WoW,
Now am completely brainfried & need to design a thorough test for the next part. I'll take a good break before creating a new one.

I chose one size for the combined powerspectrum+summax optimisation (fftlen=64), and *think* I've got that working. I want to be very sure though, so I can use the same techniques through templatisation of the kernel.

It turns out using the shared memory for speeding the reduction is STINKING DIFFICULT :o....I really hope it gets easier with practice :D

"Tentatively looking OK" result for some reductions... but the speed looks too fast to be 100% correct right through, hence the need for extreme caution & a break from coding for a little while (Stock = Yellow, Opt1 = Green though suspect speed ):

--- Quote ---Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.0 GB/s 0.0ulps

SumMax ( 64) 1.8 GFlops 7.4 GB/s
fft[0] avg 0.650947 Pk 3.050944 OK
fft[1] avg 0.624826 Pk 2.995684 OK
fft[2] avg 0.620340 Pk 2.418427 OK
fft[3] avg 0.779598 Pk 2.243930 OK

PS+SuMx( 64) 6.0 GFlops 24.2 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.1 GFlops 176.6 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax Array Mapped to pinned host memory.
256 threads, fftlem 64: 33.2 GFlops 134.5 GB/s 121.7ulps
fft[0] avg 0.650947 Pk 3.050944
fft[1] avg 0.624826 Pk 2.995684
fft[2] avg 0.620340 Pk 2.418427
fft[3] avg 0.779598 Pk 2.243929

--- End quote ---

I'll post a thorough updated test when I'm a bit more confident of the result, but prior to templating the other sizes.

Jason G:
Managed to slow it down some (by processing properly ;)), but tests out OK here (so far):

First post updated (particularly looking for which cards show any net gain, and which none, in worst & best cases):

--- Quote ---[Updated] to PowerSpectrum Unit Test #5
Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check):
- Automated detection & handling of threadcount for the powerspectrum, by compute capability
( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256)
- Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem.
- On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction :-\)

Example output (important numbers: highlighted, Stock, Opt1 )

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps

SumMax ( 64) 1.8 GFlops 7.4 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 5.9 GFlops 24.1 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 44.3 GFlops 177.1 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
8.1 GFlops 32.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
16.1 GFlops 65.2 GB/s 121.7ulps
--- End quote ---

Jason G:
BTW: Please test on unloaded system (keep forgetting to mention that ;))

[Edit:] Attached the wrong file ::) Fixing... Nevermind, was correct file after all

Richard Haselgrove:
Testing on my shrubbery. Each file contains Result4 and Result5 (since I seem to have missed a testing cycle). Other machines will follow. Last one.

Jason G:
Cheers, analysing first one:...

On that 9800GTX+ on Win7 (compute cap 1.1, I make that ~29% worst case, best case ~63% speedup. Looks like I'm getting the pre-Fermi's to budge finally ::), good (was worried about that). Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/blk, cc1.1)

analysing second one (9800GT on XP): ...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 64 thrds/block, cc1.1)
worst ~44%, best ~83%.

analysing third one (GTX 470 on XP):...
Average & peak calculations say 'OK' (Check), and correct threadcount was issued automatically (Check, 256 thrds/block, cc2.0)
worst ~45%, best ~115%.

Thanks for the test4 results, they were helpful to doublecheck the threadcount huerisitc was wise enough in all three cases.

This particular code portion has mostly low impact, but Raistmer tells me it has most impact for VHAR. In any case, it's the compute capability based hueristics, & optimisation techniques being used that should hopefully help in more significant areas. Already starting to get much better armed than a week ago. Thanks ;D

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version