Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (44/62) > >>

Jason G:
Thanks all for the massive amount of data  ;D  , will peruse to see if anything;s amiss, but think I found the sweet spot for 'worst case' at the moment, which is straightforward implementation.  I'm delighted that nothing seems to be broken on any GPU tested so far.  There is a lot of work to do to add the remaining sizes into the test (remaining powers of 2 up to 128k or so, maybe some larger sizes for growing room), Then adding FFTs & Findspikes on either side of this pipeline.   Once that's done looks like I can stripe the processing to fit Fermi's L2 cache, right through this pipeline, which should speed things up a lot for those cards.

@Joe, Thanks!, I keep forgetting it's 31 not 30  ::)  probably would have found it the hard way (again), but the heads up helps.

Jason

glennaxl:
-device 0

--- Code: ---Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.5 GFlops   19.6 GB/s
 PS+SuMx(    16) [OK]    5.0 GFlops   20.9 GB/s
 PS+SuMx(    32) [OK]    4.6 GFlops   18.7 GB/s
 PS+SuMx(    64) [OK]    7.0 GFlops   28.4 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.1   26.7 121.7 [OK]   11.7   51.4 121.7
 PS+SuMx(    16)    7.5   31.2 121.7 [OK]   11.5   48.0 121.7
 PS+SuMx(    32)    8.7   35.6 121.7 [OK]   12.0   48.9 121.7
 PS+SuMx(    64)   10.9   44.1 121.7 [OK]   14.5   58.9 121.7
--- End code ---

-device 1

--- Code: ---Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.4 GFlops   19.3 GB/s
 PS+SuMx(    16) [OK]    4.9 GFlops   20.6 GB/s
 PS+SuMx(    32) [OK]    4.5 GFlops   18.5 GB/s
 PS+SuMx(    64) [OK]    6.9 GFlops   27.9 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    6.0   26.3 121.7 [OK]   11.6   50.8 121.7
 PS+SuMx(    16)    7.3   30.5 121.7 [OK]   11.4   47.7 121.7
 PS+SuMx(    32)    8.6   35.1 121.7 [OK]   11.7   48.1 121.7
 PS+SuMx(    64)   10.7   43.3 121.7 [OK]   14.4   58.2 121.7
--- End code ---

-device 2

--- Code: ---Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    4.3 GFlops   18.7 GB/s
 PS+SuMx(    16) [OK]    4.8 GFlops   19.9 GB/s
 PS+SuMx(    32) [OK]    4.3 GFlops   17.6 GB/s
 PS+SuMx(    64) [OK]    6.6 GFlops   26.8 GB/s


Opt1: 128 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    5.8   25.5 121.7 [OK]   10.9   47.5 121.7
 PS+SuMx(    16)    7.1   29.7 121.7 [OK]   10.6   44.3 121.7
 PS+SuMx(    32)    8.2   33.7 121.7 [OK]   11.0   45.2 121.7
 PS+SuMx(    64)   10.4   42.0 121.7 [OK]   13.5   54.7 121.7
--- End code ---

Jason G:
a Hah!, we 're finding the 2xx series limits at last.  'best case' is tapering off sooner & clearly compute bound, while the worst cases show the limit of DDR3 against fermi's DDR5 memory.

Fermi best cases appear to be limited by the memory subsystem still, so down the road I'll be striping(streaming) this pipeline to fit in those cache levels.  That should lift the apparent ~20GFlops limit a bit on Fermis,  Unfortunately the 2xx cards don't have the cache levels, so we might be reaching a limit with those in some respects.

@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ?  I can lengthen the test sequence if needed.

[A bit Later:]  extending the tests from 0.5 to 5 seconds allowed me to see what the 480 is doing as a cross check.  Looks like the Opt1 best cases are reaching ~100%, and opt1 worst cases are bandwidth limited, all as expected, no surprises yet.

[Still later:] I've added the extended PowerSpectrumTest7 to the first post.  I don't need data for the extended test(results are more or less the same), but provide it for those that want to be able to see GPU utilisation differences between the test phases on their cards, like the attached image. 

Moving onto larger sizes & FFT integration , after some beer   ;)

glennaxl:

--- Quote from: Jason G on 22 Dec 2010, 01:26:47 am ---@glennaxl: could you confirm that the 200 series cards are reaching near ~100% GPU utilisation during the Opt1 tests (higher than the stock portion) ?  I can lengthen the test sequence if needed.
--- End quote ---

Yes, Opt1 spikes to 99%.

Jason G:
Cheers!

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version