[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (17/62) > >>

Miep:
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum Unit Test #4
Stock GetPowerSpectrum():
64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 3.2 GFlops 1.3 GB/s 121.7ulps
64 threads: 4.6 GFlops 1.8 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.2 GFlops 1.7 GB/s 121.7ulps
512 threads: 3.5 GFlops 1.4 GB/s 121.7ulps
1024 threads: N/A

HTH

Jason G:
Fits with the theories so far :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10 ::)

Josef W. Segur:

--- Quote from: Jason G on 21 Nov 2010, 11:15:39 am ---Fits with the theories so far :D ... and turns out we can multiply the memory throughput ( GB/s ) by 10 ::)

--- End quote ---

And maybe consider whether the kernels might be memory bound on some cards?

--- Quote from: perryjay on 20 Nov 2010, 06:55:14 pm ---Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.

--- End quote ---

Yeah, 1840 shader gives 117.76 GFLOPS per the nVidia formula with 32 CUDA cores (aka shaders). What I find interesting is that they're trying to discourage use of Furmark and such which actually try to achieve the highest possible performance...
Joe

M_M:
Actually, Nvidia seems to start even more differeintiating gaming GeForce and HTPC Tesla products by putting a limitaton in GTX580 (and probably future high-end gaming GeForce products) to downclock when its usage achieve very high level (as in FurMark or OCCT for example). Reason they are giving is that games will never put such high workload on GPU, and they are probably right. However, some highly optimized real-life CUDA applications could achieve it also - my guess is that Nvidia will respond with "buy a (much more expensive) Tesla if HTPC is what you want"... :(

Jason G:

--- Quote from: Josef W. Segur on 21 Nov 2010, 02:42:32 pm ---And maybe consider whether the kernels might be memory bound on some cards?,,,
--- End quote ---

This one is memory bound on all of them, with only 3 compute instructions which are partially fused in each thread. Getting the right kernel geometry per compute capability does seem to let us push bandwidth upward from stock though, and it appears stock code for that kernel was compute cap 1.0 optimised (reasonable). There are ways to switch kernels at run time that are automatic, in the driver,now though.

With a memory bound computation like this then, it does seem logical to increase the compute density, which the first freaky powerspectrum's inclusion of power-spectrum into the FFT output does do, will need to test & refine those implementations for extension to more sizes in the long run.

Next is probably to try to rearrange the FFT & powerspectrum into chunks, in order to better exploit the cache available on Fermi (~768k L2 ), which the FFT-> powerspectrum sequence appears to be thrashing solely due to dataset size. I'm hoping that the concurrent kernels mechanism is intelligent enough to discriminate cache hot data.

In either case the next test will probably need some extra compute density, which should see the GFlops rise against a hopefully similar bandwidth figure.

I haven't yet explored whether any processing subsequent to the powerspectrum could also be embedded to further raise the compute density, finding spikes immediately, for example, but it's looking like a possibility. Further on, Dealing with indiviual PulsePoTs for the pulsefinding looks like an option, if the FFT sequence preceding it can be done in suitable blocks.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version