[Updated] Mod3_UnitTest attached, changed both mods & added a thirdMod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precisionMod2: Fixed, but sadly is slow now, remains at stock accuracyMod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.Compute capability 1.1Compiled with CUDA 3020.Stock GetPowerSpectrum(): 64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulpsGetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.) 32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps 64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps 128 threads: 4.3 GFlops 1.7 GB/s 121.7ulps 256 threads: 4.3 GFlops 1.7 GB/s 121.7ulpsGetPowerSpectrum() mod 2 (fixed, but slow): 32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps 64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps 128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps 256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulpsGetPowerSpectrum() mod 3: (As with mod1, +threads & split loads) 32 threads: 3.0 GFlops 1.2 GB/s 121.7ulps 64 threads: 4.4 GFlops 1.8 GB/s 121.7ulps 128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps 256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps 512 threads: 3.3 GFlops 1.3 GB/s 121.7ulps 1024 threads: N/A
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.
Or are there 2 different versons of PowerSpectrum floating around?