1) register pressure2) local (shared in NV terms) memory amount3) deepness of call stack
192 threads: 44.1 GFlops 17.6 GB/s 121.7ulps
float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/10e9;
float GB = ((n * sizeof(float2)) + ( n*sizeof(float) ))/1e9;
256 threads: 44.2 GFlops 176.8 GB/s 121.7ulps
Fits with the theories so far ... and turns out we can multiply the memory throughput ( GB/s ) by 10
Just to add a little bit... I'm running Vista 32 on a E5400 dual 2.7GHz. My 9500GT has driver 260.99 and is slightly overclocked at core 723/ shader 1840 and memory at 400 to give me 118GFLOP)S Peak.
And maybe consider whether the kernels might be memory bound on some cards?,,,
(normalization required mean computation, i.e., again, access to whole PoT array ).
summax uses thread cooperation/synching and barriers.[I just wanna say that by merging such kernel with power spectrum one you will bound N points to thread instead of 1 point,where N>>1 or you will have too many reduce steps.]
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads) 256 threads: 44.1 GFlops 176.4 GB/s 121.7ulps
GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads) 256 threads: 46.7 GFlops 186.8 GB/s 121.7ulps