Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.Compute capability 2.0Compiled with CUDA 3020. PowerSpectrum+summax Unit test #5Stock: PwrSpec< 64> 29.0 GFlops 116.0 GB/s 0.0ulps SumMax ( 64) 1.8 GFlops 7.4 GB/s fft[0] avg 0.650947 Pk 3.050944 OK fft[1] avg 0.624826 Pk 2.995684 OK fft[2] avg 0.620340 Pk 2.418427 OK fft[3] avg 0.779598 Pk 2.243930 OK PS+SuMx( 64) 6.0 GFlops 24.2 GB/sGetPowerSpectrum() choice for Opt1: 256 thrds/block 256 threads: 44.1 GFlops 176.6 GB/s 121.7ulpsOpt1 (PSmod3+SM): 256 thrds/blockPowerSpectrumSumMax Array Mapped to pinned host memory. 256 threads, fftlem 64: 33.2 GFlops 134.5 GB/s 121.7ulps fft[0] avg 0.650947 Pk 3.050944 fft[1] avg 0.624826 Pk 2.995684 fft[2] avg 0.620340 Pk 2.418427 fft[3] avg 0.779598 Pk 2.243929
[Updated] to PowerSpectrum Unit Test #5Single size fftlen (64) 1meg point powerspectrum with summax reduction, to test a number of experimental features (please check): - Automated detection & handling of threadcount for the powerspectrum, by compute capability ( 1.0-1.2 = 64 thread, 1.3 = 128 thread, 2.0+ = 256) - Opt1 best & worse cases likely to occur in real life tested, worst case should indicate ~same as stock to ~30% improvement (depending on GPU) Best case ~1.3-2x stock throughput (depending on GPU etc), worst case results are checked for accuracy & flagged if there's a problem. - On Integrated GPUs, use mapped/pinned host memory, so on those worst case should be ~= best case ( and hopefully some margin better than the stock reduction )Example output (important numbers: highlighted, Stock, Opt1 )Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.Compute capability 2.0Compiled with CUDA 3020. PowerSpectrum+summax Unit test #5Stock: PwrSpec< 64> 29.0 GFlops 116.1 GB/s 0.0ulps SumMax ( 64) 1.8 GFlops 7.4 GB/sEvery ifft average & peak OK PS+SuMx( 64) 5.9 GFlops 24.1 GB/sGetPowerSpectrum() choice for Opt1: 256 thrds/block 256 threads: 44.3 GFlops 177.1 GB/s 121.7ulpsOpt1 (PSmod3+SM): 256 thrds/block 256 threads, fftlen 64: (worst case: full summax copy) 8.1 GFlops 32.8 GB/s 121.7ulpsEvery ifft average & peak OK 256 threads, fftlen 64: (best case, nothing to update) 16.1 GFlops 65.2 GB/s 121.7ulps
....Are you also interested in a run under WinXP? ...
Win7 x64 - GTX465:
...and from my 128Mb 8400M GS:
Quote from: PatrickV2 on 29 Nov 2010, 01:44:54 pm....Are you also interested in a run under WinXP? ...Sure! it'll be interesting to see if I'm closing the gap, or making it wider .Analysing your first result....8800GTX Average, peak calcs, thread-count hueristic: OK worst case speedup: ~38% best case speedup: ~69%
As requested (Q6600/8GB/8800GTX/WinXP32):
Quote from: PatrickV2 on 29 Nov 2010, 02:13:32 pmAs requested (Q6600/8GB/8800GTX/WinXP32):8800GTX earlier Win7x64 result: Average, peak calcs, thread-count hueristic: OK worst case speedup: ~38% --> 5.4 GFlops best case speedup: ~69% --> 6.6Gflops8800GTX XP32 result Average, peak calcs, thread-count hueristic: OK worst case speedup: ~48% --> 6.4 GFlops best case speedup: ~83% --> 7.9 GFlopsTentative conculsion: in both best and worst cases, with that particular card and these specific hard coded kernels (not overly driver/cuda library dependant), XP32 performance is higher by some 18-19%That's a lot of difference (more than I expected). Could you let me know both driver versions involved, whether your win7 has aero active, and any other possible differences besides OS ?Jason
Hope this provides some insight.