Update: PowerSpectrum(+summax reduction) Test #7 - completed summax reduction sizes 8 through 64 - refined Opt1 a little, should be a tad faster for size 64 that was in prior test - tidied up test result layout - enabled pinned memory use for Opt1 on all Cuda Capable cards (including cc1.0)
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.Compute capability 2.0Compiled with CUDA 3020. PowerSpectrum+summax Unit test #7 (Faster reductions)Stock: PS+SuMx( 8) [OK] 2.9 GFlops 12.9 GB/s PS+SuMx( 16) [OK] 3.9 GFlops 16.2 GB/s PS+SuMx( 32) [OK] 3.9 GFlops 15.8 GB/s PS+SuMx( 64) [OK] 6.0 GFlops 24.2 GB/sOpt1: 256 thrds/block worst case best case GFlps GB/s ulps GFlps GB/s ulps PS+SuMx( 8) 4.3 18.6 121.7 [OK] 22.8 99.7 121.7 PS+SuMx( 16) 6.7 28.1 121.7 [OK] 21.4 89.7 121.7 PS+SuMx( 32) 9.4 38.6 121.7 [OK] 20.8 85.2 121.7 PS+SuMx( 64) 11.7 47.4 121.7 [OK] 20.4 82.6 121.7
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.Compute capability 1.1Compiled with CUDA 3020. PowerSpectrum+summax Unit test #7 (Faster reductions)Stock: PS+SuMx( 8) [OK] 2.0 GFlops 8.8 GB/s PS+SuMx( 16) [OK] 2.6 GFlops 10.7 GB/s PS+SuMx( 32) [OK] 2.8 GFlops 11.5 GB/s PS+SuMx( 64) [OK] 4.5 GFlops 18.1 GB/sOpt1: 64 thrds/block worst case best case GFlps GB/s ulps GFlps GB/s ulps PS+SuMx( 8) 2.7 11.8 121.7 [OK] 7.1 31.0 121.7 PS+SuMx( 16) 4.0 16.5 121.7 [OK] 7.7 32.1 121.7 PS+SuMx( 32) 4.9 19.9 121.7 [OK] 7.3 29.7 121.7 PS+SuMx( 64) 6.6 26.7 121.7 [OK] 8.9 35.9 121.7
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.Compute capability 1.1Compiled with CUDA 3020. PowerSpectrum+summax Unit test #7 (Faster reductions)Stock: PS+SuMx( 8) [OK] 0.3 GFlops 1.3 GB/s PS+SuMx( 16) [OK] 0.3 GFlops 1.2 GB/s PS+SuMx( 32) [OK] 0.2 GFlops 0.9 GB/s PS+SuMx( 64) [OK] 0.4 GFlops 1.5 GB/sOpt1: 64 thrds/block worst case best case GFlps GB/s ulps GFlps GB/s ulps PS+SuMx( 8) 0.4 1.9 121.7 [OK] 0.5 2.1 121.7 PS+SuMx( 16) 0.4 1.8 121.7 [OK] 0.5 1.9 121.7 PS+SuMx( 32) 0.4 1.7 121.7 [OK] 0.4 1.8 121.7 PS+SuMx( 64) 0.5 2.1 121.7 [OK] 0.5 2.2 121.7
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.Compute capability 1.1Compiled with CUDA 3020. PowerSpectrum+summax Unit test #7 (Faster reductions)Stock: PS+SuMx( 8) [OK] 0.57 +- 0.048 GFlops 2.49 +- 0.24 GB/s PS+SuMx( 16) [OK] 0.57 +- 0.048 GFlops 2.39 +- 0.19 GB/s PS+SuMx( 32) [OK] 0.49 +- 0.031 GFlops 2.01 +- 0.11 GB/s PS+SuMx( 64) [OK] 0.80 +- 0.105 GFlops 3.20 +- 0.41 GB/sOpt1: 64 thrds/block worst case best case GFlps GB/s ulps GFlps GB/s ulps PS+SuMx( 8) 0.87 +- 0.048 3.92 +- 0.20 121.7 [OK] 1.21 +- 0.03 5.49 +- 0.03 121.7 PS+SuMx( 16) 0.89 +- 0.19 3.70 +- 0.78 121.7 [OK] 1.20 +- 0 5.00 +- 0 121.7 PS+SuMx( 32) 0.97 +-0.048 3.92 +- 0.19 121.7 [OK] 1.10 +- 0 4.60 +- 0 121.7 PS+SuMx( 64) 1.24 +- 0.11 5.02 +- 0.42 121.7 [OK] 1.41 +- 0.03 5.85 +- 0.05 121.7
How did you do ten runs, while collecting data, on 'that thing' in that timeframe ? magic ?[ Oh yeah I set the timer tolerances to do that, I forgot ]
timer tolerances?
Yeah, faster cards probably do 'a few more' runs within the allocated 0.5 seconds per test
[BTW:] On Opt1, See the difference in the standard deviations of best & worse cases ? , That's memory&bus contention on the worst cases randomising things up a bit
I was wondering more about the apparent lack of variation on the best case. I would have expected a little more fluctuation.
Best case requires few memory transfers back to the host CPU ( only one best spike & no detections) [Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?