Device: GeForce 8800 GTX, 1350 MHz clock, 768 MB memory.Compute capability 1.0Compiled with CUDA 3020. PowerSpectrum+summax Unit test #9 (FFT pipeline) Christmas 2010 edition.Stock: FFT+PS+SM( 8) 9.3 GFlops 16.4 GB/s ulps(fft 1.3,ps 4775.9) [OK] FFT+PS+SM( 16) 13.6 GFlops 18.5 GB/s ulps(fft 1.6,ps 4817.4) [OK] FFT+PS+SM( 32) 16.0 GFlops 17.8 GB/s ulps(fft 1.6,ps 4628.1) [OK] FFT+PS+SM( 64) 28.3 GFlops 26.8 GB/s ulps(fft 1.6,ps 4557.6) [OK] FFT+PS+SM( 128) 44.4 GFlops 36.5 GB/s ulps(fft 2.0,ps 4942.0) [OK] FFT+PS+SM( 256) 59.2 GFlops 43.1 GB/s ulps(fft 2.0,ps 4967.8) [OK] FFT+PS+SM( 512) 72.6 GFlops 47.4 GB/s ulps(fft 2.1,ps 5128.1) [OK] FFT+PS+SM( 1024) 71.7 GFlops 42.5 GB/s ulps(fft 2.5,ps 5552.5) [OK] FFT+PS+SM( 2048) 72.1 GFlops 39.1 GB/s ulps(fft 2.7,ps 5770.3) [OK] FFT+PS+SM( 4096) 66.5 GFlops 33.3 GB/s ulps(fft 2.4,ps 5313.7) [OK] FFT+PS+SM( 8192) 63.3 GFlops 29.4 GB/s ulps(fft 2.8,ps 5881.1) [OK] FFT+PS+SM( 16384) 58.6 GFlops 25.3 GB/s ulps(fft 3.3,ps 6399.1) [OK] FFT+PS+SM( 32768) 62.9 GFlops 25.5 GB/s ulps(fft 3.3,ps 6380.1) [OK] FFT+PS+SM( 65536) 67.2 GFlops 25.6 GB/s ulps(fft 3.4,ps 6534.8) [OK] FFT+PS+SM(131072) 66.0 GFlops 23.7 GB/s ulps(fft 3.6,ps 6694.2) [OK]Opt1 (worst case): 64 thrds/block FFT+PS+SM( 8) 14.3 GFlops 25.2 GB/s ulps(fft 1.3,ps 4637.5) [OK] FFT+PS+SM( 16) 21.2 GFlops 28.9 GB/s ulps(fft 1.6,ps 4589.2) [OK] FFT+PS+SM( 32) 27.5 GFlops 30.7 GB/s ulps(fft 1.6,ps 4535.6) [OK] FFT+PS+SM( 64) 39.1 GFlops 37.0 GB/s ulps(fft 1.6,ps 4426.7) [OK] FFT+PS+SM( 128) 47.4 GFlops 39.0 GB/s ulps(fft 2.0,ps 4818.1) [OK] FFT+PS+SM( 256) 62.5 GFlops 45.5 GB/s ulps(fft 2.0,ps 4831.0) [OK] FFT+PS+SM( 512) 76.0 GFlops 49.7 GB/s ulps(fft 2.1,ps 4987.2) [OK] FFT+PS+SM( 1024) 74.1 GFlops 43.9 GB/s ulps(fft 2.5,ps 5438.0) [OK] FFT+PS+SM( 2048) 74.2 GFlops 40.3 GB/s ulps(fft 2.7,ps 5674.7) [OK] FFT+PS+SM( 4096) 67.3 GFlops 33.7 GB/s ulps(fft 2.4,ps 5202.4) [OK] FFT+PS+SM( 8192) 64.7 GFlops 30.0 GB/s ulps(fft 2.8,ps 5765.4) [OK] FFT+PS+SM( 16384) 59.8 GFlops 25.9 GB/s ulps(fft 3.3,ps 6291.8) [OK] FFT+PS+SM( 32768) 64.3 GFlops 26.0 GB/s ulps(fft 3.3,ps 6275.5) [OK] FFT+PS+SM( 65536) 68.6 GFlops 26.1 GB/s ulps(fft 3.4,ps 6429.1) [OK] FFT+PS+SM(131072) 67.5 GFlops 24.3 GB/s ulps(fft 3.6,ps 6590.4) [OK]
Device: GeForce 8800 GTX, 1350 MHz clock, 731 MB memory.Compute capability 1.0Compiled with CUDA 3020. PowerSpectrum+summax Unit test #9 (FFT pipeline) Christmas 2010 edition.Stock: FFT+PS+SM( 8) 8.4 GFlops 14.9 GB/s ulps(fft 1.3,ps 4775.9) [OK] FFT+PS+SM( 16) 12.1 GFlops 16.6 GB/s ulps(fft 1.6,ps 4817.4) [OK] FFT+PS+SM( 32) 14.6 GFlops 16.3 GB/s ulps(fft 1.6,ps 4628.1) [OK] FFT+PS+SM( 64) 25.9 GFlops 24.5 GB/s ulps(fft 1.6,ps 4557.6) [OK] FFT+PS+SM( 128) 38.6 GFlops 31.8 GB/s ulps(fft 2.0,ps 4942.0) [OK] FFT+PS+SM( 256) 50.3 GFlops 36.6 GB/s ulps(fft 2.0,ps 4967.8) [OK] FFT+PS+SM( 512) 61.2 GFlops 40.0 GB/s ulps(fft 2.1,ps 5128.1) [OK] FFT+PS+SM( 1024) 61.6 GFlops 36.5 GB/s ulps(fft 2.5,ps 5552.5) [OK] FFT+PS+SM( 2048) 62.3 GFlops 33.8 GB/s ulps(fft 2.7,ps 5770.3) [OK] FFT+PS+SM( 4096) 57.5 GFlops 28.7 GB/s ulps(fft 2.4,ps 5313.7) [OK] FFT+PS+SM( 8192) 56.1 GFlops 26.0 GB/s ulps(fft 2.8,ps 5881.1) [OK] FFT+PS+SM( 16384) 52.4 GFlops 22.7 GB/s ulps(fft 3.3,ps 6399.1) [OK] FFT+PS+SM( 32768) 55.5 GFlops 22.5 GB/s ulps(fft 3.3,ps 6380.1) [OK] FFT+PS+SM( 65536) 59.2 GFlops 22.5 GB/s ulps(fft 3.4,ps 6534.8) [OK] FFT+PS+SM(131072) 58.8 GFlops 21.1 GB/s ulps(fft 3.6,ps 6694.2) [OK]Opt1 (worst case): 64 thrds/block FFT+PS+SM( 8) 14.2 GFlops 25.0 GB/s ulps(fft 1.3,ps 4637.5) [OK] FFT+PS+SM( 16) 21.0 GFlops 28.6 GB/s ulps(fft 1.6,ps 4589.2) [OK] FFT+PS+SM( 32) 27.5 GFlops 30.7 GB/s ulps(fft 1.6,ps 4535.6) [OK] FFT+PS+SM( 64) 39.2 GFlops 37.1 GB/s ulps(fft 1.6,ps 4426.7) [OK] FFT+PS+SM( 128) 46.8 GFlops 38.5 GB/s ulps(fft 2.0,ps 4818.1) [OK] FFT+PS+SM( 256) 61.1 GFlops 44.5 GB/s ulps(fft 2.0,ps 4831.0) [OK] FFT+PS+SM( 512) 75.2 GFlops 49.2 GB/s ulps(fft 2.1,ps 4987.2) [OK] FFT+PS+SM( 1024) 73.6 GFlops 43.6 GB/s ulps(fft 2.5,ps 5438.0) [OK] FFT+PS+SM( 2048) 73.4 GFlops 39.8 GB/s ulps(fft 2.7,ps 5674.7) [OK] FFT+PS+SM( 4096) 67.7 GFlops 33.9 GB/s ulps(fft 2.4,ps 5202.4) [OK] FFT+PS+SM( 8192) 64.4 GFlops 29.8 GB/s ulps(fft 2.8,ps 5765.4) [OK] FFT+PS+SM( 16384) 59.5 GFlops 25.7 GB/s ulps(fft 3.3,ps 6291.8) [OK] FFT+PS+SM( 32768) 64.0 GFlops 25.9 GB/s ulps(fft 3.3,ps 6275.5) [OK] FFT+PS+SM( 65536) 68.2 GFlops 26.0 GB/s ulps(fft 3.4,ps 6429.1) [OK] FFT+PS+SM(131072) 67.1 GFlops 24.1 GB/s ulps(fft 3.6,ps 6590.4) [OK]
Ran test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.
Sorry bout that... hope I didn't mess you up too much. Glad it gave you some extra to think about.
Quote from: PatrickV2 on 25 Dec 2010, 05:21:25 amRan test #9 on my Q6600/8GB/8800GTX, under both WinXP-32 as well as Win7-64.Excellent, not broken on the 8800. Last hurdle for that code area cleared & can move on
Opt1 (worst case): 256 thrds/block, 1 x 1048576 element streams FFT+PS+SM( 8) 19.2 GFlops 33.8 GB/s ulps(fft 1.2,ps 4324.2) [OK] FFT+PS+SM( 16) 36.8 GFlops 50.3 GB/s ulps(fft 1.6,ps 4326.2) [OK] FFT+PS+SM( 32) 60.7 GFlops 67.8 GB/s ulps(fft 1.3,ps 4003.6) [OK] FFT+PS+SM( 64) 86.2 GFlops 81.6 GB/s ulps(fft 1.5,ps 4270.2) [OK] FFT+PS+SM( 128) 92.5 GFlops 76.1 GB/s ulps(fft 1.7,ps 4347.9) [OK] FFT+PS+SM( 256) 135.0 GFlops 98.3 GB/s ulps(fft 1.7,ps 4261.8) [OK] FFT+PS+SM( 512) 172.0 GFlops 112.4 GB/s ulps(fft 1.8,ps 4327.4) [OK] FFT+PS+SM( 1024) 214.7 GFlops 127.3 GB/s ulps(fft 2.1,ps 4727.6) [OK] FFT+PS+SM( 2048) 225.9 GFlops 122.6 GB/s ulps(fft 2.2,ps 4921.2) [OK] FFT+PS+SM( 4096) 232.3 GFlops 116.2 GB/s ulps(fft 2.2,ps 4764.3) [OK] FFT+PS+SM( 8192) 226.0 GFlops 104.8 GB/s ulps(fft 2.6,ps 5278.8) [OK] FFT+PS+SM( 16384) 221.5 GFlops 95.8 GB/s ulps(fft 2.6,ps 5357.5) [OK] FFT+PS+SM( 32768) 213.1 GFlops 86.3 GB/s ulps(fft 2.3,ps 4992.8) [OK] FFT+PS+SM( 65536) 210.5 GFlops 80.2 GB/s ulps(fft 2.0,ps 4604.3) [OK] FFT+PS+SM(131072) 202.6 GFlops 72.8 GB/s ulps(fft 2.7,ps 5392.8) [OK]
Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams FFT+PS+SM( 8) 26.7 GFlops 47.2 GB/s ulps(fft 1.2,ps 4324.2) [OK] FFT+PS+SM( 16) 66.9 GFlops 91.3 GB/s ulps(fft 1.6,ps 4326.2) [OK] FFT+PS+SM( 32) 90.9 GFlops 101.5 GB/s ulps(fft 1.3,ps 4003.6) [OK] FFT+PS+SM( 64) 105.0 GFlops 99.4 GB/s ulps(fft 1.5,ps 4270.2) [OK] FFT+PS+SM( 128) 94.0 GFlops 77.3 GB/s ulps(fft 1.7,ps 4347.9) [OK] FFT+PS+SM( 256) 135.9 GFlops 98.9 GB/s ulps(fft 1.7,ps 4261.8) [OK] FFT+PS+SM( 512) 167.9 GFlops 109.7 GB/s ulps(fft 1.8,ps 4327.4) [OK] FFT+PS+SM( 1024) 198.4 GFlops 117.6 GB/s ulps(fft 2.1,ps 4727.6) [OK] FFT+PS+SM( 2048) 209.1 GFlops 113.4 GB/s ulps(fft 2.2,ps 4921.2) [OK] FFT+PS+SM( 4096) 209.9 GFlops 105.0 GB/s ulps(fft 2.2,ps 4764.3) [OK] FFT+PS+SM( 8192) 204.8 GFlops 95.0 GB/s ulps(fft 2.6,ps 5278.8) [OK] FFT+PS+SM( 16384) 205.0 GFlops 88.6 GB/s ulps(fft 2.6,ps 5357.5) [OK] FFT+PS+SM( 32768) 187.5 GFlops 75.9 GB/s ulps(fft 2.3,ps 4992.8) [OK] FFT+PS+SM( 65536) 195.2 GFlops 74.4 GB/s ulps(fft 2.0,ps 4604.3) [OK] FFT+PS+SM(131072) 172.5 GFlops 62.0 GB/s ulps(fft 2.7,ps 5392.8) [OK]
Update: PowerPsectrum Test #10 (attached)- summary performance of FFT pipeline improvements against stock, for assessing overall progress- can vary, so may need a few runs, just to check stability of result- Please use DLLs provided with Test#9
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.Compute capability 2.1Compiled with CUDA 3020. PowerSpectrum+summax Unit test #10 (FFT pipeline throughput)Stock: Processing... Done! Compute Thoughput GFlops Avg( 67.27) Peak( 111.28) Min( 9.42) [OK] Memory thoughput GB/s Avg( 36.72) Peak( 55.70) Min( 15.41)Opt1 (worst case): 256 thrds/block, 2 x 524288 element streams revert to single stream from size 512 Processing... Done! Compute thoughput [GFlops] - Avg( 84.36, 1.25x) Peak( 131.47, 1.18x) Min( 31.13, 3.30x) [OK] Memory thoughput [GB/s] - Avg( 51.22, 1.39x) Peak( 66.16, 1.19x) Min( 34.18, 2.22x)