Stock: PowerSpectrum< 64 thrd/blk> 29.0 GFlops 115.9 GB/s 0.0ulps SumMax ( 8 ) 4.3 GFlops 19.0 GB/s SumMax ( 16) 3.8 GFlops 16.5 GB/s SumMax ( 32) 1.8 GFlops 8.0 GB/s SumMax ( 64) 3.1 GFlops 13.5 GB/s SumMax ( 128) 4.7 GFlops 20.5 GB/s SumMax ( 256) 6.3 GFlops 27.6 GB/s SumMax ( 512) 11.2 GFlops 48.9 GB/s SumMax ( 1024) 17.2 GFlops 75.1 GB/s SumMax ( 2048) 20.3 GFlops 88.9 GB/s SumMax ( 4096) 24.3 GFlops 106.3 GB/s SumMax ( 8192) 25.2 GFlops 110.2 GB/s SumMax ( 16384) 24.8 GFlops 108.7 GB/s SumMax ( 32768) 28.3 GFlops 123.8 GB/s SumMax ( 65536) 18.4 GFlops 80.4 GB/s SumMax (131072) 10.1 GFlops 44.3 GB/s Powerspectrum + SumMax ( 8 ) 12.0 GFlops 49.1 GB/s Powerspectrum + SumMax ( 16) 10.8 GFlops 44.4 GB/s Powerspectrum + SumMax ( 32) 6.2 GFlops 25.2 GB/s Powerspectrum + SumMax ( 64) 9.3 GFlops 38.3 GB/s Powerspectrum + SumMax ( 128) 12.6 GFlops 51.7 GB/s Powerspectrum + SumMax ( 256) 15.3 GFlops 62.5 GB/s Powerspectrum + SumMax ( 512) 20.8 GFlops 85.1 GB/s Powerspectrum + SumMax ( 1024) 24.8 GFlops 101.5 GB/s Powerspectrum + SumMax ( 2048) 26.3 GFlops 107.5 GB/s Powerspectrum + SumMax ( 4096) 27.7 GFlops 113.5 GB/s Powerspectrum + SumMax ( 8192) 28.0 GFlops 114.6 GB/s Powerspectrum + SumMax ( 16384) 27.8 GFlops 113.8 GB/s Powerspectrum + SumMax ( 32768) 28.8 GFlops 117.9 GB/s Powerspectrum + SumMax ( 65536) 25.4 GFlops 104.0 GB/s Powerspectrum + SumMax (131072) 19.8 GFlops 81.1 GB/s
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.Compute capability 2.0Compiled with CUDA 3020.reference summax[FFT#0]( 8) mean - 0.673622, peak - 1.624994reference summax[FFT#0]( 16) mean - 0.705653, peak - 2.213269reference summax[FFT#0]( 32) mean - 0.728661, peak - 2.725552reference summax[FFT#0]( 64) mean - 0.650947, peak - 3.050944reference summax[FFT#0]( 128) mean - 0.637886, peak - 3.113411reference summax[FFT#0]( 256) mean - 0.668928, peak - 2.968936reference summax[FFT#0]( 512) mean - 0.666855, peak - 2.978162reference summax[FFT#0]( 1024) mean - 0.665324, peak - 2.985018reference summax[FFT#0]( 2048) mean - 0.661129, peak - 3.003958reference summax[FFT#0]( 4096) mean - 0.665850, peak - 2.982658reference summax[FFT#0]( 8192) mean - 0.667464, peak - 2.975447reference summax[FFT#0]( 16384) mean - 0.666575, peak - 2.979414reference summax[FFT#0]( 32768) mean - 0.665878, peak - 2.982532reference summax[FFT#0]( 65536) mean - 0.665683, peak - 2.983408reference summax[FFT#0](131072) mean - 0.665053, peak - 2.992251 PowerSpectrum+summax Unit testStock: PowerSpectrum< 64 thrd/blk> 29.1 GFlops 116.3 GB/s 0.0ulps SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994 SumMax ( 16) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.705653 Pk 2.213270 SumMax ( 32) 1.1 GFlops 4.7 GB/s; fft[0] avg 0.728661 Pk 2.725552 SumMax ( 64) 1.8 GFlops 7.8 GB/s; fft[0] avg 0.650947 Pk 3.050944 SumMax ( 128) 2.6 GFlops 11.5 GB/s; fft[0] avg 0.637887 Pk 3.113411 SumMax ( 256) 3.5 GFlops 15.2 GB/s; fft[0] avg 0.668928 Pk 2.968936 SumMax ( 512) 5.0 GFlops 21.7 GB/s; fft[0] avg 0.666855 Pk 2.978162 SumMax ( 1024) 6.1 GFlops 26.7 GB/s; fft[0] avg 0.665324 Pk 2.985018 SumMax ( 2048) 6.7 GFlops 29.4 GB/s; fft[0] avg 0.661129 Pk 3.003958 SumMax ( 4096) 7.2 GFlops 31.3 GB/s; fft[0] avg 0.665850 Pk 2.982658 SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447 SumMax ( 16384) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.666575 Pk 2.979414 SumMax ( 32768) 7.3 GFlops 32.1 GB/s; fft[0] avg 0.665878 Pk 2.982532 SumMax ( 65536) 6.2 GFlops 27.1 GB/s; fft[0] avg 0.665683 Pk 2.983408 SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
Stock:... SumMax ( 8) 0.8 GFlops 3.5 GB/s; fft[0] avg 0.673622 Pk 1.624994... SumMax ( 8192) 7.3 GFlops 31.9 GB/s; fft[0] avg 0.667464 Pk 2.975447... SumMax (131072) 5.1 GFlops 22.5 GB/s; fft[0] avg 0.665053 Pk 2.992251
[Next Day:] Stock method is doomed by those memory transfers for the summax reduction results. I'll go onto FindSpikes to see if all that data is really needed.
solely because we have an upper limit of 30 signals...
Flag is better while FFT size are small. I setted threshold size at 2048. After that size it looks better to pass values itself than flag, look OpenCL PC_find_spike* kernels.
But again, it works good only while FFT sizes are small (many short arrays).