Forum > GPU crunching
[Split] PowerSpectrum Unit Test
Richard Haselgrove:
Preparing the usual three:
9800GTX+, Windows 7/32
--- Code: ---Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 1.7 GFlops 7.4 GB/s
PS+SuMx( 16) [OK] 2.3 GFlops 9.6 GB/s
PS+SuMx( 32) [OK] 2.6 GFlops 10.5 GB/s
PS+SuMx( 64) [OK] 3.9 GFlops 15.9 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.5 15.4 121.7 [OK] 7.1 31.3 121.7
PS+SuMx( 16) 4.0 16.5 121.7 [OK] 7.4 31.0 121.7
PS+SuMx( 32) 4.9 20.0 121.7 [OK] 7.2 29.5 121.7
PS+SuMx( 64) 6.3 25.4 121.7 [OK] 8.8 35.5 121.7
--- End code ---
9800GT, Windows XP/32
--- Code: ---Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 1.7 GFlops 7.2 GB/s
PS+SuMx( 16) [OK] 2.1 GFlops 8.9 GB/s
PS+SuMx( 32) [OK] 2.2 GFlops 9.0 GB/s
PS+SuMx( 64) [OK] 3.6 GFlops 14.5 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 2.5 11.1 121.7 [OK] 5.2 22.9 121.7
PS+SuMx( 16) 3.5 14.7 121.7 [OK] 5.5 23.0 121.7
PS+SuMx( 32) 4.1 16.7 121.7 [OK] 5.2 21.2 121.7
PS+SuMx( 64) 5.4 21.7 121.7 [OK] 6.3 25.7 121.7
--- End code ---
GTX 470, Windows XP/32
--- Code: ---Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.3 GFlops 9.9 GB/s
PS+SuMx( 16) [OK] 3.0 GFlops 12.6 GB/s
PS+SuMx( 32) [OK] 3.0 GFlops 12.1 GB/s
PS+SuMx( 64) [OK] 4.8 GFlops 19.3 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 3.7 16.0 121.7 [OK] 15.6 68.4 121.7
PS+SuMx( 16) 5.7 23.9 121.7 [OK] 14.8 61.8 121.7
PS+SuMx( 32) 7.9 32.5 121.7 [OK] 14.3 58.7 121.7
PS+SuMx( 64) 9.9 39.9 121.7 [OK] 14.0 56.7 121.7
--- End code ---
perryjay:
Here's mine...
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\perry>cd/test
C:\test> powerspectrumtest7.exe
Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 0.7 GFlops 3.2 GB/s
PS+SuMx( 16) [OK] 0.8 GFlops 3.5 GB/s
PS+SuMx( 32) [OK] 0.8 GFlops 3.1 GB/s
PS+SuMx( 64) [OK] 1.1 GFlops 4.4 GB/s
Opt1: 64 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 1.2 5.4 121.7 [OK] 1.6 6.8 121.7
PS+SuMx( 16) 0.7 3.0 121.7 [OK] 1.5 6.1 121.7
PS+SuMx( 32) 1.4 5.6 121.7 [OK] 1.6 6.4 121.7
PS+SuMx( 64) 1.7 6.7 121.7 [OK] 1.8 7.5 121.7
C:\test>
Josef W. Segur:
--- Quote from: Jason G on 21 Dec 2010, 12:46:29 pm ---Best case requires few memory transfers back to the host CPU ( only one best spike & no detections) ;)
[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)
--- End quote ---
--- Quote from: Miep on 21 Dec 2010, 01:04:35 pm ---Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
--- End quote ---
The lower graph on http://setiathome.berkeley.edu/sah_glossary/spike_graphs.php is related, note the log scale on the counts. S@H Enhanced does relatively more short FFT lengths, but there's still a very strong bias toward the long FFT lengths for both reportable and "best" spikes. A quick survey of 44 recent results from my P-M showed 35 best_spikes at fft_len 131072, 6 at fft_len 65536, 2 at fft_len 32768, and 1 at fft_len 16384.
However, the processing order starts at FFT length 8 and works up, so there should be some "worst case" for short FFT lengths during that zero chirp sequence. Subsequent visits to the short FFT lengths are likely to be all "best case". At AR 0.42 FFT length 8 is done 13 times so overall there will be mostly "best case", but at AR 3.0 FFT length 8 is only done once so the probability of "worst case" will be higher.
Note that our test WUs shortened by lowering chirp limits will have a higher proportion of the zero chirp worst cases than full length WUs. In general I think that's good, brief sloppy tests which slightly underestimate improvement from optimization are better than those which cause unwarranted enthusiasm. But it would also be possible to create a set of test WUs shortened by adjusting chirp resolution which would give better quick test timing.
Edit: Jason, result_overflow is triggered by the 31st found signal...
Joe
arkayn:
And now the GTX460-768 card,
Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 2.2 GFlops 9.7 GB/s
PS+SuMx( 16) [OK] 2.8 GFlops 11.5 GB/s
PS+SuMx( 32) [OK] 2.1 GFlops 8.7 GB/s
PS+SuMx( 64) [OK] 3.4 GFlops 13.6 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 4.2 18.3 121.7 [OK] 11.1 48.5 121.7
PS+SuMx( 16) 5.8 24.5 121.7 [OK] 10.5 44.1 121.7
PS+SuMx( 32) 7.2 29.7 121.7 [OK] 10.2 41.7 121.7
PS+SuMx( 64) 8.4 33.9 121.7 [OK] 10.2 41.5 121.7
SciManStev:
Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
PS+SuMx( 8) [OK] 5.0 GFlops 22.0 GB/s
PS+SuMx( 16) [OK] 6.0 GFlops 25.3 GB/s
PS+SuMx( 32) [OK] 4.7 GFlops 19.2 GB/s
PS+SuMx( 64) [OK] 7.2 GFlops 29.1 GB/s
Opt1: 256 thrds/block
worst case best case
GFlps GB/s ulps GFlps GB/s ulps
PS+SuMx( 8) 9.0 39.2 121.7 [OK] 23.0 100.7 121.7
PS+SuMx( 16) 11.7 49.0 121.7 [OK] 21.7 90.8 121.7
PS+SuMx( 32) 13.6 55.8 121.7 [OK] 21.1 86.4 121.7
PS+SuMx( 64) 15.1 61.2 121.7 [OK] 20.7 83.7 121.7
Steve
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version