Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (43/62) > >>

Richard Haselgrove:
Preparing the usual three:

9800GTX+, Windows 7/32

--- Code: ---Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    1.7 GFlops    7.4 GB/s
 PS+SuMx(    16) [OK]    2.3 GFlops    9.6 GB/s
 PS+SuMx(    32) [OK]    2.6 GFlops   10.5 GB/s
 PS+SuMx(    64) [OK]    3.9 GFlops   15.9 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.5   15.4 121.7 [OK]    7.1   31.3 121.7
 PS+SuMx(    16)    4.0   16.5 121.7 [OK]    7.4   31.0 121.7
 PS+SuMx(    32)    4.9   20.0 121.7 [OK]    7.2   29.5 121.7
 PS+SuMx(    64)    6.3   25.4 121.7 [OK]    8.8   35.5 121.7
--- End code ---

9800GT, Windows XP/32

--- Code: ---Device: GeForce 9800 GT, 1500 MHz clock, 512 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    1.7 GFlops    7.2 GB/s
 PS+SuMx(    16) [OK]    2.1 GFlops    8.9 GB/s
 PS+SuMx(    32) [OK]    2.2 GFlops    9.0 GB/s
 PS+SuMx(    64) [OK]    3.6 GFlops   14.5 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    2.5   11.1 121.7 [OK]    5.2   22.9 121.7
 PS+SuMx(    16)    3.5   14.7 121.7 [OK]    5.5   23.0 121.7
 PS+SuMx(    32)    4.1   16.7 121.7 [OK]    5.2   21.2 121.7
 PS+SuMx(    64)    5.4   21.7 121.7 [OK]    6.3   25.7 121.7
--- End code ---

GTX 470, Windows XP/32

--- Code: ---Device: GeForce GTX 470, 1215 MHz clock, 1280 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.3 GFlops    9.9 GB/s
 PS+SuMx(    16) [OK]    3.0 GFlops   12.6 GB/s
 PS+SuMx(    32) [OK]    3.0 GFlops   12.1 GB/s
 PS+SuMx(    64) [OK]    4.8 GFlops   19.3 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    3.7   16.0 121.7 [OK]   15.6   68.4 121.7
 PS+SuMx(    16)    5.7   23.9 121.7 [OK]   14.8   61.8 121.7
 PS+SuMx(    32)    7.9   32.5 121.7 [OK]   14.3   58.7 121.7
 PS+SuMx(    64)    9.9   39.9 121.7 [OK]   14.0   56.7 121.7
--- End code ---

perryjay:
Here's mine...


Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd/test

C:\test> powerspectrumtest7.exe

Device: GeForce 9500 GT, 1848 MHz clock, 1006 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    0.7 GFlops    3.2 GB/s
 PS+SuMx(    16) [OK]    0.8 GFlops    3.5 GB/s
 PS+SuMx(    32) [OK]    0.8 GFlops    3.1 GB/s
 PS+SuMx(    64) [OK]    1.1 GFlops    4.4 GB/s


Opt1: 64 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    1.2    5.4 121.7 [OK]    1.6    6.8 121.7
 PS+SuMx(    16)    0.7    3.0 121.7 [OK]    1.5    6.1 121.7
 PS+SuMx(    32)    1.4    5.6 121.7 [OK]    1.6    6.4 121.7
 PS+SuMx(    64)    1.7    6.7 121.7 [OK]    1.8    7.5 121.7



C:\test>

Josef W. Segur:

--- Quote from: Jason G on 21 Dec 2010, 12:46:29 pm ---Best case requires few memory transfers back to the host CPU ( only one best spike & no detections)  ;)

[Edit:] Worst case would be a best signal + numdatapoints/fftlen detections, i.e. not really possible since we're limited to 30 detections, so wouldn't bother transferring more than the first 30 ( ... unlike stock...)

--- End quote ---


--- Quote from: Miep on 21 Dec 2010, 01:04:35 pm ---Now he tells us ::) ;)
So normal data would perform somewhere in between - any info on the distribution between the two endpoints?
--- End quote ---

The lower graph on http://setiathome.berkeley.edu/sah_glossary/spike_graphs.php is related, note the log scale on the counts. S@H Enhanced does relatively more short FFT lengths, but there's still a very strong bias toward the long FFT lengths for both reportable and "best" spikes. A quick survey of 44 recent results from my P-M showed 35 best_spikes at fft_len 131072, 6 at fft_len 65536, 2 at fft_len 32768, and 1 at fft_len 16384.

However, the processing order starts at FFT length 8 and works up, so there should be some "worst case" for short FFT lengths during that zero chirp sequence. Subsequent visits to the short FFT lengths are likely to be all "best case". At AR 0.42 FFT length 8 is done 13 times so overall there will be mostly "best case", but at AR 3.0 FFT length 8 is only done once so the probability of "worst case" will be higher.

Note that our test WUs shortened by lowering chirp limits will have a higher proportion of the zero chirp worst cases than full length WUs. In general I think that's good, brief sloppy tests which slightly underestimate improvement from optimization are better than those which cause unwarranted enthusiasm. But it would also be possible to create a set of test WUs shortened by adjusting chirp resolution which would give better quick test timing.

Edit: Jason, result_overflow is triggered by the 31st found signal...
                                                                                           Joe

arkayn:
And now the GTX460-768 card,

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    2.2 GFlops    9.7 GB/s
 PS+SuMx(    16) [OK]    2.8 GFlops   11.5 GB/s
 PS+SuMx(    32) [OK]    2.1 GFlops    8.7 GB/s
 PS+SuMx(    64) [OK]    3.4 GFlops   13.6 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    4.2   18.3 121.7 [OK]   11.1   48.5 121.7
 PS+SuMx(    16)    5.8   24.5 121.7 [OK]   10.5   44.1 121.7
 PS+SuMx(    32)    7.2   29.7 121.7 [OK]   10.2   41.7 121.7
 PS+SuMx(    64)    8.4   33.9 121.7 [OK]   10.2   41.5 121.7

SciManStev:

Device: GeForce GTX 480, 810 MHz clock, 1503 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #7 (Faster reductions)
Stock:
 PS+SuMx(     8) [OK]    5.0 GFlops   22.0 GB/s
 PS+SuMx(    16) [OK]    6.0 GFlops   25.3 GB/s
 PS+SuMx(    32) [OK]    4.7 GFlops   19.2 GB/s
 PS+SuMx(    64) [OK]    7.2 GFlops   29.1 GB/s


Opt1: 256 thrds/block
                        worst case              best case
                   GFlps  GB/s ulps         GFlps  GB/s ulps
 PS+SuMx(     8)    9.0   39.2 121.7 [OK]   23.0  100.7 121.7
 PS+SuMx(    16)   11.7   49.0 121.7 [OK]   21.7   90.8 121.7
 PS+SuMx(    32)   13.6   55.8 121.7 [OK]   21.1   86.4 121.7
 PS+SuMx(    64)   15.1   61.2 121.7 [OK]   20.7   83.7 121.7


Steve

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version