Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (8/62) > >>

Claggy:
My 9800GTX+'s rerun:

Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       16.2 GFlops    6.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.8 GFlops    6.3 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        2.7 GFlops    1.1 GB/s 1183.3ulps
     64 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    128 threads:        2.6 GFlops    1.1 GB/s 1183.3ulps
    256 threads:        2.5 GFlops    1.0 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       15.2 GFlops    6.1 GB/s 121.7ulps
     64 threads:       16.2 GFlops    6.5 GB/s 121.7ulps
    128 threads:       15.9 GFlops    6.4 GB/s 121.7ulps
    256 threads:       15.9 GFlops    6.3 GB/s 121.7ulps
    512 threads:       15.1 GFlops    6.0 GB/s 121.7ulps
   1024 threads: N/A

Claggy

Edit: and my 128Mb 8400M GS:

Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        1.2 GFlops    0.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
     64 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    128 threads:        0.3 GFlops    0.1 GB/s 1183.3ulps
    256 threads:        0.2 GFlops    0.1 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
     64 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    128 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    256 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
    512 threads:        1.2 GFlops    0.5 GB/s 121.7ulps
   1024 threads: N/A

M_M:
Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.

Miep:
Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.

Jason G:

--- Quote from: M_M on 19 Nov 2010, 01:30:12 pm ---Very strange the both 9800GTX+ and GTX260 seems to be faster then GTX460, since in every game and benchmark GTX460 wins... something's wrong...

Also, what is the "clock" measurement, as displayed in this test? Is it a shader clock? If it is, why is it showing just 810MHz for me, it should be much higher.

--- End quote ---

The clock rate is just what the driver/library reports, which is some fixed number & doesn't measure any hardware (or mean much other than some general indication of the original core spec).

As far as GTX 260 Vs 9800GTX+ Vs GTX 460 goes, quite right  ;) but not strange at all,  This is a 'memory bound' kernel, almost purely instead of 'compute bound'.  That makes it not overly dependant on the processing speed of the GPU at all, but instead on the specific memory implementation, clocks & quality of the RAM chips used, as well as the kernel playing around I've been trying out.

So for that reason, this should be taken as a comparison of memory bound operations on different cards, and relative memory subsystem performance of the cards with respect to kernel tweaking, not a guide to GPU compute performance .... as there simply is very little to compute in a powerspectrum at all.

The goals at this time involve isolating effective strategies at shovelling data in and out of the GPU, rather than what's going on inside.... That comes later with some more meaty (compute intensive) kernels.

Jason

Jason G:

--- Quote from: Miep on 19 Nov 2010, 01:36:13 pm ---Looks like 256 gets best performance out of fermis with no or only little loss for smaller cards.

--- End quote ---
  Yes it's looking not bad.  I can readily embed a couple of codepaths in now,  As the drivers have their own built n dispatch ( YaY ).  To me that means we probably can have our cake & eat it too, but it is just a matter of running around picking up all the crumbs & sticking them together first.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version