[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (7/62) > >>

Jason G:
A source of a subtle stock code precision variation on pre-Fermi cards found, will test patch mod1 & mod3 & leave stock alone, probably fix mod2 but leave precision un-fixed as a test (fixing mod2 will make it slower anyway)

[A Bit Later:] Updated first post:

--- Quote ---[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1: Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2: Fixed, but sadly is slow now, remains at stock accuracy
Mod3: As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

--- End quote ---

Some variation on #1 and/or #3 may need to end up contributing to a stock update down the road due to the stock code (very tiny) precision mismatch on CPU Vs PreFermi Vs Fermi ). The issue could be a contributor to the 'dodgy Gaussians', time will tell whether that's the case or not.

glennaxl:
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.5 GFlops 10.6 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.5 GFlops 10.6 GB/s 121.7ulps
128 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
256 threads: 26.7 GFlops 10.7 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
64 threads: 6.3 GFlops 2.5 GB/s 1183.3ulps
128 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
256 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.5 GFlops 10.6 GB/s 121.7ulps
128 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
256 threads: 26.7 GFlops 10.7 GB/s 121.7ulps
512 threads: 26.6 GFlops 10.7 GB/s 121.7ulps
1024 threads: N/A

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 26.1 GFlops 10.4 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.4 GFlops 7.4 GB/s 121.7ulps
64 threads: 26.1 GFlops 10.4 GB/s 121.7ulps
128 threads: 26.3 GFlops 10.5 GB/s 121.7ulps
256 threads: 26.4 GFlops 10.5 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 6.1 GFlops 2.4 GB/s 1183.3ulps
64 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
128 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps
256 threads: 6.2 GFlops 2.5 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.5 GFlops 7.4 GB/s 121.7ulps
64 threads: 25.9 GFlops 10.3 GB/s 121.7ulps
128 threads: 26.0 GFlops 10.4 GB/s 121.7ulps
256 threads: 26.4 GFlops 10.6 GB/s 121.7ulps
512 threads: 26.4 GFlops 10.6 GB/s 121.7ulps
1024 threads: N/A

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 25.5 GFlops 10.2 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 18.7 GFlops 7.5 GB/s 121.7ulps
64 threads: 25.6 GFlops 10.2 GB/s 121.7ulps
128 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
256 threads: 25.9 GFlops 10.4 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 5.9 GFlops 2.4 GB/s 1183.3ulps
64 threads: 6.1 GFlops 2.4 GB/s 1183.3ulps
128 threads: 6.0 GFlops 2.4 GB/s 1183.3ulps
256 threads: 5.9 GFlops 2.4 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 18.7 GFlops 7.5 GB/s 121.7ulps
64 threads: 25.6 GFlops 10.2 GB/s 121.7ulps
128 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
256 threads: 25.9 GFlops 10.4 GB/s 121.7ulps
512 threads: 25.8 GFlops 10.3 GB/s 121.7ulps
1024 threads: N/A

M_M:
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
   64 threads: 14.7 GFlops 5.9 GB/s 0.0ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in a
   32 threads: 8.1 GFlops 3.2 GB/s 121.7ulps
   64 threads: 14.5 GFlops 5.8 GB/s 121.7ulps
   128 threads: 22.2 GFlops 8.9 GB/s 121.7ulps
   256 threads: 26.2 GFlops 10.5 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
   32 threads: 9.4 GFlops 3.8 GB/s 0.0ulps
   64 threads: 12.2 GFlops 4.9 GB/s 0.0ulps
   128 threads: 14.7 GFlops 5.9 GB/s 0.0ulps
   256 threads: 14.3 GFlops 5.7 GB/s 0.0ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split lo
   32 threads: 7.6 GFlops 3.0 GB/s 121.7ulps
   64 threads: 14.0 GFlops 5.6 GB/s 121.7ulps
   128 threads: 21.5 GFlops 8.6 GB/s 121.7ulps
   256 threads: 20.8 GFlops 8.3 GB/s 121.7ulps
   512 threads: 20.0 GFlops 8.0 GB/s 121.7ulps
   1024 threads: 17.5 GFlops 7.0 GB/s 121.7ulps

Miep:
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
64 threads: 4.6 GFlops 1.8 GB/s 1183.3ulps

GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
32 threads: 2.9 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
128 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps

GetPowerSpectrum() mod 2 (fixed, but slow):
32 threads: 0.8 GFlops 0.3 GB/s 1183.3ulps
64 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
128 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps
256 threads: 0.7 GFlops 0.3 GB/s 1183.3ulps

GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
32 threads: 3.0 GFlops 1.2 GB/s 121.7ulps
64 threads: 4.4 GFlops 1.8 GB/s 121.7ulps
128 threads: 4.4 GFlops 1.7 GB/s 121.7ulps
256 threads: 4.3 GFlops 1.7 GB/s 121.7ulps
512 threads: 3.3 GFlops 1.3 GB/s 121.7ulps
1024 threads: N/A

Oh, look, I'm faster than an ION...
And look how horrible even mod3 is a whole 5% slower than stock. That's 4 minutes on a 90' task. Which means I'd diminish throughput by one task per 4-5 days. Simply outrageous.

Jason G:
LoL, don't worry, we'll put a crappy stock codepath in just for you ;)

[Edit:] I'm leaning toward the simpler Mod1 Kernel for the rest of us. On the Fermi's at least there is some cache control to play with yet, but then the denser threadcount of Mod3, at little cost, may allow more active kernels to fit on the Fermi GPU concurrently... Hmmm....

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version