Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (7/62) > >>

Jason G:
A source of a subtle stock code precision variation on pre-Fermi cards found, will test patch mod1 & mod3 & leave stock alone,  probably fix mod2 but leave precision un-fixed as a test (fixing mod2 will make it slower anyway)

[A Bit Later:] Updated first post:


--- Quote ---[Updated] Mod3_UnitTest attached, changed both mods & added a third
Mod1:  Tuned precision such that non-Fermi & Fermi match, and exceed stock pre-fermi precision
Mod2:  Fixed, but sadly is slow now, remains at stock accuracy
Mod3:  As with Mod1, adding extra threads & split loads (May be suitable for some ranges of cards)

--- End quote ---

Some variation on #1 and/or #3 may need to end up contributing to a stock update down the road due to the stock code (very tiny) precision mismatch on CPU Vs PreFermi Vs Fermi ).  The issue could be a contributor to the 'dodgy Gaussians', time will tell whether that's the case or not.

glennaxl:
-device 0
Device: GeForce GTX 295, 1476 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.5 GFlops   10.6 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
     64 threads:        6.3 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.5 GFlops   10.6 GB/s 121.7ulps
    128 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    256 threads:       26.7 GFlops   10.7 GB/s 121.7ulps
    512 threads:       26.6 GFlops   10.7 GB/s 121.7ulps
   1024 threads: N/A

-device 1
Device: GeForce GTX 295, 1476 MHz clock, 873 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       26.1 GFlops   10.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.4 GFlops    7.4 GB/s 121.7ulps
     64 threads:       26.1 GFlops   10.4 GB/s 121.7ulps
    128 threads:       26.3 GFlops   10.5 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    128 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps
    256 threads:        6.2 GFlops    2.5 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.5 GFlops    7.4 GB/s 121.7ulps
     64 threads:       25.9 GFlops   10.3 GB/s 121.7ulps
    128 threads:       26.0 GFlops   10.4 GB/s 121.7ulps
    256 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
    512 threads:       26.4 GFlops   10.6 GB/s 121.7ulps
   1024 threads: N/A

-device 2
Device: GeForce GTX 260, 1487 MHz clock, 874 MB memory.
Compute capability 1.3
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       25.5 GFlops   10.2 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps
     64 threads:        6.1 GFlops    2.4 GB/s 1183.3ulps
    128 threads:        6.0 GFlops    2.4 GB/s 1183.3ulps
    256 threads:        5.9 GFlops    2.4 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:       18.7 GFlops    7.5 GB/s 121.7ulps
     64 threads:       25.6 GFlops   10.2 GB/s 121.7ulps
    128 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    256 threads:       25.9 GFlops   10.4 GB/s 121.7ulps
    512 threads:       25.8 GFlops   10.3 GB/s 121.7ulps
   1024 threads: N/A

M_M:
Device: GeForce GTX 460, 810 MHz clock, 993 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:       14.7 GFlops    5.9 GB/s   0.0ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in a
     32 threads:        8.1 GFlops    3.2 GB/s 121.7ulps
     64 threads:       14.5 GFlops    5.8 GB/s 121.7ulps
    128 threads:       22.2 GFlops    8.9 GB/s 121.7ulps
    256 threads:       26.2 GFlops   10.5 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        9.4 GFlops    3.8 GB/s   0.0ulps
     64 threads:       12.2 GFlops    4.9 GB/s   0.0ulps
    128 threads:       14.7 GFlops    5.9 GB/s   0.0ulps
    256 threads:       14.3 GFlops    5.7 GB/s   0.0ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split lo
     32 threads:        7.6 GFlops    3.0 GB/s 121.7ulps
     64 threads:       14.0 GFlops    5.6 GB/s 121.7ulps
    128 threads:       21.5 GFlops    8.6 GB/s 121.7ulps
    256 threads:       20.8 GFlops    8.3 GB/s 121.7ulps
    512 threads:       20.0 GFlops    8.0 GB/s 121.7ulps
   1024 threads:       17.5 GFlops    7.0 GB/s 121.7ulps

Miep:
Device: Quadro FX 570M, 950 MHz clock, 242 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
Stock GetPowerSpectrum():
     64 threads:        4.6 GFlops    1.8 GB/s 1183.3ulps


GetPowerSpectrum() mod 1: (made Fermi & Pre-Fermi match in accuracy.)
     32 threads:        2.9 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    128 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps


GetPowerSpectrum() mod 2 (fixed, but slow):
     32 threads:        0.8 GFlops    0.3 GB/s 1183.3ulps
     64 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    128 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps
    256 threads:        0.7 GFlops    0.3 GB/s 1183.3ulps


GetPowerSpectrum() mod 3: (As with mod1, +threads & split loads)
     32 threads:        3.0 GFlops    1.2 GB/s 121.7ulps
     64 threads:        4.4 GFlops    1.8 GB/s 121.7ulps
    128 threads:        4.4 GFlops    1.7 GB/s 121.7ulps
    256 threads:        4.3 GFlops    1.7 GB/s 121.7ulps
    512 threads:        3.3 GFlops    1.3 GB/s 121.7ulps
   1024 threads: N/A

Oh, look, I'm faster than an ION...
And look how horrible even mod3 is a whole 5% slower than stock. That's 4 minutes on a 90' task. Which means I'd diminish throughput by one task per 4-5 days. Simply outrageous.

Jason G:
LoL, don't worry, we'll put a crappy stock codepath in just for you  ;)

[Edit:] I'm leaning toward the simpler Mod1 Kernel for the rest of us.  On the Fermi's at least there is some cache control to play with yet, but then the denser threadcount of Mod3, at little cost, may allow more active kernels to fit on the Fermi GPU concurrently... Hmmm....

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version