Forum > GPU crunching

[Split] PowerSpectrum Unit Test

<< < (32/62) > >>

Jason G:
Cheers both, will investigate.  Not sure why the build decided to  use  32_7 from  ::) , probably from messing with drivers earlier.  Will rebuild shortly & reattach. [Done]

Jason 

Ghost0210:
Thanks Jason:
Here's results under XP:

--- Quote ---Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   15.8 GFlops   63.3 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.4 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.6 GFlops   30.6 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps

--- End quote ---

and under Win 7:

--- Quote ---Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   17.3 GFlops   69.2 GB/s   0.0ulps

 SumMax (    64)    1.2 GFlops    5.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.0 GFlops   16.3 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       27.5 GFlops  110.0 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.2 GFlops   29.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         9.2 GFlops   37.3 GB/s 121.7ulps

--- End quote ---

Jason G:
Ghosts' before Pinned memory usage ( Test #5 memory throughput) :


--- Quote ---Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case:  (35.3-35.4)/35.4 = ~0.3% advantage to Win7

--- End quote ---

with pinned memory (Test #6 Memory throughput )
Stock case*:  (17.5-16.3)/16.3 = ~7.36 advantage to XP (consistent with prior result)
Worst case: (30.6-29.2)/29.2 =  ~4.8% advantage to XP (Narrowed)
Best case:  (35.3-37.3)/37.3 = ~5.4% advantage to Win7 (!)  :o

*Stock code doesn't use pinned memory

Further tentative analysis:  Hiding memory transfers with the use of pinned (non-pageable) memory for critical datasets, and Asynchronous Host<->Device transfers aids in hiding additional overheads experienced in the WDDM driver model.  Careful use of these latency hiding mehanisms, though complex, can yield improved performance on WDDM platforms when large transfers are needed (such as with 'worst case'), and completely hide costs when transfers are minimised (such as with 'best case').  The end result on WDDM platforms with partial implementation of the optimisation strategies, will likely be  performance that roughly approximates XPDM performance, or exceeds it by some small margin when costs can be totally hidden.  This is likely a function of the WDDM host memory paging scheme employed under the newer driver model, already having effectively 'mirrored' some required data on the host & device.

Cheers Alll! Success!  ;D  More ammunition to go on with helps a lot.

Overall, it seems Windows 7/Vista WDDM driver model is not slower after all, but requires 'more careful' (& complex) programming to make the implementations efficient.

Jason

Ghost0210:
Brilliant news  :D

Ghost

Claggy:
Here's the PowerSpectrum6 results on my 9800GTX+ on Win 7 64bit:


--- Quote ---Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.6 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.5 GFlops   18.3 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         7.1 GFlops   28.7 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   40.0 GB/s 121.7ulps
--- End quote ---

and on Win Vista 64bit:


--- Quote ---Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.3 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.4 GFlops   17.8 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         6.9 GFlops   27.8 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   39.9 GB/s 121.7ulps
--- End quote ---

and on my 128Mb 8400M GS on Vista 32bit:


--- Quote ---Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    1.2 GFlops    4.8 GB/s 1183.3ulps

 SumMax (    64)    0.1 GFlops    0.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        1.2 GFlops    4.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         0.6 GFlops    2.5 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         0.6 GFlops    2.6 GB/s 121.7ulps
--- End quote ---

Claggy

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version