[Split] PowerSpectrum Unit Test

Forum > GPU crunching

<< < (32/62) > >>

Jason G:
Cheers both, will investigate. Not sure why the build decided to use 32_7 from ::) , probably from messing with drivers earlier. Will rebuild shortly & reattach. [Done]

Jason

Ghost0210:
Thanks Jason:
Here's results under XP:

--- Quote ---Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 15.8 GFlops 63.3 GB/s 0.0ulps

SumMax ( 64) 1.4 GFlops 5.7 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.3 GFlops 17.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.1 GFlops 92.4 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.6 GFlops 30.6 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps

--- End quote ---

and under Win 7:

--- Quote ---Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 17.3 GFlops 69.2 GB/s 0.0ulps

SumMax ( 64) 1.2 GFlops 5.2 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.0 GFlops 16.3 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 27.5 GFlops 110.0 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
256 threads, fftlen 64: (worst case: full summax copy)
7.2 GFlops 29.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
9.2 GFlops 37.3 GB/s 121.7ulps

--- End quote ---

Jason G:
Ghosts' before Pinned memory usage ( Test #5 memory throughput) :

--- Quote ---Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case: (35.3-35.4)/35.4 = ~0.3% advantage to Win7

--- End quote ---

with pinned memory (Test #6 Memory throughput )
Stock case*: (17.5-16.3)/16.3 = ~7.36 advantage to XP (consistent with prior result)
Worst case: (30.6-29.2)/29.2 = ~4.8% advantage to XP (Narrowed)
Best case: (35.3-37.3)/37.3 = ~5.4% advantage to Win7 (!) :o

*Stock code doesn't use pinned memory

Further tentative analysis: Hiding memory transfers with the use of pinned (non-pageable) memory for critical datasets, and Asynchronous Host<->Device transfers aids in hiding additional overheads experienced in the WDDM driver model. Careful use of these latency hiding mehanisms, though complex, can yield improved performance on WDDM platforms when large transfers are needed (such as with 'worst case'), and completely hide costs when transfers are minimised (such as with 'best case'). The end result on WDDM platforms with partial implementation of the optimisation strategies, will likely be performance that roughly approximates XPDM performance, or exceeds it by some small margin when costs can be totally hidden. This is likely a function of the WDDM host memory paging scheme employed under the newer driver model, already having effectively 'mirrored' some required data on the host & device.

Cheers Alll! Success! ;D More ammunition to go on with helps a lot.

Overall, it seems Windows 7/Vista WDDM driver model is not slower after all, but requires 'more careful' (& complex) programming to make the implementations efficient.

Jason

Ghost0210:
Brilliant news :D

Ghost

Claggy:
Here's the PowerSpectrum6 results on my 9800GTX+ on Win 7 64bit:

--- Quote ---Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 16.1 GFlops 64.6 GB/s 1183.3ulps

SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.5 GFlops 18.3 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
7.1 GFlops 28.7 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
9.9 GFlops 40.0 GB/s 121.7ulps
--- End quote ---

and on Win Vista 64bit:

--- Quote ---Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 16.1 GFlops 64.3 GB/s 1183.3ulps

SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.4 GFlops 17.8 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 16.2 GFlops 64.7 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
6.9 GFlops 27.8 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
9.9 GFlops 39.9 GB/s 121.7ulps
--- End quote ---

and on my 128Mb 8400M GS on Vista 32bit:

--- Quote ---Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
PwrSpec< 64> 1.2 GFlops 4.8 GB/s 1183.3ulps

SumMax ( 64) 0.1 GFlops 0.5 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 0.4 GFlops 1.5 GB/s

GetPowerSpectrum() choice for Opt1: 64 thrds/block
64 threads: 1.2 GFlops 4.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
64 threads, fftlen 64: (worst case: full summax copy)
0.6 GFlops 2.5 GB/s 121.7ulps
Every ifft average & peak OK
64 threads, fftlen 64: (best case, nothing to update)
0.6 GFlops 2.6 GB/s 121.7ulps
--- End quote ---

Claggy

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version