+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 162830 times)

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #150 on: 04 Dec 2010, 04:42:10 pm »
@Heinz, something broke in that source you used, investigating.

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #151 on: 04 Dec 2010, 04:44:04 pm »
I've been playing with a couple of other versions of drivers (263.xx & 256.xx) as well and there is no improvement over the current 260.99 WHQL release drivers figures.
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #152 on: 04 Dec 2010, 04:54:29 pm »
Was worth doing this just to get an XP machine up and running again - although I'm struggling to remember where anything is.....

Yep going back is a challenge after adapting.  Now that I'm pretty confident the memory transfers are the main factor, I'm hopeful a certain 'trick' may squash the difference.  We'll see.

[Edit:] Updated first post:
Quote
Update: powerspectrum Test 6, pinned memory
- does it improve 'worst case' optimisation on WDDM versus XPDM ?
- or does it improve on both OSes the same ? (or neither,  Test5 remains for comparison)

Will use pinned memory, for Opt1, on GPUs that can do so.
« Last Edit: 04 Dec 2010, 05:40:22 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #153 on: 04 Dec 2010, 07:07:07 pm »
Hi Jason,

Getting an error with the new build saying that cudart_32_32_7.dll isn't present - is this meant to be in the .7z file?

ghost
« Last Edit: 04 Dec 2010, 07:09:40 pm by Ghost »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: [Split] PowerSpectrum Unit Test
« Reply #154 on: 04 Dec 2010, 07:35:43 pm »
Just to see if it would run, I made a copy of the cudart32_32_16.dll, renamed it to cudart32_32_7.dll and then ran the test

Device: GeForce GTX 460, 1600 MHz clock, 768 MB memory.
Compute capability 2.1
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   12.9 GFlops   51.4 GB/s   0.0ulps

 SumMax (    64)    1.0 GFlops    4.4 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    3.4 GFlops   13.6 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       19.4 GFlops   77.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         6.0 GFlops   24.4 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         7.0 GFlops   28.2 GB/s 121.7ulps

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #155 on: 04 Dec 2010, 08:09:44 pm »
Cheers both, will investigate.  Not sure why the build decided to  use  32_7 from  ::) , probably from messing with drivers earlier.  Will rebuild shortly & reattach. [Done]

Jason 
« Last Edit: 04 Dec 2010, 08:32:33 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #156 on: 05 Dec 2010, 05:15:18 am »
Thanks Jason:
Here's results under XP:
Quote
Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   15.8 GFlops   63.3 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.3 GFlops   17.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.1 GFlops   92.4 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.6 GFlops   30.6 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps

and under Win 7:
Quote
Device: GeForce GTX 465, 1215 MHz clock, 994 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   17.3 GFlops   69.2 GB/s   0.0ulps

 SumMax (    64)    1.2 GFlops    5.2 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.0 GFlops   16.3 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       27.5 GFlops  110.0 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
PowerSpectrumSumMax array pinned in host memory.
  256 threads, fftlen 64: (worst case: full summax copy)
         7.2 GFlops   29.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         9.2 GFlops   37.3 GB/s 121.7ulps

« Last Edit: 05 Dec 2010, 05:24:08 am by Ghost »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #157 on: 05 Dec 2010, 06:01:54 am »
Ghosts' before Pinned memory usage ( Test #5 memory throughput) :

Quote
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case:  (35.3-35.4)/35.4 = ~0.3% advantage to Win7

with pinned memory (Test #6 Memory throughput )
Stock case*:  (17.5-16.3)/16.3 = ~7.36 advantage to XP (consistent with prior result)
Worst case: (30.6-29.2)/29.2 =  ~4.8% advantage to XP (Narrowed)
Best case:  (35.3-37.3)/37.3 = ~5.4% advantage to Win7 (!)  :o

*Stock code doesn't use pinned memory

Further tentative analysis:  Hiding memory transfers with the use of pinned (non-pageable) memory for critical datasets, and Asynchronous Host<->Device transfers aids in hiding additional overheads experienced in the WDDM driver model.  Careful use of these latency hiding mehanisms, though complex, can yield improved performance on WDDM platforms when large transfers are needed (such as with 'worst case'), and completely hide costs when transfers are minimised (such as with 'best case').  The end result on WDDM platforms with partial implementation of the optimisation strategies, will likely be  performance that roughly approximates XPDM performance, or exceeds it by some small margin when costs can be totally hidden.  This is likely a function of the WDDM host memory paging scheme employed under the newer driver model, already having effectively 'mirrored' some required data on the host & device.

Cheers Alll! Success!  ;D  More ammunition to go on with helps a lot.

Overall, it seems Windows 7/Vista WDDM driver model is not slower after all, but requires 'more careful' (& complex) programming to make the implementations efficient.

Jason
« Last Edit: 05 Dec 2010, 06:43:35 am by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #158 on: 05 Dec 2010, 06:36:13 am »
Brilliant news  :D

Ghost

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: [Split] PowerSpectrum Unit Test
« Reply #159 on: 05 Dec 2010, 07:21:51 am »
Here's the PowerSpectrum6 results on my 9800GTX+ on Win 7 64bit:

Quote
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.6 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.5 GFlops   18.3 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         7.1 GFlops   28.7 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   40.0 GB/s 121.7ulps

and on Win Vista 64bit:

Quote
Device: GeForce 9800 GTX/9800 GTX+, 1900 MHz clock, 496 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   16.1 GFlops   64.3 GB/s 1183.3ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.4 GFlops   17.8 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       16.2 GFlops   64.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         6.9 GFlops   27.8 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.9 GFlops   39.9 GB/s 121.7ulps

and on my 128Mb 8400M GS on Vista 32bit:

Quote
Device: GeForce 8400M GS, 800 MHz clock, 114 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    1.2 GFlops    4.8 GB/s 1183.3ulps

 SumMax (    64)    0.1 GFlops    0.5 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    0.4 GFlops    1.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        1.2 GFlops    4.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         0.6 GFlops    2.5 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         0.6 GFlops    2.6 GB/s 121.7ulps

Claggy
« Last Edit: 05 Dec 2010, 07:32:49 am by Claggy »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #160 on: 05 Dec 2010, 07:54:06 am »
Hehe, those ( worst case Opt1) are up a bit ( apart from the 8400M, I suppose unsurprisingly ).  Looks like we found WDDM display driver limitation, and should be able to work around it, with lots of effort.
« Last Edit: 05 Dec 2010, 07:56:28 am by Jason G »

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: [Split] PowerSpectrum Unit Test
« Reply #161 on: 05 Dec 2010, 08:04:23 am »
I also added PowerSpectrum5 results for my 9800GTX+ on Vista 64bit, on page Eight

Claggy

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #162 on: 05 Dec 2010, 08:21:58 am »
Cheers, yep was looking back there, definitely confirms the use of pinned memory helped Opt1, a bit more than I expected too.

On the XPDM Vs WDDM issue, I've had further confirmation on 8800GTS, from a non-crunching friend, that test #5 Opt1 worst case is faster on XPDM over win7, but roughly same speed in Test #6 (using Pinned Memory).  The 'Best case' is also faster on Win7, so the numbers seem to match up.   Make the code a bit more sophisticated & Win7 performance is ~equal to a bit faster than XP.

I'll be stewing on these additional aspects we've worked out here for a little while, and apply the knowledge to expanded tests with more fft sizes ~end of week.  If that pans out well, it'll be time to start levering in these small improvements into the X series codebase.  After the powerspectrum+reduction is integrated, then will probably be refinement & expansion of the 'freaky powerspectrum' (custom FFT) kernels using the same knowledge.

All this, of course is working towards 'fixing' the problematic puslefinding down the road, and having enough strategies to do so effectively.
(Can't wait for the time when I can ask Berkeley to send VLARs back out to GPUs again  :P)

Jason
« Last Edit: 05 Dec 2010, 08:30:38 am by Jason G »

Offline Richard Haselgrove

  • Messenger Pigeon
  • Knight who says 'Ni!'
  • *****
  • Posts: 2819
Re: [Split] PowerSpectrum Unit Test
« Reply #163 on: 05 Dec 2010, 09:40:16 am »
9800GTX+, Windows 7/32

Code: [Select]
Device: GeForce 9800 GTX/9800 GTX+, 1890 MHz clock, 498 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>   15.8 GFlops   63.4 GB/s 1183.3ulps

 SumMax (    64)    1.3 GFlops    5.3 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.1 GFlops   16.5 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:       15.9 GFlops   63.7 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         6.9 GFlops   28.1 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         9.8 GFlops   39.5 GB/s 121.7ulps

Offline perryjay

  • Knight Templar
  • ****
  • Posts: 427
Re: [Split] PowerSpectrum Unit Test
« Reply #164 on: 05 Dec 2010, 11:13:38 am »
Took a couple of tries but I think I got it right....


Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

C:\Users\perry>cd \test

C:\test>powerspectrum6.exe >results.txt
'powerspectrum6.exe' is not recognized as an internal or external command,
operable program or batch file.

C:\test>powerspectrumtest6.exe

Device: GeForce 9500 GT, 1840 MHz clock, 1008 MB memory.
Compute capability 1.1
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #6 (pinned mem)
Stock:
 PwrSpec<    64>    2.8 GFlops   11.3 GB/s 1183.3ulps

 SumMax (    64)    0.4 GFlops    1.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.2 GFlops    4.9 GB/s


GetPowerSpectrum() choice for Opt1: 64 thrds/block
     64 threads:        2.8 GFlops   11.4 GB/s 121.7ulps


Opt1 (PSmod3+SM): 64 thrds/block
PowerSpectrumSumMax array pinned in host memory.
   64 threads, fftlen 64: (worst case: full summax copy)
         1.9 GFlops    7.6 GB/s 121.7ulps
Every ifft average & peak OK
   64 threads, fftlen 64: (best case, nothing to update)
         2.0 GFlops    8.2 GB/s 121.7ulps



C:\test>

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 355
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 25
Total: 25
Powered by EzPortal