+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: [Split] PowerSpectrum Unit Test  (Read 165281 times)

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #135 on: 01 Dec 2010, 03:17:13 pm »
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"
« Last Edit: 01 Dec 2010, 05:51:46 pm by Ghost »

Offline kevin6912

  • Knave
  • Posts: 12
Re: [Split] PowerSpectrum Unit Test
« Reply #136 on: 01 Dec 2010, 05:18:11 pm »
Test 5  output.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.3 GFlops   81.3 GB/s   0.0ulps

 SumMax (    64)    0.7 GFlops    2.9 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    2.3 GFlops    9.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       29.2 GFlops  117.0 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.3 GFlops    9.5 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.1 GFlops   44.9 GB/s 121.7ulps

Kevin

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: [Split] PowerSpectrum Unit Test
« Reply #137 on: 01 Dec 2010, 05:37:45 pm »
PowerSpectrumTest5.exe -device 0
.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.6 GFlops   82.5 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    6.0 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.5 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.0 GFlops  119.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.6 GFlops   26.8 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.2 GFlops   45.2 GB/s 121.7ulps


PowerSpectrumTest5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   20.7 GFlops   82.6 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.6 GFlops   18.7 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       30.1 GFlops  120.5 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.6 GFlops   26.9 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
        11.2 GFlops   45.3 GB/s 121.7ulps


.
Done
« Last Edit: 01 Dec 2010, 07:47:41 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #138 on: 02 Dec 2010, 03:46:56 pm »
Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"


@Ghost: I did get the following so far:
- Made the modifications appropriate to the inf file, and successfully installed 263.06 TCC driver ( On 480 )
- Disabled the device as a 'normal' display (using mobo display instead)
- Merged the nSight registry key that disables WPF acceleration (for good measure, shouldn't be necessary with no active display on it)


Next step should be to switch the devices driver mode to TCC mode.  That's done via the command:
  nvidia-smi --driver-model=

howevr I get this response:
Quote
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe --driver-model=
GPU 0 is not a supported TCC device, skipping
[Edit:] Note that it doesn't say that the card/driver doesn't support it...
Confirming with DeviceQuery:
Quote
CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 480"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 1576468480 bytes
  Multiprocessors x Cores/MP = Cores:            15 (MP) x 32 (Cores/MP) = 480 (
Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    0.81 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads
can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = GeForce GTX 480


PASSED

So I gather we're stuck for now  :(  [Edit:] unless you happen to be good with SoftIce or similar.... ::)

Going to try checking if I got the section number in the inf right etc...
« Last Edit: 02 Dec 2010, 04:34:24 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #139 on: 02 Dec 2010, 05:31:33 pm »
I got stuck @ the same point as well.
When I ran the nvidia-smi.exe -dm0 cmd I got the same message about the GPU not being supported on TCC.
I tried modifying the .inf file with limited success so I used this site http://laptopvideo2go.com
Basically they create a standard .inf file that allows all NV cards to use all drivers ;D Saved a lot of time and hassle - they also have unreleased drivers on their site. The latest I could see were 265.90. Not sure where they get them from so use at your own risk, but I've had no issues with them.
also saw a slight increase in the worst case scenario with Mod5 with these drivers, of the top of my head it was about .2 increase over the official release drivers.
Haven't had a look at SoftIce yet - I'll do a bit of research tomorrow as it looks like I may not be getting into the office again :D

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #140 on: 03 Dec 2010, 02:25:17 pm »
Despite finding a couple more posts (after spending a few hour searching) saying that it is possible to enable TCC mode on non-Tesla cards, I haven't been able to get it running on my 465.
At the moment all I have managed is to change the compute mode ruleset through nvidia-smi. Although DeviceQuery still says its running in Default mode, so wether this has had any real affect or not is up for debate >:(
Think its about time to give up on this idea unfortunately, shame though, it would have been nice to get it working as all this card does is crunch Seti

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #141 on: 03 Dec 2010, 02:45:34 pm »
Yeah tried that compute mode thing too.  With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances  ;).

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM.  I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

If it had been a matter of the current ~10% difference stock sees between the driver models, I wouldn't have gone to the trouble of DualBoot.  My optimisations, on the other hand, yielding ~30% in favour of XPDM, really force the issue for me (even though faster on Both OS /Driver Models), since a lot of the refinement achieved here with a small kernel is likely to apply through most of the application (after a lot of work).   That translates to a crapload of compute performance in my book, since the single 480 on Wolfdale,, x32f on Win7x64) was sustaining 25-26k RAC when there was work.  Much as I dislike RAC as a measure, ~10% extra there (current code) would only boost it to ~27-28k or so (within work dependant variation anyway),  ~32k, though, seems more definite & well worth the added effort. ( [Edit:] then add optimisation benefit I suppose )

Jason
« Last Edit: 03 Dec 2010, 02:50:42 pm by Jason G »

Offline Miep

  • Global Moderator
  • Knight who says 'Ni!'
  • *****
  • Posts: 964
Re: [Split] PowerSpectrum Unit Test
« Reply #142 on: 03 Dec 2010, 02:54:56 pm »
OK, non-critical unless I make computation mistakes  ( I was mostly concerned here to not make code slower...).  Stock / x32f code there is doing something your GPU doesn't like IMO.

Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?


I dug out a 'shared memory: no' from a german comparison site. its got 256M of it's own as far as I know.

nvidia control panel system info comes up with
total available 1535 MB
dedicated 256 MB GDDR3
sytem video 0MB
shared system mem 1279MB
The road to hell is paved with good intentions

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #143 on: 03 Dec 2010, 03:05:16 pm »
Great, thanks.  Yes the dedicated number is the clincher, it's a discrete GPU then, explaining why it didn't trigger a special integrated GPU optimisation I made in the last test (that particular functionality remains untested/verified).  The shared bit will just be system memory the driver's using for WDDM paging.   I only care about that because It looks like santa might be bringing me an ION2 based netbook (I was a good boy all year, sortof) , so poking around with that functionality early seemed a good idea.

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #144 on: 03 Dec 2010, 06:19:27 pm »
Yeah tried that compute mode thing too.  With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances  ;).

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM.  I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

Jason

I was thinking about something similar - just ordered a couple of 1TB drives and a raid controller, so after migrating my current data drives to those (and finally getting some internal redundancy  :)) I'll have a spare drive that I was planning on either loading with Linux, or XP if I can find the disk again. A 30% boost is definatly worth the extra effort of setting up the dualboot. Although I may just run a VM for Boinc (depending on wether I can get the GPU's to be seen by the VM) and how good Boinc operates, I'll create an XP VM and run it that way
« Last Edit: 03 Dec 2010, 06:26:38 pm by Ghost »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #145 on: 04 Dec 2010, 12:34:51 pm »
Extra drives on order here ... hopefully will be able to find a floppy disk for XP raid driver install before they arrive  ::).

If you're able to verify (increased XPDM advantage with the heavily optimised kernels, over stock ~10% advantage between driver models) prior to me getting setup, I'll report the increased XPDM<->WDDM speed discrepancy with highly optimised kernels ... Since they may not have factored as much as 30% performance difference into decisions (related to TCC mode).
« Last Edit: 04 Dec 2010, 12:38:39 pm by Jason G »

Ghost0210

  • Guest
Re: [Split] PowerSpectrum Unit Test
« Reply #146 on: 04 Dec 2010, 02:09:09 pm »
I've managed to scavenge an old drive from an old machine for this test, so have now got a dual-boot machine for a short time ;)
Just downloading and installing the standard drivers to get a baseline for the test -

Stock results on XP Pro x32 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
      PowerSpectrum+summax Unit test #5
Stock:
 PwrSpec<    64>   16.0 GFlops   63.8 GB/s   0.0ulps

 SumMax (    64)    1.4 GFlops    5.8 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    4.4 GFlops   17.7 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       23.0 GFlops   91.9 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         6.7 GFlops   27.2 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps
« Last Edit: 04 Dec 2010, 02:42:20 pm by Ghost »

Offline _heinz

  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 2117
Re: [Split] PowerSpectrum Unit Test
« Reply #147 on: 04 Dec 2010, 02:18:56 pm »
PowerSpectrumxe2011Test5.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5)
Stock:
 PwrSpec<    64>   11.9 GFlops   47.6 GB/s   0.0ulps

 SumMax (    64)    0.4 GFlops    1.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.4 GFlops    5.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       18.5 GFlops   73.8 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.1 GFlops    8.3 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         2.4 GFlops    9.6 GB/s 121.7ulps


PowerSpectrumxe2011Test5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
                PowerSpectrum+summax Unit test #5)
Stock:
 PwrSpec<    64>   11.9 GFlops   47.6 GB/s   0.0ulps

 SumMax (    64)    0.4 GFlops    1.7 GB/s
Every ifft average & peak OK

 PS+SuMx(    64)    1.4 GFlops    5.8 GB/s


GetPowerSpectrum() choice for Opt1: 256 thrds/block
    256 threads:       18.3 GFlops   73.3 GB/s 121.7ulps


Opt1 (PSmod3+SM): 256 thrds/block
  256 threads, fftlen 64: (worst case: full summax copy)
         2.1 GFlops    8.4 GB/s 121.7ulps
Every ifft average & peak OK
  256 threads, fftlen 64: (best case, nothing to update)
         2.4 GFlops    9.6 GB/s 121.7ulps


.
Done

Remark:compiled with XE2011

modify:
something must be changed, last Test5 above shows
11.2 GFlops   45.3 GB/s 121.7ulps

in last line
« Last Edit: 04 Dec 2010, 02:23:45 pm by _heinz »

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #148 on: 04 Dec 2010, 02:32:34 pm »

something must be changed, last Test5 above shows
11.2 GFlops   45.3 GB/s 121.7ulps

Yeah, 11.2 is more like what that card should be doing heinz.

Offline Jason G

  • Construction Fraggle
  • Knight who says 'Ni!'
  • *****
  • Posts: 8980
Re: [Split] PowerSpectrum Unit Test
« Reply #149 on: 04 Dec 2010, 04:23:27 pm »
Stock results on XP Pro x32 260.99 drivers:
...
 PS+SuMx(    64)    4.4 GFlops   17.7 GB/s
...
  256 threads, fftlen 64: (worst case: full summax copy)
         6.7 GFlops   27.2 GB/s 121.7ulps
...
  256 threads, fftlen 64: (best case, nothing to update)
         8.7 GFlops   35.3 GB/s 121.7ulps

OK, so far against your previous results (assuming all else equal), we're back to our roughly ~10% performance advantage to XP:

(XP32-Win7x64)/Win7x64
Stock case: (4.4-4.1)/4.1 = ~7.3 % advantage to XP (expected, not too annoying)
Worst case: (6.7-6.0)/6.0 = ~11.7% advantage to XP ( I can *almost* live with that)
Best case:  (8.7-8.7)/8.7 = ~0.0% advantage to XP (fine)

So there appears to be a greater advantage to XP with the worst case (lot's of memory transfers), though not as great as feared... Phew!  ;D

Since the Memory numbers have more significant digits, and the worst case advantage indicates a memory issue of some sort, I'll compare the throughput figures also:
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case:  (35.3-35.4)/35.4 = ~0.3% advantage to Win7

Tentative analysis based on above:   Raw compute speed between the two OS/Driver models is roughly the same ('Best Case has no memory transfer of results), however WDDM's memory paging schemes increase overheads for the worst case by up to ~14.2% on that system ( 1/(1-0.124) ).

So memory transfers will have to be minimised in critical kernels.  I can enable a pinned memory optimisation I implemented for integrated GPUs, which might just help the situation.  At least we're not looking at the ~30% difference that had me petrified.

Jason

 

 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 7
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 45
Total: 45
Powered by EzPortal