Author Topic: [Split] PowerSpectrum Unit Test (Read 186602 times)

Ghost0210 · « **Reply #135 on:** 01 Dec 2010, 03:17:13 pm »

Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote

Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"

kevin6912 · « **Reply #136 on:** 01 Dec 2010, 05:18:11 pm »

Test 5 output.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.3 GFlops 81.3 GB/s 0.0ulps

SumMax ( 64) 0.7 GFlops 2.9 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 2.3 GFlops 9.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 29.2 GFlops 117.0 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.3 GFlops 9.5 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.1 GFlops 44.9 GB/s 121.7ulps

Kevin

_heinz · « **Reply #137 on:** 01 Dec 2010, 05:37:45 pm »

PowerSpectrumTest5.exe -device 0
.
Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.6 GFlops 82.5 GB/s 0.0ulps

SumMax ( 64) 1.4 GFlops 6.0 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.6 GFlops 18.5 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.0 GFlops 119.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.6 GFlops 26.8 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.2 GFlops 45.2 GB/s 121.7ulps

PowerSpectrumTest5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 20.7 GFlops 82.6 GB/s 0.0ulps

SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.6 GFlops 18.7 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 30.1 GFlops 120.5 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.6 GFlops 26.9 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
11.2 GFlops 45.3 GB/s 121.7ulps

.
Done

Jason G · « **Reply #138 on:** 02 Dec 2010, 03:46:56 pm »

Quote from: Ghost on 01 Dec 2010, 03:17:13 pm

Looks like someone may have got the TCC model drivers to work with a GT220 card......
may give this a go on the 465 and see what happens

http://forums.nvidia.com/index.php?showtopic=159208

Quote
Ok, I revisited this problem and found out that I had incorrectly modified the INF file for the TCC driver. I now have the driver loading for my GT220 and CUDA programs running through Remote Desktop, which is fantastic.
In short, these are the modifications I had to do to NVWD.inf from the TCC package:
[NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.0A20.01% = Section001, PCI\VEN_10DE&DEV_0A20
[NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.0A20.01% = Section002, PCI\VEN_10DE&DEV_0A20
[Strings]
NVIDIA_DEV.0A20.01 = "NVIDIA GeForce GT 220"

@Ghost: I did get the following so far:
- Made the modifications appropriate to the inf file, and successfully installed 263.06 TCC driver ( On 480 )
- Disabled the device as a 'normal' display (using mobo display instead)
- Merged the nSight registry key that disables WPF acceleration (for good measure, shouldn't be necessary with no active display on it)

Next step should be to switch the devices driver mode to TCC mode. That's done via the command:
nvidia-smi --driver-model=

howevr I get this response:

Quote

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe --driver-model=
GPU 0 is not a supported TCC device, skipping

[Edit:] Note that it doesn't say that the card/driver doesn't support it...
Confirming with DeviceQuery:

Quote

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1576468480 bytes
Multiprocessors x Cores/MP = Cores: 15 (MP) x 32 (Cores/MP) = 480 (
Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 0.81 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads
can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = GeForce GTX 480

PASSED

So I gather we're stuck for now

[Edit:] unless you happen to be good with SoftIce or similar....

Going to try checking if I got the section number in the inf right etc...

Ghost0210 · « **Reply #139 on:** 02 Dec 2010, 05:31:33 pm »

I got stuck @ the same point as well.
When I ran the nvidia-smi.exe -dm0 cmd I got the same message about the GPU not being supported on TCC.
I tried modifying the .inf file with limited success so I used this site http://laptopvideo2go.com
Basically they create a standard .inf file that allows all NV cards to use all drivers

Saved a lot of time and hassle - they also have unreleased drivers on their site. The latest I could see were 265.90. Not sure where they get them from so use at your own risk, but I've had no issues with them.
also saw a slight increase in the worst case scenario with Mod5 with these drivers, of the top of my head it was about .2 increase over the official release drivers.
Haven't had a look at SoftIce yet - I'll do a bit of research tomorrow as it looks like I may not be getting into the office again

Ghost0210 · « **Reply #140 on:** 03 Dec 2010, 02:25:17 pm »

Despite finding a couple more posts (after spending a few hour searching) saying that it is possible to enable TCC mode on non-Tesla cards, I haven't been able to get it running on my 465.
At the moment all I have managed is to change the compute mode ruleset through nvidia-smi. Although DeviceQuery still says its running in Default mode, so wether this has had any real affect or not is up for debate

Think its about time to give up on this idea unfortunately, shame though, it would have been nice to get it working as all this card does is crunch Seti

Jason G · « **Reply #141 on:** 03 Dec 2010, 02:45:34 pm »

Yeah tried that compute mode thing too. With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances

.

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM. I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

If it had been a matter of the current ~10% difference stock sees between the driver models, I wouldn't have gone to the trouble of DualBoot. My optimisations, on the other hand, yielding ~30% in favour of XPDM, really force the issue for me (even though faster on Both OS /Driver Models), since a lot of the refinement achieved here with a small kernel is likely to apply through most of the application (after a lot of work). That translates to a crapload of compute performance in my book, since the single 480 on Wolfdale,, x32f on Win7x64) was sustaining 25-26k RAC when there was work. Much as I dislike RAC as a measure, ~10% extra there (current code) would only boost it to ~27-28k or so (within work dependant variation anyway), ~32k, though, seems more definite & well worth the added effort. ( [Edit:] then add optimisation benefit I suppose )

Jason

Miep · « **Reply #142 on:** 03 Dec 2010, 02:54:56 pm »

Quote from: Jason G on 30 Nov 2010, 11:37:35 am

OK, non-critical unless I make computation mistakes ( I was mostly concerned here to not make code slower...). Stock / x32f code there is doing something your GPU doesn't like IMO.

Was that quadro 'integrated & using some portion of system memory ? or does it use dedicated memory ?

I dug out a 'shared memory: no' from a german comparison site. its got 256M of it's own as far as I know.

nvidia control panel system info comes up with
total available 1535 MB
dedicated 256 MB GDDR3
sytem video 0MB
shared system mem 1279MB

Jason G · « **Reply #143 on:** 03 Dec 2010, 03:05:16 pm »

Great, thanks. Yes the dedicated number is the clincher, it's a discrete GPU then, explaining why it didn't trigger a special integrated GPU optimisation I made in the last test (that particular functionality remains untested/verified). The shared bit will just be system memory the driver's using for WDDM paging. I only care about that because It looks like santa might be bringing me an ION2 based netbook (I was a good boy all year, sortof) , so poking around with that functionality early seemed a good idea.

Ghost0210 · « **Reply #144 on:** 03 Dec 2010, 06:19:27 pm »

Quote from: Jason G on 03 Dec 2010, 02:45:34 pm

Yeah tried that compute mode thing too. With Fermi's we want 'Normal' mode anyway, so as to allow multiple instances .

I've thought about it, and to bypass the issue altogether I'll try get another hard drive sorted sometime soon, and use it to dualboot to WinXP32. That way I can keep my snazzy Win7 dev environment, yet leave for extended period crunching under XPDM. I have XPx64 as well, but since 64 bit Cuda apps yield a net small slowdown, it seems illogical to use that copy for that.

Jason

I was thinking about something similar - just ordered a couple of 1TB drives and a raid controller, so after migrating my current data drives to those (and finally getting some internal redundancy

) I'll have a spare drive that I was planning on either loading with Linux, or XP if I can find the disk again. A 30% boost is definatly worth the extra effort of setting up the dualboot. Although I may just run a VM for Boinc (depending on wether I can get the GPU's to be seen by the VM) and how good Boinc operates, I'll create an XP VM and run it that way

Jason G · « **Reply #145 on:** 04 Dec 2010, 12:34:51 pm »

Extra drives on order here ... hopefully will be able to find a floppy disk for XP raid driver install before they arrive

.

If you're able to verify (increased XPDM advantage with the heavily optimised kernels, over stock ~10% advantage between driver models) prior to me getting setup, I'll report the increased XPDM<->WDDM speed discrepancy with highly optimised kernels ... Since they may not have factored as much as 30% performance difference into decisions (related to TCC mode).

Ghost0210 · « **Reply #146 on:** 04 Dec 2010, 02:09:09 pm »

I've managed to scavenge an old drive from an old machine for this test, so have now got a dual-boot machine for a short time

Just downloading and installing the standard drivers to get a baseline for the test -

Stock results on XP Pro x32 260.99 drivers:

Device: GeForce GTX 465, 1215 MHz clock, 1024 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5
Stock:
PwrSpec< 64> 16.0 GFlops 63.8 GB/s 0.0ulps

SumMax ( 64) 1.4 GFlops 5.8 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 4.4 GFlops 17.7 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 23.0 GFlops 91.9 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
6.7 GFlops 27.2 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps

_heinz · « **Reply #147 on:** 04 Dec 2010, 02:18:56 pm »

PowerSpectrumxe2011Test5.exe -device 0

Device: GeForce GTX 470, 810 MHz clock, 1248 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5)
Stock:
PwrSpec< 64> 11.9 GFlops 47.6 GB/s 0.0ulps

SumMax ( 64) 0.4 GFlops 1.7 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.4 GFlops 5.8 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 18.5 GFlops 73.8 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.1 GFlops 8.3 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
2.4 GFlops 9.6 GB/s 121.7ulps

PowerSpectrumxe2011Test5.exe -device 1

Device: GeForce GTX 470, 810 MHz clock, 1249 MB memory.
Compute capability 2.0
Compiled with CUDA 3020.
PowerSpectrum+summax Unit test #5)
Stock:
PwrSpec< 64> 11.9 GFlops 47.6 GB/s 0.0ulps

SumMax ( 64) 0.4 GFlops 1.7 GB/s
Every ifft average & peak OK

PS+SuMx( 64) 1.4 GFlops 5.8 GB/s

GetPowerSpectrum() choice for Opt1: 256 thrds/block
256 threads: 18.3 GFlops 73.3 GB/s 121.7ulps

Opt1 (PSmod3+SM): 256 thrds/block
256 threads, fftlen 64: (worst case: full summax copy)
2.1 GFlops 8.4 GB/s 121.7ulps
Every ifft average & peak OK
256 threads, fftlen 64: (best case, nothing to update)
2.4 GFlops 9.6 GB/s 121.7ulps

.
Done

Remark:compiled with XE2011

modify:
something must be changed, last Test5 above shows
11.2 GFlops 45.3 GB/s 121.7ulps

in last line

Jason G · « **Reply #148 on:** 04 Dec 2010, 02:32:34 pm »

Quote from: _heinz on 04 Dec 2010, 02:18:56 pm

something must be changed, last Test5 above shows
11.2 GFlops 45.3 GB/s 121.7ulps

Yeah, 11.2 is more like what that card should be doing heinz.

Jason G · « **Reply #149 on:** 04 Dec 2010, 04:23:27 pm »

Quote from: Ghost on 04 Dec 2010, 02:09:09 pm

Stock results on XP Pro x32 260.99 drivers:
...
PS+SuMx( 64) 4.4 GFlops 17.7 GB/s
...
256 threads, fftlen 64: (worst case: full summax copy)
6.7 GFlops 27.2 GB/s 121.7ulps
...
256 threads, fftlen 64: (best case, nothing to update)
8.7 GFlops 35.3 GB/s 121.7ulps

OK, so far against your previous results (assuming all else equal), we're back to our roughly ~10% performance advantage to XP:

(XP32-Win7x64)/Win7x64
Stock case: (4.4-4.1)/4.1 = ~7.3 % advantage to XP (expected, not too annoying)
Worst case: (6.7-6.0)/6.0 = ~11.7% advantage to XP ( I can *almost* live with that)
Best case: (8.7-8.7)/8.7 = ~0.0% advantage to XP (fine)

So there appears to be a greater advantage to XP with the worst case (lot's of memory transfers), though not as great as feared... Phew!

Since the Memory numbers have more significant digits, and the worst case advantage indicates a memory issue of some sort, I'll compare the throughput figures also:
Stock case: (17.7-16.5)/16.5 = ~7.27% advantage to XP
Worst case: (27.2-24.2)/24.2 = ~12.4% advantage to XP
Best case: (35.3-35.4)/35.4 = ~0.3% advantage to Win7

Tentative analysis based on above: Raw compute speed between the two OS/Driver models is roughly the same ('Best Case has no memory transfer of results), however WDDM's memory paging schemes increase overheads for the worst case by up to ~14.2% on that system ( 1/(1-0.124) ).

So memory transfers will have to be minimised in critical kernels. I can enable a pinned memory optimisation I implemented for integrated GPUs, which might just help the situation. At least we're not looking at the ~30% difference that had me petrified.

Jason

Author Topic: [Split] PowerSpectrum Unit Test (Read 186602 times)

Ghost0210

Re: [Split] PowerSpectrum Unit Test

kevin6912

Re: [Split] PowerSpectrum Unit Test

_heinz

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Miep

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Ghost0210

Re: [Split] PowerSpectrum Unit Test

_heinz

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test

Jason G

Re: [Split] PowerSpectrum Unit Test