Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Raistmer on 20 Dec 2009, 06:20:53 pm

Title: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 20 Dec 2009, 06:20:53 pm: Attached builds use affinity lock and priority class NORMAL/ thread priority HIGHEST to achieve better CPU response times for their needs.
On my own quad with single low-end GPU I don;t see any improvement so try at your own hosts and see if these build will increase host's RAC or not.
x4 version designed for use on quad (or better) with >2 GPUs installed.
another one - for multicore (duo or quad or better) host with only 2 GPUs installed.
But again, try both and see what will works better on your particular equipment (apps not tested on targed hardware, it's only assumptions).

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 23 Dec 2009, 02:24:38 pm: > 20 downloads and still no feedback? ???
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 23 Dec 2009, 09:52:41 pm: I have been running I grab it and will install on the 8400GS...

I have to ask is it VLAR or NONVLAR kill..

Edit: successfully tranplanted into Main on these two hosts.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5133086 (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5133086) which is the 9800GT, I am noticing a bit of sluggishness on this machine doing a 0.40 AR. But then it is doing Aqua on teh CPU's and Seti on the GPU. Still tolerable.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=2435134 which is the 8400GS (which is 4 hours 20% into a 0.008). So whiile the VLAR is sluggish it is tollerable.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 24 Dec 2009, 03:18:38 am: 1) Should be VLARkill version (cause all who aware use rebranding tool and for ones who unaware better not to meet VLARs at all ;) )
So not quite understand how your can do VLAR with this build.
2) Unfortunately, you have no targed hardware it seems (as do I myself). On my 9400GT this build showed worser results on standalone test than prev build.
That is, it probably not suitable for single GPU configs. Affinity lock implemented (and needed) solely for multi-GPU (and fast GPU) hosts where initial CPU-based phase should be as small as possible....
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pepi on 24 Dec 2009, 09:40:10 am: Raistmer I try on my both machines your new builds ( both)
I know that you not build form my type of computer , but I was trying them and now I am back to your "normal build"

1 computer is Sampron 140 with green GT9800. With new builds ( both of them) every WU is slower about 50 sec in 1600 sec ( 1650 sec verse 1600 sec with "old build")
2 computer is AMD Quad with GT240: both new builds are slower, but not as in previous case.

I disabled network access, make rar archive and try set of ten results , so I think it is good comparative method.

Best regards for holidays :)
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 24 Dec 2009, 11:29:19 am: Nothing new, it's for multi-GPU hardware.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 24 Dec 2009, 12:11:40 pm: on the 8400GS host it had 3 errors not sure if it I have not rebooted in a couple fo days (memory corruption)

AR 2.0 error in pulsefind

http://setiathome.berkeley.edu/result.php?resultid=1460163970
http://setiathome.berkeley.edu/result.php?resultid=1460163968
http://setiathome.berkeley.edu/result.php?resultid=1460163895
switching back
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 24 Dec 2009, 12:24:04 pm: Wow, time exceeded in pulsefind. Resembles my own 9400GT experiments but I had unspecified error...

And do you never encountered same error for prev builds?
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 24 Dec 2009, 03:40:45 pm: Quote from: Raistmer on 24 Dec 2009, 12:24:04 pm
Wow, time exceeded in pulsefind. Resembles my own 9400GT experiments but I had unspecified error...

And do you never encountered same error for prev builds?

the 8400 was running the v12 nokill and wasdoing fine other than a few inconclusives.
the 9800GT has no problems.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 25 Dec 2009, 03:36:37 am: Hm... interesting... can't recall on whose PC last released V2 was built - mine or Jason's? And what CUDA version was used at build time... Currently I did build with 2.3 CUDA SDK installed.
It would be good to discriminate effect of building environment and priority/affinity changes itself.
Cause nothing in data-processing path was changed I see no other possible reasons for such change in behavior...
Ah, BTW, I did one more change in last build (this change reduced executable size against prev builds) - dropped Volkov's FFT sources from build cause they not used (it seems CUDA compiler embed kernel code into executable even if kernel doesn't in use in program). This surely changed alignment of CUDA kernels. Could it be reason of this timeout you've seen - no idea...
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Jason G on 25 Dec 2009, 04:17:14 am: The release V12 in the installer & the NoKill separate download ? Were built on mine using 2.2 at the time (2.3 was not in common use) and set that limit thing to 2048 as directed by yourself & Joe.

No other changes from your sources at that time, around June 20th 2009 according to build date on the exe at my end, and svn logs, (though later experiments deviate quite a lot). I think it's possible the 2.3 sdk does build larger kernels, and the 2.3 DLL's are definitely larger & produce more stress, and use more video RAM. What effects this should have on smaller cards I'm not entirely sure.

Later in the course of experimentation, as well as adding Joe's triplet kernel fixing stuff, I did introduce a constant definition in my experimental branch, called NUM_ITER which reduces the length of the pulsefinding calls. But that definition isn't in those builds.

@Al, please tell me the creation date on the exe you used that worked well on the 8400GS, so I can pinpoint which parameters were used, and corresponding svn revision.

Cheers, Jason
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 25 Dec 2009, 06:19:44 am: Thanks, low-end GPUs are borderline case (by amount of memory available and by lenght of kernel calls) so they are especially sensefull to even smallest changes between builds. I still had to understand why my own 9400GT works just well in Q9450 and fails badly and often in Core duo and Athlon64 hosts....
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Jason G on 25 Dec 2009, 06:30:51 am: Hmm, yes, very confusing. Could you list the builds you've tried on the Athlon (and I guess from what you say none work properly...). At one time before I went hybrid, I did lots of test builds with reduced pulse finding blocks (NUM_ITER5 IIRC), perhaps those work in this? While v13 would be interesting to try on that, I don't think it'll help pinpoint the problem, since obviously the problem is in cuda code or hardware somewhere. I'm thinking something to do with chipset/DMA transfers. I presume the mobo BIOS is up to date? because there was some issues with PCIe on some mobos IIRC.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 25 Dec 2009, 11:59:34 am: Quote from: Jason G on 25 Dec 2009, 04:17:14 am
The release V12 in the installer & the NoKill separate download ? Were built on mine using 2.2 at the time (2.3 was not in common use) and set that limit thing to 2048 as directed by yourself & Joe.

No other changes from your sources at that time, around June 20th 2009 according to build date on the exe at my end, and svn logs, (though later experiments deviate quite a lot). I think it's possible the 2.3 sdk does build larger kernels, and the 2.3 DLL's are definitely larger & produce more stress, and use more video RAM. What effects this should have on smaller cards I'm not entirely sure.

Later in the course of experimentation, as well as adding Joe's triplet kernel fixing stuff, I did introduce a constant definition in my experimental branch, called NUM_ITER which reduces the length of the pulsefinding calls. But that definition isn't in those builds.

@Al, please tell me the creation date on the exe you used that worked well on the 8400GS, so I can pinpoint which parameters were used, and corresponding svn revision.

Cheers, Jason

6-19-2009

this message SETI CUDA MB (http://lunatics.kwsn.net/gpu-crunching/seti-cuda-mb.msg18534.html#msg18534) so with this the "Proposed 'Better?' medium term VLAR solution" started not too long after that around the 1st of July.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Jason G on 25 Dec 2009, 12:11:11 pm: Quote from: Pappa on 25 Dec 2009, 11:59:34 am
....at around the 1st of July.
OK... that date matches up with my builds of 'Bog standard V12' with FPLIM 2048 applied (amongst other values tested at the time), later committed, after proof with testing & mimo's profiling, on on 7th July. (prior assertions confirmed)

@Raistmer: That corresponds to r93 in the CudaMB_exp branch, which you might like to compare to your r89, which it is based on. I don't see significant source changes amongst the experiments, between those revisions, So I guess used SDK might be one remaining suspect.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 25 Dec 2009, 12:46:23 pm: On my GTX295+GTX260(216) Rig, x4 version makes the system sluggish.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Sutaru Tsureku on 25 Dec 2009, 03:07:03 pm: I made a test of CUDA_V12b_x4 on my AMD QUAD 940 BE & 4x OCed GTX260-216.

But, the result don't look so well.. :-\
SETI@home/NC subforum/'eFMer Priority' - (CUDA) app priority change [Message 958712 (http://setiathome.berkeley.edu/forum_thread.php?id=56605&nowrap=true#958712)]

(http://www.cheesebuerger.de/images/smilie/xmas/a014.gif)
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 25 Dec 2009, 06:42:23 pm: Quote from: glennaxl on 25 Dec 2009, 12:46:23 pm
On my GTX295+GTX260(216) Rig, x4 version makes the system sluggish.
What about another one? And what about CUDA MB itself performance?
Sluggishness can be anticipated indeed.Priority boost and affinity lock - both these changes can negatively affect GUI and other apps.
But what about performance?

@Jason
Can't say if all builds fail or not. Cause "failed" build passed KNA bench with PG* tasks just OK.
Host experienced random failures:
"-1" computational errors with unspecified kernel launch failure (in pulsefind, always same 106 line), BSoDs, very sluggish behavior...
But all this not directly related to V12b build discussed in this thread so better I will continue my complaints on own 9400GT in another thread :)
P.S. very agreed with your suspiction about PCI-E bus in that host.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 25 Dec 2009, 07:23:16 pm: @Raistmer It has been running on the 9800GT for over a day. I do see a slight sluggishness while typing. Out of the ~ 120 Result that have been returned ony 6-7 yielded "inconclusize." Less than a half dozen were VLAR and about 6 that were into HAR which it handled nicely.

Without capturing all the data to run an average before and after, generally the numbers look about the same. As AR's are not "exacrly the same." some small variance is seen from WU to WU. This I account to small variation in the AR and what signals are processed. Then one would have to have several days worth of Data to get any real idea of improvement.

The only way I could really think to setup a proper test would be to setup a MB Bench to process a group of MB to keep the CPU's busy and then make like 6 copies (only one AR) of the same Cuda WU in the Cuda Bench then compare Stock, V12, V12b and this one. Then any variance is due to Hardware conflicts/contention for the CPU.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 25 Dec 2009, 07:27:54 pm: Yes, I did such test. But I have only 1 GPU installed now - no targed hardware for this build.
And I still don't understand how you manage to process VLARs with it... Could you list some completed VLAR results?
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 25 Dec 2009, 07:47:14 pm: Quote from: Raistmer on 25 Dec 2009, 07:27:54 pm
Yes, I did such test. But I have only 1 GPU installed now - no targed hardware for this build.
And I still don't understand how you manage to process VLARs with it... Could you list some completed VLAR results?

the 8400 had a VLAR in progress at the time I installed it did a Fallback to CPU. So as it is running the nokill option it is doing VLARS

http://setiathome.berkeley.edu/results.php?hostid=2435134
and this task
http://setiathome.berkeley.edu/result.php?resultid=1458893819

So over the course of the week it has not did a lot of WU's as only Seti Cuda is running on the GPU and NFS on the CPU's. It is also susecptable to errors.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 25 Dec 2009, 08:08:54 pm: Hm, do you talk about V12b modification or V12 one?
What build did VLARs ? This thread mainly about V12b build for multi-GPU hosts.

[BTW,
"Device 1 : Device Emulation (CPU)
"
It's already not GPU but CPU. Then app did fallback from CPU GPU-emulation to plain CPU code :o :o :o Why no GPU found ?
]

EDIT: ah, I see now. That task was resumed by V12b version...
After app init: total GPU memory 0 free GPU memory 0
VLAR WU (AR: 0.008794 )detected, but task partially done already, continuing computations
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 25 Dec 2009, 08:33:11 pm: In the next day or so I am going to setup to do a test on the 8400. I will setup a MB Bench with a couple of the same WU's and launch (both cores active). The a Cuda Bench with a single AR copied about 6 times to run Stock, v12 and v12b...

The numbers should be identical. What I expect to show up is changing times due to hardware conflicts/contention. This is what would affect the low end cards mostly. The end result should show a truer picture of speed improvement.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 26 Dec 2009, 03:13:42 am: GTX295+GTX260-216/i7 920

MB_6.08_mod_CUDA_V12b.exe
====================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 26 seconds
MB_6.08_mod_CUDA_V12b.exe : 25 seconds
Speedup: 3.85%, Ratio: 1.04 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b.exe : 32 seconds
Speedup: -6.67%, Ratio: 0.94 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 32 seconds
MB_6.08_mod_CUDA_V12b.exe : 36 seconds
Speedup: -12.50%, Ratio: 0.89 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 17 seconds
MB_6.08_mod_CUDA_V12b.exe : 17 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 31 seconds
MB_6.08_mod_CUDA_V12b.exe : 32 seconds
Speedup: -3.23%, Ratio: 0.97 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b.exe : 1 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 19 seconds
MB_6.08_mod_CUDA_V12b.exe : 20 seconds
Speedup: -5.26%, Ratio: 0.95 x

MB_6.08_mod_CUDA_V12b_x4.exe
======================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 25 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 25 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 31 seconds
Speedup: -3.33%, Ratio: 0.97 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 32 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 36 seconds
Speedup: -12.50%, Ratio: 0.89 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 18 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 17 seconds
Speedup: 5.56%, Ratio: 1.06 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 31 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 32 seconds
Speedup: -3.23%, Ratio: 0.97 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 1 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 19 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 20 seconds
Speedup: -5.26%, Ratio: 0.95 x
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 26 Dec 2009, 03:23:51 am: Dual GTX260-216/i7 920

MB_6.08_mod_CUDA_V12b.exe
====================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 23 seconds
MB_6.08_mod_CUDA_V12b.exe : 23 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 28 seconds
MB_6.08_mod_CUDA_V12b.exe : 28 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b.exe : 33 seconds
Speedup: -10.00%, Ratio: 0.91 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 16 seconds
MB_6.08_mod_CUDA_V12b.exe : 16 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 28 seconds
MB_6.08_mod_CUDA_V12b.exe : 28 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b.exe : 0 seconds
Speedup: 100.00%, Ratio: 1.#J x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 18 seconds
MB_6.08_mod_CUDA_V12b.exe : 18 seconds
Speedup: 0.00%, Ratio: 1.00 x
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 26 Dec 2009, 04:35:57 am: @glennaxl
Unfortunately, final bench output contains only CPU times, no elapsed times provided.
Could you attach log files from TestDatas directory too, please?
And could you give some description of used environment:
in what conditions bench running? some background CPU tasks enabled or only single GPU in work and second GPU + CPU cores sit idle during test?

I gave some explanations of idea behind V12b mod on SETI main, but will try to explain it here again. Better understanding of idea could lead to better benchmark configs.

What we can see with prev versions on hosts running 2 or more CUDA MB processes and 2,4 or 8 (duo, quad or i7) CPU MB/AP processes (or CPU-based apps from another projects, it doesn't matter):
windows can pair 2 CUDA MB processes on single core leaving all other cores for CPU processes. This will heavely increase CUDA MB initialization times and reduce performance.
One solution for this - leave CPU cores idle (i.e., not dong CPU-based apps at all). But this will reduce host performance.
Another one, implemented in V12b, restrict available cores for GPU app making them reside on different cores.
In case of otherwise idle CPU cores it can/will reduce app performance (cause when Windows uses core for its own needs it can't move GPU process to another idle core). That's what we see for standalone test when all other cores/GPUs idle.
The possible advantage of V12b can be highlighted (or will be proved that there are no benefit at all ;) ) by measuring GPU both elapsed and CPU times in next config:
for i7 CPU:
8 KNA bench running with same/different test tasks in separate directories, CPU MB app
+
2 KNA bench in separate directories running a)V12 b)V12b.
Possible difference between timings a) and b) cases could give valuable info.
But again, full-loaded system required for this test (!)

And some note: due to variable nature of Windows sheduling decisions a) should be completed few times (cause sometime GPU apps (provided number of GPUs less than numbers of cores) can be paired with each other on single core, sometimes - not.

Order of bench launches should be:
CPU apps first (!),
GPU apps - second (GPU apps should be launched on already busy system, this precisely emulates usual BOINC state).
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 27 Dec 2009, 09:33:53 am: @Raistmer
As requested:
-created 11 kna bench folders (8 cpu & 3 GPU)
-launch them, cpu first then gpu
also
-modified the script to run to specific device (-device n, where n is the device number)
-tweak the cpuz reporting using the latest version for correct info

Code: [Select]
Quick timetable for GPU0 (gtx295) WU : testWU-1.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 29 seconds MB_6.08_mod_CUDA_V12b.exe : 31 seconds Speedup: -6.90%, Ratio: 0.94 x MB_6.08_mod_CUDA_V12b_x4.exe : 37 seconds Speedup: -27.59%, Ratio: 0.78 x WU : testWU-2.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 32 seconds MB_6.08_mod_CUDA_V12b.exe : 35 seconds Speedup: -9.38%, Ratio: 0.91 x MB_6.08_mod_CUDA_V12b_x4.exe : 42 seconds Speedup: -31.25%, Ratio: 0.76 x WU : testWU-3.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 34 seconds MB_6.08_mod_CUDA_V12b.exe : 39 seconds Speedup: -14.71%, Ratio: 0.87 x MB_6.08_mod_CUDA_V12b_x4.exe : 39 seconds Speedup: -14.71%, Ratio: 0.87 x WU : testWU-4.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 30 seconds MB_6.08_mod_CUDA_V12b.exe : 26 seconds Speedup: 13.33%, Ratio: 1.15 x MB_6.08_mod_CUDA_V12b_x4.exe : 24 seconds Speedup: 20.00%, Ratio: 1.25 x WU : testWU-5.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 34 seconds MB_6.08_mod_CUDA_V12b.exe : 38 seconds Speedup: -11.76%, Ratio: 0.89 x MB_6.08_mod_CUDA_V12b_x4.exe : 32 seconds Speedup: 5.88%, Ratio: 1.06 x WU : testWU-6.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 4 seconds MB_6.08_mod_CUDA_V12b.exe : 2 seconds Speedup: 50.00%, Ratio: 2.00 x MB_6.08_mod_CUDA_V12b_x4.exe : 3 seconds Speedup: 25.00%, Ratio: 1.33 x WU : testWU-7.wu MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 23 seconds MB_6.08_mod_CUDA_V12b.exe : 23 seconds Speedup: 0.00%, Ratio: 1.00 x MB_6.08_mod_CUDA_V12b_x4.exe : 26 seconds Speedup: -13.04%, Ratio: 0.88 x
[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 27 Dec 2009, 01:40:52 pm: @glennaxl

When You go to post you will notice a Yellow "Additional Options" That is where you click to upload the file. When it opens you should see Attach and a "Browse" Button that allows you to find the file.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 27 Dec 2009, 01:55:02 pm: Okay, I have Aqua running on the X2 6000 using both cores...

Quick timetable

WU : 01-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.578 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.047 secs CPU
Speedup : 20.09%
Ratio : 1.25 x
MB_6.08_mod_CUDA_V12b.exe : 14.391 secs CPU
Speedup : 18.13%
Ratio : 1.22 x

WU : 02-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.703 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.781 secs CPU
Speedup : 22.15%
Ratio : 1.28 x
MB_6.08_mod_CUDA_V12b.exe : 14.250 secs CPU
Speedup : 19.51%
Ratio : 1.24 x

WU : 03-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.156 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.422 secs CPU
Speedup : 15.94%
Ratio : 1.19 x
MB_6.08_mod_CUDA_V12b.exe : 14.594 secs CPU
Speedup : 14.93%
Ratio : 1.18 x

WU : 04-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.375 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.422 secs CPU
Speedup : 22.75%
Ratio : 1.29 x
MB_6.08_mod_CUDA_V12b.exe : 14.953 secs CPU
Speedup : 13.94%
Ratio : 1.16 x

WU : 05-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.438 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.703 secs CPU
Speedup : 16.64%
Ratio : 1.20 x
MB_6.08_mod_CUDA_V12b.exe : 14.203 secs CPU
Speedup : 13.60%
Ratio : 1.16 x

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Pappa on 27 Dec 2009, 02:23:26 pm: Actually the more complete test on the X2 6000

adding in the "CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe"

Quick timetable

WU : 01-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.938 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.297 secs CPU
Speedup : -43.45%
Ratio : 0.70 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.547 secs CPU
Speedup : 20.02%
Ratio : 1.25 x
MB_6.08_mod_CUDA_V12b.exe : 14.344 secs CPU
Speedup : 15.31%
Ratio : 1.18 x

WU : 02-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.703 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.938 secs CPU
Speedup : -49.30%
Ratio : 0.67 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.859 secs CPU
Speedup : 17.03%
Ratio : 1.21 x
MB_6.08_mod_CUDA_V12b.exe : 14.125 secs CPU
Speedup : 15.43%
Ratio : 1.18 x

WU : 03-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.984 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.359 secs CPU
Speedup : -43.42%
Ratio : 0.70 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.031 secs CPU
Speedup : 17.39%
Ratio : 1.21 x
MB_6.08_mod_CUDA_V12b.exe : 14.281 secs CPU
Speedup : 15.91%
Ratio : 1.19 x

WU : 04-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.609 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 25.094 secs CPU
Speedup : -51.09%
Ratio : 0.66 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.016 secs CPU
Speedup : 15.61%
Ratio : 1.19 x
MB_6.08_mod_CUDA_V12b.exe : 14.500 secs CPU
Speedup : 12.70%
Ratio : 1.15 x

WU : 05-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.094 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.984 secs CPU
Speedup : -46.16%
Ratio : 0.68 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.609 secs CPU
Speedup : 20.39%
Ratio : 1.26 x
MB_6.08_mod_CUDA_V12b.exe : 14.078 secs CPU
Speedup : 17.64%
Ratio : 1.21 x

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 27 Dec 2009, 02:51:56 pm: Ok, my own and Pappa's tests show that V12b unneeded for single GPU hosts.
That is, no advantage to use on non-targed hardware.
(http://img24.imageshack.us/img24/6010/v12bsinglegpu.png)
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 27 Dec 2009, 02:55:17 pm: @glennaxl
PM sent.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Sutaru Tsureku on 30 Dec 2009, 08:39:08 pm: Raistmer, I have one, two questions.. :-[

SETI@home/NC subforum/'eFMer Priority' - (CUDA) app priority change- [Message 959115 (http://setiathome.berkeley.edu/forum_thread.php?id=56605&nowrap=true#959115)]

Which forum you prefer, here or there?
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 31 Dec 2009, 04:46:15 am: Quote from: Sutaru Tsureku on 30 Dec 2009, 08:39:08 pm

Raistmer, I have one, two questions.. :-[

SETI@home/NC subforum/'eFMer Priority' - (CUDA) app priority change- [Message 959115 (http://setiathome.berkeley.edu/forum_thread.php?id=56605&nowrap=true#959115)]

Which forum you prefer, here or there?

I answered most of these questions few times already.
Once more - no, Windows not smart enough to do priority based scheduling through different cores.
About changes between V12b and V12 - most probably small speed variations you see comes just from different binaries.
If you pay attention to other tests published you will see that this difference pretty variable one.
+ limiting number of cores available for app will inherently lead to some small (or not so small - it strongly depends from system load pattern) slowdown.
This was explained in some thread too. This is some kind of tradeoff: loosing some speed under one set of conditions to get boost under another.
And again once more, V12b intended to fight possible x2 (or x3 if 3 GPU processes launched on same core) slowdowns. If you don't see such slowdowns with V12 now - no reason to use V12b.
About priority - still didn't see your tests w/o EFmer's priority tool for V12b. Do you see difference in speed or not??
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 16 Jan 2010, 09:31:00 am: @glennaxl
could you attach (zipped) logs from those 8 CPU benchmarks you ran with 3 or 2 GPUs together for results you posted earlier. It will give full picture, thanks in advance.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 16 Jan 2010, 10:03:32 am: Quote from: Raistmer on 16 Jan 2010, 09:31:00 am
@glennaxl
could you attach (zipped) logs from those 8 CPU benchmarks you ran with 3 or 2 GPUs together for results you posted earlier. It will give full picture, thanks in advance.

Here it is. Its good I still have those logs, i don't have to re-run it.

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 16 Jan 2010, 10:27:06 am: Thanks a lot, will see what can I study from them :)
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 16 Jan 2010, 02:10:16 pm: Quote from: glennaxl on 16 Jan 2010, 10:03:32 am
Quote from: Raistmer on 16 Jan 2010, 09:31:00 am
@glennaxl
could you attach (zipped) logs from those 8 CPU benchmarks you ran with 3 or 2 GPUs together for results you posted earlier. It will give full picture, thanks in advance.

Here it is. Its good I still have those logs, i don't have to re-run it.

Unfortunately very old version of KNA bench was taken. It only reports elapsed time, w/o CPU time, and only in integer number of seconds.
But will see what picture we have with such data at least...
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 16 Jan 2010, 03:06:37 pm: Ok, results from one of glennaxl hosts, with 2 GPUs one:
(http://img687.imageshack.us/img687/223/glennax2602gpu8cpu.png)

What expected: higher load on first 4 CPUs. ~~What unexpected - sometimes bigger load on CPUs with higher numbers. Here both groups sometime over-loaded and sometimes not - it's strange~~.
EDIT:
Actually, cause CPU tasks go w/o affinity, from task to task CPU, assigned for particular bench number can change. So CPU results completely expected! 4 cores always have higher load than 4 another.
Interesting to test x4 build on same host. Here I would expect only 2 overloaded cores instead of 4.

Elapsed times for GPU apps don't allow to chose the best app IMO.

If additional tests on that host possible what I would love to have:

1) benchmark script replaced on something more new, possible samples attached to this post. Lack of CPU times for GPU app is very sad.
2) test-wu6 can be excluded completely. It VLAR-killed anyway.
3) No need so much work on CPU now. GPU loaded only ~350 seconds and CPU loaded ~1600 seconds total. If CPU would be loaded slightly longer that GPU it would be OK for my purposes and save time for productive crunching :) (although nothing wrong with doing all test WUs on CPU too).
4) Slightly changed experiment conditions:
a) single GPU0 run, w/o CPU loaded at all.
b) single GPU0 run with CPU fully loaded as here.
c) separate (it's important) run for V12 with both GPU + all CPU loaded.
d) again, separate run for V12b both GPU, all CPU.
e) separate run for V12b x4, both GPU, all CPU.

Is it possible to perform these additional tests?

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 16 Jan 2010, 04:07:56 pm: And here data for another host.
Looks like third GPU likes x4 build :)

(http://img690.imageshack.us/img690/9369/glennaxl2953gpu8cpu.png)

If possible same new set of tests would be very nice to have for this host too.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 23 Jan 2010, 10:08:14 am: Another round of tests, same rigs as before. A 3rd rig test is coming but somehow its crashing on test b. 3rd rig is a q6600, dual gtx 260 on p55 chipset board.

Test Cases:
TEST A: 1 GPU, No CPU
TEST B: 1 GPU, 100% CPU
TEST C: v12 ALL GPU, 100% CPU
TEST D: v12b ALL GPU, 100% CPU
TEST E: v12 x4 ALL GPU, 100% CPU

TEST A:
1) a. GPU0 @ GTX295-CORE0 (v12 vs v12b)
b. GPU0 @ GTX295-CORE0 (v12 vs v12b x4)
2) a. GPU1 @ GTX260 (v12 vs v12b)
b. GPU1 @ GTX260 (v12 vs v12b x4)
3) a. GPU2 @ GTX295-CORE1 (v12 vs v12b)
b. GPU2 @ GTX295-CORE1 (v12 vs v12b x4)

TEST B:
CPU0-7 @i7 920 (AKv8 vs AKv8b)
1) a. GPU0 @ GTX295-CORE0 (v12 vs v12b)
b. GPU0 @ GTX295-CORE0 (v12 vs v12b x4)
2) a. GPU1 @ GTX260 (v12 vs v12b)
b. GPU1 @ GTX260 (v12 vs v12b x4)
3) a. GPU2 @ GTX295-CORE1 (v12 vs v12b)
b. GPU2 @ GTX295-CORE1 (v12 vs v12b x4)

TEST C:
1) GPU0 @ GTX295-CORE0 (stock609 vs v12)
GPU1 @ GTX260 (stock609 vs v12)
GPU2 @ GTX295-CORE1 (stock609 vs v12)
CPU0-7 @i7 920 (AKv8 vs AKv8b)

TEST D:
1) GPU0 @ GTX295-CORE0 (v12 vs v12b)
GPU1 @ GTX260 (v12 vs v12b)
GPU2 @ GTX295-CORE1 (v12 vs v12b)
CPU0-7 @i7 920 (AKv8 vs AKv8b)

TEST E:
1) GPU0 @ GTX295-CORE0 (v12 vs v12b x4)
GPU1 @ GTX260 (v12 vs v12b x4)
GPU2 @ GTX295-CORE1 (v12 vs v12b x4)
CPU0-7 @i7 920 (AKv8 vs AKv8b)

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 23 Jan 2010, 02:12:32 pm: The 3rd rig I mentioned - upgraded to 196.21 from 195.62 and it fix the issue.

It seems the speed up is less on Q6600 than i7 920.

[attachment deleted by admin]
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 23 Jan 2010, 02:15:47 pm: Thanks a lot!
Will look at results.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: glennaxl on 23 Jan 2010, 06:25:39 pm: (http://img.techpowerup.org/100123/Capture.jpg)
Results are from Test D and E.
Title: Re: CUDA MB V12b for multi-GPU multicore hosts.
Post by: Raistmer on 23 Jan 2010, 06:36:25 pm: Looks like V12b has some sense for hosts with 3 GPU ~~but not for host with only 2 GPUs.~~ [Rig 3 2-GPUs only too...]
V12b x4 takes 1 CPU only and for i7 CPU it will mean that 2 instanses sitting on same physical core because of HyperThreading. It's sub-optimal of course so x4 results almost always worser.
V12b takes 2 CPUs per instance that is, always full i7 core, but using only first 4 CPUs so again, 3 instanses will use 2 i7 cores instead of 3.
Will try to do some i7-related tuning and maybe results will be more clearer...