Author Topic: CUDA MB V12b for multi-GPU multicore hosts. (Read 37670 times)

glennaxl · « **Reply #15 on:** 25 Dec 2009, 12:46:23 pm »

On my GTX295+GTX260(216) Rig, x4 version makes the system sluggish.

Sutaru Tsureku · « **Reply #16 on:** 25 Dec 2009, 03:07:03 pm »

I made a test of CUDA_V12b_x4 on my AMD QUAD 940 BE & 4x OCed GTX260-216.

But, the result don't look so well.. $:-\$
SETI@home/NC subforum/'eFMer Priority' - (CUDA) app priority change [Message 958712]

Raistmer · « **Reply #17 on:** 25 Dec 2009, 06:42:23 pm »

Quote from: glennaxl on 25 Dec 2009, 12:46:23 pm

On my GTX295+GTX260(216) Rig, x4 version makes the system sluggish.

What about another one? And what about CUDA MB itself performance?
Sluggishness can be anticipated indeed.Priority boost and affinity lock - both these changes can negatively affect GUI and other apps.
But what about performance?

@Jason
Can't say if all builds fail or not. Cause "failed" build passed KNA bench with PG* tasks just OK.
Host experienced random failures:
"-1" computational errors with unspecified kernel launch failure (in pulsefind, always same 106 line), BSoDs, very sluggish behavior...
But all this not directly related to V12b build discussed in this thread so better I will continue my complaints on own 9400GT in another thread

P.S. very agreed with your suspiction about PCI-E bus in that host.

Pappa · « **Reply #18 on:** 25 Dec 2009, 07:23:16 pm »

@Raistmer It has been running on the 9800GT for over a day. I do see a slight sluggishness while typing. Out of the ~ 120 Result that have been returned ony 6-7 yielded "inconclusize." Less than a half dozen were VLAR and about 6 that were into HAR which it handled nicely.

Without capturing all the data to run an average before and after, generally the numbers look about the same. As AR's are not "exacrly the same." some small variance is seen from WU to WU. This I account to small variation in the AR and what signals are processed. Then one would have to have several days worth of Data to get any real idea of improvement.

The only way I could really think to setup a proper test would be to setup a MB Bench to process a group of MB to keep the CPU's busy and then make like 6 copies (only one AR) of the same Cuda WU in the Cuda Bench then compare Stock, V12, V12b and this one. Then any variance is due to Hardware conflicts/contention for the CPU.

Raistmer · « **Reply #19 on:** 25 Dec 2009, 07:27:54 pm »

Yes, I did such test. But I have only 1 GPU installed now - no targed hardware for this build.
And I still don't understand how you manage to process VLARs with it... Could you list some completed VLAR results?

Pappa · « **Reply #20 on:** 25 Dec 2009, 07:47:14 pm »

Quote from: Raistmer on 25 Dec 2009, 07:27:54 pm

Yes, I did such test. But I have only 1 GPU installed now - no targed hardware for this build.
And I still don't understand how you manage to process VLARs with it... Could you list some completed VLAR results?

the 8400 had a VLAR in progress at the time I installed it did a Fallback to CPU. So as it is running the nokill option it is doing VLARS

http://setiathome.berkeley.edu/results.php?hostid=2435134
and this task
http://setiathome.berkeley.edu/result.php?resultid=1458893819

So over the course of the week it has not did a lot of WU's as only Seti Cuda is running on the GPU and NFS on the CPU's. It is also susecptable to errors.

Raistmer · « **Reply #21 on:** 25 Dec 2009, 08:08:54 pm »

Hm, do you talk about V12b modification or V12 one?
What build did VLARs ? This thread mainly about V12b build for multi-GPU hosts.

[BTW,
"Device 1 : Device Emulation (CPU)
"
It's already not GPU but CPU. Then app did fallback from CPU GPU-emulation to plain CPU code

Why no GPU found ?
]

EDIT: ah, I see now. That task was resumed by V12b version...
After app init: total GPU memory 0 free GPU memory 0
VLAR WU (AR: 0.008794 )detected, but task partially done already, continuing computations

Pappa · « **Reply #22 on:** 25 Dec 2009, 08:33:11 pm »

In the next day or so I am going to setup to do a test on the 8400. I will setup a MB Bench with a couple of the same WU's and launch (both cores active). The a Cuda Bench with a single AR copied about 6 times to run Stock, v12 and v12b...

The numbers should be identical. What I expect to show up is changing times due to hardware conflicts/contention. This is what would affect the low end cards mostly. The end result should show a truer picture of speed improvement.

glennaxl · « **Reply #23 on:** 26 Dec 2009, 03:13:42 am »

GTX295+GTX260-216/i7 920

MB_6.08_mod_CUDA_V12b.exe
====================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 26 seconds
MB_6.08_mod_CUDA_V12b.exe : 25 seconds
Speedup: 3.85%, Ratio: 1.04 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b.exe : 32 seconds
Speedup: -6.67%, Ratio: 0.94 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 32 seconds
MB_6.08_mod_CUDA_V12b.exe : 36 seconds
Speedup: -12.50%, Ratio: 0.89 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 17 seconds
MB_6.08_mod_CUDA_V12b.exe : 17 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 31 seconds
MB_6.08_mod_CUDA_V12b.exe : 32 seconds
Speedup: -3.23%, Ratio: 0.97 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b.exe : 1 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 19 seconds
MB_6.08_mod_CUDA_V12b.exe : 20 seconds
Speedup: -5.26%, Ratio: 0.95 x

MB_6.08_mod_CUDA_V12b_x4.exe
======================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 25 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 25 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 31 seconds
Speedup: -3.33%, Ratio: 0.97 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 32 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 36 seconds
Speedup: -12.50%, Ratio: 0.89 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 18 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 17 seconds
Speedup: 5.56%, Ratio: 1.06 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 31 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 32 seconds
Speedup: -3.23%, Ratio: 0.97 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 1 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 19 seconds
MB_6.08_mod_CUDA_V12b_x4.exe : 20 seconds
Speedup: -5.26%, Ratio: 0.95 x

glennaxl · « **Reply #24 on:** 26 Dec 2009, 03:23:51 am »

Dual GTX260-216/i7 920

MB_6.08_mod_CUDA_V12b.exe
====================
WU : testWU-1.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 23 seconds
MB_6.08_mod_CUDA_V12b.exe : 23 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-2.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 28 seconds
MB_6.08_mod_CUDA_V12b.exe : 28 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-3.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 30 seconds
MB_6.08_mod_CUDA_V12b.exe : 33 seconds
Speedup: -10.00%, Ratio: 0.91 x

WU : testWU-4.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 16 seconds
MB_6.08_mod_CUDA_V12b.exe : 16 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-5.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 28 seconds
MB_6.08_mod_CUDA_V12b.exe : 28 seconds
Speedup: 0.00%, Ratio: 1.00 x

WU : testWU-6.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 1 seconds
MB_6.08_mod_CUDA_V12b.exe : 0 seconds
Speedup: 100.00%, Ratio: 1.#J x

WU : testWU-7.wu
MB_6.08_CUDA_V12_VLARKill_FPLim2048.exe : 18 seconds
MB_6.08_mod_CUDA_V12b.exe : 18 seconds
Speedup: 0.00%, Ratio: 1.00 x

Raistmer · « **Reply #25 on:** 26 Dec 2009, 04:35:57 am »

@glennaxl
Unfortunately, final bench output contains only CPU times, no elapsed times provided.
Could you attach log files from TestDatas directory too, please?
And could you give some description of used environment:
in what conditions bench running? some background CPU tasks enabled or only single GPU in work and second GPU + CPU cores sit idle during test?

I gave some explanations of idea behind V12b mod on SETI main, but will try to explain it here again. Better understanding of idea could lead to better benchmark configs.

What we can see with prev versions on hosts running 2 or more CUDA MB processes and 2,4 or 8 (duo, quad or i7) CPU MB/AP processes (or CPU-based apps from another projects, it doesn't matter):
windows can pair 2 CUDA MB processes on single core leaving all other cores for CPU processes. This will heavely increase CUDA MB initialization times and reduce performance.
One solution for this - leave CPU cores idle (i.e., not dong CPU-based apps at all). But this will reduce host performance.
Another one, implemented in V12b, restrict available cores for GPU app making them reside on different cores.
In case of otherwise idle CPU cores it can/will reduce app performance (cause when Windows uses core for its own needs it can't move GPU process to another idle core). That's what we see for standalone test when all other cores/GPUs idle.
The possible advantage of V12b can be highlighted (or will be proved that there are no benefit at all

) by measuring GPU both elapsed and CPU times in next config:
for i7 CPU:
8 KNA bench running with same/different test tasks in separate directories, CPU MB app
+
2 KNA bench in separate directories running a)V12 b)V12b.
Possible difference between timings a) and b) cases could give valuable info.
But again, full-loaded system required for this test (!)

And some note: due to variable nature of Windows sheduling decisions a) should be completed few times (cause sometime GPU apps (provided number of GPUs less than numbers of cores) can be paired with each other on single core, sometimes - not.

Order of bench launches should be:
CPU apps first (!),
GPU apps - second (GPU apps should be launched on already busy system, this precisely emulates usual BOINC state).

glennaxl · « **Reply #26 on:** 27 Dec 2009, 09:33:53 am »

@Raistmer
As requested:
-created 11 kna bench folders (8 cpu & 3 GPU)
-launch them, cpu first then gpu
also
-modified the script to run to specific device (-device n, where n is the device number)
-tweak the cpuz reporting using the latest version for correct info

Code: [Select]

Quick timetable for GPU0 (gtx295)
 
WU : testWU-1.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 29 seconds 
MB_6.08_mod_CUDA_V12b.exe : 31 seconds 
Speedup: -6.90%, Ratio: 0.94 x
MB_6.08_mod_CUDA_V12b_x4.exe : 37 seconds 
Speedup: -27.59%, Ratio: 0.78 x
 
WU : testWU-2.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 32 seconds 
MB_6.08_mod_CUDA_V12b.exe : 35 seconds 
Speedup: -9.38%, Ratio: 0.91 x
MB_6.08_mod_CUDA_V12b_x4.exe : 42 seconds 
Speedup: -31.25%, Ratio: 0.76 x
 
WU : testWU-3.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 34 seconds 
MB_6.08_mod_CUDA_V12b.exe : 39 seconds 
Speedup: -14.71%, Ratio: 0.87 x
MB_6.08_mod_CUDA_V12b_x4.exe : 39 seconds 
Speedup: -14.71%, Ratio: 0.87 x
 
WU : testWU-4.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 30 seconds 
MB_6.08_mod_CUDA_V12b.exe : 26 seconds 
Speedup: 13.33%, Ratio: 1.15 x
MB_6.08_mod_CUDA_V12b_x4.exe : 24 seconds 
Speedup: 20.00%, Ratio: 1.25 x
 
WU : testWU-5.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 34 seconds 
MB_6.08_mod_CUDA_V12b.exe : 38 seconds 
Speedup: -11.76%, Ratio: 0.89 x
MB_6.08_mod_CUDA_V12b_x4.exe : 32 seconds 
Speedup: 5.88%, Ratio: 1.06 x
 
WU : testWU-6.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 4 seconds 
MB_6.08_mod_CUDA_V12b.exe : 2 seconds 
Speedup: 50.00%, Ratio: 2.00 x
MB_6.08_mod_CUDA_V12b_x4.exe : 3 seconds 
Speedup: 25.00%, Ratio: 1.33 x
 
WU : testWU-7.wu 
MB_6.08_CUDA_V12_VLARKill_FPLim2048_test.exe : 23 seconds 
MB_6.08_mod_CUDA_V12b.exe : 23 seconds 
Speedup: 0.00%, Ratio: 1.00 x
MB_6.08_mod_CUDA_V12b_x4.exe : 26 seconds 
Speedup: -13.04%, Ratio: 0.88 x

[attachment deleted by admin]

Pappa · « **Reply #27 on:** 27 Dec 2009, 01:40:52 pm »

@glennaxl

When You go to post you will notice a Yellow "Additional Options" That is where you click to upload the file. When it opens you should see Attach and a "Browse" Button that allows you to find the file.

Pappa · « **Reply #28 on:** 27 Dec 2009, 01:55:02 pm »

Okay, I have Aqua running on the X2 6000 using both cores...

Quick timetable

WU : 01-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.578 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.047 secs CPU
Speedup : 20.09%
Ratio : 1.25 x
MB_6.08_mod_CUDA_V12b.exe : 14.391 secs CPU
Speedup : 18.13%
Ratio : 1.22 x

WU : 02-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.703 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.781 secs CPU
Speedup : 22.15%
Ratio : 1.28 x
MB_6.08_mod_CUDA_V12b.exe : 14.250 secs CPU
Speedup : 19.51%
Ratio : 1.24 x

WU : 03-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.156 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.422 secs CPU
Speedup : 15.94%
Ratio : 1.19 x
MB_6.08_mod_CUDA_V12b.exe : 14.594 secs CPU
Speedup : 14.93%
Ratio : 1.18 x

WU : 04-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.375 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.422 secs CPU
Speedup : 22.75%
Ratio : 1.29 x
MB_6.08_mod_CUDA_V12b.exe : 14.953 secs CPU
Speedup : 13.94%
Ratio : 1.16 x

WU : 05-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.438 secs CPU
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.703 secs CPU
Speedup : 16.64%
Ratio : 1.20 x
MB_6.08_mod_CUDA_V12b.exe : 14.203 secs CPU
Speedup : 13.60%
Ratio : 1.16 x

[attachment deleted by admin]

Pappa · « **Reply #29 on:** 27 Dec 2009, 02:23:26 pm »

Actually the more complete test on the X2 6000

adding in the "CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe"

Quick timetable

WU : 01-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.938 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.297 secs CPU
Speedup : -43.45%
Ratio : 0.70 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.547 secs CPU
Speedup : 20.02%
Ratio : 1.25 x
MB_6.08_mod_CUDA_V12b.exe : 14.344 secs CPU
Speedup : 15.31%
Ratio : 1.18 x

WU : 02-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.703 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.938 secs CPU
Speedup : -49.30%
Ratio : 0.67 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.859 secs CPU
Speedup : 17.03%
Ratio : 1.21 x
MB_6.08_mod_CUDA_V12b.exe : 14.125 secs CPU
Speedup : 15.43%
Ratio : 1.18 x

WU : 03-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.984 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.359 secs CPU
Speedup : -43.42%
Ratio : 0.70 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.031 secs CPU
Speedup : 17.39%
Ratio : 1.21 x
MB_6.08_mod_CUDA_V12b.exe : 14.281 secs CPU
Speedup : 15.91%
Ratio : 1.19 x

WU : 04-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 16.609 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 25.094 secs CPU
Speedup : -51.09%
Ratio : 0.66 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 14.016 secs CPU
Speedup : 15.61%
Ratio : 1.19 x
MB_6.08_mod_CUDA_V12b.exe : 14.500 secs CPU
Speedup : 12.70%
Ratio : 1.15 x

WU : 05-FMN0446.wu
setiathome_6.08_windows_intelx86__cuda.exe : 17.094 secs CPU
CUDAMB_V13noKill_ICCIPP_SSE3_AKPFTest_TK4.exe : 24.984 secs CPU
Speedup : -46.16%
Ratio : 0.68 x
MB_6.08_CUDA_V12_noKill_FPLim2048.exe : 13.609 secs CPU
Speedup : 20.39%
Ratio : 1.26 x
MB_6.08_mod_CUDA_V12b.exe : 14.078 secs CPU
Speedup : 17.64%
Ratio : 1.21 x

[attachment deleted by admin]

Author Topic: CUDA MB V12b for multi-GPU multicore hosts. (Read 37670 times)

glennaxl

Re: CUDA MB V12b for multi-GPU multicore hosts.

Sutaru Tsureku

Re: CUDA MB V12b for multi-GPU multicore hosts.

Raistmer

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.

Raistmer

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.

Raistmer

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.

glennaxl

Re: CUDA MB V12b for multi-GPU multicore hosts.

glennaxl

Re: CUDA MB V12b for multi-GPU multicore hosts.

Raistmer

Re: CUDA MB V12b for multi-GPU multicore hosts.

glennaxl

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.

Pappa

Re: CUDA MB V12b for multi-GPU multicore hosts.