Author Topic: GPU/CPU performance dependence from AR value (Read 26181 times)

Raistmer · « **on:** 17 Apr 2009, 03:28:54 pm »

I did some extensive benhmarking of artifical tasks that differ only in AR value.
GPU/CPU performance ratio depends from AR non-monotonic as showed on next picture:

That is, not only VLARs but all tasks with AR more than 1.127 (with precision of this experiment) experience performance drop when executed on GPU (relatively their execution on CPU).
Also, VLAR limit can be refined from this graph.

It's worth to do GPU->CPU "rebranding" for low and high ARs while doing CPU->GPU "rebranding" for midrange ARs.
This can substantionaly increase host performance.

(for performance ratio estimation two metrics was used:
1) CPU elapsed time to GPU elapsed time ratio
2) CPU elapsed time to (GPU elapsed time+ CPU tme of GPU task) ratio.
I think second metric is more accurate cause when GPU task actively using CPU it leaves GPU idle and interfere with task processed by CPU thus lowering not only its own performance but performance of another task too. That's why I sum GPU task total elapsed time and its CPU time)

Gecko_R7 · « **Reply #1 on:** 17 Apr 2009, 04:00:21 pm »

Your results show a valid reason for better task management & WU allocation/assignment between CPU & GPU.
A "legitimate" use for cherry-picking to say, that would have a net (+) for the project, as well as the user.
This is a good study & direction to explore further.

Josef W. Segur · « **Reply #2 on:** 17 Apr 2009, 04:47:11 pm »

It would be good to be more specific about what CPU and GPU are being compared. I believe that the general shape of the curves might apply to all such pairs, but a slow CPU and fast GPU or vice versa would make a significant difference in the ratios. IOW, I think there could be systems where it is always better to do the work on GPU even if that requires most of the CPU power to support, and also systems where the GPU provides only some small productivity increment because both are used but GPU takes a negligible amount of CPU support.
Joe

Raistmer · « **Reply #3 on:** 17 Apr 2009, 04:52:49 pm »

Quote from: Josef W. Segur on 17 Apr 2009, 04:47:11 pm

It would be good to be more specific about what CPU and GPU are being compared. I believe that the general shape of the curves might apply to all such pairs, but a slow CPU and fast GPU or vice versa would make a significant difference in the ratios. IOW, I think there could be systems where it is always better to do the work on GPU even if that requires most of the CPU power to support, and also systems where the GPU provides only some small productivity increment because both are used but GPU takes a negligible amount of CPU support.
Joe

CPU and GPU listed in picture footnotes.
Just dubbing here:
Host: CPU Q9450 stock freq, GPU 9600GSO 705/1900/2220 MHz

Completely agreed, absolute numbers will be different for different GPU/CPU pairs but I think for anygiven CPU/GPU pair one range of AR will be better to do on GPU and another one on CPU.
Moreover, these AR ranges will be the same as for my config. That is, though Y-axis values depends from system used in test, X-axis depends only from GPU and CPU architecture differencies themselvs and current CPU and CUDA realizations of SETI MB algorithm. One could renormalize Y-axis to 1 and recive some normalized GPU/CPU perfomance ratio picture.

About extreme cases:
1) GPU very sow CPU is very fast. One have choice not to use GPU at all, but if he will use it better to use it in most effective range of ARs. This range separation will apply.
2) GPU(s) is(are) very fast, CPU is very slow - well, this scenario can lead to situation when no tasks assigned to CPU at all, all that CPU does is feeding of GPU(s). No rebranding needed but CPU doesn't do any tasks.

In other cases this AR separation between CPU and GPU should increase host performance.

Richard Haselgrove · « **Reply #4 on:** 18 Apr 2009, 06:21:16 am »

Can I urge a note of caution in re-branding VHAR tasks from GPU to CPU? You'll need to put some alternative means of cache size limitation in place too.

Everybody who's been around for more than a few months will know, by experience, that Matt is quite capable of serving up a 'shorty storm' - injudicious choice of tapes can mean that all, or a very high proportion, of tasks come into the VHAR category.

Consider what would happen. BOINC requests CUDA work: gets mainly VHAR. Re-brand those to the CPU queue: the CUDA queue is now light on work, and will be re-filled. Rinse and repeat.

You'll end up with a huge number of VHAR queuing for the CPU(s). BOINC will run them in High Priority, of course, but that's no guarantee that they can all be processed within 7 days. I'm not exactly certain whether CPU High Priority would inhibit CUDA work fetch for the project - I suspect not, and I think that most people here would argue that it shouldn't. So there would be no inhibition on runaway cache filling except splitter capacity and host daily quota. Anyone got a CPU that can process 1,000 tasks a day, even shorties?

I'm seeing this in a limited way with VLARs. I was away over the holiday weekend, so I set my cache high so I could re-brand any VLARs before I left. 8 days later, one box still has over 50 VLARs from that initial batch slowly chugging through the CPU. They won't be a problem (03 May deadline, and I have plenty of spare firepower on the Quad if things ever get tight), but it's an example of the tight squeezes you can get into with re-branding.

Given the propensity we've seen here for SETIzens to download anything in sight with go-faster stripes on, and turn all the knobs up to 11 without reading the instruction manual first, some sort of limitation - especially for the slow CPU / fast GPU mis-matched hosts - would avoid a lot of disappointment.

Raistmer · « **Reply #5 on:** 18 Apr 2009, 06:39:56 am »

Sure, any external task move between CPU and GPU will result in breaking BOINC's expectance about how much work it has and need.
That is, no such action should be taken for 10 day cahce at all. And better is such rebranding will occur on regular basis pertty often but by pretty small chunks to not to deceive BOINC work amount estimate too much each time.

Richard Haselgrove · « **Reply #6 on:** 18 Apr 2009, 09:33:08 am »

You probably need to add some theoretical underpinning to the 'WU selection for re-branding', too. I've just finished an AR=0.127238 task, which, while exhibiting some of the sluggishness of the 'true' VLARs, ran for about 30 minutes on a 9800GT - compared to 90 minutes for a ~0.01, or 20 minutes for a ~0.4

This WU would have come from the first peak of Joe's graph recently published at Main:

I've also had (rare) tasks from the second peak, around AR=0.2, with similar runtimes. Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.

Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.

Josef W. Segur · « **Reply #7 on:** 18 Apr 2009, 05:08:33 pm »

Quote from: Richard Haselgrove on 18 Apr 2009, 09:33:08 am

...
Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.
...

Yes, it is possible. First, here's a table which applies to all LAR pulse finding:

Code: [Select]

FFTLen  Stepsize  NumCfft
     8 17.072753       11
    16  8.536377       23
    32  4.268188       47
    64  2.134094       93
   128  1.067047      187
   256  0.533524      375
   512  0.266762      749
  1024  0.133381     1499
  2048  0.066690     2999
  4096  0.033345     5997
  8192  0.016673    11995

The Stepsize is in chirp values, and the work goes to a final chirp limit of +/- 100, so the NumCfft (number of chirp/fft pairs) is simply derived by dividing 100 by Stepsize, truncating to an integer, doubling (for positive and negative), and adding 1 (for processing at zero chirp). The reason the same set of values applies to all LAR work is the code uses pot_min_slew to establish the Stepsize for any work below AR 0.225485775 . FFTLen 8 and 16 are only used if there's sufficient motion that PoTlen will be below the 40960 limit.

When an FFTLen 32 is done it produces 32 arrays 32768 long, but only 31 are searched for signals. For true VLAR, motion is less than one beamwidth for the full WU duration, so each of the 31 is done in one gulp. Because there are 47 chirp/fft pairs to process, we get 31*47=1457 of them.

For anything above one beamwidth the data is processed with PoTs overlapped by at least 50%, so for 1 to 1.5 beamwidths two PoTs are needed, then three for 1.5 to 2 beamwidths, four for 2 to 2.5, etc. Here's a table for true VLAR plus the two peaks which reach 40960 length:

Code: [Select]

FFTLen  AR<=0.05      AR=0.08        AR=0.16
        ------------- -------------- -------------
     8                               462(@40960)
    16                1035(@40960)   2070(@20480)
    32  1457(@32768)  4371(@20480)   8742(@10240)
    64  5859(@16384)  17577(@10240)  35154(@5120)
   128  23749(@8192)  71247(@5120)   142494(@2560)
   256  95625(@4096)  286875(@2560)  573750(@1280)
   512  382739(@2048) 1148217(@1280) 2296434(@640)
  1024  1.53e6(@1024) 4.6e6(@640)    9.2e6(@320)
  2048  6.14e6(@512)  1.84e7(@320)   3.68e7(@160)
  4096  2.46e7(@256)  7.37e7(@160)   1.47e8(@80)
  8192  9.83e7(@128)  2.95e8(@80)    5.9e8(@40)

In another thread there's the testing Raistmer did at my request showing how much time is saved doing VLAR work with the 32768 and 16384 PoT lengths not done, and with the 8192 length skipped also. Although the counts for those lengths is relatively small, they seem to account for most of the GPU crunch time. What I don't know is if the relationship is relatively smooth or time increases in steps as certain sizes are exceeded.

Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
Joe

Raistmer · « **Reply #8 on:** 19 Apr 2009, 08:12:27 am »

Quote from: Richard Haselgrove on 18 Apr 2009, 09:33:08 am

Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.

Yes, multicore (as mult-GPU) consideration can complicate this picture much.
I consider it as first-level approach while curent (any AR goes anywhere) is zero-level one. Moreover situation with HT-capable host can be different too (compared with true multicore host).
Will try to measure CPU/ELAPSED times while all cores busy with VHARs.
For protocol: All measurements made in first post was made with BOINC disabled. All cores were idle except wich running test app.

Raistmer · « **Reply #9 on:** 19 Apr 2009, 08:14:24 am »

Quote from: Josef W. Segur on 18 Apr 2009, 05:08:33 pm

Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
Joe

Sure, will mail it to you.

Richard Haselgrove · « **Reply #10 on:** 19 Apr 2009, 08:36:16 am »

Just for fun, I adapted Fred's script to re-brand VHAR as well as VLAR - as it happens, the strict 'V' cases are easy to handle, at both LAR and HAR. The ones in the middle are more difficult to pick out.

The first one I tried, BOINC (as expected) immediately fetched more work to re-fill the gaps in the CUDA queue caused by the re-branding. I got mostly VLAR. So I re-branded that, and BOINC replaced it - with VHAR. So I re-branded that, and .... you get the picture. Note that I am doing this under controlled, observed conditions on machines with plenty of horsepower and spare capacity I can 'borrow' from other projects. Even so, I'm going to have to keep an eye on VHAR deadlines for the next few days. If I was working on a SETI-only box with a huge cache setting, I could have got myself into serious bother.

Quote from: Raistmer on 18 Apr 2009, 06:39:56 am

That is, no such action should be taken for 10 day cache at all. And better is such rebranding will occur on regular basis pertty often but by pretty small chunks to not to deceive BOINC work amount estimate too much each time.

Difficult one. I can see your point about moving in small steps, but the objective you're trying to achieve is to shave a few seconds off crunch times. In order to re-brand tasks, you have to shut down and restart the BOINC core client: and that's one of the most inefficient operations around. What I'm doing (except when experimenting) is waiting until the last possible moment before one of the target ARs reaches the head of the CUDA queue, and then doing the whole lot in one big batch. That gives me the longest possible uninterrupted run before the next batch needs doing. This approach would only work for VLAR with v6.6.20, because VHAR run immediately on arrival: I'm running v6.6.23, so I can use it for VHAR as well.

Josef W. Segur · « **Reply #11 on:** 19 Apr 2009, 10:21:56 am »

Quote from: Richard Haselgrove on 19 Apr 2009, 08:36:16 am

...
In order to re-brand tasks, you have to shut down and restart the BOINC core client: and that's one of the most inefficient operations around.
...

Perhaps using a rewriting proxy to rebrand in the Scheduler reply would be a better approach. I think perhaps the Proxomitron might be able to handle that.
Joe

Raistmer · « **Reply #12 on:** 19 Apr 2009, 10:57:00 am »

well, script could be written on perl. If that proxy supports external program calls maybe yes, such proxy will be best way to go.
For now I mean "small step: is 1-2 days cache, not 10 days one to avoid deadline misses but big enough to compensate BOINC shutdown/restart.

Raistmer · « **Reply #13 on:** 19 Apr 2009, 11:04:08 am »

Quote from: Richard Haselgrove on 19 Apr 2009, 08:36:16 am

The first one I tried, BOINC (as expected) immediately fetched more work to re-fill the gaps in the CUDA queue caused by the re-branding. I got mostly VLAR. So I re-branded that, and BOINC replaced it - with VHAR. So I re-branded that, and .... you get the picture.

Not nessesary to rebrand all tasks. Some limit can be setted up. In short, in boundaries of deadline such script would improve performance. That is, our goal will write such script and ensure that its work will not lead to deadline misses (for more or less conscious user, even app_info can be devastating weapon in hands of "no-clue" user...)

Raistmer · « **Reply #14 on:** 19 Apr 2009, 11:24:12 am »

<result>
<name>27dc08ab.16638.387984.5.8.30_0</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>1</state>
<platform>windows_intelx86</platform>
<version_num>608</version_num>
<plan_class>cuda</plan_class>
<wu_name>27dc08ab.16638.387984.5.8.30</wu_name>
<report_deadline>1240759087.000000</report_deadline>
<file_ref>
<file_name>27dc08ab.16638.387984.5.8.30_0_0</file_name>
<open_name>result.sah</open_name>
</file_ref>
</result>
Do I understand right that only yellow lines should be changed for CPU<->GPU rebranding ?

Author Topic: GPU/CPU performance dependence from AR value (Read 26181 times)

Raistmer

GPU/CPU performance dependence from AR value

Gecko_R7

Re: GPU/CPU pefrmance dependence from AR value

Josef W. Segur

Re: GPU/CPU pefrmance dependence from AR value

Raistmer

Re: GPU/CPU pefrmance dependence from AR value

Richard Haselgrove

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value

Richard Haselgrove

Re: GPU/CPU performance dependence from AR value

Josef W. Segur

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value

Richard Haselgrove

Re: GPU/CPU performance dependence from AR value

Josef W. Segur

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value

Raistmer

Re: GPU/CPU performance dependence from AR value