Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Raistmer on 17 Apr 2009, 03:28:54 pm

Title: GPU/CPU performance dependence from AR value
Post by: Raistmer on 17 Apr 2009, 03:28:54 pm
I did some extensive benhmarking of artifical tasks that differ only in AR value.
GPU/CPU performance ratio depends from AR non-monotonic as showed on next picture:
(http://img136.imageshack.us/img136/7830/gpucpuperformance.th.png) (http://img136.imageshack.us/my.php?image=gpucpuperformance.png)

That is, not only VLARs but all tasks with AR more than 1.127 (with precision of this experiment) experience performance drop when executed on GPU (relatively their execution on CPU).
Also, VLAR limit can be refined from this graph.

It's worth to do GPU->CPU "rebranding" for low and high ARs while doing CPU->GPU "rebranding" for midrange ARs.
This can substantionaly increase host performance.

(for performance ratio estimation two metrics was used:
1) CPU elapsed time to GPU elapsed time ratio
2) CPU elapsed time to (GPU elapsed time+ CPU tme of GPU task) ratio.
I think second metric is more accurate cause when GPU task actively using CPU it leaves GPU idle and interfere with task processed by CPU thus lowering not only its own performance but performance of another task too. That's why I sum GPU task total elapsed time and its CPU time)
Title: Re: GPU/CPU pefrmance dependence from AR value
Post by: Gecko_R7 on 17 Apr 2009, 04:00:21 pm
Your results show a valid reason for better task management & WU allocation/assignment between CPU & GPU.
A "legitimate" use for cherry-picking to say, that would have a net (+) for the project, as well as the user.
This is a good study & direction to explore further.
Title: Re: GPU/CPU pefrmance dependence from AR value
Post by: Josef W. Segur on 17 Apr 2009, 04:47:11 pm
It would be good to be more specific about what CPU and GPU are being compared. I believe that the general shape of the curves might apply to all such pairs, but a slow CPU and fast GPU or vice versa would make a significant difference in the ratios. IOW, I think there could be systems where it is always better to do the work on GPU even if that requires most of the CPU power to support, and also systems where the GPU provides only some small productivity increment because both are used but GPU takes a negligible amount of CPU support.
                                                                         Joe
Title: Re: GPU/CPU pefrmance dependence from AR value
Post by: Raistmer on 17 Apr 2009, 04:52:49 pm
It would be good to be more specific about what CPU and GPU are being compared. I believe that the general shape of the curves might apply to all such pairs, but a slow CPU and fast GPU or vice versa would make a significant difference in the ratios. IOW, I think there could be systems where it is always better to do the work on GPU even if that requires most of the CPU power to support, and also systems where the GPU provides only some small productivity increment because both are used but GPU takes a negligible amount of CPU support.
                                                                         Joe
CPU and GPU listed in picture footnotes.
Just dubbing here:
Host: CPU Q9450 stock freq, GPU 9600GSO 705/1900/2220 MHz

Completely agreed, absolute numbers will be different for different GPU/CPU pairs but I think for anygiven CPU/GPU pair one range of AR will be better to do on GPU and another one on CPU.
Moreover, these AR ranges will be the same as for my config. That is, though Y-axis values depends from system used in test, X-axis depends only from GPU and CPU architecture differencies themselvs and current CPU and CUDA realizations of SETI MB algorithm. One could renormalize Y-axis to 1 and recive some normalized GPU/CPU perfomance ratio picture.

About extreme cases:
1) GPU very sow CPU is very fast. One have choice not to use GPU at all, but if he will use it better to use it in most effective range of ARs. This range separation will apply.
2) GPU(s) is(are) very fast, CPU is very slow - well, this scenario can lead to situation when no tasks assigned to CPU at all, all that CPU does is feeding of GPU(s). No rebranding needed but CPU doesn't do any tasks.

In other cases this AR separation between CPU and GPU should increase host performance.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 18 Apr 2009, 06:21:16 am
Can I urge a note of caution in re-branding VHAR tasks from GPU to CPU? You'll need to put some alternative means of cache size limitation in place too.

Everybody who's been around for more than a few months will know, by experience, that Matt is quite capable of serving up a 'shorty storm' - injudicious choice of tapes can mean that all, or a very high proportion, of tasks come into the VHAR category.

Consider what would happen. BOINC requests CUDA work: gets mainly VHAR. Re-brand those to the CPU queue: the CUDA queue is now light on work, and will be re-filled. Rinse and repeat.

You'll end up with a huge number of VHAR queuing for the CPU(s). BOINC will run them in High Priority, of course, but that's no guarantee that they can all be processed within 7 days. I'm not exactly certain whether CPU High Priority would inhibit CUDA work fetch for the project - I suspect not, and I think that most people here would argue that it shouldn't. So there would be no inhibition on runaway cache filling except splitter capacity and host daily quota. Anyone got a CPU that can process 1,000 tasks a day, even shorties?

I'm seeing this in a limited way with VLARs. I was away over the holiday weekend, so I set my cache high so I could re-brand any VLARs before I left. 8 days later, one box still has over 50 VLARs from that initial batch slowly chugging through the CPU. They won't be a problem (03 May deadline, and I have plenty of spare firepower on the Quad if things ever get tight), but it's an example of the tight squeezes you can get into with re-branding.

Given the propensity we've seen here for SETIzens to download anything in sight with go-faster stripes on, and turn all the knobs up to 11 without reading the instruction manual first, some sort of limitation - especially for the slow CPU / fast GPU mis-matched hosts - would avoid a lot of disappointment.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 18 Apr 2009, 06:39:56 am
Sure, any external task move between CPU and GPU will result in breaking BOINC's expectance about how much work it has and need.
That is, no such action should be taken for 10 day cahce at all. And better is such rebranding will occur on regular basis pertty often but by pretty small chunks to not to deceive BOINC work amount estimate too much each time.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 18 Apr 2009, 09:33:08 am
You probably need to add some theoretical underpinning to the 'WU selection for re-branding', too. I've just finished an AR=0.127238 task, which, while exhibiting some of the sluggishness of the 'true' VLARs, ran for about 30 minutes on a 9800GT - compared to 90 minutes for a ~0.01, or 20 minutes for a ~0.4

This WU would have come from the first peak of Joe's graph recently published at Main (http://setiathome.berkeley.edu/forum_thread.php?id=51712&nowrap=true#884715):

(http://users.westelcom.com/jsegur/SAHenh/ARvsMaxPoT.png)

I've also had (rare) tasks from the second peak, around AR=0.2, with similar runtimes. Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.

Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Josef W. Segur on 18 Apr 2009, 05:08:33 pm
...
Joe, if you're watching, is there any way of knowing how many times these big pulse PoT arrays are processed during a run at different ARs? If there's a massive array at 40960, but it's only run once, the problem is much less than the arrays at 32768 which seem to run forever at VLAR.
...
Yes, it is possible. First, here's a table which applies to all LAR pulse finding:

Code: [Select]
FFTLen  Stepsize  NumCfft
     8 17.072753       11
    16  8.536377       23
    32  4.268188       47
    64  2.134094       93
   128  1.067047      187
   256  0.533524      375
   512  0.266762      749
  1024  0.133381     1499
  2048  0.066690     2999
  4096  0.033345     5997
  8192  0.016673    11995

The Stepsize is in chirp values, and the work goes to a final chirp limit of +/- 100, so the NumCfft (number of chirp/fft pairs) is simply derived by dividing 100 by Stepsize, truncating to an integer, doubling (for positive and negative), and adding 1 (for processing at zero chirp). The reason the same set of values applies to all LAR work is the code uses pot_min_slew to establish the Stepsize for any work below AR 0.225485775 . FFTLen 8 and 16 are only used if there's sufficient motion that PoTlen will be below the 40960 limit.

When an FFTLen 32 is done it produces 32 arrays 32768 long, but only 31 are searched for signals. For true VLAR, motion is less than one beamwidth for the full WU duration, so each of the 31 is done in one gulp. Because there are 47 chirp/fft pairs to process, we get 31*47=1457 of them.

For anything above one beamwidth the data is processed with PoTs overlapped by at least 50%, so for 1 to 1.5 beamwidths two PoTs are needed, then three for 1.5 to 2 beamwidths, four for 2 to 2.5, etc. Here's a table for true VLAR plus the two peaks which reach 40960 length:

Code: [Select]
FFTLen  AR<=0.05      AR=0.08        AR=0.16
        ------------- -------------- -------------
     8                               462(@40960)
    16                1035(@40960)   2070(@20480)
    32  1457(@32768)  4371(@20480)   8742(@10240)
    64  5859(@16384)  17577(@10240)  35154(@5120)
   128  23749(@8192)  71247(@5120)   142494(@2560)
   256  95625(@4096)  286875(@2560)  573750(@1280)
   512  382739(@2048) 1148217(@1280) 2296434(@640)
  1024  1.53e6(@1024) 4.6e6(@640)    9.2e6(@320)
  2048  6.14e6(@512)  1.84e7(@320)   3.68e7(@160)
  4096  2.46e7(@256)  7.37e7(@160)   1.47e8(@80)
  8192  9.83e7(@128)  2.95e8(@80)    5.9e8(@40)

In another thread there's the testing Raistmer did at my request showing how much time is saved doing VLAR work with the 32768 and 16384 PoT lengths not done, and with the 8192 length skipped also. Although the counts for those lengths is relatively small, they seem to account for most of the GPU crunch time. What I don't know is if the relationship is relatively smooth or time increases in steps as certain sizes are exceeded.

Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
                                                                            Joe
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 08:12:27 am

Also, another thought. Several people (myself included) have observed and reported that VHAR tasks don't scale well on multi-core hosts: there's a big penalty for running 8 x VHAR on my dual Xeon (memory bus saturation, even with quad-channel FB-DIMMs, we think). Your efficiency cross-over point may be different if you measure with all CPU cores saturated with VHAR work - which would likely be the case after re-branding.
Yes, multicore (as mult-GPU) consideration can complicate this picture much.
I consider it as first-level approach while curent (any AR goes anywhere) is zero-level one. Moreover situation with HT-capable host can be different too (compared with true multicore host).
Will try to measure CPU/ELAPSED times while all cores busy with VHARs.
 For protocol: All measurements made in first post was made with BOINC disabled. All cores were idle except wich running test app.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 08:14:24 am

Raistmer, could you provide the raw data from which your chart was derived? I'd like to be able to correlate things better.
                                                                            Joe
Sure, will mail it to you.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 19 Apr 2009, 08:36:16 am
Just for fun, I adapted Fred's script to re-brand VHAR as well as VLAR - as it happens, the strict 'V' cases are easy to handle, at both LAR and HAR. The ones in the middle are more difficult to pick out.

The first one I tried, BOINC (as expected) immediately fetched more work to re-fill the gaps in the CUDA queue caused by the re-branding. I got mostly VLAR. So I re-branded that, and BOINC replaced it - with VHAR. So I re-branded that, and .... you get the picture. Note that I am doing this under controlled, observed conditions on machines with plenty of horsepower and spare capacity I can 'borrow' from other projects. Even so, I'm going to have to keep an eye on VHAR deadlines for the next few days. If I was working on a SETI-only box with a huge cache setting, I could have got myself into serious bother.


That is, no such action should be taken for 10 day cache at all. And better is such rebranding will occur on regular basis pertty often but by pretty small chunks to not to deceive BOINC work amount estimate too much each time.


Difficult one. I can see your point about moving in small steps, but the objective you're trying to achieve is to shave a few seconds off crunch times. In order to re-brand tasks, you have to shut down and restart the BOINC core client: and that's one of the most inefficient operations around. What I'm doing (except when experimenting) is waiting until the last possible moment before one of the target ARs reaches the head of the CUDA queue, and then doing the whole lot in one big batch. That gives me the longest possible uninterrupted run before the next batch needs doing. This approach would only work for VLAR with v6.6.20, because VHAR run immediately on arrival: I'm running v6.6.23, so I can use it for VHAR as well.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Josef W. Segur on 19 Apr 2009, 10:21:56 am
...
In order to re-brand tasks, you have to shut down and restart the BOINC core client: and that's one of the most inefficient operations around.
...

Perhaps using a rewriting proxy to rebrand in the Scheduler reply would be a better approach. I think perhaps the Proxomitron (http://www.proxomitron.info/) might be able to handle that.
                                                                                 Joe
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 10:57:00 am
well, script could be written on perl. If that proxy supports external program calls maybe yes, such proxy will be best way to go.
For now I mean "small step: is 1-2 days cache, not 10 days one to avoid deadline misses but big enough to compensate BOINC shutdown/restart.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 11:04:08 am
The first one I tried, BOINC (as expected) immediately fetched more work to re-fill the gaps in the CUDA queue caused by the re-branding. I got mostly VLAR. So I re-branded that, and BOINC replaced it - with VHAR. So I re-branded that, and .... you get the picture.
Not nessesary to rebrand all tasks. Some limit can be setted up. In short, in boundaries of deadline such script would improve performance. That is, our goal will write such script and ensure that its work  will not lead to deadline misses (for more or less conscious user, even app_info can be devastating weapon in hands of "no-clue" user...)
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 11:24:12 am
<result>
    <name>27dc08ab.16638.387984.5.8.30_0</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>1</state>
    <platform>windows_intelx86</platform>
    <version_num>608</version_num>
    <plan_class>cuda</plan_class>
    <wu_name>27dc08ab.16638.387984.5.8.30</wu_name>
    <report_deadline>1240759087.000000</report_deadline>
    <file_ref>
        <file_name>27dc08ab.16638.387984.5.8.30_0_0</file_name>
        <open_name>result.sah</open_name>
    </file_ref>
</result>
Do I understand right that only yellow lines should be changed for CPU<->GPU rebranding ?
Title: Re: GPU/CPU performance dependence from AR value
Post by: Jason G on 19 Apr 2009, 11:39:17 am
Do I understand right that only yellow lines should be changed for CPU<->GPU rebranding ?

Yes, pretty much. What I found easier (for me, while I was doing it manually) was to add an extra application name for cpu app with no plan class to app_info, say: setiathome_enhanced_AK  version 608  , (Planclass CPU optional).  This way all rebranded work shows as new version but CPU in manager.  That was helpful in quickly identifying work that was already rebranded visually.

This way only the application name and plan class lines needed changing, rather than the version, and work already allocated to CPU via 6.03 was untouched. new work for CPU went to 6.03, new cuda work went to cuda 6.08, and only rebranded work was 6.08 (CPU plan class), meaning it was easy to recognise and undo if needed.


Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 12:44:32 pm
I going to rebrand not only GPU to CPU but CPU to GPU too ;)
Now finishing current results running (for clean first test) and will try to do rebrand. produced file looks OK at my not very experienced in client_state.xml sight :)
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 01:58:18 pm
Unfortunately, it was not single place:
<workunit>
    <name>01fe09aa.11349.15205.4.8.69</name>
    <app_name>setiathome_enhanced</app_name>
    <version_num>608</version_num>
    <rsc_fpops_est>78856403871557.094000</rsc_fpops_est>
    <rsc_fpops_bound>788564038715571.000000</rsc_fpops_bound>
    <rsc_memory_bound>33554432.000000</rsc_memory_bound>
    <rsc_disk_bound>33554432.000000</rsc_disk_bound>
    <file_ref>
        <file_name>01fe09aa.11349.15205.4.8.69</file_name>
        <open_name>work_unit.sah</open_name>
    </file_ref>
</workunit>

Here version number mentioned too.
I got errors on all rebranded tasks. Smth like can't link  on result ....
19/04/2009 21:56:31   SETI@home   [error] State file error: missing task
19/04/2009 21:56:31   SETI@home   [error] Can't link task 01fe09aa.11349.15205.4.8.75_0 in state file
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 19 Apr 2009, 02:39:28 pm
Fred's script just replaces "608" with "603" in both the <workunit> and <result> sections (matching ones, of course), and deletes the <plan_class>cuda</plan_class> line completely - making it look exactly like a 603 directly allocated to the CPU by the server. That seems the simplest solution: but I'm intrigued by Josef's suggestion. That might be worth a look.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 19 Apr 2009, 04:25:38 pm
Well, I've had a look through the Proxomitron docs, and I'm a bit daunted by the match/replace language. Does anyone have any experience of writing that sort of meta-character based filter? Here's an example of what we need to do.

Quote
<rubbish>
...
</rubbish>
<file_info>
    <name>27dc08ab.32733.481.6.8.9</name>
    <url>http://boinc2.ssl.berkeley.edu/sah/download_fanout/f1/27dc08ab.32733.481.6.8.9</url>
    <md5_cksum>d8bf53ae5251691603446976bd9e757d</md5_cksum>
    <nbytes>375323</nbytes>
</file_info>
<workunit>
    <rsc_fpops_est>23780000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>237800000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>33554432.000000</rsc_memory_bound>
    <rsc_disk_bound>33554432.000000</rsc_disk_bound>
    <name>27dc08ab.32733.481.6.8.9</name>
    <app_name>setiathome_enhanced</app_name>
<file_ref>
    <file_name>27dc08ab.32733.481.6.8.9</file_name>
    <open_name>work_unit.sah</open_name>
</file_ref>
</workunit>
...
<file>
<WU>
<file>
<WU>
...
<file_info>
  <name>27dc08ab.32733.481.6.8.9_1_0</name>
  <generated_locally/>
  <upload_when_present/>
  <max_nbytes>65536</max_nbytes>
  <url>http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<xml_signature>
b849d6e0adcc332ad1601a97d75f4d073ea633ae16b00663e6aa98bac0477c08
3742e0330aae2deee62f2406ddcd1020b3ff02e3cf6f7f77482a97dbc453a489
21fe18199095dda88f172da2d97b1d1cddff23272c832be8e44ba10b38212700
0e5ff950052f3a870c850bb3efa7cefcee57ce02ddcb6473d55526a34ba2dc4f
.
</xml_signature>
</file_info>
<result>
<report_deadline>1240773971</report_deadline>
<wu_name>27dc08ab.32733.481.6.8.9</wu_name>
<name>27dc08ab.32733.481.6.8.9_1</name>
  <file_ref>
    <file_name>27dc08ab.32733.481.6.8.9_1_0</file_name>
    <open_name>result.sah</open_name>
  </file_ref>
    <platform>windows_intelx86</platform>
    <version_num>608</version_num>
    <plan_class>cuda</plan_class>
</result>
...
<file>
<result>
<file>
<result>
...
<rubbish>
...
</rubbish>

The <rubbish> and all the intervening file/workunit/file/result stuff needs preserving, of course.

But the process is:

Identify a WU for re-branding by <rsc_fpops_est> - this is a VHAR
Remember the next <name> for later use
Look for a matching <wu_name> (it only occurs in a <result> section)
Change the next following <version_num> from 608 to 603, or vice versa as desired
Delete or insert the <plan_class> line to match

allowing for up to 20 names to be matched and re-branded in a single sched_reply file. The process is very similar to Fred's surgery on client_state, so it could easily be scripted: but I'm not sure it could be pattern-matched on the fly.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 04:26:02 pm
Fred's script just replaces "608" with "603" in both the <workunit> and <result> sections (matching ones, of course), and deletes the <plan_class>cuda</plan_class> line completely - making it look exactly like a 603 directly allocated to the CPU by the server. That seems the simplest solution: but I'm intrigued by Josef's suggestion. That might be worth a look.
Yes, mine too.
But it can do CPU->GPU move too.
Beta version in testing.

BOINC can be stopped by "boinccmd --quit "/net stop boinc and restarted after patching via start/net start boinc
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 04:28:38 pm

Identify a WU for re-branding by <rsc_fpops_est> - this is a VHAR
How AR can be matched with this field value?
Any table or formula exist?
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 19 Apr 2009, 04:42:13 pm
Identify a WU for re-branding by <rsc_fpops_est> - this is a VHAR

How AR can be matched with this field value?
Any table or formula exist?

That's why I said earlier that the 'V' cases are easy - 80360000000000.000000 is a VLAR (true VLAR - AR<0.05) and 23780000000000.000000 is a VHAR.

For the formula in between, ask Josef, and remind him of http://setiathome.berkeley.edu/forum_thread.php?id=44178&nowrap=true#698744 - there will be a linear scaling factor to the formula in that post. But note that the formula will not be single-valued in converting from fpops to AR - any intermediate value could be on either of the curves.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 04:46:46 pm
I see. Will stay with rebranding script for now then.
Running once per 1-2 days it can increase net host performance IMHO.
~50% of task speedup for VHAR and even more for VLAR and no need to kill task - enough advantages for next step to perfection ;)
Title: Re: GPU/CPU performance dependence from AR value
Post by: Richard Haselgrove on 19 Apr 2009, 05:23:45 pm
I've split Fred's script into an information-only part (which can run while BOINC is active), and an action part which shuts down BOINC, does the necessary, and restarts BOINC.

When I get to within 50 CUDA tasks of the first one I want to re-brand, the default button switches from 'No' to 'Yes', and the action script runs automatically unless I intervene to stop it.

[attachment deleted by admin]
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 19 Apr 2009, 05:32:13 pm
I switched to 6.6.20 CUDA management from V10 pack for using rebranding and seems not edit app_info flop estimates right. So I recived 500 tasks at once. Did rebranding on whole queue (lost only few tasks during debugging). sorted by deadline time queue looks similarly GPU.CPU performance graph. 603 then 608 CUDA then 603 again. It seems scriopt work.
If some other successful reports will arive will put it to this thread. Maybe it could be ported to VBscript, to avoid using of Perl interpretator.
Title: Re: GPU/CPU performance dependence from AR value
Post by: Raistmer on 25 Apr 2009, 01:20:09 pm
To evaluate dependance of performance curve from type of GPU I did the same measurements on the same host with different GPU.
Now it was 8500GT, one of the slowest CUDA-capable GPUs. First experiment was with 9600GSO that can be viewed as midrange GPU.
Results plotted on this graph.
(http://img411.imageshack.us/img411/9305/relativegpuperformance.th.png) (http://img411.imageshack.us/my.php?image=relativegpuperformance.png)
Performance values normalized on 100% to simplify GPU comparison.
One can see that the slower GPU is the less sensitive it to differencies in performances for different ARs. Nevertheless, curve has the same peculiarities as for 9600GSO.
IMHO the extrapolation to the faster GPUs will emphasize differencies in performance for different ARs. That is, the more fast GPU is the more it needs correct AR tasks to compute.

EDIT: current results in VHAR area recived for single core busy with other cores idle on CPU. CPU VHAR performance drops significally when all CPU cores busy with same VHAR task.
This effect needs more investigations (and will take influence on VHAR part of provided curves).