Seti@Home optimized science apps and information

Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Raistmer on 05 Aug 2012, 06:14:08 pm

Title: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 05 Aug 2012, 06:14:08 pm
Feature of this set: tasks are zero-blanked w/o any signals. This constitutes ideal case for GPU part.
Preferred usage: to tune GPU AP parameters.
Also can be used in testing phase for check for false positives.

I personally prefer to use long enough (around 300secs mean elapsed time) task for tuning and performance measuring cause such task total length much more than startup time but still small enough for doing whole bunch of runs with different params.

Here longest one attached. Execution time on my HD6950 around 300secs, with good (but not nessesary the best) params:
-unroll 10 -ffa_block 4096 -ffa_block_fetch 4096
on idle CPU with Cat 12.1 driver, Win7 x64 OS.
WU : Clean_20LC.wu
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 304.541 secs
      CPU 91.573 secs

    single pulses: 0
repetitive pulses: 0
  percent blanked: 0.00

For slower GPUs I will attach smaller tasks. One can create own Clean_*LC task by editing     <dm_high>3455</dm_high> field.

EDIT: Perl-based extraction script added

{edit} JWS: Removing the Clean_20LC.rar attachment, now included in the Clean_xxLC_WUs.7z download for AP test tools.
Title: Re: New set of test tasks for GPU AP
Post by: Raistmer on 05 Aug 2012, 06:57:38 pm
And here is the smallest task, for low-end GPUs.

My C-60 does it in~300 secs

Running app : AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -unroll 2
with WU     : Clean_01LC.wu
Started at  : 02:48:48.977
Ended at    : 02:54:12.341
    323.181 secs Elapsed
    233.237 secs CPU time

{edit} JWS: Removing the Clean_01LC.wu.7z attachment, now included in the Clean_xxLC_WUs.7z download for AP test tools.
Title: Re: New set of test tasks for GPU AP
Post by: Claggy on 06 Aug 2012, 12:25:06 am
Here's a bench of the  Clean_20LC.wu task on my GTX460 (Win 7 x64, 304.79)

AP6_win_x86_SSE2_OpenCL_NV_r1363.exe  / Clean_20LC.wu :
AppName: AP6_win_x86_SSE2_OpenCL_NV_r1363.exe
AppArgs: 
TaskName: Clean_20LC.wu
Started at  : 01:37:12.156
Ended at    : 01:45:11.080
    478.900 secs Elapsed
    464.197 secs CPU time
Speedup     : 93.34%
Ratio       : 15.02x
 
ref-astropulse_6.01_windows_intelx86.exe-Clean_20LC.wu.res: <ap_signal>10,<pulses>0,<best_pulses>10
result-AP6_win_x86_SSE2_OpenCL_NV_r1363.exe-Clean_20LC.wu.res: <ap_signal>10,<pulses>0,<best_pulses>10
             All Signals: Weakly similar or Different.
                  Pulses: Checked   0,  0 , Strongly Similar
             Best Pulses: Weakly similar or Different.

Claggy
Title: Re: New set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 02:41:26 am
Took quite more than 300 secs so if you will attempt to tune params on your GPU task could be reduced a little. Besides of fixed startup time execution time for AP clean task should scale linearly with number of large DM chuncks involved. Each large DM chunk consists of 128DMs. Lowest one is 896 so expression 896+N*128-1 can be used for high DM field where N is the number of large DM chunks task will contain. In your case I would use 12 or 13 LC one instead of 20LC.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 04:58:10 am
To get ~same execution time for C-60 as for HD6950 task was reduced in 20 times.
But C-60 has 2 CU while HD6950 has 22 CU, 11 times difference, not 20.
Looks like another 9 times of slowdown came from memory used. HD6950 is discrete GPU with own dedicated fast memory while C-60 is APU that uses system memory.

Interesting, what slowdown came from different architectures of these 2 devices....
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 05:20:11 am
Here first results of unroll param tuning for HD6950. CPU idle, Cat 12.1, OS Vista x86.

As one can see there is some saturation in performance when unroll factor reached certain threshold. Too low unrolls considerably inefficient. Take into account that real world task will have signals in it _ some % of blanking. Each factor adds slowdown when unroll increases (memory requirement increases with unroll too) so I would reccommend to stay at minimal effective unrolls and not rise them too much w/o need.

Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 06:14:50 am
And the same run for C-60 APU. CPU idle, Cat 11.12 mobile, OS Win7 x64
Quite different picture. Bench run was aborted due to system shutdown (I leaved netbook on soft surface and it overheated). Will repeat with good cooling to see if such performance decrease at higer unroll is overheating effect (APU freq drop maybe? ) or it's inherent to device.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 06:50:37 am
Want to share picture TThrottle provided.
Stages:
1. Netbook idle with screen off
2. Screen goes online
3. Benchmark started, many ap_genwiz tasks are done
4. First test Clean_01LC.wu task running.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 07:33:55 am
And another picture:

1: Screen was switched off during task run
2: External air cooling applied
3: screen goes off during task run then waked up.

Looks like there is performance drop when screen goes off. Not only temperature decreases, but execution time increases too.

It can explain such erratic times I get on netbook and monotonic dependence on discrete GPU where screen always ON. To check this I will do run with display always ON and then with display turned off after 1 min of keyboard idle.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Fredericx51 on 06 Aug 2012, 08:08:54 am
Is it usefull to do test with I7-2600 + 2x HD5870 GPUs, using AP rev.1316 app. with unroll 15;
ffa_block  10240 ffa_block_fetch 5120
?  These give the lowest runtime and CPU time.
{Cat 12.4;  AMD-APP (SDK) 2.4; OpenCL 1.2}

Or try this on GTX470 or 480?


Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 08:19:43 am
Is it usefull to do test with I7-2600 + 2x HD5870 GPUs, using AP rev.1316 app. with unroll 15;
ffa_block  10240 ffa_block_fetch 5120
?  These give the lowest runtime and CPU time.
{Cat 12.4;  AMD-APP (SDK) 2.4; OpenCL 1.2}

Or try this on GTX470 or 480?


I found useful to get dependence curve from param, not just single dot. It's not test for valideness, it's tuning, I see no sense in single dot here, it will say nothing about good or bad params were chosen..
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 01:44:48 pm
C-60 picture updated, extraction script added to first post.
Looks like additional cooling and keeping display ON can make results more stable indeed (yellow dots)
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 06 Aug 2012, 02:54:02 pm
Being curious I decided to pass whole possible range of unrolls.
In short, it breaks on 65 for this GPU. Errors (-61, invalid buffer size) and then driver restart.
Very interesting dot on 64 unroll (will repeat it after reboot, on driver restart host lost mouse cursor completely): high CPU usage. We see high CPU usage for new FFA PC kernel sequence where total kernel sequence run time (w/o sync point with host) quite big. I supposed that ATi driver switches from interrupts to busy-wait loop after some awaiting threshold hence if kernel sequence too long we get increase in CPU time (in contradiction with all GPU optimization manuals, btw).
Here, with unroll increase, single kernel becomes longer and longer so, at some point, same driver switch should occur if any exist. This preliminary data show that yes, it happens. Need to be repeated few times of course to be sure.

And another conclusion: half of CU number unroll is good guess but little not optimal, but going further than unroll of number of CUs is pointless.

EDIT: added missed dots and repeated last one few times - it's reproducable, very high CPU usage at unroll 64 indeed! (blue dots recived after reboot, vertical line is the number of CU for this GPU).
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: arkayn on 06 Aug 2012, 03:55:04 pm
Ran both WU's on my HD-7750 and GTX-670

Quick timetable
 
WU : #ap_genwis.dat
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 4.561 secs
      CPU 2.527 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 53.743 secs, speedup: -1078.32%  ratio: 0.08
      CPU 51.574 secs, speedup: -1940.92%  ratio: 0.05
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 3.401 secs, speedup: 25.43%  ratio: 1.34
      CPU 1.420 secs, speedup: 43.81%  ratio: 1.78
 
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 718.923 secs
      CPU 715.717 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 40.220 secs, speedup: 94.41%  ratio: 17.87
      CPU 8.408 secs, speedup: 98.83%  ratio: 85.12
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 23.584 secs, speedup: 96.72%  ratio: 30.48
      CPU 20.967 secs, speedup: 97.07%  ratio: 34.14
 
WU : Clean_20LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 14193.554 secs
      CPU 14184.235 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 730.437 secs, speedup: 94.85%  ratio: 19.43
      CPU 122.820 secs, speedup: 99.13%  ratio: 115.49
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 402.683 secs, speedup: 97.16%  ratio: 35.25
      CPU 385.572 secs, speedup: 97.28%  ratio: 36.79
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 07 Aug 2012, 03:21:01 pm
Here is full range of unrolls for C-60.
As was expected display ON and display OFF constitute very different modes of operation.
Though the power plans for netbook differ only by display behavior, both PCIe settings and CPU settings were exactly the same, GPU performance was considerably different with display ON and display OFF.
It's annoying feature for GPCPU computing cause hardly someone will keep netbook display ON always just for crunching. I will check if manuall turning off display (not via power plan but via Fn+display off key) will result in same slowdown...

Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 07 Aug 2012, 03:23:37 pm
And here is initial stage of HD6950 ffa_block=ffa_block_fetch parameters curve. Again, it's clear that minimal default params not good for this GPU, more domain size for FFA kernels required. Red vertical line marks point where each CU will have 1 wavefront. Actually, cause workgroup size of all kernels != 64, some CU will have no wavefronts at all and some will have 4 waves in that point.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Mike on 07 Aug 2012, 05:27:30 pm
I think those tests are a little iffy becaue real live circumstances are much different.
Specially on high end cards.
Most users want to be able to do everything on the computer whilst crunching.
With actual driver design its hard enough getting a stable configuration.
Specially with multiple GPU setup.

Thats the reason i rarely give high optimized values officially on the forums.
Whats possible on your (my) host is not neccesaarily possible on another.
So we should be careful with such stuff.

kindest regards
Mike
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Raistmer on 08 Aug 2012, 08:48:07 am
Definitely. That's why I do tests on 2 quite different GPUs - to show how different behavior can be for different hardware (and software, btw) configs.
It's so called "case study" as I understand it. And it's in progress. How well params chosen from such tests will behave in real-world situation further phases will show. But it should be started from something. As with any "real" study I trying to exclude as much different factors as I can to better understand the influence of remaining ones.
For example, when I collected real-life data from my netbook I saw great deviation in run times w/o real clue what caused it.
Now I have good illustration of display state influence (in real world I can't control that cause sometimes I beside keyboard sometimes not). 
Moreover, I expect that different param configs will react differently on additional CPU load. So I need baseline with CPU idle to understand what come from loaded CPU and what comes from change of params...

EDIT: I would say this thread not about some direct recommendations what params to use but about better understanding how those params can influent on performance and what additional factors to consider when tuning performed. I think such additional knowledge can be useful too.
Title: Re: GPU AP tuning: new set of test tasks for GPU AP
Post by: Mike on 08 Aug 2012, 05:46:05 pm
I totally understand that.
But my experience specially from the last few weeks is most users get wrong assumptions of those suggestions.
Some users changing params in such a short timeframe that they will never get an idea what caused what in particular.
 
Also keep in mind not everybody has techincal understanding you have.

Just my 2 cent worth.

Mike