+- +-
Say hello if visiting :) by Gecko
11 Jan 2023, 07:43:05 pm

Seti is down again by Mike
09 Aug 2017, 10:02:44 am

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view by Raistmer
11 Dec 2016, 06:30:56 am

Loading APU to the limit: performance considerations by Mike
05 Nov 2016, 06:49:26 am

Better sleep on Windows - new round by Raistmer
26 Aug 2016, 02:02:31 pm

Author Topic: GPU AP tuning: new set of test tasks for GPU AP  (Read 27911 times)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
GPU AP tuning: new set of test tasks for GPU AP
« on: 05 Aug 2012, 06:14:08 pm »
Feature of this set: tasks are zero-blanked w/o any signals. This constitutes ideal case for GPU part.
Preferred usage: to tune GPU AP parameters.
Also can be used in testing phase for check for false positives.

I personally prefer to use long enough (around 300secs mean elapsed time) task for tuning and performance measuring cause such task total length much more than startup time but still small enough for doing whole bunch of runs with different params.

Here longest one attached. Execution time on my HD6950 around 300secs, with good (but not nessesary the best) params:
-unroll 10 -ffa_block 4096 -ffa_block_fetch 4096
on idle CPU with Cat 12.1 driver, Win7 x64 OS.
WU : Clean_20LC.wu
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 304.541 secs
      CPU 91.573 secs

    single pulses: 0
repetitive pulses: 0
  percent blanked: 0.00

For slower GPUs I will attach smaller tasks. One can create own Clean_*LC task by editing     <dm_high>3455</dm_high> field.

EDIT: Perl-based extraction script added

{edit} JWS: Removing the Clean_20LC.rar attachment, now included in the Clean_xxLC_WUs.7z download for AP test tools.
« Last Edit: 24 Apr 2015, 03:48:48 pm by Josef W. Segur »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: New set of test tasks for GPU AP
« Reply #1 on: 05 Aug 2012, 06:57:38 pm »
And here is the smallest task, for low-end GPUs.

My C-60 does it in~300 secs

Running app : AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -unroll 2
with WU     : Clean_01LC.wu
Started at  : 02:48:48.977
Ended at    : 02:54:12.341
    323.181 secs Elapsed
    233.237 secs CPU time

{edit} JWS: Removing the Clean_01LC.wu.7z attachment, now included in the Clean_xxLC_WUs.7z download for AP test tools.
« Last Edit: 24 Apr 2015, 03:50:10 pm by Josef W. Segur »

Offline Claggy

  • Alpha Tester
  • Knight who says 'Ni!'
  • ***
  • Posts: 3111
    • My computers at Seti Beta
Re: New set of test tasks for GPU AP
« Reply #2 on: 06 Aug 2012, 12:25:06 am »
Here's a bench of the  Clean_20LC.wu task on my GTX460 (Win 7 x64, 304.79)

AP6_win_x86_SSE2_OpenCL_NV_r1363.exe  / Clean_20LC.wu :
AppName: AP6_win_x86_SSE2_OpenCL_NV_r1363.exe
AppArgs: 
TaskName: Clean_20LC.wu
Started at  : 01:37:12.156
Ended at    : 01:45:11.080
    478.900 secs Elapsed
    464.197 secs CPU time
Speedup     : 93.34%
Ratio       : 15.02x
 
ref-astropulse_6.01_windows_intelx86.exe-Clean_20LC.wu.res: <ap_signal>10,<pulses>0,<best_pulses>10
result-AP6_win_x86_SSE2_OpenCL_NV_r1363.exe-Clean_20LC.wu.res: <ap_signal>10,<pulses>0,<best_pulses>10
             All Signals: Weakly similar or Different.
                  Pulses: Checked   0,  0 , Strongly Similar
             Best Pulses: Weakly similar or Different.

Claggy
« Last Edit: 06 Aug 2012, 12:29:25 am by Claggy »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: New set of test tasks for GPU AP
« Reply #3 on: 06 Aug 2012, 02:41:26 am »
Took quite more than 300 secs so if you will attempt to tune params on your GPU task could be reduced a little. Besides of fixed startup time execution time for AP clean task should scale linearly with number of large DM chuncks involved. Each large DM chunk consists of 128DMs. Lowest one is 896 so expression 896+N*128-1 can be used for high DM field where N is the number of large DM chunks task will contain. In your case I would use 12 or 13 LC one instead of 20LC.
« Last Edit: 06 Aug 2012, 03:18:44 am by Raistmer »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #4 on: 06 Aug 2012, 04:58:10 am »
To get ~same execution time for C-60 as for HD6950 task was reduced in 20 times.
But C-60 has 2 CU while HD6950 has 22 CU, 11 times difference, not 20.
Looks like another 9 times of slowdown came from memory used. HD6950 is discrete GPU with own dedicated fast memory while C-60 is APU that uses system memory.

Interesting, what slowdown came from different architectures of these 2 devices....

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #5 on: 06 Aug 2012, 05:20:11 am »
Here first results of unroll param tuning for HD6950. CPU idle, Cat 12.1, OS Vista x86.

As one can see there is some saturation in performance when unroll factor reached certain threshold. Too low unrolls considerably inefficient. Take into account that real world task will have signals in it _ some % of blanking. Each factor adds slowdown when unroll increases (memory requirement increases with unroll too) so I would reccommend to stay at minimal effective unrolls and not rise them too much w/o need.


Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #6 on: 06 Aug 2012, 06:14:50 am »
And the same run for C-60 APU. CPU idle, Cat 11.12 mobile, OS Win7 x64
Quite different picture. Bench run was aborted due to system shutdown (I leaved netbook on soft surface and it overheated). Will repeat with good cooling to see if such performance decrease at higer unroll is overheating effect (APU freq drop maybe? ) or it's inherent to device.
« Last Edit: 06 Aug 2012, 01:41:59 pm by Raistmer »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #7 on: 06 Aug 2012, 06:50:37 am »
Want to share picture TThrottle provided.
Stages:
1. Netbook idle with screen off
2. Screen goes online
3. Benchmark started, many ap_genwiz tasks are done
4. First test Clean_01LC.wu task running.
« Last Edit: 06 Aug 2012, 06:59:07 am by Raistmer »

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #8 on: 06 Aug 2012, 07:33:55 am »
And another picture:

1: Screen was switched off during task run
2: External air cooling applied
3: screen goes off during task run then waked up.

Looks like there is performance drop when screen goes off. Not only temperature decreases, but execution time increases too.

It can explain such erratic times I get on netbook and monotonic dependence on discrete GPU where screen always ON. To check this I will do run with display always ON and then with display turned off after 1 min of keyboard idle.
« Last Edit: 06 Aug 2012, 07:37:20 am by Raistmer »

Offline Fredericx51

  • Knight o' The Round Table
  • ***
  • Posts: 207
  • Knight Who Says Ni N!
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #9 on: 06 Aug 2012, 08:08:54 am »
Is it usefull to do test with I7-2600 + 2x HD5870 GPUs, using AP rev.1316 app. with unroll 15;
ffa_block  10240 ffa_block_fetch 5120
?  These give the lowest runtime and CPU time.
{Cat 12.4;  AMD-APP (SDK) 2.4; OpenCL 1.2}

Or try this on GTX470 or 480?



Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #10 on: 06 Aug 2012, 08:19:43 am »
Is it usefull to do test with I7-2600 + 2x HD5870 GPUs, using AP rev.1316 app. with unroll 15;
ffa_block  10240 ffa_block_fetch 5120
?  These give the lowest runtime and CPU time.
{Cat 12.4;  AMD-APP (SDK) 2.4; OpenCL 1.2}

Or try this on GTX470 or 480?


I found useful to get dependence curve from param, not just single dot. It's not test for valideness, it's tuning, I see no sense in single dot here, it will say nothing about good or bad params were chosen..

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #11 on: 06 Aug 2012, 01:44:48 pm »
C-60 picture updated, extraction script added to first post.
Looks like additional cooling and keeping display ON can make results more stable indeed (yellow dots)

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #12 on: 06 Aug 2012, 02:54:02 pm »
Being curious I decided to pass whole possible range of unrolls.
In short, it breaks on 65 for this GPU. Errors (-61, invalid buffer size) and then driver restart.
Very interesting dot on 64 unroll (will repeat it after reboot, on driver restart host lost mouse cursor completely): high CPU usage. We see high CPU usage for new FFA PC kernel sequence where total kernel sequence run time (w/o sync point with host) quite big. I supposed that ATi driver switches from interrupts to busy-wait loop after some awaiting threshold hence if kernel sequence too long we get increase in CPU time (in contradiction with all GPU optimization manuals, btw).
Here, with unroll increase, single kernel becomes longer and longer so, at some point, same driver switch should occur if any exist. This preliminary data show that yes, it happens. Need to be repeated few times of course to be sure.

And another conclusion: half of CU number unroll is good guess but little not optimal, but going further than unroll of number of CUs is pointless.

EDIT: added missed dots and repeated last one few times - it's reproducable, very high CPU usage at unroll 64 indeed! (blue dots recived after reboot, vertical line is the number of CU for this GPU).
« Last Edit: 06 Aug 2012, 06:57:18 pm by Raistmer »

Offline arkayn

  • Janitor o' the Board
  • Knight who says 'Ni!'
  • *****
  • Posts: 1230
  • Aaaarrrrgggghhhh
    • My Little Place On The Internet
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #13 on: 06 Aug 2012, 03:55:04 pm »
Ran both WU's on my HD-7750 and GTX-670

Quick timetable
 
WU : #ap_genwis.dat
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 4.561 secs
      CPU 2.527 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 53.743 secs, speedup: -1078.32%  ratio: 0.08
      CPU 51.574 secs, speedup: -1940.92%  ratio: 0.05
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 3.401 secs, speedup: 25.43%  ratio: 1.34
      CPU 1.420 secs, speedup: 43.81%  ratio: 1.78
 
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 718.923 secs
      CPU 715.717 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 40.220 secs, speedup: 94.41%  ratio: 17.87
      CPU 8.408 secs, speedup: 98.83%  ratio: 85.12
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 23.584 secs, speedup: 96.72%  ratio: 30.48
      CPU 20.967 secs, speedup: 97.07%  ratio: 34.14
 
WU : Clean_20LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
  Elapsed 14193.554 secs
      CPU 14184.235 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1363.exe -verbose  :
  Elapsed 730.437 secs, speedup: 94.85%  ratio: 19.43
      CPU 122.820 secs, speedup: 99.13%  ratio: 115.49
AP6_win_x86_SSE2_OpenCL_NV_r1363.exe -verbose  :
  Elapsed 402.683 secs, speedup: 97.16%  ratio: 35.25
      CPU 385.572 secs, speedup: 97.28%  ratio: 36.79

Offline Raistmer

  • Working Code Wizard
  • Volunteer Developer
  • Knight who says 'Ni!'
  • *****
  • Posts: 14349
Re: GPU AP tuning: new set of test tasks for GPU AP
« Reply #14 on: 07 Aug 2012, 03:21:01 pm »
Here is full range of unrolls for C-60.
As was expected display ON and display OFF constitute very different modes of operation.
Though the power plans for netbook differ only by display behavior, both PCIe settings and CPU settings were exactly the same, GPU performance was considerably different with display ON and display OFF.
It's annoying feature for GPCPU computing cause hardly someone will keep netbook display ON always just for crunching. I will check if manuall turning off display (not via power plan but via Fn+display off key) will result in same slowdown...


 

Welcome, Guest.
Please login or register.
 
 
 
Forgot your password?
Members
Total Members: 97
Latest: ToeBee
New This Month: 0
New This Week: 0
New Today: 0
Stats
Total Posts: 59559
Total Topics: 1672
Most Online Today: 36
Most Online Ever: 983
(20 Jan 2020, 03:17:55 pm)
Users Online
Members: 0
Guests: 21
Total: 21
Powered by EzPortal