Author Topic: AstroPulse v7 performance illustration (Read 30875 times)

Raistmer · « **on:** 24 Aug 2014, 04:25:28 am »

Here is very initial graph of APv7 performance for CPU and ATi GPU.

Upper part illustrates CPU APv7 run times dependence from blanking. As one can see dependence still linear but now time decreases with blanking % increase rather than increases as for APv6. With increased blanking there are actually less data to search so new version better reflects this.

Lower part of graph illustrates ATi GPU performance in purely default config with all CPU cores busy and w/o any additional params in config files.
As one can see it's not the best way to run GPU app. elapsed time fluctuiates a lot that decreases host performance.
Better way to either free CPU core or use -cpu_lock switch that will be tested later.

EDIT: to describe test configuration more precise: results collected at time where host running ONLY APv7 tasks both on CPU and GPU. No APv6 or MB7 tasks in progress through data collection. Some non-BOINC activity still possible but not too heavy.

Mike · « **Reply #1 on:** 24 Aug 2014, 04:44:04 am »

I experienced the same.
On CPU higher blankings result in shorter run times whilst on GPU processing times dont differ much on my fast GPU.
But it depends whats running on the CPU.
On my FX running APs on CPU and GPU at the same time slows down GPU a little bit ~500 seconds.

Richard Haselgrove · « **Reply #2 on:** 24 Aug 2014, 04:58:26 am »

That's really good news on the blanked task runtimes - worth all the effort.

One of the ways we could approach the CPU core release generically, rather than by expecting people to make manual adjustments post-installation, would be to increase the CPU usage value in the AI stub. Remembering that BOINC will allow over-commitment of the CPU by any fractional amount up to 0.99, I'd suggest we use a fraction above 0.50, but below 0.99.

For example, I tend to use 0.67. That means that if two AP tasks run together, BOINC automatically releases one extra core - three tasks, and it's an extra two, but it's still two with four tasks running. There's no exact science behind that choice, just a gut feeling (and I don't usually run AP anyway), but it's an approach we have time to consider and fine tune before the live release (I hope).

Raistmer · « **Reply #3 on:** 24 Aug 2014, 05:54:25 am »

Quote from: Richard Haselgrove on 24 Aug 2014, 04:58:26 am

That's really good news on the blanked task runtimes - worth all the effort.

One of the ways we could approach the CPU core release generically, rather than by expecting people to make manual adjustments post-installation, would be to increase the CPU usage value in the AI stub. Remembering that BOINC will allow over-commitment of the CPU by any fractional amount up to 0.99, I'd suggest we use a fraction above 0.50, but below 0.99.

For example, I tend to use 0.67. That means that if two AP tasks run together, BOINC automatically releases one extra core - three tasks, and it's an extra two, but it's still two with four tasks running. There's no exact science behind that choice, just a gut feeling (and I don't usually run AP anyway), but it's an approach we have time to consider and fine tune before the live release (I hope).

Will it free core when single instance runing?

Richard Haselgrove · « **Reply #4 on:** 24 Aug 2014, 06:04:17 am »

Quote from: Raistmer on 24 Aug 2014, 05:54:25 am

Will it free core when single instance runing?

Not unless we go all the way up to 1.00 - which I think is possibly (but open to discussion) overkill for general users. And if we do that, we consume a whole thread of CPU for each and every AP in progress - which, again, I think is a bit over the top. Not every user is a hardcore optimizer! (and those who are, can look after themselves, as always).

Mike · « **Reply #5 on:** 24 Aug 2014, 06:09:50 am »

Quote from: Richard Haselgrove on 24 Aug 2014, 06:04:17 am

Quote from: Raistmer on 24 Aug 2014, 05:54:25 am

Will it free core when single instance runing?

Not unless we go all the way up to 1.00 - which I think is possibly (but open to discussion) overkill for general users. And if we do that, we consume a whole thread of CPU for each and every AP in progress - which, again, I think is a bit over the top. Not every user is a hardcore optimizer! (and those who are, can look after themselves, as always).

You mean like Mark and Juan ?

Richard Haselgrove · « **Reply #6 on:** 24 Aug 2014, 06:11:34 am »

LOL.

We'll teach them software yet!

Raistmer · « **Reply #7 on:** 24 Aug 2014, 06:11:53 am »

Then I would suggest -cpu_lock option instead as enabled by default in aistub and installer.
I'll test this suggestion in next few days. Maybe this should be default mode for stock too, will see.

Richard Haselgrove · « **Reply #8 on:** 24 Aug 2014, 06:14:37 am »

Quote from: Raistmer on 24 Aug 2014, 06:11:53 am

Then I would suggest -cpu_lock option instead as enabled by default in aistub and installer.
I'll test this suggestion in next few days. Maybe this should be default mode for stock too, will see.

Makes sense. Like you, I want to make minimal changes possible for the interim v0.42 installer, so there'll be no behavioural change in v0.42c for people to get used to, but I think it's a discussion worth having between v0.42 and v0.43

Raistmer · « **Reply #9 on:** 28 Aug 2014, 10:48:56 am »

And here is update for initial graph with -cpu_lock data added.

As one can see the usage of -cpu_lock on GPU app leads to some increase of elapsed time for CPU apps 9and reported CPU time remains the same). Maybe it reflects inability of OS to move GPU app process between cores so it either leave core idle for some time or starts to move CPU app processes more actively that increases overhead. This decrease performance on CPU side.
But there is much more noticeable effect on GPU side. Now elapsed times for GPU tasks have much lass deviation and grouped toward their bottom boundary (especially for low-blanked tasks). This lead to big increase in productivity on GPU side that will offset some loss of performance on CPU side. Also, CPU side performance can be affected non-BOINC activity on this host so more data set required to make sure if CPU performance has some decrease indeed. But increase in GPU performance is obvious. Compare blue and greed dots. Number of dots is comparable so data sets can be directly compared.

After little more data collection in -cpu_lock regime I plan to show how freeing CPU core (w/o -cpu_lock) will affect host performance.

EDIT: more data added, graph updated. Now in "1 core idle" mode for collection new data.

Raistmer · « **Reply #10 on:** 29 Aug 2014, 05:13:14 pm »

And very preliminary (tiny dataset for now) data with 1 CPU core freed and defaults for GPU app.
As one can see there is no noticeable speedup for CPU apps in this mode (host has real cores and big enough cache perhaps, it's Q9450) and not improvement for GPU app comparing with -cpu_lock and full CPu load.
All this preliminary conclusions means that freeing 1 core in tested case provide worse performance than -cpu_lock option (host misses 1 core production w/o improvements on other devices).
Will see if this will hold true with more data collected.

Raistmer · « **Reply #11 on:** 13 Sep 2014, 05:46:06 pm »

And another update. Now due to long mess for CPU app selection on beta I collected many GPU tasks paired with 7.00 plain CPU app.
1 core was idle still. But as one can see, sometimes leaving 1 core free not enough. Maybe there were moments of non-BOINC host activity, maybe just unlocky OS decision, but elapsed times in this config definitely bigger sometimes than with all cores busy but -cpu_lock enabled.

Data sets now not compaable though, for idle core much more points aquired.

Next will be 2 GPU tasks per HD6950 with -cpu_lock enabled and all CPU cores busy again.

Raistmer · « **Reply #12 on:** 12 Oct 2014, 12:39:17 pm »

Promised long ago graph for 2 instances with -cpu_lock on fully loaded CPU.
As one can see there is additional benefit to run 2 instances in my config. Also, -cpu_lock works well in this case too.
As side note, x86 SSE3 AP works better than SSE x86 on Core2 Quad at full load (orange dots vs black and grey ones).

Now I switch configuration quite drastically (APv7 was released on main) so next tests will be in my "production environment" conditions. That is, all CPU cores occupoied with latest MB (AKv8) app while GPU runs AP (initial run will be with same scaled defaults and -cpu_lock x2, to merge dots between all prev and new ones ).

Author Topic: AstroPulse v7 performance illustration (Read 30875 times)

Raistmer

AstroPulse v7 performance illustration

Mike

Re: AstroPulse v7 performance illustration

Richard Haselgrove

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration

Richard Haselgrove

Re: AstroPulse v7 performance illustration

Mike

Re: AstroPulse v7 performance illustration

Richard Haselgrove

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration

Richard Haselgrove

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration

Raistmer

Re: AstroPulse v7 performance illustration