Recent Posts

Pages: [1] 2 3 ... 10
1
Discussion Forum / Re: Seti is down again
« Last post by Mike on 09 Aug 2017, 10:02:44 am »
And it`s back up.

Jeff posted in the Panic mode thread that it was his fault.
No big deal, we are all humans.
2
Discussion Forum / Re: Seti is down again
« Last post by Mike on 09 Aug 2017, 06:38:25 am »
Yes, it is.

This happened before not to long ago.
We will have to wait til 5 PM or so i guess.
3
Discussion Forum / Re: Seti is down again
« Last post by Dirk on 09 Aug 2017, 03:45:45 am »
Is SETI@home really still in 'maintenance mode'? Or something at my end is broken? :-\

If I read the messages in BOINC correct, then since yesterday ~ 17:00 to now ~ 09:30 (MEST, UTC + 02:00), 16 1/2 hrs...

I switch on my dual Xeon with quad R9FuryX ((maybe?) just for the upcoming 'SETI.Germany Wow!-Event 2017' which start in ~ 5d:20hrs), and now since yesterday ~ 23:30 the GPUs, since ~ 05:00 the CPUs idle.

How can my PCs make much pending credits of results?  :(
4
Discussion Forum / Re: Seti is down again
« Last post by mr.mac52 on 25 Mar 2017, 09:59:27 pm »
I just checked my main cruncher and the queues are full at this point in time.  So I'm not sure what is really down aside from the main site.

John
5
Discussion Forum / Re: Seti is down again
« Last post by arkayn on 25 Mar 2017, 07:52:47 pm »
After only SETI being down last night, both SETI and BOINC are down currently.
6
Since summer PulseFind algorithm had been greatly improved in part of work splitting between separate kernel calls.
Now this splitting can vary during task run and takes into account real device performance.
So, -period_iterations_num N influence quite changed.

Here I would like to describe these changes, new tuning options they introduced and how to use them for guided optimization.

As described earlier -period_iterations_num N splits single PulseFind on N separate kernel calls allocating M/N different folding periods to try for each call. Where M is total number of different folding periods to try for particular input data.
This works OK if goal just to limit maximal kernel length to reduce possible lags.
But each PulseFind geometry has own M value. Moreover, even same M values don't mean same execution times cause different task ARs and different FFT sizes all provide different amount of data to fold/process.
So, if for longest search M/N provide reasonable amount of work and kernel length, for lesser M values and for other geometries same M/N will provide too low amount of work, especially for modern fast GPU devices.
To solve this issue I developed adaptation algorithm that profiles PulseFind kernels and monitors their lenghts to deside if number of periods per particular call should be increased or decreased.
This allows to both reduce lags and keep good overall performance.

Adaptation algorithm guided by few tunable command line options. Most important of those is -tt N.
It provides desirable length in milliseconds (ms) for single PulseFind kernel call. Its default value is 60ms. As of 2016 year, GPU devices are not preemptive (instead of CPUs). That is, GPU should finish piece of work before it can respond on next request. That's why so important to limit length of single kernel call to avoid GUI lags. 60ms seems as reasonable compromise between GUI responsibility and performance (each kernel call incurs substantional overhead so for better performance one should try to keep number of calls at minimum). But it's tunable. If you feel GUI too laggy try to set -tt N option to lower value, like -tt 15 for example.

How all this influence -period_iterations_num N behavior?
Adaptation algorithm takes that M/N as initial value but starts to change it after few initial iterations to meet -tt N goal. So, one can set -period_iterations_num to very high value and reduce initial few PulseFind kernel calls to very low lenghts, but soon after call's lenght will start to increase and  GUI lags (if any) will reappear. The same with optimization attempt and -period_iterations_num 1 (for example). This will disable PulseFind kernel splitting only for few first calls. After that adfaptation algotithm will start to split single call into the few again to meet default 60ms per call goal (and performance may drop).

So, if you want to change default behavior, you need to use both -period_iterations_num N and -tt n options.

To reduce lags set N big and n - low (like 500 and 15 for example).
To improve performance set N low and n high (like 1-3 for N and 300 for n, especially if GUI lags not important).

To learn how adaptation algorithm interacts with your particular setup and for aid in optimization look into set of PulseFind profiling counters app will print in its stderr after task finishing.
Here is exampe of those counters and description how to decipher info they provide:


Fftlength=32,pass=3:Tune: sum=106916(ms); min=15.21(ms); max=84.31(ms); mean=58.08(ms); s_mean=54.2; sleep=45(ms); delta=289; N=1841; usual
Fftlength=32,pass=4:Tune: sum=152665(ms); min=12.25(ms); max=89.75(ms); mean=59.5(ms); s_mean=48.54; sleep=45(ms); delta=185; N=2566; usual
Fftlength=32,pass=5:Tune: sum=38946.1(ms); min=9.851(ms); max=68.61(ms); mean=46.75(ms); s_mean=57.65; sleep=60(ms); delta=643; N=833; usual
Fftlength=64,pass=3:Tune: sum=67767.6(ms); min=7.595(ms); max=70.24(ms); mean=53.11(ms); s_mean=58.06; sleep=60(ms); delta=457; N=1276; usual
Fftlength=64,pass=4:Tune: sum=82488.8(ms); min=5.996(ms); max=78.9(ms); mean=58.63(ms); s_mean=60.21; sleep=60(ms); delta=354; N=1407; usual
Fftlength=64,pass=5:Tune: sum=33600.6(ms); min=5.2(ms); max=64.29(ms); mean=38.31(ms); s_mean=59.64; sleep=60(ms); delta=732; N=877; usual
Fftlength=128,pass=3:Tune: sum=39667.5(ms); min=3.837(ms); max=80.19(ms); mean=37.28(ms); s_mean=49.33; sleep=45(ms); delta=997; N=1064; usual
Fftlength=128,pass=4:Tune: sum=32580.9(ms); min=3.373(ms); max=78.17(ms); mean=31.79(ms); s_mean=38.77; sleep=30(ms); delta=1112; N=1025; usual
Fftlength=128,pass=5:Tune: sum=17835.2(ms); min=2.578(ms); max=47.37(ms); mean=18.75(ms); s_mean=21.72; sleep=15(ms); delta=1082; N=951; usual
Fftlength=256,pass=3:Tune: sum=32437(ms); min=1.931(ms); max=43.89(ms); mean=29.3(ms); s_mean=43.32; sleep=45(ms); delta=1194; N=1107; usual
Fftlength=256,pass=4:Tune: sum=23303.3(ms); min=1.54(ms); max=31.68(ms); mean=21.9(ms); s_mean=31.03; sleep=30(ms); delta=1151; N=1064; usual
Fftlength=256,pass=5:Tune: sum=17114.3(ms); min=1.318(ms); max=23.98(ms); mean=16.76(ms); s_mean=22.92; sleep=15(ms); delta=1108; N=1021; usual
Fftlength=512,pass=3:Tune: sum=32597(ms); min=0.9826(ms); max=22.52(ms); mean=19.43(ms); s_mean=21.76; sleep=15(ms); delta=1721; N=1678; usual
Fftlength=512,pass=4:Tune: sum=23698.2(ms); min=0.7869(ms); max=17.35(ms); mean=14.3(ms); s_mean=16.03; sleep=15(ms); delta=1700; N=1657; usual
Fftlength=512,pass=5:Tune: sum=17414(ms); min=0.6734(ms); max=12.65(ms); mean=10.65(ms); s_mean=11.65; sleep=0(ms); delta=1678; N=1635; usual
Fftlength=1024,pass=3:Tune: sum=72685.6(ms); min=0.5037(ms); max=25.85(ms); mean=23.54(ms); s_mean=24.48; sleep=15(ms); delta=3109; N=3088; high_perf
Fftlength=1024,pass=4:Tune: sum=452.15(ms); min=0.4042(ms); max=8.098(ms); mean=3.3(ms); s_mean=6.856; sleep=0(ms); delta=3098; N=137; usual
Fftlength=1024,pass=5:Tune: sum=326.168(ms); min=0.3532(ms); max=6.036(ms); mean=2.589(ms); s_mean=5.704; sleep=0(ms); delta=3087; N=126; usual
Fftlength=2048,pass=3:Tune: sum=71583.2(ms); min=5.29(ms); max=12.58(ms); mean=11.94(ms); s_mean=12.02; sleep=15(ms); delta=1; N=5997; high_perf
Fftlength=4096,pass=3:Tune: sum=78586.3(ms); min=2.574(ms); max=6.829(ms); mean=6.553(ms); s_mean=6.538; sleep=0(ms); delta=1; N=11993; high_perf
Fftlength=8192,pass=3:Tune: sum=91514.2(ms); min=1.448(ms); max=4.018(ms); mean=3.815(ms); s_mean=3.809; sleep=0(ms); delta=1; N=23987; high_perf


As was mentioned - each FFT size for task with particular AR will provide own amount of work so algorithm (and counters) keep separate tracks for all possible FFT sizes/lengths.

pass=3,4,5 designates particular folding procedure that also influences kernel call length
After Tune: info about particular case contained.
sum is sum of all such PulseFind kernel calls lengths. One can estimate how long GPU was processed this particular type of work inside this particular task.
min/max/mean - corresponding kernel call lenghts in ms.
s_mean so called sliding mean that computed through not all but only few last calls. Algorithm looks this value to decide if and how it should change work splitting  to meet target time (tt) goal.

sleep: if -use_sleep active this will agrument of Sleep() call. Algorithm takes into account sleep quantum and tries to optimize number fo such quantums to sleep only needed time to keep both performance and CPU usage at optimum. If -use_sleep not used (Sleep disabled) ignore it.

delta: that M/N value memtioned earlier.
Can't be less than 1. If you see delta=1 and still have GUI lags, -tt/-period_iterations_num options can't help in such case.

N: total number of calls of such kind in this particular task.

usual/high_perf: Sometime it's possible to merge 3 separate passes (3,4,5) into single kernel sequence to make all 3 in fraction of target time. In such case passes will be merged and high_perf modifier will appear. In such case useful info contained it pass=3 counter, pass4/5 ones stopped to update after the merge.

And now, how to use all this info for guided optimization?
1) Look max values. If they significally higher than mean ones - incease -period_iterations_num value to reduce lags. That is, starting point requires correction.

2) look mean and s_mean values. Few first lines corresponding hardest cases for GPU so concentrate on them. If you see mean times ~60ms 9at default settings) that means your setup will response on -tt value increase (there is optimization potential here). If even for first few lines s_mean lower than default 60 (or lower than provided -tt value) it means there isn't enough work to load GPU. Further increasing of -tt value will not help, consider to increase -sbs value instead or just try to run few tasks at once on such GPU.

From other side, if listed s_mean times already considerably lower than -tt value but lags still present - hardly playing with -period_iterations_num/tt will help in such case. Try to describe your config in forum support thread and maybe some specific solution could be found.

There can be the case (on slow devices) where with -tt set to low values few first lines show s_mean ~ tt value, but lines closer to the bottom (with bigger FFT sizes) still have s_mean > than desired tt value and GUI lags can't be reduced.
If you experience such situation please describe your config in forum support thread (for 8.19 it's https://setiathome.berkeley.edu/forum_thread.php?id=80381 ; similar will be created for new releases too).

EDIT: it seems things started to change regarding pre-emption on GPU devices.
NV Pascal architecture has adequate preemption mechanisms acordingly to architecture description: http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
Still have to prove that experimentally though.
7
Discussion Forum / Re: Loading APU to the limit: performance considerations
« Last post by Mike on 05 Nov 2016, 06:49:26 am »

Don`t forget you have first generation APU based on Bulldozer.
Kaveri and Carrizo APU`s have better Management.
Same for FX CPU`s.
Steamroller based CPU`s are better than Bulldozer.
8
Discussion Forum / Re: Loading APU to the limit: performance considerations
« Last post by Raistmer on 02 Nov 2016, 07:14:11 pm »
Finally I processed all data.
it's amazing how repeatable were that benches.
3 strongly-separated groups of results:
pinned to different modules, pinned to same Bulldozer module, not pinned (superposition of those 2 states).
The size of performance drop for pinned to single module doesn't allow to call those CPUs in the same module as "cores" - they too interrelated for that. Overall picture much more resemble 2-core CPU with kind of hyperthreading than 4-core one.

So, for the SETI project Bulldozer-based AMD CPUs have  very impaired performance.
9
Discussion Forum / Re: Loading APU to the limit: performance considerations
« Last post by Raistmer on 24 Oct 2016, 10:19:18 am »
Thanks to EdwardPF's hint about his own config I'll continue exploration of AMD's Bulldozer CPU architecture (with shared FPU and L1 cahes between pair of "cores") on example of Trinity A10-5700 APU (its CPU part).
Now to explore affinity influence I will return to older methodology where additional load comes not from background BOINC processes but from multiple bench instances.
Each bench instance will be pinned to single logical CPU (different CPUs will be used for different runs).
To avoid overall system overload I'll do test with only 2 simultaneous tasks first. So, in some runs tasks will share same "module" of 2 "cores" and for other they will be placed in really different cores.
Another change is Marco Franceschini's FFTW 3.3.5 x64 AVX DLL binary (thanks for it!) that tested to generate AVX codelets on this hardware ("stock" x64 DLL generates only SSE2/register ones).
To allow continuous load each benchmark ended with pair of renamed full BLC tasks (so last PG VHAR could be partially paired with that BLC). As before, CPU throughput will be measure in PS_set/s units.
10
Discussion Forum / Re: Loading APU to the limit: performance considerations
« Last post by Raistmer on 23 Oct 2016, 02:39:44 am »
And C-60 Loveland data.

Picture quite different here (vs Trinity). CPU load scaling almost linear, GPU almost doubled total device throughput/performance.
CPU part load practically doesn't affect GPU part and vice versa.
Though C-60 quite weak device per se (it's netbook APU), its CPU and GPU parts truly augment each other.
Pages: [1] 2 3 ... 10
Powered by EzPortal