Recent Posts

Pages: 1 [2] 3 4 ... 10

Discussion Forum / Re: Seti is down again

« Last post by arkayn on 25 Mar 2017, 07:52:47 pm »

After only SETI being down last night, both SETI and BOINC are down currently.

Discussion Forum / Re: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

« Last post by Raistmer on 11 Dec 2016, 06:30:56 am »

Since summer PulseFind algorithm had been greatly improved in part of work splitting between separate kernel calls.
Now this splitting can vary during task run and takes into account real device performance.
So, -period_iterations_num N influence quite changed.

Here I would like to describe these changes, new tuning options they introduced and how to use them for guided optimization.

As described earlier -period_iterations_num N splits single PulseFind on N separate kernel calls allocating M/N different folding periods to try for each call. Where M is total number of different folding periods to try for particular input data.
This works OK if goal just to limit maximal kernel length to reduce possible lags.
But each PulseFind geometry has own M value. Moreover, even same M values don't mean same execution times cause different task ARs and different FFT sizes all provide different amount of data to fold/process.
So, if for longest search M/N provide reasonable amount of work and kernel length, for lesser M values and for other geometries same M/N will provide too low amount of work, especially for modern fast GPU devices.
To solve this issue I developed adaptation algorithm that profiles PulseFind kernels and monitors their lenghts to deside if number of periods per particular call should be increased or decreased.
This allows to both reduce lags and keep good overall performance.

Adaptation algorithm guided by few tunable command line options. Most important of those is -tt N.
It provides desirable length in milliseconds (ms) for single PulseFind kernel call. Its default value is 60ms. As of 2016 year, GPU devices are not preemptive (instead of CPUs). That is, GPU should finish piece of work before it can respond on next request. That's why so important to limit length of single kernel call to avoid GUI lags. 60ms seems as reasonable compromise between GUI responsibility and performance (each kernel call incurs substantional overhead so for better performance one should try to keep number of calls at minimum). But it's tunable. If you feel GUI too laggy try to set -tt N option to lower value, like -tt 15 for example.

How all this influence -period_iterations_num N behavior?
Adaptation algorithm takes that M/N as initial value but starts to change it after few initial iterations to meet -tt N goal. So, one can set -period_iterations_num to very high value and reduce initial few PulseFind kernel calls to very low lenghts, but soon after call's lenght will start to increase and GUI lags (if any) will reappear. The same with optimization attempt and -period_iterations_num 1 (for example). This will disable PulseFind kernel splitting only for few first calls. After that adfaptation algotithm will start to split single call into the few again to meet default 60ms per call goal (and performance may drop).

So, if you want to change default behavior, you need to use both -period_iterations_num N and -tt n options.

To reduce lags set N big and n - low (like 500 and 15 for example).
To improve performance set N low and n high (like 1-3 for N and 300 for n, especially if GUI lags not important).

To learn how adaptation algorithm interacts with your particular setup and for aid in optimization look into set of PulseFind profiling counters app will print in its stderr after task finishing.
Here is exampe of those counters and description how to decipher info they provide:

Fftlength=32,pass=3:Tune: sum=106916(ms); min=15.21(ms); max=84.31(ms); mean=58.08(ms); s_mean=54.2; sleep=45(ms); delta=289; N=1841; usual Fftlength=32,pass=4:Tune: sum=152665(ms); min=12.25(ms); max=89.75(ms); mean=59.5(ms); s_mean=48.54; sleep=45(ms); delta=185; N=2566; usual Fftlength=32,pass=5:Tune: sum=38946.1(ms); min=9.851(ms); max=68.61(ms); mean=46.75(ms); s_mean=57.65; sleep=60(ms); delta=643; N=833; usual Fftlength=64,pass=3:Tune: sum=67767.6(ms); min=7.595(ms); max=70.24(ms); mean=53.11(ms); s_mean=58.06; sleep=60(ms); delta=457; N=1276; usual Fftlength=64,pass=4:Tune: sum=82488.8(ms); min=5.996(ms); max=78.9(ms); mean=58.63(ms); s_mean=60.21; sleep=60(ms); delta=354; N=1407; usual Fftlength=64,pass=5:Tune: sum=33600.6(ms); min=5.2(ms); max=64.29(ms); mean=38.31(ms); s_mean=59.64; sleep=60(ms); delta=732; N=877; usual Fftlength=128,pass=3:Tune: sum=39667.5(ms); min=3.837(ms); max=80.19(ms); mean=37.28(ms); s_mean=49.33; sleep=45(ms); delta=997; N=1064; usual Fftlength=128,pass=4:Tune: sum=32580.9(ms); min=3.373(ms); max=78.17(ms); mean=31.79(ms); s_mean=38.77; sleep=30(ms); delta=1112; N=1025; usual Fftlength=128,pass=5:Tune: sum=17835.2(ms); min=2.578(ms); max=47.37(ms); mean=18.75(ms); s_mean=21.72; sleep=15(ms); delta=1082; N=951; usual Fftlength=256,pass=3:Tune: sum=32437(ms); min=1.931(ms); max=43.89(ms); mean=29.3(ms); s_mean=43.32; sleep=45(ms); delta=1194; N=1107; usual Fftlength=256,pass=4:Tune: sum=23303.3(ms); min=1.54(ms); max=31.68(ms); mean=21.9(ms); s_mean=31.03; sleep=30(ms); delta=1151; N=1064; usual Fftlength=256,pass=5:Tune: sum=17114.3(ms); min=1.318(ms); max=23.98(ms); mean=16.76(ms); s_mean=22.92; sleep=15(ms); delta=1108; N=1021; usual Fftlength=512,pass=3:Tune: sum=32597(ms); min=0.9826(ms); max=22.52(ms); mean=19.43(ms); s_mean=21.76; sleep=15(ms); delta=1721; N=1678; usual Fftlength=512,pass=4:Tune: sum=23698.2(ms); min=0.7869(ms); max=17.35(ms); mean=14.3(ms); s_mean=16.03; sleep=15(ms); delta=1700; N=1657; usual Fftlength=512,pass=5:Tune: sum=17414(ms); min=0.6734(ms); max=12.65(ms); mean=10.65(ms); s_mean=11.65; sleep=0(ms); delta=1678; N=1635; usual Fftlength=1024,pass=3:Tune: sum=72685.6(ms); min=0.5037(ms); max=25.85(ms); mean=23.54(ms); s_mean=24.48; sleep=15(ms); delta=3109; N=3088; high_perf Fftlength=1024,pass=4:Tune: sum=452.15(ms); min=0.4042(ms); max=8.098(ms); mean=3.3(ms); s_mean=6.856; sleep=0(ms); delta=3098; N=137; usual Fftlength=1024,pass=5:Tune: sum=326.168(ms); min=0.3532(ms); max=6.036(ms); mean=2.589(ms); s_mean=5.704; sleep=0(ms); delta=3087; N=126; usual Fftlength=2048,pass=3:Tune: sum=71583.2(ms); min=5.29(ms); max=12.58(ms); mean=11.94(ms); s_mean=12.02; sleep=15(ms); delta=1; N=5997; high_perf Fftlength=4096,pass=3:Tune: sum=78586.3(ms); min=2.574(ms); max=6.829(ms); mean=6.553(ms); s_mean=6.538; sleep=0(ms); delta=1; N=11993; high_perf Fftlength=8192,pass=3:Tune: sum=91514.2(ms); min=1.448(ms); max=4.018(ms); mean=3.815(ms); s_mean=3.809; sleep=0(ms); delta=1; N=23987; high_perf

As was mentioned - each FFT size for task with particular AR will provide own amount of work so algorithm (and counters) keep separate tracks for all possible FFT sizes/lengths.

pass=3,4,5 designates particular folding procedure that also influences kernel call length
After Tune: info about particular case contained.
sum is sum of all such PulseFind kernel calls lengths. One can estimate how long GPU was processed this particular type of work inside this particular task.
min/max/mean - corresponding kernel call lenghts in ms.
s_mean so called sliding mean that computed through not all but only few last calls. Algorithm looks this value to decide if and how it should change work splitting to meet target time (tt) goal.

sleep: if -use_sleep active this will agrument of Sleep() call. Algorithm takes into account sleep quantum and tries to optimize number fo such quantums to sleep only needed time to keep both performance and CPU usage at optimum. If -use_sleep not used (Sleep disabled) ignore it.

delta: that M/N value memtioned earlier.
Can't be less than 1. If you see delta=1 and still have GUI lags, -tt/-period_iterations_num options can't help in such case.

N: total number of calls of such kind in this particular task.

usual/high_perf: Sometime it's possible to merge 3 separate passes (3,4,5) into single kernel sequence to make all 3 in fraction of target time. In such case passes will be merged and high_perf modifier will appear. In such case useful info contained it pass=3 counter, pass4/5 ones stopped to update after the merge.

And now, how to use all this info for guided optimization?
1) Look max values. If they significally higher than mean ones - incease -period_iterations_num value to reduce lags. That is, starting point requires correction.

2) look mean and s_mean values. Few first lines corresponding hardest cases for GPU so concentrate on them. If you see mean times ~60ms 9at default settings) that means your setup will response on -tt value increase (there is optimization potential here). If even for first few lines s_mean lower than default 60 (or lower than provided -tt value) it means there isn't enough work to load GPU. Further increasing of -tt value will not help, consider to increase -sbs value instead or just try to run few tasks at once on such GPU.

From other side, if listed s_mean times already considerably lower than -tt value but lags still present - hardly playing with -period_iterations_num/tt will help in such case. Try to describe your config in forum support thread and maybe some specific solution could be found.

There can be the case (on slow devices) where with -tt set to low values few first lines show s_mean ~ tt value, but lines closer to the bottom (with bigger FFT sizes) still have s_mean > than desired tt value and GUI lags can't be reduced.
If you experience such situation please describe your config in forum support thread (for 8.19 it's https://setiathome.berkeley.edu/forum_thread.php?id=80381 ; similar will be created for new releases too).

EDIT: it seems things started to change regarding pre-emption on GPU devices.
NV Pascal architecture has adequate preemption mechanisms acordingly to architecture description: http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
Still have to prove that experimentally though.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Mike on 05 Nov 2016, 06:49:26 am »

Don`t forget you have first generation APU based on Bulldozer.
Kaveri and Carrizo APU`s have better Management.
Same for FX CPU`s.
Steamroller based CPU`s are better than Bulldozer.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 02 Nov 2016, 07:14:11 pm »

Finally I processed all data.
it's amazing how repeatable were that benches.
3 strongly-separated groups of results:
pinned to different modules, pinned to same Bulldozer module, not pinned (superposition of those 2 states).
The size of performance drop for pinned to single module doesn't allow to call those CPUs in the same module as "cores" - they too interrelated for that. Overall picture much more resemble 2-core CPU with kind of hyperthreading than 4-core one.

So, for the SETI project Bulldozer-based AMD CPUs have very impaired performance.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 24 Oct 2016, 10:19:18 am »

Thanks to EdwardPF's hint about his own config I'll continue exploration of AMD's Bulldozer CPU architecture (with shared FPU and L1 cahes between pair of "cores") on example of Trinity A10-5700 APU (its CPU part).
Now to explore affinity influence I will return to older methodology where additional load comes not from background BOINC processes but from multiple bench instances.
Each bench instance will be pinned to single logical CPU (different CPUs will be used for different runs).
To avoid overall system overload I'll do test with only 2 simultaneous tasks first. So, in some runs tasks will share same "module" of 2 "cores" and for other they will be placed in really different cores.
Another change is Marco Franceschini's FFTW 3.3.5 x64 AVX DLL binary (thanks for it!) that tested to generate AVX codelets on this hardware ("stock" x64 DLL generates only SSE2/register ones).
To allow continuous load each benchmark ended with pair of renamed full BLC tasks (so last PG VHAR could be partially paired with that BLC). As before, CPU throughput will be measure in PS_set/s units.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 23 Oct 2016, 02:39:44 am »

And C-60 Loveland data.

Picture quite different here (vs Trinity). CPU load scaling almost linear, GPU almost doubled total device throughput/performance.
CPU part load practically doesn't affect GPU part and vice versa.
Though C-60 quite weak device per se (it's netbook APU), its CPU and GPU parts truly augment each other.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 17 Oct 2016, 04:59:04 pm »

Same on single graph for direct comparison.

On full CPU load A10-5700 more(!) than twice slower vs i5-3470 for PG MultiBeam set $:-\$
Hardly it's twice cheaper or consumes twice less power... Having 2 of such APUs and only single IvyBridge feel disappointed

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 17 Oct 2016, 02:52:43 pm »

And here similar data for Trinity AMD APU.
One additional dot - 3 CPU tasks + busy GPU so one can see how strong GPU influence.

Situation much worse here.
Even CPU part alone scales very badly. Just 2 busy cores show considerable declination from linear scaling.
And 4 performs only slightly better than 3.
With GPU addition to equation APU seems overloaded. Maybe this is result of particular drivers (CPU load from GPU app unexpectedly high, much higher than for discrete ATi GPU with same app).
So the difference between CPU time and elapsed time became non-neglectible (red dots - from elapsed, black dots computed from CPU time)
Of course, dot5 (just as with IvyBridge case) doesn't fully reflect device performance, GPU part throughput not accounted here, only its negative influence on CPU part shown.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 17 Oct 2016, 02:15:28 pm »

And here is the first results for IvyBridge.
So far only CPU part explored under different loads.
Quite linear performance increase. declining only on full CPU load.
With busy GPU part CPU part performance drops stronger but not fatal.
To estimate complete device throughput in this condition additional measurement of GPU throughput required.
In general, quite good scaling of load for MultiBeam.

Discussion Forum / Re: Loading APU to the limit: performance considerations

« Last post by Raistmer on 15 Oct 2016, 07:43:43 am »

Quite long ago I did some exploration of this topic based on AstroPulse application.
Nowadays AP is rare beast so some refreshment data with MultiBeam required.
So I decided to revive this thread.
Also, there will be some changes in methodology to make this test less invasive for crunching.

So, how APU performance tuning versus load can be done now:

1) aquire PGv8 set of shortened tasks. With GBT data advance this set is biased, but separate adjustment by running some shortened GBT/blc task can be done if needed.
2) multiply each task. I prefer 3 tasks for each AR to have some statistics and error estimation.
3) download KWSN 2.13 benchmark
4) configure it not to suspend BOINC (it's important!). BOINC will provide background load for this type of tests.
5) configure BOINC for particular background load.
6) run test.
7) sum all times, divide by 3 and take reverse value. This will represent some mean "PGset-throughput" per second for particular config.

Repeat this for all wanted configs. Bigger value will designate better load configuration for particular device.

Now, what is "configure BOINC":
for example one want to test how APU will perform with 3 cores loaded + GPU part.
In current methodology such estimation can be aquired in 2 steps (2 bench runs):
1) make only 3 CPUs available for BOINC (check that GPU computations disabled and only 3 CPU tasks active in BOINC by reducing number of available CPUs to BOINC).
2) run bench with GPU build (can be ATi or iGPU depending on device under investigation), compute GPU part throughput (GPU_throughput)
3) make only 2 CPUs available for BOINC (by reserving more cores) but unsuspend GPU computations. Check that BOINC runs 2 CPU + 1 GPU task.
4) run bench with opt CPU app. Compute throughput (CPU_throughput)
5) device throughput for such config will be APU_throughput(1_core_reserved)=3*CPU_throughput+GPU_throughput.

Similarly all other configs can be checked.
Such approach allows minimal sacrifice to crunching performance of host. But imply some precision degradation in case of strong CPU-consumption dependence of GPU app from AR.
To solve this one can replace BOINC's GPU load by run of some cloned standard task in separate bench instance (preferably 2 estimations then - with high-CPU load and with low CPU load).

Fortunately, both ATi and iGPU apps CPU consumption low enough to discard such enhancement in first approach at least (actual mostly for NV OpenCL builds).

Pages: 1 [2] 3 4 ... 10