Author Topic: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view (Read 69345 times)

Raistmer · « **on:** 06 Jun 2016, 08:41:11 am »

Cause again and again same questions arise regarding how to deal with MB tuning I'll try to give some insights in what options do to put anonymous tuning parameter space into something more understandable. Knowledge of app machinery could help in guided tuning IMHO.

Short option description given in ReadMe file that comes with app. Here I'll try to explain some of options in more detail. I did the same many times in different threads in different forums but info get scattered and lost. So, separate thread that publicly read but quite restricted in commenting so has good chances to not get diluted.

Well, first quite important (especially in the view of GBT data advance wich mostly VLAR) option:

-period_iterations_num N

The MultiBeam app consists of few different searches for artifical signals. This option belongs to so called Pulse signal. That is, short burst of energy. Cause we don't know interval between bursts and its duration we should guess some possible ones (that is, search/scan parameter space for both duration and period). So, one needs to sum up powers of adjacent points through time in different sequencies to see if for some pattern resulting energy will exceed some threshold (threshold ~ designates some background level that could be expected from random white noise). We should do such summation through all time we receive signal from same point of sky.
MultiBeam mostly piggy-back algorithm (that is, collect all data we could have, not those we want to have). Some time telescope moves fast between points... but SETI recorder still works - this will restrict available length of data array for summation.
Also, what frequency range one should use? Again, we don't know what frequency width of signal so check different ones.
All this result in many different patterns arising from the same initial data (there is also doppler shift that leads to dechirping but this just more increase number of possible patterns). Initial data (as of v8, v7,v6) consist of 1024x1024 points. Depending of what frequency width we analyse it can be split in matrices of different forms.
For example 8x128k (k==1024 here) or 16k x 64
One dimension will be number of separate data arrays while another - length of each particular array.
Of course, the more separate independent arrays we have the easier GPU's work (cause by design GPU is massively parallel device).
So, obviously, there are extreme cases, where number of separate arrays very low - that's the case of PulseFind (in particular) in VLAR task.
Each array can be very long (cause with VLAR we stare almost the same point of sky all task data duration) but this result in low number of such separate arrays.
Here we have issues for modern GPUs cause there are only 8 such fully separate arrays (recall that modern GPU can have as many as 40 or even more independend CU (compute units) that can't be synchronised between each other (again, by GPU hardware design).
This essentially leads to very low computational load (don't mix real load with reported by tools like GPU-Z - even single busy CU will be reported as buzy GPU by such tools) and bad processing speed. GPU becomes essentially 8-core CPU (but with much less core freq as you know).
So, something should be done to increase parallelization. Luckely, we have another parameter that needs to be scanned - pulse periods. There are plenty of them.
So, actually, those 8 arrays generate enormous number of new arrays to fold, each folded little differently (different periods) than another. Memorize this fact cause it will be used in another option's description.
So, if we will try to process all possible periods in single kernel call (especially with each iteration loading only let say 8 (actually can be even less cause few arrays can be processed on single CU) CUs of GPU ) we can get very long kernel call even on fastest GPUs. Hence, lags, driver restarts.
Such call should be split to many separate kernel calls to allow GPU to switch context to another tasks like GUI redraw.
This option governs this process. It splits single call to N subsequent calls each of those process only 1/Nth of all available periods.
So, summarizing, this option can help with lags, it can change longest kernel duration (so, useful when attempting to balance GPU computations with CPU-sleeping) but doesn't really help with real GPU load. Some speedup/slowdown can be observed because of additional overhead of making separate calls and because of found signal early in sequence. If signal detected remaining iterations omitted in some of versions.

Current (as of 6 Juny 2016) defaults are 50 for most of GPUs and 500 for so called "low performance path" GPUs with only 3 or less CUs per device.

Raistmer · « **Reply #1 on:** 06 Jun 2016, 10:13:13 am »

Next important option to consider is

-sbs N

That stands for Single Buffer Size (N- in MB).

As you could recall there are plenty of periods per each data array to process. That can be used to increase coarse GPU load (that is, to increase number of active CUs participating in data processing). 8 separate entities to process is too small for modern devices.
So, what if we will subdivide different periods sets into separate entities? Each period independednt from another so - can be parallelized! Said - done. But each array for particular periods set now represent separate input data. To be processed simultaneously we need some storage space to keep this data. It can't be retrieved from original data now cause we want simultaneoud changing (folding) of all arrays. So, requirements to storage amount (used GPU memory) increase.
Here comes -sbs N option. In some sense it's similar to my AstroPulse's -unroll N and -ffa_block N options. All govern some data unrolling to increase coarse parallelizm. But instead of AstroPulse's options where number of unrols specified directly here we can just set upper boundary for used memory buffer size.
Why so? Cause instead of AstroPulse where we deal with mostly pre-determined and unchanging data sizes and dimensions (single exclusion - FFA but not because of lack of determinism but just because of too big number of variations to account. Nevertheless, exactly same variations (periods) repeat on each of AP's tasks) in MultiBeam we have recording hardware (recall, telescope movement!) parameter - AR - angular range, that governs all data splitting for particular task. So, much more agility required to deal with all possible patterns.
In early app versions periods were unrolled just up to buffer size to fill all available space. For some ARs this resulted in too much overhead (cause to make arrays fully independable also means to repeat some of computations that now can't be shared - all has its price). So, in recent versions new algorithm was devised to produce as many separate data entities as needed to fully load GPU's CUs (of course, in constrains of available buffer size).
So, summarizing, this option governs, though indirectly, amount of unroll of data in PulseFind (currently, it affects only PulseFind) algorithm to increase coarse parallelizm degree of algorithm.
With recent algorithm change few new options were introduced that work in cooperation with -sbs N one.
They are:
-pref_wg_size N
and
-pref_wg_num_per_cu N

Look ReadMe for short description. Together these 3 options allow to control distribution of computational load inside GPU device.

As of 6 Juny 2016 default value for -sbs N is 128.
From where it came? Some first OpenCL devices (AMD HD4xxx ones) had only 128MB of available buffer no matter how many GPU RAM device had. Also, some of them though allowed big buffer produced invalid results. So, initial default was as low as 64MB. Currently GPUs in general have plenty of memory onboard and runtime evolved to support ~1/4 of installed memory at single allocation (that's what memory buffer represents - single allocation of memory).
But on some low-end GPUs bigger buffer lead to driver restarts so I decided to follow "better safe than sorry" approach once again and restrict default to pretty universal value. For most GPUs it can be increased to 256. Maybe future builds will have bigger or floating default.

Raistmer · « **Reply #2 on:** 06 Jun 2016, 11:07:37 am »

Now let's discuss

-use_sleep

and

-use_sleep_ex N

options.

They quite important for modern nVidia's OpenCL runtime implementation, much less important for AMD's one and hardly have any use for Intel's implementation.
To understand why they needed let's talk about ways of interaction between CPU and auxilary hardware like GPU.
To work together these devices need some communication between each other. Also they need some mechanism to order they memory accesses to cooperatively work and produce correct results.
In very core of typical PC system CPU can "talk" with outer world via memory writes (with adresses can be mapped to some different from real RAM hardware), ports wites (where write comes read too). And external world can communicate with CPU by setting signals on some of its lines that represent so called interrupt requests.
From these abilities come 2 base different ways of synchronisation - asynchronous communication (via interrupts that occur when external hardware needs and CPU responds) and so called polling, that is, constant read from some port or memory range until hardware set some flag.
Of course, such action requires constant CPU processing corresponding I/O instruction - CPU will be busy.
From other side - modern GPUs so fast that if they would send interrupt to CPU on each required action interrupt controller would be just overhelmed and again, CPU would do nothing but switching to/from interrupt processing (that costly). So some balance between all those sync methods required to make CPU less busy and GPU device more busy instead. Bus mastering allows GPU device direct accessing to memory w/o CPU disturbance. But anyway, once data required for CPU it should know somehow that data ready and all external writes finished so data can be read in its final form.
All this low-level machinery isolated on driver and framework runtime levels from application. And different vendors use different mix of possible synching primitives to achieve their goals. Thus we have so noticable differencies in OpenCL MultiBeam behavior on different vendor's runtimes.
It seems, the sync algorithm used for both vendors depends on estimated(measured?) length of kernel(s) call. But defaults are different.
This most vividly exposed by SoG (Signals on GPU) flavour of MultiBeam app.
For so called VHAR tasks (telescope movement so fast and wide that some searches like Pulse find of Gaussian find [look prev posts for Pulse find] just can't collect enough data points to form adequate array of data) some searches disabled. With remaining searches processing completely on GPU (also with intermediate signal logging) there is no need to disturb CPU at all after initial kernels enqueuing. CPU just sends all to GPU and then awaits for final results. This allows maximal decoupling between devices. But, as independend GPU work time exceeded some threshold, suddenly usually low CPU-consuming AMD build started to consume full CPU core. WTF? Well, some hardwired into AMD runtime limit was exceeded and (perhaps) runtimes started to think that "soon enogh" kernel finishes. But no, it continues - and CPU enters polling mode for prolonged amount of time. Unfortunately, even -use_sleep can't help in this AMD case cause polling done not in app's process but inside driver process. So solution was to restrict (!) length of CPU-GPU uncoupling and to send clFinish() synchronizing instructions each N iterations (btw, app has undocumented yet switch that governs this period). This way runtime stays on its default synching (interrupts w/o heavy polling?) method and CPU consumption low enough.
With nVidia we heve diametrally different situation. Default behavior here is polling and switching from polling occurs on quite prolonged loads. So, on NV SoG results in great CPU usage decrease for VHAR.
Luckely, NV runtime executes polling from one of app's process threads. And hence can be switched away from CPU by standart system Sleep() call.

Here we come to -use_sleep* options.
It enables 1ms Sleep() call ( for Windows 1 ms is the lowest possible sleep duration besides of just yeld with value of 0) in cycle while awaiting completion of event marker specially inserted into GPU processing queue.
-use_sleep_ex N just allows to pass particular number to that Sleep() call (NB, try to use -use_sleep_ex 0 to decrease sleep as much as possible - with externel CPU load it could decrease CPU consumption in waiting cycle too while not causing big GPU starvation).
What the implications of such approach? Only single but big one: typical kernel times and minimal possible sleep time differs in order of magnitude.
That is, sleeping even 1 ms we will starve GPU. Especially high-end one.
What possible solution to overcome this limitation? To somehow increase amount of pre-scheduled work to GPU for it to sustain app's sleep.
What possible methods to do this currently:
1) to increase number of independent kernel queues(streams in CUDA terminology).
2) to increase length of scheduled kernel call.

What options could help?:
1) multitasking. Try to run few simultaneous tasks per GPU while one process sleeping another will do kernel scheduling or even (if you will be lucky and VLAR will be mixed with VHAR) schedules so many kernel calls at once that GPU will be totally busy. unfortunately, multitasking comes with own overheads. For pre-FERMI architecture it's so costly that overhead offsets any possible gain. FERMI and up made multitasking much more feasible.
corresponding app option to use is -instances_per_device N .
Consult elsewhere for corresponding app_config.xml content.

2) to make kernel bigger one could prevent its splitting in the first hand.
So, -period_iterations_num 1
But beware, this will expose all problems described in the first post so approach to lowest possible value of 1 with care.
And -sbs 256 or higher. Though it will not increase amount of work (that is, kernel length) per se, it will improve effectiveness of kernel work.

Because of that 1ms as minimal sleep Windows limitation (in Linux AFAIK situation can be improved with using special nano-sleep library) this method of saving synching CPU time is quite clumsy. So implemented only for longest kernels (that is, PulseFind) currently.
Maybe, in future buiulds its area of effect will be even more restricted to only longest of those PulseFind calls (they differ in length hugely through the task) this will allow bigger GPU load but also increases CPU time wasted for sync polling.

As of 6 Juny 2016 low-performance path enables sleep automatically (can be disabled via -no_defaults_scaling option) with sleep time of 5ms.
In current SVN head this already changed to 1ms.

27.06.2016 addon
Tests on C-60 APU based host (here one can see raw data for this experiment: https://drive.google.com/file/d/0BwjTLNvsJmLBbHB0d2hXTU4waEE/view?usp=sharing ) revealed quite coarse real sleep time increase.
No matter what Sleep() argument is real sleeping time increased with steps of 15ms.
So, instead of used-supplied values starting from rev 3476 adaptive sleeping algorithm implemented that attempts to keep partial PulseFind kernel size near 15ms per call and adjust sleep time to real kernel execution time.
Hence, in rev 3476 value of -use_sleep_ex ignored and both -use_sleep and -use_sleep_ex N just enable this new sleep-adjustment algorithm.

Raistmer · « **Reply #3 on:** 11 Dec 2016, 06:30:56 am »

Since summer PulseFind algorithm had been greatly improved in part of work splitting between separate kernel calls.
Now this splitting can vary during task run and takes into account real device performance.
So, -period_iterations_num N influence quite changed.

Here I would like to describe these changes, new tuning options they introduced and how to use them for guided optimization.

As described earlier -period_iterations_num N splits single PulseFind on N separate kernel calls allocating M/N different folding periods to try for each call. Where M is total number of different folding periods to try for particular input data.
This works OK if goal just to limit maximal kernel length to reduce possible lags.
But each PulseFind geometry has own M value. Moreover, even same M values don't mean same execution times cause different task ARs and different FFT sizes all provide different amount of data to fold/process.
So, if for longest search M/N provide reasonable amount of work and kernel length, for lesser M values and for other geometries same M/N will provide too low amount of work, especially for modern fast GPU devices.
To solve this issue I developed adaptation algorithm that profiles PulseFind kernels and monitors their lenghts to deside if number of periods per particular call should be increased or decreased.
This allows to both reduce lags and keep good overall performance.

Adaptation algorithm guided by few tunable command line options. Most important of those is -tt N.
It provides desirable length in milliseconds (ms) for single PulseFind kernel call. Its default value is 60ms. As of 2016 year, GPU devices are not preemptive (instead of CPUs). That is, GPU should finish piece of work before it can respond on next request. That's why so important to limit length of single kernel call to avoid GUI lags. 60ms seems as reasonable compromise between GUI responsibility and performance (each kernel call incurs substantional overhead so for better performance one should try to keep number of calls at minimum). But it's tunable. If you feel GUI too laggy try to set -tt N option to lower value, like -tt 15 for example.

How all this influence -period_iterations_num N behavior?
Adaptation algorithm takes that M/N as initial value but starts to change it after few initial iterations to meet -tt N goal. So, one can set -period_iterations_num to very high value and reduce initial few PulseFind kernel calls to very low lenghts, but soon after call's lenght will start to increase and GUI lags (if any) will reappear. The same with optimization attempt and -period_iterations_num 1 (for example). This will disable PulseFind kernel splitting only for few first calls. After that adfaptation algotithm will start to split single call into the few again to meet default 60ms per call goal (and performance may drop).

So, if you want to change default behavior, you need to use both -period_iterations_num N and -tt n options.

To reduce lags set N big and n - low (like 500 and 15 for example).
To improve performance set N low and n high (like 1-3 for N and 300 for n, especially if GUI lags not important).

To learn how adaptation algorithm interacts with your particular setup and for aid in optimization look into set of PulseFind profiling counters app will print in its stderr after task finishing.
Here is exampe of those counters and description how to decipher info they provide:

Fftlength=32,pass=3:Tune: sum=106916(ms); min=15.21(ms); max=84.31(ms); mean=58.08(ms); s_mean=54.2; sleep=45(ms); delta=289; N=1841; usual Fftlength=32,pass=4:Tune: sum=152665(ms); min=12.25(ms); max=89.75(ms); mean=59.5(ms); s_mean=48.54; sleep=45(ms); delta=185; N=2566; usual Fftlength=32,pass=5:Tune: sum=38946.1(ms); min=9.851(ms); max=68.61(ms); mean=46.75(ms); s_mean=57.65; sleep=60(ms); delta=643; N=833; usual Fftlength=64,pass=3:Tune: sum=67767.6(ms); min=7.595(ms); max=70.24(ms); mean=53.11(ms); s_mean=58.06; sleep=60(ms); delta=457; N=1276; usual Fftlength=64,pass=4:Tune: sum=82488.8(ms); min=5.996(ms); max=78.9(ms); mean=58.63(ms); s_mean=60.21; sleep=60(ms); delta=354; N=1407; usual Fftlength=64,pass=5:Tune: sum=33600.6(ms); min=5.2(ms); max=64.29(ms); mean=38.31(ms); s_mean=59.64; sleep=60(ms); delta=732; N=877; usual Fftlength=128,pass=3:Tune: sum=39667.5(ms); min=3.837(ms); max=80.19(ms); mean=37.28(ms); s_mean=49.33; sleep=45(ms); delta=997; N=1064; usual Fftlength=128,pass=4:Tune: sum=32580.9(ms); min=3.373(ms); max=78.17(ms); mean=31.79(ms); s_mean=38.77; sleep=30(ms); delta=1112; N=1025; usual Fftlength=128,pass=5:Tune: sum=17835.2(ms); min=2.578(ms); max=47.37(ms); mean=18.75(ms); s_mean=21.72; sleep=15(ms); delta=1082; N=951; usual Fftlength=256,pass=3:Tune: sum=32437(ms); min=1.931(ms); max=43.89(ms); mean=29.3(ms); s_mean=43.32; sleep=45(ms); delta=1194; N=1107; usual Fftlength=256,pass=4:Tune: sum=23303.3(ms); min=1.54(ms); max=31.68(ms); mean=21.9(ms); s_mean=31.03; sleep=30(ms); delta=1151; N=1064; usual Fftlength=256,pass=5:Tune: sum=17114.3(ms); min=1.318(ms); max=23.98(ms); mean=16.76(ms); s_mean=22.92; sleep=15(ms); delta=1108; N=1021; usual Fftlength=512,pass=3:Tune: sum=32597(ms); min=0.9826(ms); max=22.52(ms); mean=19.43(ms); s_mean=21.76; sleep=15(ms); delta=1721; N=1678; usual Fftlength=512,pass=4:Tune: sum=23698.2(ms); min=0.7869(ms); max=17.35(ms); mean=14.3(ms); s_mean=16.03; sleep=15(ms); delta=1700; N=1657; usual Fftlength=512,pass=5:Tune: sum=17414(ms); min=0.6734(ms); max=12.65(ms); mean=10.65(ms); s_mean=11.65; sleep=0(ms); delta=1678; N=1635; usual Fftlength=1024,pass=3:Tune: sum=72685.6(ms); min=0.5037(ms); max=25.85(ms); mean=23.54(ms); s_mean=24.48; sleep=15(ms); delta=3109; N=3088; high_perf Fftlength=1024,pass=4:Tune: sum=452.15(ms); min=0.4042(ms); max=8.098(ms); mean=3.3(ms); s_mean=6.856; sleep=0(ms); delta=3098; N=137; usual Fftlength=1024,pass=5:Tune: sum=326.168(ms); min=0.3532(ms); max=6.036(ms); mean=2.589(ms); s_mean=5.704; sleep=0(ms); delta=3087; N=126; usual Fftlength=2048,pass=3:Tune: sum=71583.2(ms); min=5.29(ms); max=12.58(ms); mean=11.94(ms); s_mean=12.02; sleep=15(ms); delta=1; N=5997; high_perf Fftlength=4096,pass=3:Tune: sum=78586.3(ms); min=2.574(ms); max=6.829(ms); mean=6.553(ms); s_mean=6.538; sleep=0(ms); delta=1; N=11993; high_perf Fftlength=8192,pass=3:Tune: sum=91514.2(ms); min=1.448(ms); max=4.018(ms); mean=3.815(ms); s_mean=3.809; sleep=0(ms); delta=1; N=23987; high_perf

As was mentioned - each FFT size for task with particular AR will provide own amount of work so algorithm (and counters) keep separate tracks for all possible FFT sizes/lengths.

pass=3,4,5 designates particular folding procedure that also influences kernel call length
After Tune: info about particular case contained.
sum is sum of all such PulseFind kernel calls lengths. One can estimate how long GPU was processed this particular type of work inside this particular task.
min/max/mean - corresponding kernel call lenghts in ms.
s_mean so called sliding mean that computed through not all but only few last calls. Algorithm looks this value to decide if and how it should change work splitting to meet target time (tt) goal.

sleep: if -use_sleep active this will agrument of Sleep() call. Algorithm takes into account sleep quantum and tries to optimize number fo such quantums to sleep only needed time to keep both performance and CPU usage at optimum. If -use_sleep not used (Sleep disabled) ignore it.

delta: that M/N value memtioned earlier.
Can't be less than 1. If you see delta=1 and still have GUI lags, -tt/-period_iterations_num options can't help in such case.

N: total number of calls of such kind in this particular task.

usual/high_perf: Sometime it's possible to merge 3 separate passes (3,4,5) into single kernel sequence to make all 3 in fraction of target time. In such case passes will be merged and high_perf modifier will appear. In such case useful info contained it pass=3 counter, pass4/5 ones stopped to update after the merge.

And now, how to use all this info for guided optimization?
1) Look max values. If they significally higher than mean ones - incease -period_iterations_num value to reduce lags. That is, starting point requires correction.

2) look mean and s_mean values. Few first lines corresponding hardest cases for GPU so concentrate on them. If you see mean times ~60ms 9at default settings) that means your setup will response on -tt value increase (there is optimization potential here). If even for first few lines s_mean lower than default 60 (or lower than provided -tt value) it means there isn't enough work to load GPU. Further increasing of -tt value will not help, consider to increase -sbs value instead or just try to run few tasks at once on such GPU.

From other side, if listed s_mean times already considerably lower than -tt value but lags still present - hardly playing with -period_iterations_num/tt will help in such case. Try to describe your config in forum support thread and maybe some specific solution could be found.

There can be the case (on slow devices) where with -tt set to low values few first lines show s_mean ~ tt value, but lines closer to the bottom (with bigger FFT sizes) still have s_mean > than desired tt value and GUI lags can't be reduced.
If you experience such situation please describe your config in forum support thread (for 8.19 it's https://setiathome.berkeley.edu/forum_thread.php?id=80381 ; similar will be created for new releases too).

EDIT: it seems things started to change regarding pre-emption on GPU devices.
NV Pascal architecture has adequate preemption mechanisms acordingly to architecture description: http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
Still have to prove that experimentally though.

Author Topic: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view (Read 69345 times)

Raistmer

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

Raistmer

Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

Raistmer

Re: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

Raistmer

Re: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view