Now let's discuss
-use_sleep_ex N
They quite important for modern nVidia's OpenCL runtime implementation, much less important for AMD's one and hardly have any use for Intel's implementation.
To understand why they needed let's talk about ways of interaction between CPU and auxilary hardware like GPU.
To work together these devices need some communication between each other. Also they need some mechanism to order they memory accesses to cooperatively work and produce correct results.
In very core of typical PC system CPU can "talk" with outer world via memory writes (with adresses can be mapped to some different from real RAM hardware), ports wites (where write comes read too). And external world can communicate with CPU by setting signals on some of its lines that represent so called interrupt requests.
From these abilities come 2 base different ways of synchronisation - asynchronous communication (via interrupts that occur when external hardware needs and CPU responds) and so called polling, that is, constant read from some port or memory range until hardware set some flag.
Of course, such action requires constant CPU processing corresponding I/O instruction - CPU will be busy.
From other side - modern GPUs so fast that if they would send interrupt to CPU on each required action interrupt controller would be just overhelmed and again, CPU would do nothing but switching to/from interrupt processing (that costly). So some balance between all those sync methods required to make CPU less busy and GPU device more busy instead. Bus mastering allows GPU device direct accessing to memory w/o CPU disturbance. But anyway, once data required for CPU it should know somehow that data ready and all external writes finished so data can be read in its final form.
All this low-level machinery isolated on driver and framework runtime levels from application. And different vendors use different mix of possible synching primitives to achieve their goals. Thus we have so noticable differencies in OpenCL MultiBeam behavior on different vendor's runtimes.
It seems, the sync algorithm used for both vendors depends on estimated(measured?) length of kernel(s) call. But defaults are different.
This most vividly exposed by SoG (Signals on GPU) flavour of MultiBeam app.
For so called VHAR tasks (telescope movement so fast and wide that some searches like Pulse find of Gaussian find [look prev posts for Pulse find] just can't collect enough data points to form adequate array of data) some searches disabled. With remaining searches processing completely on GPU (also with intermediate signal logging) there is no need to disturb CPU at all after initial kernels enqueuing. CPU just sends all to GPU and then awaits for final results. This allows maximal decoupling between devices. But, as independend GPU work time exceeded some threshold, suddenly usually low CPU-consuming AMD build started to consume full CPU core. WTF? Well, some hardwired into AMD runtime limit was exceeded and (perhaps) runtimes started to think that "soon enogh" kernel finishes. But no, it continues - and CPU enters polling mode for prolonged amount of time. Unfortunately, even -use_sleep can't help in this AMD case cause polling done not in app's process but inside driver process. So solution was to restrict (!) length of CPU-GPU uncoupling and to send clFinish() synchronizing instructions each N iterations (btw, app has undocumented yet switch that governs this period). This way runtime stays on its default synching (interrupts w/o heavy polling?) method and CPU consumption low enough.
With nVidia we heve diametrally different situation. Default behavior here is polling and switching from polling occurs on quite prolonged loads. So, on NV SoG results in great CPU usage decrease for VHAR.
Luckely, NV runtime executes polling from one of app's process threads. And hence can be switched away from CPU by standart system Sleep() call.
Here we come to -use_sleep* options.
It enables 1ms Sleep() call ( for Windows 1 ms is the lowest possible sleep duration besides of just yeld with value of 0) in cycle while awaiting completion of event marker specially inserted into GPU processing queue.
-use_sleep_ex N just allows to pass particular number to that Sleep() call (NB, try to use -use_sleep_ex 0 to decrease sleep as much as possible - with externel CPU load it could decrease CPU consumption in waiting cycle too while not causing big GPU starvation).
What the implications of such approach? Only single but big one: typical kernel times and minimal possible sleep time differs in order of magnitude.
That is, sleeping even 1 ms we will starve GPU. Especially high-end one.
What possible solution to overcome this limitation? To somehow increase amount of pre-scheduled work to GPU for it to sustain app's sleep.
What possible methods to do this currently:
1) to increase number of independent kernel queues(streams in CUDA terminology).
2) to increase length of scheduled kernel call.
What options could help?:
1) multitasking. Try to run few simultaneous tasks per GPU while one process sleeping another will do kernel scheduling or even (if you will be lucky and VLAR will be mixed with VHAR) schedules so many kernel calls at once that GPU will be totally busy. unfortunately, multitasking comes with own overheads. For pre-FERMI architecture it's so costly that overhead offsets any possible gain. FERMI and up made multitasking much more feasible.
corresponding app option to use is -instances_per_device N .
Consult elsewhere for corresponding app_config.xml content.
2) to make kernel bigger one could prevent its splitting in the first hand.
So, -period_iterations_num 1
But beware, this will expose all problems described in the first post so approach to lowest possible value of 1 with care.
And -sbs 256 or higher. Though it will not increase amount of work (that is, kernel length) per se, it will improve effectiveness of kernel work.
Because of that 1ms as minimal sleep Windows limitation (in Linux AFAIK situation can be improved with using special nano-sleep library) this method of saving synching CPU time is quite clumsy. So implemented only for longest kernels (that is, PulseFind) currently.
Maybe, in future buiulds its area of effect will be even more restricted to only longest of those PulseFind calls (they differ in length hugely through the task) this will allow bigger GPU load but also increases CPU time wasted for sync polling.
As of 6 Juny 2016 low-performance path enables sleep automatically (can be disabled via -no_defaults_scaling option) with sleep time of 5ms.
In current SVN head this already changed to 1ms.
27.06.2016 addonTests on C-60 APU based host (here one can see raw data for this experiment: ) revealed quite coarse real sleep time increase.
No matter what Sleep() argument is real sleeping time increased with steps of 15ms.
So, instead of used-supplied values starting from rev 3476 adaptive sleeping algorithm implemented that attempts to keep partial PulseFind kernel size near 15ms per call and adjust sleep time to real kernel execution time.
Hence, in rev 3476 value of -use_sleep_ex ignored and both -use_sleep and -use_sleep_ex N just enable this new sleep-adjustment algorithm.