Recent Posts

Pages: 1 ... 4 5 [6] 7 8 ... 10

Discussion Forum / Re: Better sleep on Windows - new round

« Last post by Raistmer on 17 Aug 2016, 09:29:00 am »

1. Tight loop (w/o any sleep attempts).

CPU idle:
typical sleep quantum is 9.57e-5ms, that is, ~100ns - quite low overhead inside spin-loop (and, of course, full core CPU usage).

CPU busy with MB (idle priority processes):
roughly same 100ns per loop and 100% core load by GPU app.

2. Sleep(0) inside loop.

CPU idle:
quantum size ~850ns and full CPU core consumption by GPU app.

CPU busy with MB:
quantum size ~2ms and ~2-3% CPU usage by GPU app - good mode.

Conclusion from this part:
Sleep(0) yields to lower-priority processes (!). GPU app process below-normal while CPU MB at idle priority (lowest possible) and CPU MB still takes almost full CPU to run.

3. Sleep(1) inside loop.

CPU idle:
quantum size 1,0ms, CPU consumption<~2%

CPU busy with CPU MB:
quantum size vary from 1.0 to 1.5ms but most readings 1.0ms still; CPU consumption by GPU app<2%.

Conclusion for this part:
Sleep(1) with high-precision multimedia timer provides better stability than Sleep(0) in both CPU idle\busy modes with CPU cycles saving and quite stable yield intervals.

4. SwitchToThread inside loop.

CPU idle:
quantum size ~660ns (less overhead than Sleep(0)); full core CPU consumption (same as Sleep(0) on idle CPU).

CPU busy:
quantum size vary from 0.01ms to ~2,7ms with most readings near 2,3ms; CPU consumption <2%.

Summary for this part: in idle CPU mode SST as useless as Sleep(0) with little less overhead. In busy CPU mode SST and Sleep(0) behavior very similar. Full task benchmarks needed to see what is better. But both seems no better than Sleep(1) currently.

Next post will compare Sleep(0), Sleep(1) and SwitchToThread() for PG-VHAR task on fully loaded CPU. It will take some time to conduct.

Discussion Forum / Better sleep on Windows - new round

« Last post by Raistmer on 17 Aug 2016, 09:12:15 am »

Here I'll collect all new attempts to free CPU for time intervals less than single millisecond.

Spinloop includes check for GPU event ready state (so some overhead implied). Code section for this experiment is:

Code: [Select]

			if(use_sleep){//R: spins with Sleep(1) while readback finished
				cl_event ev; clEnqueueMarker(cq,&ev);clFlush(cq);
				size_t wait_time=0;cl_int ret;
				do{SwitchToThread();/*nanosleep(100);*//*Sleep(use_sleep_ex);*/wait_time++;
					err=clGetEventInfo(ev,CL_EVENT_COMMAND_EXECUTION_STATUS,sizeof(ret),&ret,NULL);
				}while(ret>CL_COMPLETE);
				cl_ulong start=0,end=0; 
				err=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong),&start,NULL);
				err|=clGetEventProfilingInfo(ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL);
				OCL_LOG_ERR("clGetEventProfilingInfo");
				float cur_quantum=(end-start)/(wait_time*1e6);
				clReleaseEvent(ev);
				if(use_sleep_ex==1 && wait_time>7)SleepQuantumCounter::update(cur_quantum);
				if(verbose==6){
					if(use_sleep_ex==1)fprintf(stderr,"current sleep quantum %2.4gms\t",cur_quantum);
					fprintf(stderr,"Sleep before triplet result map: Awaited %d iterations for completion; elapsed %2.4gms\n",
						wait_time,(end-start)/1e6);
				}
			}

While counter provide average sleep quantum, using -v 6 allows per instance results and manual averaging "by sight".
So I'll use VHAR (AR=0.75) task with SoG flavour where more than second long spins can occur on C-60 hardware.
OS is Win7 x64.

Prev results were:
Sleep(1) can be 15ms long on C-60 - too big quantum for many kernels.
Adding -high_prec_timer makes it ~1ms long - good enough but changing system-wide multimedia timer could negatively affect whole host performance.
using nanosleep() implementation for Windows based on waitable timer (https://gist.github.com/Youka/4153f12cf2e17a77314c) gave same ~1ms quantum (though overhead of such function call expected to be higher than Sleep(1), so no advantage here).

Per Shaggie76 suggestion (http://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1809886) I'll explore SwitchToThread behavior in different host load modes.

For this experiment CPU freq of C-60 fixed to 1GHz even in P0 & P2 states by BrazosTweaker app.
exact tune line is: -period_iterations_num 4 -v 6 -use_sleep -high_prec_timer

Discussion Forum / Re: MB/AP Bench Test Instruction

« Last post by William on 17 Jun 2016, 04:13:45 am »

the principle behind AP and MB bench is the same - you should be able to directly transfer what you learned for AP.

Discussion Forum / Re: MB/AP Bench Test Instruction

« Last post by Dirk on 16 Jun 2016, 02:07:32 pm »

It looks like so more I write with my poor english, the less I were understood.

I was disappointed that guppi.vlar's are send to GPUs.
From my experiences at SETI-Beta (default settings), they last x3 than a mid-AR task (on my FuryX's). On CPU they are shorter than mid-AR tasks.

I optimized (just played around) the cmdline settings and now the guppi.vlar's last like mid-AR tasks (on my FuryX's).
Not so much performance loss like I thought.

I did this on my own, noone told it to me how to do this.

Because of this I would like to know, how to optimize the cmdline settings for to get max performance, specially on my own PC system.

For some time, Josef W. Segur guided me through the AstroPulse bench test run (on my Intel J1900 iGPU).
After this I understood the procedure and I did the same on my NV GT730 alone.
I can do this also on my FuryX's PC.

It would be very nice if someone could guide me through the MultiBeam bench test run, e.g. on my FuryX's.
Then I can do this also on my other GPUs.

Other could also follow it and optimize their GPUs.

Maybe after I could make an instruction how to make bench test runs.
From the sight of an user, maybe it could be helpful to do this.

Thanks.

Windows / SETI@home v8.12 Windows GPU applications support thread

« Last post by Raistmer on 09 Jun 2016, 10:52:44 pm »

Known issues:

-GUI lags/driver restarts on some GPU/driver combos:

To reduce/eliminate GUI lags while using v8.12 GPU applications try to add next parameters to applications's command line (there are few different ways to supply command line to app please consult with BOINC documentation and app's Readme ):
-sbs 256 -period_iterations_num 100

Increase value in -period_iterations_num option if needed.

- running together with GPU AstroPulse application and using -total_GPU_instances_num N process instances incorrectly bound to same instead of different CPUs in CPUlock mode.
For single vendor GPUs use older
-instances_per_device N
option for both apps.

For nVidia builds:
- -use_sleep leads to stop of processing with "SoG" 8.12 version
Please use this build instead if you need -use_sleep option: https://cloud.mail.ru/public/3X4g/HwEhUWHCE

There can be excessive CPU usage for some of tasks. If lower CPU usage desirable supply
-use_sleep
option to command line.

Detailed descriptions of mentioned options can be found here (for brief description look application's ReadMe file): http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931

Discussion Forum / Re: Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

« Last post by Raistmer on 06 Jun 2016, 11:07:37 am »

Now let's discuss

-use_sleep

and

-use_sleep_ex N

options.

They quite important for modern nVidia's OpenCL runtime implementation, much less important for AMD's one and hardly have any use for Intel's implementation.
To understand why they needed let's talk about ways of interaction between CPU and auxilary hardware like GPU.
To work together these devices need some communication between each other. Also they need some mechanism to order they memory accesses to cooperatively work and produce correct results.
In very core of typical PC system CPU can "talk" with outer world via memory writes (with adresses can be mapped to some different from real RAM hardware), ports wites (where write comes read too). And external world can communicate with CPU by setting signals on some of its lines that represent so called interrupt requests.
From these abilities come 2 base different ways of synchronisation - asynchronous communication (via interrupts that occur when external hardware needs and CPU responds) and so called polling, that is, constant read from some port or memory range until hardware set some flag.
Of course, such action requires constant CPU processing corresponding I/O instruction - CPU will be busy.
From other side - modern GPUs so fast that if they would send interrupt to CPU on each required action interrupt controller would be just overhelmed and again, CPU would do nothing but switching to/from interrupt processing (that costly). So some balance between all those sync methods required to make CPU less busy and GPU device more busy instead. Bus mastering allows GPU device direct accessing to memory w/o CPU disturbance. But anyway, once data required for CPU it should know somehow that data ready and all external writes finished so data can be read in its final form.
All this low-level machinery isolated on driver and framework runtime levels from application. And different vendors use different mix of possible synching primitives to achieve their goals. Thus we have so noticable differencies in OpenCL MultiBeam behavior on different vendor's runtimes.
It seems, the sync algorithm used for both vendors depends on estimated(measured?) length of kernel(s) call. But defaults are different.
This most vividly exposed by SoG (Signals on GPU) flavour of MultiBeam app.
For so called VHAR tasks (telescope movement so fast and wide that some searches like Pulse find of Gaussian find [look prev posts for Pulse find] just can't collect enough data points to form adequate array of data) some searches disabled. With remaining searches processing completely on GPU (also with intermediate signal logging) there is no need to disturb CPU at all after initial kernels enqueuing. CPU just sends all to GPU and then awaits for final results. This allows maximal decoupling between devices. But, as independend GPU work time exceeded some threshold, suddenly usually low CPU-consuming AMD build started to consume full CPU core. WTF? Well, some hardwired into AMD runtime limit was exceeded and (perhaps) runtimes started to think that "soon enogh" kernel finishes. But no, it continues - and CPU enters polling mode for prolonged amount of time. Unfortunately, even -use_sleep can't help in this AMD case cause polling done not in app's process but inside driver process. So solution was to restrict (!) length of CPU-GPU uncoupling and to send clFinish() synchronizing instructions each N iterations (btw, app has undocumented yet switch that governs this period). This way runtime stays on its default synching (interrupts w/o heavy polling?) method and CPU consumption low enough.
With nVidia we heve diametrally different situation. Default behavior here is polling and switching from polling occurs on quite prolonged loads. So, on NV SoG results in great CPU usage decrease for VHAR.
Luckely, NV runtime executes polling from one of app's process threads. And hence can be switched away from CPU by standart system Sleep() call.

Here we come to -use_sleep* options.
It enables 1ms Sleep() call ( for Windows 1 ms is the lowest possible sleep duration besides of just yeld with value of 0) in cycle while awaiting completion of event marker specially inserted into GPU processing queue.
-use_sleep_ex N just allows to pass particular number to that Sleep() call (NB, try to use -use_sleep_ex 0 to decrease sleep as much as possible - with externel CPU load it could decrease CPU consumption in waiting cycle too while not causing big GPU starvation).
What the implications of such approach? Only single but big one: typical kernel times and minimal possible sleep time differs in order of magnitude.
That is, sleeping even 1 ms we will starve GPU. Especially high-end one.
What possible solution to overcome this limitation? To somehow increase amount of pre-scheduled work to GPU for it to sustain app's sleep.
What possible methods to do this currently:
1) to increase number of independent kernel queues(streams in CUDA terminology).
2) to increase length of scheduled kernel call.

What options could help?:
1) multitasking. Try to run few simultaneous tasks per GPU while one process sleeping another will do kernel scheduling or even (if you will be lucky and VLAR will be mixed with VHAR) schedules so many kernel calls at once that GPU will be totally busy. unfortunately, multitasking comes with own overheads. For pre-FERMI architecture it's so costly that overhead offsets any possible gain. FERMI and up made multitasking much more feasible.
corresponding app option to use is -instances_per_device N .
Consult elsewhere for corresponding app_config.xml content.

2) to make kernel bigger one could prevent its splitting in the first hand.
So, -period_iterations_num 1
But beware, this will expose all problems described in the first post so approach to lowest possible value of 1 with care.
And -sbs 256 or higher. Though it will not increase amount of work (that is, kernel length) per se, it will improve effectiveness of kernel work.

Because of that 1ms as minimal sleep Windows limitation (in Linux AFAIK situation can be improved with using special nano-sleep library) this method of saving synching CPU time is quite clumsy. So implemented only for longest kernels (that is, PulseFind) currently.
Maybe, in future buiulds its area of effect will be even more restricted to only longest of those PulseFind calls (they differ in length hugely through the task) this will allow bigger GPU load but also increases CPU time wasted for sync polling.

As of 6 Juny 2016 low-performance path enables sleep automatically (can be disabled via -no_defaults_scaling option) with sleep time of 5ms.
In current SVN head this already changed to 1ms.

27.06.2016 addon
Tests on C-60 APU based host (here one can see raw data for this experiment: https://drive.google.com/file/d/0BwjTLNvsJmLBbHB0d2hXTU4waEE/view?usp=sharing ) revealed quite coarse real sleep time increase.
No matter what Sleep() argument is real sleeping time increased with steps of 15ms.
So, instead of used-supplied values starting from rev 3476 adaptive sleeping algorithm implemented that attempts to keep partial PulseFind kernel size near 15ms per call and adjust sleep time to real kernel execution time.
Hence, in rev 3476 value of -use_sleep_ex ignored and both -use_sleep and -use_sleep_ex N just enable this new sleep-adjustment algorithm.

Discussion Forum / Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

« Last post by Raistmer on 06 Jun 2016, 10:13:13 am »

Next important option to consider is

-sbs N

That stands for Single Buffer Size (N- in MB).

As you could recall there are plenty of periods per each data array to process. That can be used to increase coarse GPU load (that is, to increase number of active CUs participating in data processing). 8 separate entities to process is too small for modern devices.
So, what if we will subdivide different periods sets into separate entities? Each period independednt from another so - can be parallelized! Said - done. But each array for particular periods set now represent separate input data. To be processed simultaneously we need some storage space to keep this data. It can't be retrieved from original data now cause we want simultaneoud changing (folding) of all arrays. So, requirements to storage amount (used GPU memory) increase.
Here comes -sbs N option. In some sense it's similar to my AstroPulse's -unroll N and -ffa_block N options. All govern some data unrolling to increase coarse parallelizm. But instead of AstroPulse's options where number of unrols specified directly here we can just set upper boundary for used memory buffer size.
Why so? Cause instead of AstroPulse where we deal with mostly pre-determined and unchanging data sizes and dimensions (single exclusion - FFA but not because of lack of determinism but just because of too big number of variations to account. Nevertheless, exactly same variations (periods) repeat on each of AP's tasks) in MultiBeam we have recording hardware (recall, telescope movement!) parameter - AR - angular range, that governs all data splitting for particular task. So, much more agility required to deal with all possible patterns.
In early app versions periods were unrolled just up to buffer size to fill all available space. For some ARs this resulted in too much overhead (cause to make arrays fully independable also means to repeat some of computations that now can't be shared - all has its price). So, in recent versions new algorithm was devised to produce as many separate data entities as needed to fully load GPU's CUs (of course, in constrains of available buffer size).
So, summarizing, this option governs, though indirectly, amount of unroll of data in PulseFind (currently, it affects only PulseFind) algorithm to increase coarse parallelizm degree of algorithm.
With recent algorithm change few new options were introduced that work in cooperation with -sbs N one.
They are:
-pref_wg_size N
and
-pref_wg_num_per_cu N

Look ReadMe for short description. Together these 3 options allow to control distribution of computational load inside GPU device.

As of 6 Juny 2016 default value for -sbs N is 128.
From where it came? Some first OpenCL devices (AMD HD4xxx ones) had only 128MB of available buffer no matter how many GPU RAM device had. Also, some of them though allowed big buffer produced invalid results. So, initial default was as low as 64MB. Currently GPUs in general have plenty of memory onboard and runtime evolved to support ~1/4 of installed memory at single allocation (that's what memory buffer represents - single allocation of memory).
But on some low-end GPUs bigger buffer lead to driver restarts so I decided to follow "better safe than sorry" approach once again and restrict default to pretty universal value. For most GPUs it can be increased to 256. Maybe future builds will have bigger or floating default.

Discussion Forum / Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

« Last post by Raistmer on 06 Jun 2016, 08:41:11 am »

Cause again and again same questions arise regarding how to deal with MB tuning I'll try to give some insights in what options do to put anonymous tuning parameter space into something more understandable. Knowledge of app machinery could help in guided tuning IMHO.

Short option description given in ReadMe file that comes with app. Here I'll try to explain some of options in more detail. I did the same many times in different threads in different forums but info get scattered and lost. So, separate thread that publicly read but quite restricted in commenting so has good chances to not get diluted.

Well, first quite important (especially in the view of GBT data advance wich mostly VLAR) option:

-period_iterations_num N

The MultiBeam app consists of few different searches for artifical signals. This option belongs to so called Pulse signal. That is, short burst of energy. Cause we don't know interval between bursts and its duration we should guess some possible ones (that is, search/scan parameter space for both duration and period). So, one needs to sum up powers of adjacent points through time in different sequencies to see if for some pattern resulting energy will exceed some threshold (threshold ~ designates some background level that could be expected from random white noise). We should do such summation through all time we receive signal from same point of sky.
MultiBeam mostly piggy-back algorithm (that is, collect all data we could have, not those we want to have). Some time telescope moves fast between points... but SETI recorder still works - this will restrict available length of data array for summation.
Also, what frequency range one should use? Again, we don't know what frequency width of signal so check different ones.
All this result in many different patterns arising from the same initial data (there is also doppler shift that leads to dechirping but this just more increase number of possible patterns). Initial data (as of v8, v7,v6) consist of 1024x1024 points. Depending of what frequency width we analyse it can be split in matrices of different forms.
For example 8x128k (k==1024 here) or 16k x 64
One dimension will be number of separate data arrays while another - length of each particular array.
Of course, the more separate independent arrays we have the easier GPU's work (cause by design GPU is massively parallel device).
So, obviously, there are extreme cases, where number of separate arrays very low - that's the case of PulseFind (in particular) in VLAR task.
Each array can be very long (cause with VLAR we stare almost the same point of sky all task data duration) but this result in low number of such separate arrays.
Here we have issues for modern GPUs cause there are only 8 such fully separate arrays (recall that modern GPU can have as many as 40 or even more independend CU (compute units) that can't be synchronised between each other (again, by GPU hardware design).
This essentially leads to very low computational load (don't mix real load with reported by tools like GPU-Z - even single busy CU will be reported as buzy GPU by such tools) and bad processing speed. GPU becomes essentially 8-core CPU (but with much less core freq as you know).
So, something should be done to increase parallelization. Luckely, we have another parameter that needs to be scanned - pulse periods. There are plenty of them.
So, actually, those 8 arrays generate enormous number of new arrays to fold, each folded little differently (different periods) than another. Memorize this fact cause it will be used in another option's description.
So, if we will try to process all possible periods in single kernel call (especially with each iteration loading only let say 8 (actually can be even less cause few arrays can be processed on single CU) CUs of GPU ) we can get very long kernel call even on fastest GPUs. Hence, lags, driver restarts.
Such call should be split to many separate kernel calls to allow GPU to switch context to another tasks like GUI redraw.
This option governs this process. It splits single call to N subsequent calls each of those process only 1/Nth of all available periods.
So, summarizing, this option can help with lags, it can change longest kernel duration (so, useful when attempting to balance GPU computations with CPU-sleeping) but doesn't really help with real GPU load. Some speedup/slowdown can be observed because of additional overhead of making separate calls and because of found signal early in sequence. If signal detected remaining iterations omitted in some of versions.

Current (as of 6 Juny 2016) defaults are 50 for most of GPUs and 500 for so called "low performance path" GPUs with only 3 or less CUs per device.

Discussion Forum / Re: MultiBeam v8 processing performance in pictures

« Last post by Raistmer on 05 Jun 2016, 03:17:33 pm »

And some picture for another host, with nVidia GPU.
Much less data (host unstable and in different location, mostly AstroPulse statistics exploration at those times).
But again, VLAR is hard, for CPU too.

Discussion Forum / Re: MultiBeam v8 processing performance in pictures

« Last post by Raistmer on 05 Jun 2016, 03:05:33 pm »

And changed range to zoom GPU data

Dots take large area - elapsed times for GPU application vary much (CPUlock mechanism allow more deterministic results these days).
But best case scenario (orange dots for example) very similar in behavior to CPU data.

Once again, VLAR is hard, indeed. Even for GPU.

Pages: 1 ... 4 5 [6] 7 8 ... 10