AstroPulse OpenCL application currently available in 3 editions: for AMD/ATi, nVidia and Intel GPUs.
It's intended to process SETI@home AstroPulse v7 tasks.

Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
Build revision:2737
Date of revision commit: 2014/10/31 00:12:41

****Available command line switches****

-v N :sets level of verbosity of app. N - integer number. -v 0 disables almost all output. Default corresponds to -v 1. 
	Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage. 
	-v 2 enables all signals output.
	-v 3 additionally to level 2 enables output of simulated signals corresponding current threshold level (to easely detect near-threshold validation issues).
	-v 6 enables delays printing where sleep loops used.
	-v 7 enables oclFFT config printing for oclFFT fine tune.

-ffa_block N :sets how many FFA's different period iterations will be processed per kernel call. N should be integer even number less than 32768.
        Increase in this param's value will increase app's GPU memory consumption.

-ffa_block_fetch N: sets how many FFA's different period iterations will be processed per "fetch" kernel call (longest kernel in FFA).
	N should be positive integer number, should be divisor of ffa_block_N.

-unroll N :sets number of data chunks processed per kernel call in main application loop. N should be integer number, minimal possible value is 2.
	Increase in this param's value will increase app's GPU memory consumption.

-skip_ffa_precompute : Results in skipping FFA pre-compute kernel call. Affects performance. Experimentation required if it will increase or decrease performance on particular GPU/CPU combo. 

-exit_check :Results in more often check for exit requests from BOINC. If you experience problems with long app suspend/exit use this option.
	Can decrease performance though.

-use_sleep :Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required.

-initial_ffa_sleep N M: In PC-FFA will sleep N ms for short and M ms for large one before looking for results. Can decrease CPU usage. 
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N and M should be integer non-negative numbers.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.

-initial_single_pulse_sleep N : In SingleFind search will sleep N ms before looking for results. Can decrease CPU usage. 
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N should be integer positive number.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.

-sbs N :Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes. 
	For now if other options require bigger buffer than this option allows warning will be issued but memory allocation attempt will be made.

-hp : Results in bigger priority for application process (normal priority class and above normal thread priority). 
	Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo.

-cpu_lock : Enables CPUlock feature. Results in CPUs number limitation for particular app instance. Also attempt to bind different instances to different CPU cores will be made.
	Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required.
	Now this option allows GPU app to use only single logical CPU. 
	Different instances will use different CPUs as long as there is enough of CPU in the system.
	To use CPUlock in round-robin mode GPUlock feature will be enabled. Use -instances_per_device N option if few instances per GPU device are needed.

-cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU  (N=0,1,.., number of CPUs-1).

-gpu_lock :Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run.

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances). 
	N - integer number of allowed instances. 
These 2 options used together provide BOINC-independent way to limit number of simultaneously
executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process
execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution.


-disable_slot N: Can be used to exclude N-th GPU (starting from zero) from usage. 
	Not tested and obsolete feature, use BOINC abilities to exclude GPUs instead.

Advanced level options for developers (some app code reading and understanding of algorithms used is recommended before use, not fool-proof even in same degree as 
options above):
-tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels.
	N - kernel ID (see below)
	Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones.
	N should be one of values from this list:
	FFA_FETCH_WG=1,
	FFA_COMPARE_WG=2
	For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel.
	This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups.
	No big changes in speed expected but if you see big positive change over default please report.
	Usage example: -tune 2 32 1 1  (set workgroup size of 32 for 1D FFA comparison kernel).
-oclFFT_plan A B C : to override defaults for FFT 32k plan generation. Read oclFFT code and explanations in comments before any tweaking.
	A - global radix
	B - local radix
	C - max size of workgroup used by oclFFT kernel generation algorithm
	Usage example: 	-oclFFT_plan 64 8 256 (this corresponds to old defaults); 
			-oclFFT_plan 0 0 0 (this effectively means this option not used, hardwired defaults in play).

These switches can be placed into ap_cmdline_win_x86_SSE2_OpenCL_NV.txt also.

For device-specific settings in multi-GPU systems it's possible to override some of command-line options via application config file.
Name of this config file:
AstroPulse_<vendor>_config.xml where vendor can be ATi, NV or iGPU.
File structure:
<deviceN>
	<unroll>N</unroll>
	<ffa_block>N</ffa_block>
	<ffa_block_fetch>N</ffa_block_fetch>
</deviceN>
where deviceN - particular OpenCL device N starting with 0, multiple sections allowed, one per each device.

For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding package.

****Known issues****
- With 12.x Catalyst drivers GPU usage can be low if CPU fully used with another loads.
  Same applies to NV drivers past 267.xx and to Intel SDK drivers. 
  If you see low GPU usage of zero blanked tasks try to free one or more CPU cores. * 
- For overflowed tasks found signal sequence not always match CPU version.
- If you experience problems with time to completion estimations from BOINC you could try this advice by Terror Australis 
  (http://setiathome.berkeley.edu/forum_thread.php?id=71301&postid=1354911):
  for Astropulse the flops entry sometimes has to be in scientific notation format for BOINC to understand it right. 
  I.e XXXXe0x, where x is the number of zeros after the integer eg, 9 for Gigaflops, 8 for 100's of Megaflops etc.

  Thus the entry for GTX470 (1120GF)is...
  <flops>1120e09</flops>

  For a GTX550Ti (486GF) it would be
  <flops>486e09</flops>

  for a GTX 580 (1679GF)
  <flops>1679e09</flops>

  For a low powered card such as a GTS250 (756MF IIRC) (not recommended for AP work.)
  the entry would be something like
  <flops>756e08</flops>

  and so on.

  The flops value can be found at the top of the BOINC messages tab where the boot up details are.

****Best usage tips****

For best performance it is important to free 2 CPU cores running multiple instances.
Freeing at least 1 CPU core is necessity to get enough GPU usage.*

* As alternate solution try to use -cpu_lock / -cpu_lock_fixed_cpu N options.
   This might only work on fast multicore CPU`s. Further testing required.

command line parameters.

Command line switches can be used either in app_info.xml or ap_cmdline_win_x86_SSE2_OpenCL_NV.txt.
Params in ap_cmdline*.txt will override switches in <cmdline> tag of app_info.xml.
_______________________

NV Titan and 780 TI

-unroll 16 -ffa_block 16384 -ffa_block_fetch 8192

* Bigger unroll values < 20 doesn`t necessarily result in better run times.
Further testing required.

NV x70/x80

-unroll 12 -ffa_block 12288 -ffa_block_fetch 6144

NV x50/x60 and x60 TI

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536

NV x30/x40

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024



-tune switch

Tune values must be equal or less than max work group size.
Most modern Nvidia cards have work group size of 1024.

possible values:

-tune 1 256 4 1
-tune 1 128 8 1
-tune 1 64 16 1
-tune 1 32 32 1
-tune 1 16 64 1


Intensive testing highlighted -tune 1 64 8 1 -tune 2 64 8 1 to be fastest on mid range and high end GPU`s.
On entry level cards -tune 1 128 8 1 -tune 2 128 8 1 should be fastest.
Further testing required for other GPU`s.

-oclFFT_plan switch

Use at your own risk !
------------------------


FFT kernels are processed in 8 point fft kernels by default.
Using different fft kernel planning can speed up processing significantly.
In most cases 16 point fft kernels are fastest for AstroPulse V7.

-oclFFT_plan 256 16 256

On entry level cards like the NV 610 using full work group size is faster.

-oclFFT_plan 256 16 1024

For those suffering from high CPU usage please use -use_sleep switch.

Note:
Using -use_sleep switch requires bigger unroll values to get better GPU utilization.

Example:

For NV x80/x70
-use_sleep -unroll 18 -oclFFT_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 8 1 -tune 2 64 8 1 

NV 750TI
-use_sleep -unroll 10 -oclFFT_plan 256 16 512 -ffa_block 12288 -ffa_block_fetch 6144
 
NV610 
-use_sleep -unroll 4 -oclFFT_plan 256 16 1024

Your mileage might vary.
-----------------------------------------------------

App instances.
______________

On most cards  running 2 instances should be best.

If you experience screen lags reduce unroll factor and ffa_block_fetch value.

Addendum:
_________

Running multiple cards in a system requires freeing another CPU core.