AstroPulse v7
-------------

Linux, 64bit, AMD/ATI OpenCL :
This executable works on 2.6.32 or newer kernel versions.
It is mandatory to have the AMD Catalyst fglrx driver for linux,
coming with an OpenCL runtime component, installed on your host
to run this application.


AstroPulse OpenCL application currently available in 3 editions: for AMD/ATi, nVidia and Intel GPUs. (MacOSX 64bit)
AstroPulse OpenCL application currently available in 2 editions: for AMD/ATi, nVidia.                (Linux 64bit)
It's intended to process SETI@home AstroPulse v7 tasks.

Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
Build revision: 2751
Date of revision commit: april 2015

Available command line switches:

-v N :sets level of verbosity of app. N - integer number. -v 0 disables almost all output. Default corresponds to -v 1.
	Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage.
	-v 2 enables all signals output.
	-v 3 additionally to level 2 enables output of simulated signals corresponding current threshold level (to easely detect near-threshold validation issues).
	-v 6 enables delays printing where sleep loops used.
	-v 7 enables oclFFT config printing for oclFFT fine tune.
	-v 8 enables allocated GPU memory printing for different parts of algorithm

-ffa_block N :sets how many FFA's different period iterations will be processed per kernel call. N should be integer even number less than 32768.
        Increase in this param's value will increase app's GPU memory consumption.

-ffa_block_fetch N: sets how many FFA's different period iterations will be processed per "fetch" kernel call (longest kernel in FFA).
	N should be positive integer number, should be divisor of ffa_block_N.

-unroll N :sets number of data chunks processed per kernel call in main application loop. N should be integer number, minimal possible value is 2.
	Increase in this param's value will increase app's GPU memory consumption.

-skip_ffa_precompute : Results in skipping FFA pre-compute kernel call. Affects performance. Experimentation required if it will increase or decrease performance on particular GPU/CPU combo.

-exit_check :Results in more often check for exit requests from BOINC. If you experience problems with long app suspend/exit use this option.
	Can decrease performance though.
   (not available on 64bit linux or MacOSX)

-use_sleep :Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required.

-initial_ffa_sleep N M: In PC-FFA will sleep N ms for short and M ms for large one before looking for results. Can decrease CPU usage.
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N and M should be integer non-negative numbers.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.
   (not available on 64bit linux or MacOSX)

-initial_single_pulse_sleep N : In SingleFind search will sleep N ms before looking for results. Can decrease CPU usage.
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N should be integer positive number.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.
   (not available on 64bit linux or MacOSX)

-sbs N :Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes.
	For now if other options require bigger buffer than this option allows warning will be issued but memory allocation attempt will be made.

-hp : Results in bigger priority for application process (normal priority class and above normal thread priority).
	Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo.
   (not available on 64bit linux or MacOSX)

  On Linux and MacOSX :
  Due to OS permission setting rules you have the chance to achieve normal priority by setting <no_priority_change>1</no_priority_change> in <options>
  section of your BOINCs "cc_config.xml" file. Check BOINC manuals/wiki for details where to find and how to set this up.

-cpu_lock : Enables CPUlock feature. Results in CPUs number limitation for particular app instance. Also attempt to bind different instances to different CPU cores will be made.
	Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required.
	Now this option allows GPU app to use only single logical CPU.
	Different instances will use different CPUs as long as there is enough of CPU in the system.
	To use CPUlock in round-robin mode GPUlock feature will be enabled. Use -instances_per_device N option if few instances per GPU device are needed.
  (not available on 64bit linux or MacOSX)

-cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU  (N=0,1,.., number of CPUs-1).
  (not available on 64bit linux or MacOSX)

-gpu_lock :Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run.
  (not available on 64bit linux or MacOSX)

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
	N - integer number of allowed instances.
  (not available on 64bit linux or MacOSX)

  On Linux and MacOSX :
  Check BOINC manuals/wiki for "app_config.xml" setup to achieve multiple tasks running on one GPU.


These 2 options used together provide BOINC-independent way to limit number of simultaneously
executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process
execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution.

-disable_slot N: Can be used to exclude N-th GPU (starting from zero) from usage.
	Not tested and obsolete feature, use BOINC abilities to exclude GPUs instead.
  (not available on 64bit linux or MacOSX)

Advanced level options for developers (some app code reading and understanding of algorithms used is recommended before use, not fool-proof even in same degree as
options above):
-tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels.
	N - kernel ID (see below)
	Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones.
	N should be one of values from this list:
	FFA_FETCH_WG=1,
	FFA_COMPARE_WG=2
	For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel.
	This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups.
	No big changes in speed expected but if you see big positive change over default please report.
	Usage example: -tune 2 32 1 1  (set workgroup size of 32 for 1D FFA comparison kernel).
-oclFFT_plan A B C : to override defaults for FFT 32k plan generation. Read oclFFT code and explanations in comments before any tweaking.
	A - global radix
	B - local radix
	C - max size of workgroup used by oclFFT kernel generation algorithm
	Usage example: 	-oclFFT_plan 64 8 256 (this corresponds to old defaults);
			-oclFFT_plan 0 0 0 (this effectively means this option not used, hardwired defaults in play).

These switches can be placed into the file called ap_cmdline.txt also.

For device-specific settings in multi-GPU systems it's possible to override some of command-line options via application config file.
Name of this config file:
AstroPulse_<vendor>_config.xml where vendor can be ATi, NV or iGPU.
File structure:
<deviceN>
	<unroll>N</unroll>
	<ffa_block>N</ffa_block>
	<ffa_block_fetch>N</ffa_block_fetch>
	<oclfft_plan>
	        <size>32768</size>
		<global_radix>N</global_radix>
		<local_radix>N</local_radix>
		<workgroup_size>N</workgroup_size>
	</oclfft_plan>
	<tune>
		<tune_kernel_index>N</tune_kernel_index>
		<tune_workgroup_size_x>N</tune_workgroup_size_x>
		<tune_workgroup_size_y>N</tune_workgroup_size_y>
		<tune_workgroup_size_z>N</tune_workgroup_size_z>
	</tune>
	<sbs>N</sbs>
	<skip_ffa_precompute>
	<no_defaults_scaling>
</deviceN>
where deviceN - particular OpenCL device N starting with 0, multiple sections allowed, one per each device.
other fields - corresponding command-line options to override for this particular device.
All or some sections can be omitted.
AstroPulse uses only one FFT size in GPU-based calculations so <size> field in <oclfft_plan> reserved to 32768 and can be omitted.


For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding package.
  (not available on 64bit linux or MacOSX)
  On Linux and MacOSX :
  See example_app_info-files directory for some examples.


Known issues:
- With 12.x Catalyst drivers GPU usage can be low if CPU fully used with another loads.
  Same applies to NV drivers past 267.xx and to Intel SDK drivers.
- If you see low GPU usage of zero blanked tasks try to free one or more CPU cores. *
- For overflowed tasks found signal sequence not always matches CPU version.
- On Linux : If OpenCL reports your GPU to have only half the GPU-RAM than it actually has,
  try to add the following to the ".profile" (hidden text file) of the account that BOINC runs on :
  export set GPU_MAX_ALLOC_PERCENT=90
  export set GPU_MAX_HEAP_SIZE=90
  Logout and back into that account to take over the new settings for the Catalyst driver or reboot host.


Best usage tips:

For best performance it is important to free 2 CPU cores running multiple instances.
Freeing at least 1 CPU core is necessity to get enough GPU usage.*

* As alternate solution try to use -cpu_lock / -cpu_lock_fixed_cpu N options.
   This might only work on fast multicore CPU`s.  (not available on 64bit linux or MacOSX)

command line parameters.
Command line switches can be used either in app_info.xml or ap_cmdline_x86_64-pc-linux-gnu_SSE2_OpenCL_ATI.txt.
Params in ap_cmdline*.txt will override switches in <cmdline> tag of app_info.xml.
_______________________


You do not have to set any commandline parameters to run this app.
It will autoconfigure to some default settings depending on your GPU device.


High end cards (more than 30 compute units)

-unroll 18 -ffa_block 16384 -ffa_block_fetch 8192

* Bigger unroll values < 20 doesn`t necessarily result in better run times.

Mid range cards (12 - 24 compute units)

-unroll 12 -ffa_block 12288 -ffa_block_fetch 6144

entry level GPU (less than 6 compute units)

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024


-tune switch

possible values:

-tune 1 256 1 1
-tune 1 128 2 1
-tune 1 64 4 1
-tune 1 32 8 1
-tune 1 16 16 1

Intensive testing highlighted -tune 1 64 4 1 and -tune 1 32 8 1 to be fastest on HD 7970 and R9 280X.
Further testing required for other GPU`s.

-oclFFT_plan switch (Use at your own risk !)
------------------------

FFT kernels are processed in 8 point fft kernels by default.
Using different fft kernel planning can speed up processing significantly.
In most cases 16 point fft kernels are fastest for Astropulse V7.

-oclFFT_plan 256 16 256

Example:

High end cards
-unroll 18 -oclFFT_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 4 1 -tune 2 64 4 1

Mid range cards
-unroll 12 -oclFFT_plan 256 16 256 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -tune 2 64 4 1


Your mileage might vary.
-----------------------------------------------------

App instances.
______________

On HD 7950/7970, R9 280X running 2 instances should be fastest.
On R9 290X running 3 instances should be easily possible. Further testing required.

On mid range cards HD 5770, 6850/6870, 7850/7870 and R7 best performance should be running 2 instances.

If you experience screen lags reduce unroll factor and ffa_block_fetch value.

On 64bit linux (using any Catalyst fglrx driver before 13.4) or MacOSX you can try 3 instances also on midrange cards with at least 2GB GPU RAM.
Examples:
Using a HD7750 (1GB) on 64bit Linux set  -unroll 10 -ffa_block 2048 -ffa_block_fetch 2048 -sbs 256  and run 1 instance.
Using a HD7850 (2GB, PitcairnPro) on 64bit Linux set -unroll 27 -ffa_block 960 -ffa_block_fetch 960 -sbs 256 and run 2 instances.
Using a HD7870 (2GB, Tahiti LE) on 64bit Linux set -unroll 30 -ffa_block 1024 -ffa_block_fetch 1024 -sbs 256 and run 2 instances.
Using a R9 290 (4GB, Hawaii Pro) on 64bit Linux set -unroll 50 -ffa_block 1024 -ffa_block_fetch 1024 -sbs 480 and run 1 instance!*

*Due to driver restriction/bug on Linux Hawaii GPUs can run only single instances without too high invalid rate.

Addendum:
_________

Running multiple cards in a system requires freeing another CPU core.

