Seti@Home optimized science apps and information
Optimized Seti@Home apps => Windows => GPU crunching => Topic started by: Raistmer on 18 Mar 2011, 07:18:25 am
-
It's ATI OpenCL builld update intended to replace rev456 OpenCL build and rev 449 OpenCL+Brook build.
Hardware requirements are the same as for rev456 OpenCL build (look old release notes here: http://lunatics.kwsn.net/12-gpu-crunching/astropulse-for-ati-gpus-released.msg31201.html#msg31201 )
New command line switches:
-hp - sets high priority class
-no_cpu_lock - disables affinity setting
-instances_per_device N - will allow running N copies per each supported GPU device (don't forget to set <count> field in app_info to 1/N to instruct BOINC to launch N tasks per GPU).
-unroll N -sets DATA_CHUNK_UNROLL variable to N. This allows to do N data chunks per FindSinglePulse kernel call improving (in most cases) performance but increasing GPU memory requirements. On low-end GPUs it may be worth to use lower values. Default setted to 10 as in r456 (there it was hardwired to 10).
Known issues:
Don't forget to finish current AP task before upgrade. Or you will need to manually update CL file not only in SETI project directory but in corresponding slot directory too. BOINC doesn't do this, design flaw IMHO.
Double check if your config (GPU+driver) has OpenCL support in case of mobility GPU. Ask Ati for OpenCL support if not.
app_info section for this app:
<app>
<name>astropulse_v505</name>
</app>
<file_info>
<name>ap_5.06_win_x86_SSE2_OpenCL_ATI_r516.exe</name>
<executable/>
</file_info>
<file_info>
<name>AstroPulse_Kernels.cl</name>
<executable/>
</file_info>
<app_version>
<app_name>astropulse_v505</app_name>
<version_num>506</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.20</max_ncpus>
<plan_class>ati13ati</plan_class>
<cmdline>-instances_per_device 1 -hp -unroll 10 -ffa_block 4096 -ffa_block_fetch 2048</cmdline>
<flops>30987654321</flops>
<file_ref>
<file_name>ap_5.06_win_x86_SSE2_OpenCL_ATI_r516.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>AstroPulse_Kernels.cl</file_name>
<copy_file/>
</file_ref>
<coproc>
<type>ATI</type>
<count>1</count>
</coproc>
</app_version>
-
I'll be sure to test and report back.
Waiting on current AP to finish.
-
Hi Raistmer, only a quick question - I thought in Beta this was version_num 506 in your app_info? The release version_num in this thread is 505?
Makes no difference, but thought I'd ask ;)
-
Hi Raistmer, only a quick question - I thought in Beta this was version_num 506 in your app_info? The release version_num in this thread is 505?
Makes no difference, but thought I'd ask ;)
Doesn't matters, I'll check what was in prev release and correct if there was 5.06
[EDIT: changed to 506]
-
Didn't think it mattered was just wondering if there was a reason?
(And it broke my task monitoring app :o)
-
Didn't think it mattered was just wondering if there was a reason?
(And it broke my task monitoring app :o)
No reason besides some inaccuracy :) I took app_info section right from my production PC. And there I don;t care much for meaningless numbers :)
Also, I think flops section can/should (?) be omitted too now? Any opinions on flops ?
-
I;ve only got them in becuase I'm currently running 6.12.18 and the v7 app @ beta and didn't want that app screwing up the rest of my times for my other tasks there
If it weren't for that I wouldn't bother either anymore
Edit: another quick question - with the app_info the plan_class doesn't get used either does it (I can't use standard plan_class as need a way to differentiate between Seti & Beta tasks)
-
what do you want to use instead of ati13ati then ??
-
what do you want to use instead of ati13ati then ??
I use ati13ati for Seti tasks and ati13ati_beta for beta tasks (and yes, I've also given cpu tasks a plan_class @ Beta as well)
I got fed-up with Boinc Manager and wrote a small app that gives me pratically the same information with half the overhead of BM
Only reason I ever open BM now is to start and stop processing and do network comms
Problem is, Beta and Seti Main tasks are identical in the client_state, so to seperate them out in the app, so I could get accurate info I had to change either the version_num or the plan_class values. I chose the plan_class to change
As far as I can tell it's made no difference to work fetch and/or scheduling as I think this is all done from the coproc value
-
Seems to work great on my ATI 4550. :-*
I had to set -unroll 5 instead of 10 otherwise I get computation errors.
-
Seems to work great on my ATI 4550. :-*
I had to set -unroll 5 instead of 10 otherwise I get computation errors.
Link to host?
-
http://setiathome.berkeley.edu/results.php?hostid=4876884
-
If you want to debug this problem you ould install APP SDK from AMD and check kernel exeution times under profiler. It's possible that they just take too long on this GPU (only 2 compute devices, 128 threads instead of 256, lower freq than on HD4870). First reported error appears inside FFT call.
-
okay, I got the SDK installed and figured out how to launch the app with the profiler.
Is there any particular option I should add to sprofile or are the defaults what you are looking for?
I'll do a pass with unroll at 10 and another with 5.
-
defaults will go OK.
-
here you go.
3 CSVs files with defaults from sprofile:
"ATI4550_unroll_5.csv" is from about an hour of runtime
"ATI4550_unroll_10.csv" and "ATI4550_unroll_10_2ndrun.csv" is from 2 attempts to run when using unroll 10. I uncluded both because they seems quite different. Appliation terminates in both cases (I earase all ap_state, fold.dat, pusle.out etc between each run)
-
Gonna give it another try, new host i7-2600, 2x EAH5870, WIN7 64Bit Pro, BOINC 6.10.60 64Bit.
What unroll figure/factor is OK to try on these cards.
-
my cards seem to like unroll at 10 but you'll need to adjust yours to your own liking. 10 is a good starting place
-
my cards seem to like unroll at 10 but you'll need to adjust yours to your own liking. 10 is a good starting place
Hello, started testing the ATI AP app. rev.516 on this rig. (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5884985)
Tried some different Unroll values, like 12, 13 and 14, even at 12, screenlag becomes too heavy, so put it back to 10.
But did double the ffa_block & ffa_block_fetch and run 2 per 5870 (2) and 8 MB WUs using the SSSE3x flavor, memory use is quite high,
(temps are quite high for 4x 2GByte DDR3 @1333MHz.) but puter is stable and can (& will) be used for other things, except, playing MPeG 2 or
a game ;) .
If the AP WUs validate, I can start using the MB (rev.177). Trying to learn, in & outs of OpenCL....... :o
-
Couldn't help playing with ffa_bock_fetch and unroll, while running WU's,
but first 5 AP WU's with rev 516 have validated,
last of 5 AP W, (http://setiathome.berkeley.edu/workunit.php?wuid=728581935)
well no harm done. ::)
B.t.w. I still had some 100 Collatz C. WU's, deadline from 10 minutes to 2 to 3 days, so it runs a few at night, cooler ;)
But GPU are almost trashed by C.C. load, fans at max, temps at max, when I go to sleep, have this one in my sleeping
room and is quite noisy with such TREATMENT , not good for the average life span and safe use of the
host, cause it gets really hot.
Also have some MW, but is it still active?
Back on topic, though.
-
Some more validations with rev.516, GPU use is almost 100%, CPU use heavily depending on blanking %, here are the latest
results, this one. (http://setiathome.berkeley.edu/workunit.php?wuid=728375448)
On this host. (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5884985)
Another one. (http://setiathome.berkeley.edu/workunit.php?wuid=728356762)
-
Last validated AP WU. (http://setiathome.berkeley.edu/workunit.php?wuid=728421332)
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:10
FFA thread block override value:8192
FFA thread fetchblock override value:4096
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3; AstroPulse v. 5.06
Non-graphics FFTW USE_CONVERSION_OPT
Windows x86 rev 516, 5.06 match, by Raistmer with support of Lunatics.kwsn.net team. SSE2
OpenCL version by Raistmer
oclFFT fix for ATI GPUs by Urs Echternacht
ffa threshold mod, by Joe Segur.
static fftw lib, built by Jason G.
SSE3 dechirping by JDWhale
Build features: Non-graphics OpenCL COMBINED_DECHIRP_KERNEL FFTW USE_INCREASED_PRECISION USE_SSE2 x86
CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Cache: L1=64K L2=256K
CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Number of OpenCL platforms: 1
OpenCL Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Max compute units: 20
Max work group size: 256
Max clock frequency: 875Mhz
Max memory allocation: 134217728
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Queue properties:
Out-of-Order: No
Name: Cypress
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1332
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Max compute units: 20
Max work group size: 256
Max clock frequency: 875Mhz
Max memory allocation: 134217728
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Queue properties:
Out-of-Order: No
Name: Cypress
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1332
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Info : Building Program (clBuildProgram):main kernels: OK code 0
single pulses: 1
repetitive pulses: 0
percent blanked: 0.00
class T_remove_radar: total=3.71e+009, N=1, <>=3.71e+009, min=3.71e+009, max=3.71e+009
class T_main_loop_L1: total=3.35e+013, N=111, <>=3.02e+011, min=2.20e+011, max=3.70e+011
class T_FFT_forward: total=8.62e+009, N=182040, <>=4.73e+004, min=1.16e+004, max=3.20e+008
class T_remove_radar_randomize: total=2.20e+009, N=1817736, <>=1.21e+003, min=3.50e+002, max=1.22e+008
class T_build_chirp_table: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_DataWrite: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_DataWrite_ns: total=0, N=0, <>=0, min=0 max=0
class T_oclReadBuf: total=6.70e+006, N=182040, <>=3.60e+001, min=1.80e+001, max=2.11e+003
class T_ChirpWrite: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_ChirpWrite_ns: total=0, N=0, <>=0, min=0 max=0
class T_dechirp: total=7.42e+009, N=182040, <>=4.07e+004, min=1.60e+004, max=1.21e+008
class Dechirp_ns: total=0, N=0, <>=0, min=0 max=0
class Half_ns: total=0, N=0, <>=0, min=0 max=0
class T_PC_single_pulse_kernel_FFA_update: total=1.22e+013, N=182040, <>=6.70e+007, min=2.15e+007, max=6.12e+008
class PC_ns: total=0, N=0, <>=0, min=0 max=0
class T_oclReadBuf: total=6.70e+006, N=182040, <>=3.60e+001, min=1.80e+001, max=2.11e+003
class T_oclWriteBuf: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_FFT_inverse: total=3.22e+009, N=182040, <>=1.77e+004, min=9.08e+003, max=1.21e+008
class T_ffa: total=2.13e+013, N=1998, <>=1.06e+010, min=1.15e+009, max=5.62e+010
class T_GPU_buffer_read_backs: total=2, N=2, <>=1, min=1 max=1
USE_OPENCL OPENCL_WRITE USE_INCREASED_PRECISION SMALL_CHIRP_TABLE
rev 516
19:25:24 (3200): called boinc_finish
</stderr_txt>
]]>
-
It's quied in here, but still trying different settings with unroll_data_chunk=16 , ffa_block=10240 an ffa_block_fetch 2048 (5:1), which gives
a almost constant 48%-58% GPU load, also doing 2 at a time, on 2 EAH5870's, starts to look like a Sweet-Spot, so I'll let these run, since I've still
AP WU's on this host. (http://setiathome.berkeley.edu/show_host_detail.php?hostid=5884985)
Also almost no screen lag.
-
Two validated AP WU's with rev.516 and a few changes in ffa_block & ffa_block_fetch, unroll=16,
Both ATI AP tasks. (http://setiathome.berkeley.edu/workunit.php?wuid=730487052)
ATI and stock app.. (http://setiathome.berkeley.edu/workunit.php?wuid=730485662)
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:4096
FFA thread fetchblock override value:2048
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3; AstroPulse v. 5.06
Non-graphics FFTW USE_CONVERSION_OPT
Windows x86 rev 516, 5.06 match, by Raistmer with support of Lunatics.kwsn.net team. SSE2
OpenCL version by Raistmer
oclFFT fix for ATI GPUs by Urs Echternacht
ffa threshold mod, by Joe Segur.
static fftw lib, built by Jason G.
SSE3 dechirping by JDWhale
Build features: Non-graphics OpenCL COMBINED_DECHIRP_KERNEL FFTW USE_INCREASED_PRECISION USE_SSE2 x86
CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Cache: L1=64K L2=256K
CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Number of OpenCL platforms: 1
OpenCL Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Max compute units: 20
Max work group size: 256
Max clock frequency: 890Mhz
Max memory allocation: 134217728
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Queue properties:
Out-of-Order: No
Name: Cypress
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1332
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Max compute units: 20
Max work group size: 256
Max clock frequency: 890Mhz
Max memory allocation: 134217728
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Queue properties:
Out-of-Order: No
Name: Cypress
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1332
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Info : Building Program (clBuildProgram):main kernels: OK code 0
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:6144
FFA thread fetchblock override value:2048
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2; ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3; ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:2048
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2; ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:12
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2; ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:15
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2; ### Restart at 92.79 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0
single pulses: 3
repetitive pulses: 1
percent blanked: 8.89
class T_remove_radar: total=4.45e+009, N=1, <>=4.45e+009, min=4.45e+009, max=4.45e+009
class T_main_loop_L1: total=1.04e+012, N=7, <>=1.48e+011, min=1.36e+011, max=1.98e+011
class T_FFT_forward: total=9.72e+009, N=7672, <>=1.27e+006, min=1.60e+004, max=9.38e+009
class T_remove_radar_randomize: total=1.20e+011, N=114632, <>=1.05e+006, min=3.59e+002, max=1.31e+008
class T_build_chirp_table: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_DataWrite: total=4.60e+007, N=840, <>=5.47e+004, min=1.96e+004, max=2.54e+005
class T_DataWrite_ns: total=0, N=0, <>=0, min=0 max=0
class T_oclReadBuf: total=2.91e+005, N=7672, <>=3.70e+001, min=1.80e+001, max=1.21e+003
class T_ChirpWrite: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_ChirpWrite_ns: total=0, N=0, <>=0, min=0 max=0
class T_dechirp: total=2.81e+008, N=7672, <>=3.66e+004, min=2.06e+004, max=1.66e+006
class Dechirp_ns: total=0, N=0, <>=0, min=0 max=0
class Half_ns: total=0, N=0, <>=0, min=0 max=0
class T_PC_single_pulse_kernel_FFA_update: total=4.37e+011, N=7672, <>=5.69e+007, min=3.07e+007, max=1.30e+010
class PC_ns: total=0, N=0, <>=0, min=0 max=0
class T_oclReadBuf: total=2.91e+005, N=7672, <>=3.70e+001, min=1.80e+001, max=1.21e+003
class T_oclWriteBuf: total=4.68e+007, N=840, <>=5.57e+004, min=2.00e+004, max=2.56e+005
class T_FFT_inverse: total=1.18e+008, N=7672, <>=1.54e+004, min=1.05e+004, max=4.21e+005
class T_ffa: total=4.49e+011, N=126, <>=3.56e+009, min=1.34e+009, max=2.01e+010
class T_GPU_buffer_read_backs: total=1, N=1, <>=1, min=1 max=1
USE_OPENCL OPENCL_WRITE USE_INCREASED_PRECISION SMALL_CHIRP_TABLE
rev 516
14:40:55 (3104): called boinc_finish
</stderr_txt>
]]>
I'll start MB (rev.177), too, is it possible to run AP & MB on GPU, at the same time?
-
I'll start MB (rev.177), too, is it possible to run AP & MB on GPU, at the same time?
If both configured appropriately (for 2 instance run) - should be possible.
-
I just DownLoaded from your Russian site, at least tried, like previous time (rev.516), but got rev.521 and installed
it.
Since I can't use an AC, last days, friday, saturday, sunday and today (tuesday), temps were 25C till 31C and had to shutdown,
all, but 1 rig (X9650@3.51GHz. + 1x GTX480), whithout a casing, has no heat problems. (Computer cases, 9 out of 10,
isn't up for this job, 1, 2 or more GPU's, produce such heat, they should have their own separate casing, in or out of the case!
Got them up and running now, appeared to have some MW WU's (deadline 1 to 2 days), then I can try your latest rev.521 for AP
work.
I saw 2 AP WU's , running on 1 HD5870, looked like they'd crashed.............!
Better to try 1 at a time and with similar cmd line options used with rev.516 ?
-
Better to try 1 at a time and with similar cmd line options used with rev.516 ?
YEs, options should be the same