Forum > GPU crunching

ATI OpenCL AstroPulse (rev516) released

<< < (5/6) > >>

Fredericx51:
Some more validations with rev.516, GPU use is almost 100%, CPU use heavily depending on blanking %, here are the latest
results, this one.

On this host.

Another one.



Fredericx51:
Last validated AP WU.

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:10
FFA thread block override value:8192
FFA thread fetchblock override value:4096
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3;   AstroPulse v. 5.06
Non-graphics   FFTW   USE_CONVERSION_OPT   
Windows x86 rev 516, 5.06 match, by Raistmer with support of Lunatics.kwsn.net team.   SSE2

OpenCL version by Raistmer

oclFFT fix for ATI GPUs by Urs Echternacht
ffa threshold mod, by Joe Segur.
static fftw lib, built by Jason G.
SSE3 dechirping by JDWhale

Build features: Non-graphics   OpenCL   COMBINED_DECHIRP_KERNEL   FFTW   USE_INCREASED_PRECISION   USE_SSE2   x86   
     CPUID:         Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

     Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Number of OpenCL platforms:             1


 OpenCL Platform Name:                AMD Accelerated Parallel Processing
Number of devices:             2
  Max compute units:             20
  Max work group size:             256
  Max clock frequency:             875Mhz
  Max memory allocation:          134217728
  Cache type:                None
  Cache line size:             0
  Cache size:                0
  Global memory size:             536870912
  Constant buffer size:             65536
  Max number of constant args:          8
  Local memory type:             Scratchpad
  Local memory size:             32768
  Queue properties:            
    Out-of-Order:             No
  Name:                   Cypress
  Vendor:                Advanced Micro Devices, Inc.
  Driver version:             CAL 1.4.1332
  Version:                OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Extensions:                cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
  Max compute units:             20
  Max work group size:             256
  Max clock frequency:             875Mhz
  Max memory allocation:          134217728
  Cache type:                None
  Cache line size:             0
  Cache size:                0
  Global memory size:             536870912
  Constant buffer size:             65536
  Max number of constant args:          8
  Local memory type:             Scratchpad
  Local memory size:             32768
  Queue properties:            
    Out-of-Order:             No
  Name:                   Cypress
  Vendor:                Advanced Micro Devices, Inc.
  Driver version:             CAL 1.4.1332
  Version:                OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Extensions:                cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing


Info : Building Program (clBuildProgram):main kernels: OK code 0


    single pulses: 1
repetitive pulses: 0
  percent blanked: 0.00
class T_remove_radar:   total=3.71e+009,   N=1,   <>=3.71e+009,   min=3.71e+009,   max=3.71e+009
class T_main_loop_L1:   total=3.35e+013,   N=111,   <>=3.02e+011,   min=2.20e+011,   max=3.70e+011
 class T_FFT_forward:   total=8.62e+009,   N=182040,   <>=4.73e+004,   min=1.16e+004,   max=3.20e+008
 class T_remove_radar_randomize:   total=2.20e+009,   N=1817736,   <>=1.21e+003,   min=3.50e+002,   max=1.22e+008
 class T_build_chirp_table:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
 class T_DataWrite:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
  class T_DataWrite_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_oclReadBuf:   total=6.70e+006,   N=182040,   <>=3.60e+001,   min=1.80e+001,   max=2.11e+003
   class T_ChirpWrite:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
    class T_ChirpWrite_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_dechirp:   total=7.42e+009,   N=182040,   <>=4.07e+004,   min=1.60e+004,   max=1.21e+008
  class Dechirp_ns:   total=0,   N=0,   <>=0,   min=0   max=0
  class Half_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_PC_single_pulse_kernel_FFA_update:   total=1.22e+013,   N=182040,   <>=6.70e+007,   min=2.15e+007,   max=6.12e+008
  class PC_ns:   total=0,   N=0,   <>=0,   min=0   max=0
class T_oclReadBuf:   total=6.70e+006,   N=182040,   <>=3.60e+001,   min=1.80e+001,   max=2.11e+003
class T_oclWriteBuf:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
  class T_FFT_inverse:   total=3.22e+009,   N=182040,   <>=1.77e+004,   min=9.08e+003,   max=1.21e+008
 class T_ffa:   total=2.13e+013,   N=1998,   <>=1.06e+010,   min=1.15e+009,   max=5.62e+010
class T_GPU_buffer_read_backs:   total=2,   N=2,   <>=1,   min=1   max=1
USE_OPENCL   OPENCL_WRITE   USE_INCREASED_PRECISION   SMALL_CHIRP_TABLE   
rev 516
19:25:24 (3200): called boinc_finish

</stderr_txt>
]]>

Fredericx51:
It's quied in here, but still trying different settings with unroll_data_chunk=16 , ffa_block=10240 an ffa_block_fetch 2048 (5:1), which gives
a almost constant 48%-58% GPU load, also doing 2 at a time, on 2 EAH5870's, starts to look like a Sweet-Spot, so I'll let these run, since I've still
AP WU's on this host.

Also almost no screen lag.


Fredericx51:
Two validated AP WU's with rev.516 and a few changes in ffa_block & ffa_block_fetch, unroll=16,

Both ATI AP tasks.

ATI and stock app..

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:4096
FFA thread fetchblock override value:2048
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3;   AstroPulse v. 5.06
Non-graphics   FFTW   USE_CONVERSION_OPT   
Windows x86 rev 516, 5.06 match, by Raistmer with support of Lunatics.kwsn.net team.   SSE2

OpenCL version by Raistmer

oclFFT fix for ATI GPUs by Urs Echternacht
ffa threshold mod, by Joe Segur.
static fftw lib, built by Jason G.
SSE3 dechirping by JDWhale

Build features: Non-graphics   OpenCL   COMBINED_DECHIRP_KERNEL   FFTW   USE_INCREASED_PRECISION   USE_SSE2   x86   
     CPUID:         Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

     Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
Number of OpenCL platforms:             1


 OpenCL Platform Name:                AMD Accelerated Parallel Processing
Number of devices:             2
  Max compute units:             20
  Max work group size:             256
  Max clock frequency:             890Mhz
  Max memory allocation:          134217728
  Cache type:                None
  Cache line size:             0
  Cache size:                0
  Global memory size:             536870912
  Constant buffer size:             65536
  Max number of constant args:          8
  Local memory type:             Scratchpad
  Local memory size:             32768
  Queue properties:            
    Out-of-Order:             No
  Name:                   Cypress
  Vendor:                Advanced Micro Devices, Inc.
  Driver version:             CAL 1.4.1332
  Version:                OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Extensions:                cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
  Max compute units:             20
  Max work group size:             256
  Max clock frequency:             890Mhz
  Max memory allocation:          134217728
  Cache type:                None
  Cache line size:             0
  Cache size:                0
  Global memory size:             536870912
  Constant buffer size:             65536
  Max number of constant args:          8
  Local memory type:             Scratchpad
  Local memory size:             32768
  Queue properties:            
    Out-of-Order:             No
  Name:                   Cypress
  Vendor:                Advanced Micro Devices, Inc.
  Driver version:             CAL 1.4.1332
  Version:                OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Extensions:                cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing


Info : Building Program (clBuildProgram):main kernels: OK code 0

Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:6144
FFA thread fetchblock override value:2048
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2;   ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0

Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 3;   ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0

Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:16
FFA thread block override value:2048
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2;   ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0

Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:12
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2;   ### Restart at 78.38 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0

Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:15
FFA thread block override value:5120
FFA thread fetchblock override value:1024
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 1 device, slots 2 to 3 (including) will be checked
Used slot is 2;   ### Restart at 92.79 percent.
Info : Building Program (clBuildProgram):main kernels: OK code 0


    single pulses: 3
repetitive pulses: 1
  percent blanked: 8.89
class T_remove_radar:   total=4.45e+009,   N=1,   <>=4.45e+009,   min=4.45e+009,   max=4.45e+009
class T_main_loop_L1:   total=1.04e+012,   N=7,   <>=1.48e+011,   min=1.36e+011,   max=1.98e+011
 class T_FFT_forward:   total=9.72e+009,   N=7672,   <>=1.27e+006,   min=1.60e+004,   max=9.38e+009
 class T_remove_radar_randomize:   total=1.20e+011,   N=114632,   <>=1.05e+006,   min=3.59e+002,   max=1.31e+008
 class T_build_chirp_table:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
 class T_DataWrite:   total=4.60e+007,   N=840,   <>=5.47e+004,   min=1.96e+004,   max=2.54e+005
  class T_DataWrite_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_oclReadBuf:   total=2.91e+005,   N=7672,   <>=3.70e+001,   min=1.80e+001,   max=1.21e+003
   class T_ChirpWrite:   total=0.00e+000,   N=0,   <>=0.00e+000,   min=1.84e+019,   max=0.00e+000
    class T_ChirpWrite_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_dechirp:   total=2.81e+008,   N=7672,   <>=3.66e+004,   min=2.06e+004,   max=1.66e+006
  class Dechirp_ns:   total=0,   N=0,   <>=0,   min=0   max=0
  class Half_ns:   total=0,   N=0,   <>=0,   min=0   max=0
 class T_PC_single_pulse_kernel_FFA_update:   total=4.37e+011,   N=7672,   <>=5.69e+007,   min=3.07e+007,   max=1.30e+010
  class PC_ns:   total=0,   N=0,   <>=0,   min=0   max=0
class T_oclReadBuf:   total=2.91e+005,   N=7672,   <>=3.70e+001,   min=1.80e+001,   max=1.21e+003
class T_oclWriteBuf:   total=4.68e+007,   N=840,   <>=5.57e+004,   min=2.00e+004,   max=2.56e+005
  class T_FFT_inverse:   total=1.18e+008,   N=7672,   <>=1.54e+004,   min=1.05e+004,   max=4.21e+005
 class T_ffa:   total=4.49e+011,   N=126,   <>=3.56e+009,   min=1.34e+009,   max=2.01e+010
class T_GPU_buffer_read_backs:   total=1,   N=1,   <>=1,   min=1   max=1
USE_OPENCL   OPENCL_WRITE   USE_INCREASED_PRECISION   SMALL_CHIRP_TABLE   
rev 516
14:40:55 (3104): called boinc_finish

</stderr_txt>
]]>

I'll start MB (rev.177), too, is it possible to run AP & MB on GPU, at the same time?


Raistmer:

--- Quote from: Fredericx51 on 24 Apr 2011, 06:59:35 pm ---
I'll start MB (rev.177), too, is it possible to run AP & MB on GPU, at the same time?

--- End quote ---

If both configured appropriately (for 2 instance run) - should be possible.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version